





presented to the University of Waterloo
in fulfillment of the




Waterloo, Ontario, Canada, 2019
© Matthew Amy 2019
Examining Committee Membership
The following served on the Examining Committee for this thesis. The decision of the
Examining Committee is by majority vote.
External Examiner: Simon Perdrix
Chargé de recherche, CNRS
Supervisor: Michele Mosca
Professor, Dept. of Combinatorics & Optimization
Internal Member: Prabhakar Ragde
Professor, David R. Cheriton School of Computer Science
Internal Member: Richard Cleve
Professor, David R. Cheriton School of Computer Science
Internal-External Member: Jon Yard
Associate Professor, Dept. of Combinatorics & Optimization
ii
Author’s declaration
This thesis consists of material all of which I authored or co-authored. This is a true copy
of the thesis, including any required final revisions, as accepted by my examiners.
I understand that my thesis may be made electronically available to the public.
iii
Abstract
The design and compilation of correct, efficient quantum circuits is integral to the
future operation of quantum computers. This thesis makes contributions to the problems of
optimizing and verifying quantum circuits, with an emphasis on the development of formal
models for such purposes. We also present software implementations of these methods,
which together form a full stack of tools for the design of optimized, formally verified
quantum oracles.
On the optimization side, we study methods for the optimization of RZ and CNOT gates
in Clifford+RZ circuits. We develop a general, efficient optimization algorithm called phase
folding, which reduces the number of RZ gates without increasing any metrics by computing
its phase polynomial. This algorithm can further be combined with synthesis techniques for
CNOT-dihedral operators to optimize circuits with respect to particular costs. We then
study the optimal synthesis problem for CNOT-dihedral operators from the perspectives of
RZ and CNOT gate optimization. In the case of RZ gate optimization, we show that the
optimal synthesis problem is polynomial-time equivalent to minimum-distance decoding in
certain Reed-Muller codes. For the CNOT optimization problem, we show that the optimal
synthesis problem is at least as hard as a combinatorial problem related to Gray codes.
In both cases, we develop heuristics for the optimal synthesis problem, which together
with phase folding reduces T counts by 42% and CNOT counts by 22% across a suite of
real-world benchmarks.
From the perspective of formal verification, we make two contributions. The first is
the development of a formal model of quantum circuits with ancillary bits based on the
Feynman path integral, along with a concrete verification algorithm. The path integral
model, with some syntactic sugar, further doubles as a natural specification language for
quantum computations. Our experiments show some practical circuits with up to hundreds
of qubits can be efficiently verified. Our second contribution is a formally verified, optimizing
compiler for reversible circuits. The compiler compiles a classical, irreversible language to
reversible circuits, with a formal, machine-checked proof of correctness written in the proof
assistant F?. The compiler is structured as a partial evaluator, allowing verification to be
carried out significantly faster than previous results.
iv
Acknowledgements
Having spent a great deal of my adult life in graduate studies, I’m very grateful for the
friends, researchers and support staff who have made the process a pleasure.
I first wish to thank my advisor, Michele Mosca, for providing me with guidance, support
and useful ideas throughout my time at Waterloo. I also wish to thank my committee
members Jon Yard, Prabhakar Ragde, Richard Cleve and Simon Perdrix for reading a thesis
that turned out significantly longer than intended, as well as for their helpful comments and
conversations over the years. I especially wish to thank Simon for travelling to Waterloo in
the worst week of Winter to attend my defence.
Over the many years I spent as a graduate student, I had the privilege of working with
talented researchers and fellow students whom have impacted me greatly as a researcher. For
that I wish to thank Julien Ross, Vlad Gheorghiu, Martin Roetteler, Vadym Kliuchnikov,
Olivia Di Matteo, Mathias Soeken, Alex Parent, John Schanck, Parsiad Azimzadeh, Nathan
Killoran and Dmitri Maslov.
The many years I spent at Waterloo would have been far less bearable and far more
damaging to my mental stability were it not for the friends I made there. I wish to thank
in particular Parsiad Azimzadeh, Alexandre Laplante, Vincent Launchbury, Kyle Robinson,
Vlad Gheorghiu, Olivia Di Matteo, Alex Parent, Vincent Russo, Sebastian Verschoor, Mária
Kieferová and the memory of Alex S. Forskan for keeping me sane.
Finally, I wish to thank my parents for all of their immeasurable support and for always
letting me stay inside to play with computers, and Rebecca for her patience, love, and for





List of Tables xii
List of Figures xiii
I Introduction 1
1 Introduction 2
1.1 Quantum circuit optimization . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Gate sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Functional verification . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Compiler verification . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Quantum computation 14
2.1 The standard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Quantum circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Reversible computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
vii
3 Path integrals 23
3.1 The Path Integral formulation . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Sum-over-paths actions . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Clifford+RZ path integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Annotated circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 CNOT-dihedral circuits . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Phase polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
II Optimization 35
4 Quantum Phase folding 36
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Branching gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 The phase-folding algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Phase analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Phase-folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 CNOT-dihedral resynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Phase range analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
viii
5 T-count optimization 57
5.1 Coding theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Decoding phase polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Coding interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Connection to Reed-Muller codes . . . . . . . . . . . . . . . . . . . 66
5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 Upper bounds on phase gate counts . . . . . . . . . . . . . . . . . . 67
5.3.2 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.3 Complexity of phase gate optimization . . . . . . . . . . . . . . . . 71
5.4 Generators of Cnm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.1 Rotations of odd order . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.2 Rotations of order 2k . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6 CNOT-count optimization 86
6.1 Parity networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.1 From CNOT–minimization to parity network synthesis . . . . . . . 89
6.2 Complexity of parity network minimization . . . . . . . . . . . . . . . . . . 93
6.2.1 Fixed-target minimal parity network . . . . . . . . . . . . . . . . . 93
6.2.2 Minimal parity network with encoded inputs . . . . . . . . . . . . . 96
6.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 A heuristic synthesis algorithm . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.2 Synthesis with encoded inputs . . . . . . . . . . . . . . . . . . . . . 105
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
ix
III Verification 111
7 Functional verification 112
7.1 The path integral model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.1.1 Composing path integrals . . . . . . . . . . . . . . . . . . . . . . . 116
7.1.2 The path integral semantics . . . . . . . . . . . . . . . . . . . . . . 118
7.1.3 Computational efficiency . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 A calculus for path integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.2 Reduction rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3.1 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3.2 Clifford completeness . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.4 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.4.1 Translation validation . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.4.2 Verifying quantum algorithms . . . . . . . . . . . . . . . . . . . . . 132
7.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8 Verified compilation 136
8.1 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.1.1 The Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.1.2 Boolean expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.1.3 Target architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2.1 Boolean expression compilation . . . . . . . . . . . . . . . . . . . . 143
8.2.2 Revs compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2.3 Eager cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
x
8.3 Parameter inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4.1 Boolean expression compilation . . . . . . . . . . . . . . . . . . . . 152
8.4.2 Revs compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
IV Conclusion 158
9 Conclusion 159
9.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
References 163
Appendices 179
A Correctness of Phase-folding 180
xi
List of Tables
4.1 Phase-folding optimization results . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 T -count optimization results . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1 CNOT-count optimization results . . . . . . . . . . . . . . . . . . . . . . . 109
7.1 Optimization validation results . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2 Results of verifying formally specified quantum algorithms. . . . . . . . . . 132
8.1 Bit and gate counts for Revs and ReVerC . . . . . . . . . . . . . . . . . 156
xii
List of Figures
1.1 A quantum circuit diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Transversal implementation of a logical X gate in the 5-qubit code. . . . . 6
1.3 The Bravyi-Kitaev 15-to-1 A-state distillation circuit. . . . . . . . . . . . . 7
2.1 An example of a quantum circuit . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Standard quantum gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 A reversible circuit uncomputing all temporary ancillas. . . . . . . . . . . . 21
3.1 The paths of a particle from point A to B. . . . . . . . . . . . . . . . . . . 24
3.2 An annotated circuit implementing the Toffoli gate. . . . . . . . . . . . . . 31
5.1 Relationship between circuits, sum-over-paths actions, and unitaries . . . . 58
5.2 Evaluation vectors for monomials of n variables. . . . . . . . . . . . . . . . 61
6.1 Circuit implementing the doubly-controlled Z gate CCZ synthesized with
Algorithm 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Annotated parity network for the set S = {(y, 1) | y ∈ F32} . . . . . . . . . 105
6.3 Average CNOT counts of parity networks computed by Algorithm 6.1 and
brute force minimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4 Average CNOT counts of parity networks synthesized with Algorithm 6.1 for
sets of parities on 10 bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1 Path integral reduction rules . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xiii
7.2 Circuits for the Quantum Hidden Shift algorithm. . . . . . . . . . . . . . . 134
8.1 Syntax of Revs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2 Implementation of an n-bit adder. . . . . . . . . . . . . . . . . . . . . . . . 138
8.3 Revs implementation of SHA-256 . . . . . . . . . . . . . . . . . . . . . . . 139
8.4 Operational semantics of Revs. . . . . . . . . . . . . . . . . . . . . . . . . 141







Since the conceptualization of a quantum mechanical computer by Yuri Manin and Richard
Feynman in the early 1980’s, researchers have strived to build such machines. Motivated
by the promise of exponentially faster solutions to important problems such as simulation
of quantum systems [Fey82,Llo96] and later integer factorization [Sho94], researchers in
the 80’s and 90’s laid the groundwork for physical implementations of quantum computers
via technologies including NMR [CFH97, GC97], trapped ions [CZ95], quantum dots
[LD98] and quantum optics [KLM01], as well as theoretical results on universal primitives
[Tof80,Deu85] and error correction [Sho95]. Despite such advancements, the lack of scalable,
reliable quantum computers throughout the 2000’s led to some pessimism towards quantum
computation outside of the physics community, and as a result few scalable tools for the
design and manipulation of quantum computations were developed.
We are now in an era where large-scale quantum computers appear not only feasible but
a relative, if still distant, certainty. As a result, the development of software design tools for
quantum computers has rapidly accelerated. Quantum programming languages and the asso-
ciated compilers in particular have seen extensive work, with a large array of programming
environments having been developed – for instance Quipper [GLR+13], ScaffCC [JPK+15],
Quil/PyQuil [SCZ16], ProjectQ [SHT18], Q|SI〉 [LWZ+17], Q# [SGT+18] and Strawberry
Fields [KIQ+18] to name just a few. Such compilers allow a programmer to implement
a quantum algorithm in a high-level language or API, which is then compiled to a form
suitable for running on physical hardware, either real or otherwise. This low-level form is
typically described as a quantum circuit – a straight-line program sequentially applying
quantum logic gates and measurements to individual quantum bits (qubits). In some cases,
portable assembly-style languages representing quantum circuits have themselves been
developed and adopted as a target language for compilers [SCZ16,CBSG17].
2
A complementary problem to the physical operation of quantum computing hard-
ware which has likewise driven the development of quantum compilers is that of re-
source estimation [IAR13] – estimating the amount of time and space resources required
to implement quantum algorithms on realistic hardware. Motivated by the need to
understand the power of quantum computing in order to make policy decisions and
asses cryptographic security in a post-quantum setting, quantum resource estimates
have become increasingly common tools for establishing the practicality of different algo-
rithms [GLRS16,AMG+16,SVM+17,ABL+18]. For such applications, large-scale, reliable
compilation tools are particularly important, as instance sizes are typically on the order of
thousands or millions of qubits and gates.
In this thesis we study two problems related to the compilation of quantum circuits:
optimization and verification. A major theme running through this thesis is the development
and use of formal methods for the analysis of quantum circuits, by which we mean the
use of mathematical models to rigorously prove properties of quantum circuits related
to optimization and verification. We focus on scalability and practicality – in particular,
alongside theoretical investigations we develop concrete algorithms which are automated,
light-weight and scalable. The techniques developed here are implemented in a circuit-level
compiler infrastructure Feynman1 inspired by LLVM [LA04], as well as ReVerC2, a
compiler from classical programs to reversible circuits. Together, they form a toolchain for
the verified compilation of oracles – classical functions implemented on quantum hardware.
We now briefly outline these contributions.
1.1 Quantum circuit optimization
The quantum circuit model forms the standard model of quantum computation [NC00], used
ubiquitously to describe and implement quantum algorithms. In order to reduce the concrete
resources needed to implement quantum algorithms, it is important to optimize circuits
for various properties, depending on the particular architecture or level of abstraction. For
instance, to fit a quantum circuit simultaneously using n qubits – the basic units of quantum
information – onto a computer with only m < n available qubits, optimization must be
performed to reduce the number of bits in simultaneous use. Likewise, optimization may be
performed to use extra available physical qubits to reduce other resources such as execution








Figure 1.1: A quantum circuit diagram
Such optimizations can have a significant impact on the size of a quantum computer needed
to implement an algorithm, and moreover to outperform its classical counterpart.
In the standard model, the state of an n-qubit register is defined as a unit vector in
the 2n-dimensional Hilbert space C2n , and quantum gates are defined as elements of the
unitary group U(2n) for some n, acting on a state via matrix-vector multiplication. For
the purposes of this thesis, a circuit over a particular gate set G ⊆ U(2n) will generally
be a sequence of gates of G or their inverse – that is, a circuit is some element of the free
group 〈G〉. Quantum circuits are displayed diagrammatically, as shown in Figure 1.1, with
horizontal lines carrying qubits from left to right and labelled boxes representing gates
acting non-trivially on a subset of qubits. A circuit C ∈ 〈G〉 is further said to implement
the unitary operator JCK, where J·K : 〈G〉 → U(2n) is the unique homomorphism into U(2n)
sending a gate U to itself. We call two circuits implementing the same unitary operator
equivalent.
We define the problem of quantum circuit optimization over a gate set G and cost
function φ : 〈G〉 → R as that of finding an equivalent circuit over G minimizing φ.
Problem 1.1.1 (Circuit optimization for G, φ). Given C ∈ 〈G〉, find C ′ ∈ 〈G〉 such that
JCK = JC ′K and C ′ minimizes φ(C ′).
An important distinction is whether an algorithm solves Problem 1.1.1 exactly – i.e.
finds some circuit C ′ with a provably minimal cost – or as a heuristic, returning some circuit
C ′ which is better, but not necessarily optimal with respect to φ. As circuit optimization is
typically intractable3 many of the methods we develop are heuristic optimizations.
Two closely related problems are those of optimal synthesis and optimal mapping, defined
below.
3Classical logic minimization which is contained in the optimization problem over certain gate sets was
recently shown to be
∑P
1 -complete [BU11] and hence strictly more difficult than NP, barring a collapse in
the polynomial hierarchy
4
Problem 1.1.2 (Optimal synthesis for G, φ). Given U ∈ U(2n), find C ∈ 〈G〉 such that
JCK = U minimizing φ(C).
Problem 1.1.3 (Optimal mapping for G,G ′, φ). Given C ∈ 〈G〉, find C ′ ∈ 〈G ′〉 such that
JCK = JC ′K minimizing φ(C ′).
While all three problems are closely related and frequently interchangeable, we distinguish
them as each serves a distinct purpose in a quantum compiler toolchain. Optimal synthesis
is typically reserved for a few qubits (e.g., [DN06,KMM13b,AMMR13,KMM13a, Sel15,
RS16,PS14,BRS15]) as the size of the unitary representation scales exponentially in n,
or for restricted gate sets (e.g., [PMH08]) which admit efficient representations. On the
other hand, optimal mapping (assuming G 6= G ′) is only applicable in compilation contexts
when moving from a higher-level gate set to a lower-level (e.g., [ASD14,AADS16,Mas16])
or machine-specific one (e.g., [ZPW18]). In the either case, strict circuit optimizations are
typically combined with per-gate mappings to achieve effective results (e.g., [NRS+18]).
1.1.1 Gate sets
An important factor in the circuit optimization problem is the choice of gate set. The gate
sets we consider in this thesis are largely motivated by fault-tolerant quantum computing,
which we briefly outline.
Due to the high error rate of physical qubits and gates – on the order of 10−3 – it
is widely believed that quantum computers will require some form of error correction to
execute non-trivial algorithms to reasonable accuracy. This is typically done by using
an error-correcting code (ECC), which encodes the state of a logical qubit in the state
of multiple physical qubits. As the no-cloning theorem [WZ82] states that a quantum
state can’t be copied, in contrast to classical error correction where information can be
redundantly encoded by simply repeating each bit, quantum ECCs operate by encoding
basis vectors of C2n according to an ECC.
As decoding a quantum state to apply gates would expose the state to un-correctable
errors, fault-tolerant quantum computation combines an ECC with a discrete set of gates
which can be performed directly on encoded values without first decoding. Such logical
gates are typically implemented as circuits, though in modern planar codes such as surface
codes they may be implemented in various other ways [FMMC12,HFDM12]. In either case,
the Clifford group, generated by the Hadamard (H), controlled-NOT (CNOT) and π/4









Figure 1.2: Transversal implementation of a logical CNOT gate in the 5-qubit code.










1 0 0 0
0 1 0 0
0 0 0 1






The generators of the Clifford group also have the convenient property that they can usually
be implemented transversally; given an n-bit code, the logical operation U is implemented
transversally with n physical copies of U , as in Figure 1.2.
As the Clifford group is not universal for quantum computing [Got98], at least one
additional operation is needed to be able to approximate any unitary transformation to
arbitrary accuracy. A typical choice is the π/8 phase gate (T = diag(1, e iπ4 )), giving the
Clifford+T gate set. Contrary to the Clifford group, the logical T gate is not transversal in
most codes, and is instead implemented via magic state distillation and gate teleportation.
Indeed, it is known that for any universal set of gates, at least one must be non-traversal in
standard (stabilizer) codes [EK09].











gates for any θ ∈ R. We call this gate set the
Clifford+RZ set, which strictly contains Clifford+T since S = RZ(π/2) and T = RZ(π/4).
We collectively refer to RZ gates as phase gates or Z-axis rotations. Such gates have
applications to fault-tolerance schemes distilling higher-order rotations [CAB12], and are
used in most standard quantum algorithms, including the Quantum Fourier Transform


















Figure 1.3: The Bravyi-Kitaev 15-to-1 A-state distillation circuit.
One other gate set we consider is the CNOT-dihedral gate set, consisting of CNOT, X
and RZ gates of arbitrary angle. The name CNOT-dihedral arises from the fact that for
any θ of order m in R/2πZ, X and RZ(θ) generate a subgroup of U(2n) which is isomorphic
to the dihedral group of order 2m, up to global phase. That is, the following equations
suffice to generate all equalities over X and RZ(θ) gates:
X2 = I, RZ(θ)m = I, XRZ(θ)XRZ(θ) = eiθI.
1.1.2 Cost functions
As with gate sets, the cost functions we consider in this thesis are largely motivated by
fault-tolerance. In particular, since magic state distillation is orders of magnitude more
expensive than transversal gates [BK05] (see also Figure 1.3), in the Clifford+T gate set the
Clifford group is typically considered “free,” and the number of T gates determine the cost
of the circuit. Likewise, RZ gates of smaller angles are implemented either via Clifford+T
7
approximation or, for angles of the form 1/2k, magic state distillation – in either case, the
resulting cost dwarfs the cost of Clifford gates, or even an individual T gate [CO16].
The main types of cost functions we consider are U-counts – that is, the number of U
or U † gates in a circuit.
Definition 1.1.4. For a gate U ∈ G, the U -count of a circuit C ∈ 〈G〉 is given by
φU(C) = # of U or U † gates in C.
In the case of RZ-counts, depending on the application we consider either the total
number of RZ gates, or the number of RZ gates of highest multiplicative order, as such
gates are the most difficult to implement fault-tolerantly.
In this thesis we develop methods for the RZ-count optimization of Clifford+RZ circuits.
In particular, we provide a heuristic optimization for the Clifford+RZ gate set and the
RZ-count cost function. The heuristic allows further optimization of Clifford+RZ circuits
by solving the optimal synthesis problem for the subset of CNOT-dihedral circuits. We then
solve the optimal synthesis problem for CNOT-dihedral circuits and RZ(θ)-count for any θ
by giving a polynomial-time equivalence between optimal synthesis and minimum-distance
decoding of certain Reed-Muller codes, and further give an efficient heuristic.
Another metric that is particularly important for near-term quantum computers is
CNOT-count. Near-term, or NISQ [Pre18], devices operate without error correction, and
hence are able to efficiently implement any single-qubit gates, including RZ gates of arbitrary
angle. On the other hand, two-qubit gates – usually the CNOT gate – are commonly the
most costly in terms of both time and fidelity. To perform optimization for such applications,
we study the optimal synthesis problem for CNOT-dihedral circuits with respect to CNOT-
count. We characterize the problem as reducing to a certain combinatorial problem we call
parity network synthesis in particular cases, and give an efficient heuristic based on this
characterization.
Other cost functions which are relevant to fault-tolerant circuit design are gate depths,
and in particular the T -depth, meaning the minimum number of stages of parallel T gates
in a circuit. We do not directly study T -depth optimization in this thesis, but note that
previous related work by the author [AMM14] examined this problem.
1.2 Verification
Inexorably linked with the concept of optimization is that of verification. To achieve good
performance complex algorithms which are difficult to program correctly, and are sometimes
8
not even valid in all cases, are used to optimize software. The -O3 optimizations of the GCC
compiler for instance are explicitly not valid for all standard-compliant C programs [GCC16],
and many more bugs have been found in the less aggressive optimizations [LAS14]. Similar
bugs occur in optimizing quantum compilers, such as the following semantically different
circuits compiled from the same program by the Revs compiler, [PRS15] with and without
optimizations enabled.
Without optimization:
x0 • • • x0
x1 • x1
0 • • 0
0 0
0 x0
0 x0 ⊕ x1
With optimization:
x0 • • x0
x1 x1
0 • • 0
0 0
In classical computing, for most non-critical applications testing suffices to ensure
correctness. In the quantum domain, however, testing is not realistic in the vast majority
of cases due to the classical difficulty of simulating quantum circuits, as well as the lack
of general purpose, large-scale quantum computers. For this reason formal verification –
formal proof of correctness or other properties with respect to the underlying mathematical
model such as linear algebra – is crucial for asserting the correctness of quantum circuits
and, by extension, resource analyses.
A question which arises in the context of quantum circuit verification is which properties
are useful to verify? While classically, verification typically focuses on safety properties such
as buffer overflow and out of bounds memory access, such properties are largely irrelevant
to pure quantum circuits as they lack branching control and memory. In effect, for pure
quantum circuits, quantum circuits cannot fail or crash; they are either correct or incorrect.
We call the problem of verifying the correctness of a circuit with respect to a unitary
operator functional verification, in line with its use in electronic design automation.
Problem 1.2.1 (Functional verification). Given a unitary U ∈ U(2n) and a circuit C,
determine if JCK = U .
For more complex quantum computations such as communication and cryptographic
protocols, safety properties are relevant and verification methods for those have been devised
(e.g., [AC04,GNP08,BCM08,FYY13,FHTZ15]). This thesis only considers the question of
functional correctness, as it is directly relevant to compiling pure quantum circuits.
9
1.2.1 Functional verification
In order to verify the functionality of a circuit, the unitary U must first be specified in
some formal mathematical model, for instance a matrix; moreover, the specification must
itself be verified to ensure that it correctly specifies the intended functionality. Whether
this verification is performed via testing of the design or automated generation from some
higher-level specification language, human insight is necessarily required to ensure that the
specification is correct.
Consider, for example, integer addition of two registers. Even in the case of a one-bit
register, the unitary matrix is
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

,
which is effectively incomprehensible by humans, and hence unsuitable as a specification –
in classical logic design, such a specification corresponds to a truth table, which is typically
generated from a human-readable logical expression. An alternative is a circuit written




Functional verification against circuit-like specifications is called equivalence checking, and
has seen extensive research (e.g., [WGMD09,YM10,CD11,DL13,GD17,RPZ17]). However,
when viewed as a direct (human-level) specification, higher-level circuits are still error-prone.
As quantum programming languages typically take the form of circuit description languages,
human-readable specifications of quantum circuits and algorithms are not yet a reality.
On the verification side, a mathematical model which is as abstract as possible – in the
sense that implementation details are abstracted away – while remaining efficient to represent
10
and manipulate is needed in order to scale to useful sizes. A model which has recently
been applied to the verification of quantum circuits is the ZX-calculus [CD11,DL13,GD17].
Such graphical models abstract away many syntactic details of circuits, including structural
properties of the underlying symmetric monoidal category and motivating the tagline only the
topology matters, while still remaining efficiently representable. However, re-writing graphs
is generally difficult [BGK+16] and these systems typically require human intervention to
complete proofs due to a lack of natural direction for proofs to proceed [DL13,GD17].
We develop a formal model – based on the Feynman path integral, which also factors
heavily in our optimization work – for quantum circuits which doubles (with some syntactic
sugar) as a natural specification language. Using this mathematical model, we give a
functional verification algorithm and demonstrate its ability to verify large quantum circuits
directly against their functional specifications. Specifically, we provide an automated
quantum circuit verification method which allows verification against a logical instead of a
circuit-like description of a quantum computation.
1.2.2 Compiler verification
In contrast to functional verification, which is useful for asserting the correctness of a
circuit with respect to a human-readable, logical specification, it is often desirable to simply
ensure that a compiler has not introduced errors into a program which is otherwise assumed
to be correct. While functional verification can double for this purpose via equivalence
checking between a source program and the compiled circuit, such methods are typically not
scalable and moreover need to be applied individual to every compiled circuit. Instead, it is
preferable to directly validate that the compiler itself is correct and never introduces errors.
When this certification is performed formally, the process is called compiler verification.
The idea of formally proving the correctness of a compiler was developed in the late
60’s by McCarthy and Painter [MP67], leading up to the first machine-checked proofs
of correctness in the early 70’s by Diffie, Milner and Weyhrauch [MW72]. However, the
advancements in proof assistants have recently led to an explosion in popularity of the
technique, spurred by Leroy’s formally verified, optimizing C compiler CompCert [Ler06].
The novel insight of Leroy, facilitated by these advancements, was to write the compiler,
specification, and formal proof all in the same language – for instance, Coq [Coq17] –
allowing the implementation of the compiler to be directly verified, as opposed to just the
verification algorithm as in [MW72].
As an illustration, an implementation of the factorial function, together with a proof
that the factorial function always returns a positive integer on non-negative integers, is
11
val factorial : nat -> nat
let rec factorial x = match x with
| 0 -> 1
| _ -> x * factorial (x - 1)
val factorial_is_pos: (x:nat) -> Lemma (factorial x > 0)
let rec factorial_is_pos x = match x with
| 0 -> () (* 0 > 0 *)
| _ -> factorial_is_pos (x - 1)
(* factorial (x - 1) > 0 => x * factorial (x - 1) > 0 *)
given in the proof assistant F? [SHK+16] below. The type of factorial_is_pos specifies
that the program computing the factorial function maps natural numbers to non-negative
integers, while the implementation of factorial_is_pos functions as a proof which is
machine checked by the F? compiler. In contrast to proof assistants based more directly
on constructive logic such as Coq, the F? compiler uses an SMT solver to machine check
proofs, allowing proofs to be partially specified as above.
We follow the methodology laid out by Leroy [Ler06] to develop a formally verified
optimizing compiler for reversible circuits. The compiler ReVerC compiles a classical
irreversible language Revs [PRS15] to reversible circuits. A notable aspect of our compiler
is that it is built around partial evaluation, which offloads most of the proof obligations to
a tiny two-instruction language. In practice, this allowed the formal proof to be carried out
with a low (see, e.g., [Ler06]) ratio of approximately 1:1 lines of code and programmer time
compared to the development of the compiler proper.
1.3 Contributions
In summary, the main research contributions of this thesis are:
• An algorithm for optimizing quantum circuits which merges Z-axis rotation gates
and groups them into CNOT-dihedral sub-circuits for further optimization.
• A proof of the polynomial-time equivalence of RZ-count minimization of CNOT-
dihedral circuits and minimum distance decoding of certain Reed-Muller codes, along
with an associated optimization algorithm.
12
• A characterization of CNOT-count minimization of CNOT-dihedral circuits as minimal
parity network synthesis. The minimal parity network synthesis problem is shown to
be NP-complete for two restricted cases, and an efficient heuristic is given.
• A mathematical model for the formal specification of quantum circuits, along with a
verification algorithm which is complete for Clifford circuits and scales to Clifford+RZ
circuits with hundreds of qubits.
• A formally verified reversible circuit compiler which ties together both verification
and optimization by optimizing for the total space usage of compiled circuits.
The contributions above are published chronologically in [AM16,ARS17,AAM18,Amy18].
In addition to the work described here, other publications completed during my doctoral
studies include [AADS16,AASD16,AMG+16,ACJR17,KIQ+18].
1.4 Outline
This thesis is divided into three parts, which are briefly outlined below.
• Part I, including chapters 1 to 3 covers background material necessary for this thesis.
In Chapter 2 we describe the circuit model of quantum computing and its standard
interpretation in linear algebraic terms. Chapter 3 introduces the path integral model
of quantum circuits, which underlies many of the theoretical results in this thesis.
• Part II (chapters 4 to 6) covers optimization algorithms for quantum circuits. Chapter 4
introduces a general optimization algorithm called phase folding to merge Z-axis
rotations in arbitrary quantum circuits. The remaining chapters examine the problem
of optimal synthesis of CNOT-dihedral circuits, which arises as a sub-problem of
the phase folding algorithm, from the perspective of RZ-count (Chapter 5) and
CNOT-count (Chapter 6) optimization.
• Part III (chapters 7 and 8) studies the problem of verifying compiled quantum
circuits. In Chapter 7 a framework for the functional specification and verification
of quantum circuits is developed, along with a concrete verification algorithm for
circuits in the Clifford hierarchy. Chapter 8 culminates with the formal verification of
an optimizing reversible circuit compiler, compiler circuits which are provably correct




In this chapter we review the relevant background on quantum computation and the
standard linear algebraic model of quantum circuits. We begin by covering the standard
model of quantum computation, then cover relevant aspects of quantum circuits followed
by discussions of reversible computation.
2.1 The standard model
In the standard model of quantum computation, the state of a qubit – a unit of quantum
information – is described as a unit vector in C2. We use Dirac (Bra-Ket) notation, where
|ψ〉 denotes a vector in C and 〈ψ| denotes its complex conjugate |ψ〉†; the vector space inner
product is further written as 〈ϕ|ψ〉. As is customary we fix a basis {|0〉 = e1, |1〉 = e2} of
C2 called the computational basis. The states |0〉 and |1〉 are known as the classical states,
and the state of a qubit |ψ〉 is said to be a superposition of classical states if





where α and β are non-zero complex numbers. As the state of a qubit is required to be a
unit vector in C2, we have that |α|2 + |β|2 = 1.
The state of n qubits is likewise given by a unit vector in the 2n-dimensional complex
vector space C2n . We denote the states of the computational basis by |x〉 = |x1x2 . . . xn〉
where x ∈ Fn2 and F2 = ({0, 1},⊕, ·) is the binary field – when confusion is unlikely to
arise, we may also use |i〉 to refer to the classical state corresponding to the n-digit binary
14
expansion of i. States of distinct subsystems |ψ1〉 ∈ C2
n and |ψ2〉 ∈ C2
m may be combined
by taking their tensor product |ψ1〉 ⊗ |ψ2〉 ∈ C2
n+m , defined as the standard vector space
tensor product:












The tensor product of two basis states |x〉, |y〉 is defined as the basis state labelled by
concatenation xy of x and y:
|x〉 ⊗ |y〉 = |xy〉 = |x1x2 . . . xny1y2 . . . ym〉.
A multi-qubit state |ψ〉 which can be written as a tensor product of one-qubit states is said
to be separable. Conversely, a state which is not separable is said to be entangled.







We typically drop the ⊗ and simply write |ψ1〉|ψ2〉 to refer to the tensor product of |ψ1〉
and |ψ2〉. As the tensor product is symmetric – e.g., there exists a natural isomorphism
σn,m between Cn ⊗Cm and Cm ⊗Cn – we assume the order of subspaces can be arbitrarily
rearranged.
A quantum state evolves according to some unitary transformation U : C2n → C2n .
By unitary we mean an invertible linear operator U on C2n such that U † = U−1, where
U † is the conjugate-transpose of U , obtained by transposing U and taking the complex
conjugate of each entry. Equivalently, U is a linear operator that preserves the Euclidean
norm, and hence maps unit vectors to unit vectors. We write U(d) to denote the set of
unitary operators on a complex vector space of dimension d. As with states, unitaries
U ∈ U(2n), V ∈ U(2m) operating on distinct subsystems may be combined with the vector
space tensor product U ⊗ V ∈ U(2n+m), where it can be readily verified that
(U ⊗ V )(|ψ1〉 ⊗ |ψ2〉) = U |ψ1〉 ⊗ V |ψ2〉.
A qubit or system of qubits may also be measured in some orthonormal basis {|bi〉}





Figure 2.1: An example of a quantum circuit, implementing the Quantum Fourier Transform.
and leaves the qubit in the state |bi〉 with probability |αi|2. If only a subset of a quantum
state is measured, then when |ψ〉 = ∑i αi|bi〉|ψ′i〉 where each ψi is a normalized quantum
state, measurement returns the outcome bi and leaves the system in the state |bi〉|ψ〉 with
probability |αi|2. Measurement is typically performed in the computational basis, with
measurement in other bases being emulated by applying unitary transformations first.
2.2 Quantum circuits
Quantum computations are typically described via circuit diagrams in analogy to classical
computing. Such diagrams carry qubits along horizontal lines called wires into unitary
gates and measurements which transform their state. Figure 2.1 gives an example of a
quantum circuit diagram. We restrict our attention to pure quantum circuits – those not
containing measurements.
Formally, given a finite set of gates G ⊆ ⋃n U(2n) called a gate set, each acting on a
non-zero number of qubits called its arity and denoted ar(U), U ∈ G, we can construct
circuits C over G recursively as terms of the form
C := In, n ∈ N | σn,m, n,m ∈ N | U, V ∈ G | C† | C ; C ′ | C ⊗ C ′.
The notation ; denotes left-to-right functional composition, in contrast to the standard
right-to-left composition ◦. We extend arity to circuits by defining
ar(In) = n
ar(C†) = ar(C)
ar(C ; C ′) = ar(C)
ar(C ⊗ C ′) = ar(C) + ar(C ′).
We say a circuit is well-formed if for every sub-term C ; C ′, ar(C) = ar(C ′).
16
Diagrammatically, we write the identity circuit In as a n horizontal lines, and an n-qubit




We typically drop the subscript from I. The tensor product U ⊗ V is written as vertical
composition, while the (left-to-right) functional composition U ; V is written as horizontal




As is typical we assume certain structural properties of quantum circuits hold – in
particular, that † and ; form a group, ⊗ is associative, and that for any U,U ′, V, V ′ the
bifunctorial law
(U ; V )⊗ (U ′ ; V ′) = (U ⊗ U ′) ; (V ⊗ V ′)
holds. Further, we assume that the σn,m gate, swapping the first n wires of a diagram with
the last m, satisfies
σn,m ; (Im ⊗ C) = (C ⊗ Im) ; σn,m.
We use this swap operator to allow the application of multi-qubit gates to non-sequential
qubits – in particular, we write Ui1,i2,...,ik to denote an n-qubit gate U of arity k applied
to qubits i1 through ik. Concretely, the notation Ui1,i2,...,ik corresponds to some circuit
σ† ; U ⊗ In−k ; σ, where σ is some sequence of swaps mapping 1 7→ i1, . . . , k 7→ ik.
Using the above assumptions – corresponding to a symmetric monoidal groupoid – any
n-qubit circuit over G can be written as an element of the freely generated group 〈Gn〉,
where Gn is the n-qubit completion of G, consisting of the images Ui1,i2,...,ik of each gate on
n qubits. We write 〈G〉 to refer to well-formed circuits over G and use either the symmetric
monoidal (G, ;,⊗, †) or group (Gn, ;, †) presentations depending on which is most convenient
in the particular context.
Semantics The semantics of a circuit defines its meaning as a mathematical object of
some domain. Typically, the semantics of a unitary circuit is taken as the unitary matrix
obtained by interpreting ⊗, ;, and † as their vector-space equivalents – we call this the










































1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0













Figure 2.2: Standard quantum gates & their matrix interpretations
Definition 2.2.1 (standard interpretation). The standard interpretation of an circuit
C ∈ 〈G〉 for some G ⊆ ⋃n∈N U(2n) is denoted JCK ∈ U(2ar(C)), defined recursively as





JC ; C ′K = JC ′KJCK
JC ⊗ C ′K = JCK⊗ JC ′K
Standard gates Standard gates we will use throughout this thesis are listed in Figure 2.2.
We now describe several important gate sets formed from subsets of the listed gates.
The well-known Clifford group on n qubits, Cn, arises as the normalizer of the n-qubit
Pauli group Pn = 〈Xi, Yi, Zi | i ∈ [n]〉. In particular,
Cn = {U ∈ U(2n) | UPnU−1 ⊆ Pn}.
A minimal generating set for Cn is {H,S,CNOT}, though for practical reasons we typically
include X in the Clifford gate set.
By adjoining the T gate to the Clifford gate set, we generate the Clifford+T group, on n
qubits denoted C3n. Again, the Clifford+T group arises as the normalizer of the Clifford group
– this recursive construction is called the Clifford hierarchy, where C2n = Cn and C1n = Pn.
18
While for k > 3, Ckn does not form a group [CGK17], it is known that Rk = RZ(2π/2k) ∈ Ckn
but not Ck−1n for any k, hence we refer to circuits over {H,CNOT, X,Rk} as Clifford
hierarchy circuits. Note that for any k, R2kk = I, and in particular 2k is the least such
exponent, hence we say that Rk has order 2k in U(2n). Equivalently, since RZ(2π) = I,
we say that θ = 2π/2k has order 2k in the additive group R/2πZ. If we adjoin a Z-axis
rotation of any non even-power order to the Clifford group, we get a universal gate set
which is not contained in the Clifford hierarchy. We generally call such circuits Clifford+RZ
circuits.
One other relevant gate set is the CNOT-dihedral gate set {CNOT, X,RZ(θ) | θ ∈ R}.
In the standard interpretation – and indeed in any symmetric monoidal groupoid – the
image of all n-qubit circuits over {CNOT, X,RZ(θ)} for a fixed θ of order k is isomorphic
to the dihedral group of order 2k. We call circuits generated by {CNOT, X,RZ(θ)} with
RZ(θ) of order k the CNOT-dihedral group of order 2k.
Two other gates which are relevant include the Toffoli gate TOF, and the doubly-
controlled Z gate CCZ, given as matrices below:
TOF =

1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1




1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 −1

.
The TOF and CCZ gates, as well as the CNOT gate, are instances of controlled gates. For
instance the CNOT gate computes the transformation
CNOT(α|0〉|ψ〉+ β|1〉|ψ′〉) = α|0〉|ψ〉+ β|1〉(X|ψ′〉),
effectively applying the X gate to the second qubit – called the target – if the first – the
control – is in the state |1〉, and the identity transformation otherwise. Likewise, the TOF
and CCZ gates apply X and Z, respectively, “controlled” on the value of the first two




to denote a controlled-U gate with the top qubit acting as the control and the bottom qubit
as the target. In the context of controlled-X gates (e.g., CNOT, TOF) we typically denote
the X gate as
.
2.3 Reversible computation
An important subset of quantum circuits are reversible circuits. Such circuits implement
self-inverse mappings from computational basis states to other computational basis states,
and hence correspond to classical computations which can be inverted by reversing the
computation. We say a gate is a reversible gate if its standard interpretation is a permutation
matrix.
Example 2.3.1. The X, CNOT and TOF gates are all reversible gates. As functions on
computational basis states,
X : |x〉 7→ |1⊕ x〉, CNOT : |x〉|y〉 7→ |x〉|x⊕ y〉, TOF : |x〉|y〉|z〉 7→ |x〉|y〉|z ⊕ xy〉,
where ⊕ denotes addition in F2.
It can be readily observed that not all classical functions from n to m bits are reversible,
as for instance the AND function x∧y is not invertible. Even restricted to classical functions
of the form {0, 1}n → {0, 1}n, not all such functions can be reversed. However, with the
addition of ancillas – extra qubits initialized in the |0〉 state – we can recover the full power
of classical computing by emulating the functionally complete NAND and FANOUT gates
with X, CNOT and TOF gates:
(I ⊗ I ⊗X)TOF : |x〉|y〉|0〉 7→ |x〉|y〉|1⊕ xy〉
CNOT : |x〉|0〉 7→ |x〉|x〉
A consequence of using ancillas to implement arbitrary classical circuits is that the
space usage is linear in the size of the circuit. While we could simply measure temporary
values after they are no longer needed to project the ancilla back into the |0〉 or |1〉 state,
thus freeing the ancilla up to be reused later, if the qubit is entangled with the rest of the
system it may affect the state. On the other hand, if entangled ancillas are left around they
may decohere on their own, reducing the fidelity of the entire state.
20
compute uncompute
x • • • • x
y • • • • y
0 • • 0
0 • • 0
0 • 0
0 (x ∧ y) ∧ (x⊕ y)
Figure 2.3: A reversible circuit uncomputing all temporary ancillas.
Bennett [Ben73] showed that rather than measuring a temporary value, temporary bits
can be returned to the zero-state by copying out the result and then running the circuit
in reverse – in effect uncomputing the temporary values and freeing any allocated ancillas.
Figure 2.3 gives an example of this compute-copy-uncompute sequence for the Boolean
expression (x ∧ y) ∧ (x ⊕ y). In general, Bennett’s method can be summarized by the











0 f(x1, . . . , xn)
The choice of when to clean temporary bits allows a trade-off between time and
space [Ben89]. For instance, the circuit below on the left computes the product wxyz with
only one final round of cleanup, while the circuit on the right applies intermediate cleanup
of the product xy. Though still not optimal, the result is a reduction of one ancilla, at the
cost of doubling the number of compute-uncompute cycles for the first sub-expression xy.
Such trade-offs are an instance of pebble games, and it can be noted that given a sequence
of k intermediate expressions, if each expression is uncomputed immediately after use, the
ith sub-expression will be computed or uncomputed 2k−i times.
21
compute uncompute
w • • w
x • • x
y • • y
z • • z
0 • • 0




w • • • • w
x • • • • x
y • • y
z • • z
0 • • • 0
0 • • 0
0 wxyz
Classical linear operators A special subclass of reversible computations are those
which are linear or otherwise affine permutations. Specifically, an n-qubit reversible circuit
implements a linear (resp. affine) permutation if it maps a basis state |x〉 to |Ax〉 for
some invertible linear (resp. affine) operator A on Fn2 . We denote by GL(n,F2) the general
linear group of invertible n× n matrices over F2, and likewise GA(n,F2) = Fn2 o GL(n,F2)
denotes the general affine group, where Ab ∈ GA(n,F2) acts on x ∈ Fn2 as Ax + b (note
that b ∈ Fn2 gives the affine subspace). The CNOT and X gates implement linear and
affine permutations, respectively, while the Toffoli gate is a non-linear permutation.
It will sometimes be useful to refer to non-invertible linear or affine operators on F2. We
denote by L(Fn2 ,Fm2 ) the space of linear operators from Fn2 → Fm2 – that is, m by n Boolean




In this chapter, we introduce the path integral model of quantum computation, which has
proven useful for computational methods in Quantum Field Theory and underlies much
of the formal reasoning in this thesis. In particular, we give an algebraic account of the
sum-over-paths action of a quantum circuit, which can then be analyzed and manipulated
as a mathematical object. We begin by describing the background and intuition of the
path integral formulation, then introduce the path integral representations of Clifford+RZ
circuits we will use in various forms throughout this thesis. We close with a discussion
of representations of the phase function of a path integral, particularly by their Fourier
expansions which have a close connection to the structure of Clifford+RZ circuits.
3.1 The Path Integral formulation
The path integral formulation of quantum mechanics, formalized by Feynman in the
40’s [FH65], serves as an alternate, equivalent formulation of quantum mechanics to the
standard model. In a general sense, the idea is to describe the probability amplitude of
the transition from one state to another (say, the position of a particle) by a sum over all
possible transitions between intermediate states or paths leading to that state. Figure 3.1
shows possible trajectories of a particle moving from states A to B; in the path integral
formulation, the final amplitude is the equal-weighted sum of the phase – eiθ for some θ ∈ R
– acquired along each such path.
As a quantum process, the action of a quantum circuit can likewise be described as




Figure 3.1: The paths of a particle from point A to B.
typically discretized and modelled as operators on a finite Hilbert space (C2n), a discrete
sum is used rather than a continuous integral.
Viewing the computational basis states as the physical states of a qubit system, a unitary
operator sends a particular state to a sum of states, each with a particular amplitude –
this process can be viewed as an initial state (possibly in a superposition) taking some
superposition of paths to other states. For instance, consider the Hadamard gate H,








In effect, the Hadamard gate sends a single basis state |0〉 or |1〉, along two distinct,
equal-weight paths to states |0〉 and |1〉 with phases 1 and (−1)x, respectively. Visually, we









The effect of interference between paths becomes clear when by composing the outputs
with a second Hadamard gate, corresponding to the circuit HH, which in the standard









To compute the amplitude with which the state |0〉 transitions to state |0〉 – i.e. 〈0|HH|0〉
– we simply sum the phases along each path from |0〉 to |0〉 and scale by 12 , since each
path has amplitude 12 . It can be easily verified that the total amplitude is
1
2 (1 + 1) = 1.
Likewise, if we sum the phases along all paths from |0〉 to |1〉, the result is an amplitude
of 12 (1− 1) = 0 as expected. In this case, the paths leading to |1〉 destructively interfere
while the paths to |0〉 constructively interfere.
3.1.1 Sum-over-paths actions
While path integrals can be used to a particular transition amplitude – that is, 〈x′|U |x〉 –
as a sum of all internal paths from the input state to the output state, a general unitary
operator involves the transition amplitudes between all input and output states. Rather
than give a separate path integral for each transition, it is possible to give a self-contained
path integral account of unitary dynamics as a collection of paths Π between basis states,
together with a set of amplitudes φ : Π→ C. This set of paths and amplitudes naturally





where π : x→ x′ denotes that π is a path from |x〉 to |x′〉. Writing the operator above by
its action on a basis state,




we call it the sum-over-paths action defined by Π and φ, where Πx ⊆ Π gives the paths
starting at |x〉.
Any unitary matrix U ∈ U(2n) can also be described as a sum-over-paths action by
letting Π contain one path πx,x′ : x→ x′ for each x,x′ ∈ Fn2 , and φ(πx,x′) = 〈x′|U |x〉. Then
U : |x〉 7→
∑
x′∈Fn2




In this way, the path integral and linear algebraic views are complimentary; either can be
encoded in the other.
3.1.2 Composition
One of the advantages of path integrals for computational problems is that an evaluation
strategy for quantum amplitudes is not fixed a priori. In particular, functional composition
25
can be defined by taking the familiar relational composition of paths
Π ; Π′ = {(π ; π′) : x→ x′ | ∃x′′.π : x→ x′′ ∈ Π ∧ π′ : x′′ → x′ ∈ Π′}.
That is, the resulting collection of paths is given by connecting the outputs of paths in Π
with the inputs of paths in Π′. The amplitude of each path is likewise the product of the
amplitudes of each segment,
φ ; φ′ : π ; π′ 7→ φ(π) · φ′(π′).
The sum-over-paths action defined by composing path integrals in this way unsurprisingly
coincides with the composition of their respective linear operators U, V :




In effect, the composed sum-over-paths action above delays all evaluation of amplitudes,
in contrast to composition in the standard model (matrix multiplication) which directly
evaluates the sum of interfering paths. This idea of lazy evaluation of amplitudes was
previously used to show the containment of the bounded-error quantum polynomial time
complexity class (BQP) in PSPACE by Bernstein and Vazirani [BV97], PP by Adleman,
DeMarrais, and Huang [ADH97] and later AWPP by Fortnow and Rogers [FR99], allowing
analysis of the complexity of quantum amplitudes.
Tensor composition is similarly intuitive in the path integral view, corresponding to a
product1 of all paths Π× Π′ and their amplitudes φ× φ′ : Π× Π′ → C:





The other major advantage of the path integral formulation appears when restricted to
paths which have algebraically “nice” amplitudes. In these cases, the sum-over-paths action
of a quantum process may be given by a simple algebraic description.
For instance, recall that the Hadamard gate maps an initial state |x〉 to the equal-weight
superposition of states |x′〉, x′ ∈ {0, 1}, together with a phase in {1,−1}. We call a gate
where each path has equal weight balanced; in this case, only the phase of a path is needed
1More formally, as the collection of paths may not generally be separable we need a monoidal product.
26
to compute its transition amplitude. Over gate sets consisting solely of balanced gates,
every path contributes equal amplitude but with a different phase, hence the sum-over-paths






Since the Hadamard gate induces a branching into two output states, |0〉 and |1〉, the
paths in Πx may also be indexed by the particular branch taken and the output state given
as a function of this branch. Writing the action of H as a sum over the branches x′ ∈ F2
with output state x′ and writing the phase as a function of the initial state x and x′, we
arrive at the familiar expression





We call x′ a path variable, as its values index the different computational paths taken
simultaneously by the Hadamard gate.
The relational composition of sets of paths indexed by path variables can likewise be
indexed by all of the component variables. However, as some paths are not feasible – for
instance, the composition of π : 0→ 0 and π′ : 1→ 1 – care must be taken to ensure the
correct paths are composed. Intuitively, this can be achieved by setting the initial state of
the latter path to be equal to the output (as a function of the path variables) of the former.
For instance, the Hadamard gate can be composed with itself as below:




We call the path variable x′′ an internal variable, as the output of a path does not depend
on x′′; equivalently, paths which differ only in values of internal variables interfere. We call
path variables which are not internal external variables.
The use of variables to index paths appears in Dawson et al. [DHM+05], where the fact
that paths over {H,TOF} are balanced and have phase in {1,−1} was used to reduce the
amplitude function to a Boolean polynomial. Using this sum-over-paths action, computing
the transition amplitudes of a quantum circuit reduces to computing the gap of a Boolean
polynomial – the difference between the number of solutions to P (x) = 0 and P (x) = 1.
As the problem of computing the gap of a Boolean polynomial is contained in PP, they
gave a significantly simpler proof of the containment BQP ⊆ PP. This proof was later
simplified further by Montanaro [Mon17], who showed that the phase function over the
universal {H,Z,CZ,CCZ} gate set can be described by a degree 3 Boolean polynomial.
27
3.2 Clifford+RZ path integrals
We now consider circuits over a concrete gate set, Clifford+RZ , and give a representa-
tion for their sum-over-paths action. Recall that we define the Clifford+RZ gate set as
{H,CNOT, X,RZ(θ) | θ ∈ R}. Each gate in this set can be defined by a sum-over-paths
action as below:





CNOT :|x〉|y〉 7→ |x〉|x⊕ y〉
X :|x〉 7→ |1⊕ x〉
RZ(θ) :|x〉 7→ eiθx|x〉
Each of the above actions is balanced, with phase and output given by a pseudo-Boolean
function and an affine function of the input and path variables, respectively. In particular,






which is completely specified by m ∈ N, f : Fn+m2 → R, and Ab ∈ A(Fn+m2 ,Fn2 ). Note
that we denote the application of f and A to the concatenation of x and y by f(x,y) and





′(x,y)|A′(x,y) + b′〉 by linearity


















where f ′′ and A′′ are a pseudo-Boolean and an affine function of the input and m+m′ path
variables, as both compose with affine Boolean transformations. In particular,
A′′ = A′(A
⊕
Im′), b′′ = A′b + b′,
f ′′(x,y,y′) = f(x,y) + f ′(A(x,y) + b,y′),
28









The tensor product of two such path integrals is similarly expressible, and so it follows
that any Clifford+RZ circuit is expressible as a phase function and affine transformation
over a set of path variables.
Proposition 3.2.1. The action of an n-qubit Clifford+RZ circuit C is expressible by a set of
m path variables, phase function f : Fn+m2 → R and affine transformation Ab ∈ A(Fn+m2 ,Fn2 ).
In particular,





In general, a Clifford+RZ circuit C may have more than representation of the form in
Proposition 3.2.1 – in particular, the identity circuit HH is expressible as either
HH : |x〉 7→ 12
∑
y1,y2∈F2
(−1)xy1+y1y2|y2〉 or HH : |x〉 7→ |x〉,
due to the interference of paths. When we refer to the sum-over-paths action of a circuit,
we usually refer to the canonical action, corresponding to the complete composition of
each individual gate’s paths. We show below in Section 3.2.1 an intuitive, standard
[DHM+05,Mon17] method of calculating the canonical action of a Clifford+RZ circuit.
Affine gates Without the X gate, the canonical sum-over-paths action of an Clifford+RZ
circuit has an output state which is a strictly linear function of the input and path variables.
We can nevertheless represent the X gate in this way, by noting that X = HZH which has
the canonical sum-over-paths action




As a result, we could represent path integrals over {H,CNOT, X,RZ} using linear rather
than affine transformations by directly defining the sum-over-paths action of X as





This demonstrates the fact that there is no fixed path integral representation of a given
unitary gate or circuit, and so different representations may be more useful than others
for certain circuit analyses. We use this representation as it allows us to model X gates
without using path variables, which will be important for both optimization and verification.
The phase group While the phase function f for a circuit over Clifford+RZ is described
as a R-valued function, it can be observed that if the angles of RZ gates are restricted
to some subgroup G or the additive group R, then f is a pseudo-Boolean function with
codomain contained in G. Moreover, there exists a natural equivalence of phase functions,
as
eif(x,y) = eif ′(x,y)
whenever f(x,y) = f ′(x,y) mod 2π for all x,y. In this case we say that f ∼ f ′.
It will be convenient at times to use a group isomorphic to either G or G/2πZ to
represent the phase function. For instance, the Clifford+T gate set restricts the phase
group to π4Z; further, since
π
4Z/2πZ ' Z8, the phase function for a Clifford+T circuit is
isomorphic to some Z8-valued function. Likewise, for any phase gate Zk = RZ(2π/k) of
order k ≥ 4, we can represent the action of a Clifford+Zk circuit with a Zk-valued phase
function.
3.2.1 Annotated circuits
The canonical sum-over-paths action of a Clifford+RZ circuit C can be easily calculated by
constructing an annotated circuit [DHM+05,Mon17]. This method is particularly useful for
calculations by hand and hence is useful for the presentation of results in this thesis. We
discuss more explicitly computational methods later in Chapter 7.
We begin by labelling the n inputs of the circuit x1, x2, . . . , xn, and label each output
of every gate with an affine combination of inputs and fresh path variables according to
is sum-over-paths action. In particular, as the CNOT gate maps |x〉|y〉 to |x〉|x⊕ y〉, the
output for the control bit of a CNOT gate has the same label as its input, while the target
bit is labelled with the sum of the input labels. Likewise, the X gate adds an affine factor of
1 to its input label, and RZ has output label the same as its input. The H gate introduces
a new path variable yi and places it along its output.
To construct the path integral from the annotated circuit, the phase contribution of





















Figure 3.2: An annotated circuit implementing the Toffoli gate.
are taken as the output function. For a gate with phase function f(x,y′) and input label
A(x,y) + b, the gate contributes a phase of f(A(x,y) + b,y′).
As an example, Figure 3.2 shows a Clifford+T circuit implementing the Toffoli gate,
TOF : |x1〉|x2〉|x3〉 7→ |x1〉|x2〉|x3 ⊕ x1x2〉.
The state of each qubit as a linear combination of the inputs x1, x2, x3 ∈ F2 and path
variables y1, y2 ∈ F2 has been annotated after each gate. Annotations are only shown in
cases where the state of the qubit is changed. Summing up the phase contributions due to
T, T †, and H gates gives the phase function, written in mixed arithmetic as
f(x,y) = π4 (4x3y1 + x1 + x2 + 7(x1 ⊕ y1) + 7(x2 ⊕ y1) + (x1 ⊕ x2 ⊕ y1) + 7(x1 ⊕ x2) + y1 + 4y1y2) .
As the circuit returns the first two qubits to their initial states, and the third qubit to the
value of y2, we see that
A =
1 0 0 0 00 1 0 0 0
0 0 0 0 1





A special case of path integrals occurs when no branching gates appear in a circuit – that
is, gates which map a classical state to a superposition of classical states. In this case, the
sum-over-paths action of the circuit maps a single input to a single output, corresponding
uniquely to the composition of each gate’s classical action and phase.
Circuits over the CNOT-dihedral gate set {CNOT, X,RZ(θ)} belong to this special case.
It can hence be observed that for such circuits, the output state is a pure affine function of
the input basis state, and furthermore the phase is a pseudo-Boolean function of the input.
Additionally, since quantum circuits are reversible, the mapping from input to output states
is one-to-one and hence is given by an invertible affine function Ab ∈ GA(n,F2), where
GA(n,F2) ' Fn2 o GL(n,F2) is the general affine group of invertible affine operators from
Fn2 to Fn2 . We sum this up in the following proposition.
31
Proposition 3.2.2. The action of an n-qubit CNOT-dihedral circuit C is expressible by a
phase function f : Fn2 → R and affine transformation Ab ∈ GA(n,F2). In particular,
JCK : |x〉 7→ eif(x)|Ax + b〉.
Again, restricting to CNOT-phase circuits – i.e. circuits over {CNOT, RZ(θ)} – gives a
simpler presentation in terms of general linear operators A ∈ GL(n,F2), but the addition
of X gates allows more non-branching circuits to be represented.
A trivial property of the sum-over-paths action of CNOT-dihedral circuits which will
be important for optimal synthesis later, is that the representation as a phase function
f : Fn2 → R and invertible affine function Ab ∈ GA(n,F2) is unique, up to equivalences
mod 2π in the phase.
Proposition 3.2.3. Given two CNOT-dihedral circuits C,C ′ with sum-over-paths actions
(f, Ab), (f ′, A′b′) respectively, then JCK = JC ′K if and only if Ab = A′b′, and f = f ′ + g for
some g : Fn2 → 2πZ.
Proof. A simple consequence of the fact that affine transformations in GA(n,F2) are
invertible, hence for any Ab, A′b′ ∈ GA(n,F2), Ax + b = A′x + b′ for all x ∈ Fn2 if and only
if A = A′ and b = b′. Then for any x,
eif(x)|Ax + b〉 = eif ′(x)|A′x + b′〉
if and only if f(x) = f ′(x) mod 2π. Hence f = f ′ + g for some g : Fn2 → 2πZ.
As a trivial corollary, the set of n-qubit (infinite order) CNOT-dihedral circuits is
isomorphic to
R2n/2πZ2n ×GA(n,F2).
In Chapter 5 we further characterize the CNOT-dihedral circuits of any finite order in this
way – i.e. as R2n/G×GA(n,F2) for some G / R2
n .
3.3 Phase polynomials
In the preceding section, the phase function of a path integral was described simply as some
pseudo-Boolean function. In this section, we discuss two concrete representations of the
phase functions: as multilinear polynomials and their Fourier expansions.
32
It is a well known fact (see, e.g., [O’D14]) that any pseudo-Boolean function can be





where xy = xy11 xy22 · · ·xynn is a multi-index, and by multilinear we mean each yi ∈ F2. More
generally, any function f : Fn2 → G for some Abelian group G – i.e. any phase group – can
also be uniquely represented in the above form.
There exists however a more direct relationship between the structure Clifford+RZ
circuit and the expression of the phase function as a sum of parities of the n variables.
Denoting by χy : Fn2 → F2 the parity function
χy(x) = yTx = x1y1 ⊕ x1y2 ⊕ · · · ⊕ xnyn
for an indicator vector y ∈ Fn2 , it is known that any pseudo-Boolean function f : Fn2 → R





called the Fourier expansion of f . The coefficients f̃(y) are known as the Fourier coefficients
of f , with the set of all 2n coefficients collectively referred to as the Fourier spectrum.
It will be convenient to instead write an expansion directly as functions of parities,
rather than their image in {1,−1}. In particular, it can be observed that for any x,y ∈ Fn2 ,
f̃(y)(−1)χy(x) = f̃(y)− 2f̃(y)χy(x).
Writing f̂(0) = f̃(0) +∑y∈Fn2 \{0} f̃(y) and f̂(y) = −2f̃(y) for y 6= 0, we obtain a unique
representation directly over the parity functions.
Proposition 3.3.1. For any pseudo-Boolean function f : Fn2 → R, there exists a unique
expansion of f of the form




Uniqueness follows from the fact that the mapping of coefficients f̃ 7→ f̂ is an isomor-
phism, with
f̃(0) =
2f̂(0) +∑y∈Fn2 \{0} f̂(y)
2 , f̃(y) = −
1
2 f̂(y) for y 6= 0.
33
Note that χ0(x) = 0 for all x ∈ Fn2 , hence the additional constant factor f̂(0).
This expression of f , which we call a phase polynomial, has a deep connection to the
structure of a Clifford+RZ circuit with phase function f . Specifically, each phase gate in a
Clifford+RZ circuit contributes to exactly one term (up to constant factors) of the phase
polynomial. Moreover, the phase function f(x, y) = πxy of the Hadamard gate can be






Hence, if we define the support of f̂ as supp(f̂) = {y ∈ Fn+m2 | f̂(y) 6= 0} and the cardinality
of C to be the number of gates, we have the following important fact:
Proposition 3.3.2. For any Clifford+RZ circuit C, if C has (canonical) sum-over-paths
action





then |supp(f̂)| ∈ O(|C|).
As the affine transformation Ab has size polynomial in |C|, it follows that the sum-
over-paths action of a Clifford+RZ circuit C can be represented in space polynomial in
|C|.
Example 3.3.3. Recall the annotated circuit in Figure 3.2. Removing the Hadamard gates
















which has phase polynomial
f(x) = π4 (x1 + x2 + 7(x1 ⊕ x3) + 7(x2 ⊕ x3) + (x1 ⊕ x2 ⊕ x3) + 7(x1 ⊕ x2) + x3) .
As each label is some parity χy(x) of the input bits x, the effect of a phase gate is to
apply a phase rotation conditioned on the value of that parity. The phase polynomial
representation hence directly represents the phase factors applied throughout a circuit.
While every pseudo-Boolean function into R has both a unique multilinear and Fourier-
expanded representation, over R/2πZ only the multilinear presentation is unique. This







Optimization is a crucial part of the quantum circuit design process, particularly to help
best utilize the limited resources available and to accurately assess the crossover points for
quantum algorithms. In this chapter, we introduce an efficient circuit optimization heuristic






in a quantum circuit, which we call quantum phase-folding. Recall that for phase gates
outside of the set {Z, S, S†}, they are typically the most expensive gates to implement
fault-tolerantly, and hence it is highly desirable for fault-tolerant quantum computing to
reduce the number of such gates.
Our algorithm uses the intuition developed in Chapter 3 to develop an optimization
algorithm which tracks only which phase gates apply to the same set of computational
paths, corresponding to the same terms of the phase polynomial, and hence can be merged.
As only which paths a phase has been applied to matters, the algorithm applies to gates
with arbitrary – even indeterminate – phases. By using abstraction, we further extend
applicability to circuits which contain uninterpreted gates – gates whose interpretation
as a unitary matrix is not known. Since only phase gates are merged, the algorithm is
effectively non-increasing in all other metrics, including depth and gate counts, hence
making it suitable as a standard quantum compiler optimization. To allow additional more
targeted optimizations, a simple modification to the algorithm further breaks the circuit into
(overlapping) CNOT-dihedral sub-circuits, which can then be optimally or sub-optimally
synthesized; we explore this problem in the following two chapters.
36
This chapter builds upon work which originally appeared in [AMM14], in particular
by separating T -count and T -depth optimization into phase-folding and CNOT-dihedral
synthesis, respectively, and extending the algorithm to uninterpreted gates. A proof of
correctness is also given and the overall run-time complexity is reduced.
4.1 Overview
Consider the following circuit:
• • T
T • •















8 (x1 ⊕ x2) =
π
2 (x1 ⊕ x2).
As a result, while the circuit contains two phase gates, only a single combined phase rotation
is performed with angle π/2, conditioned on the value of x1 ⊕ x2.
Rather than applying T gates twice to the state |x1 ⊕ x2〉 at different points in the
circuit, a single T 2 = S gate could instead be applied. As the input to each T gate is
|x1 ⊕ x2〉, this can be done simply by replacing one T gate with an S gate and removing
the other. If we examine the evolution of each initial state through the application of each













CNOT1,2 T2 CNOT1,2 CNOT1,2 T1 CNOT1,2
.
37
Note that ω = eiπ/4. Here the red and blue paths – corresponding to the solutions of
x1 ⊕ x2 = 1 – each acquire a total phase of ω2 = i, which moreover can be placed at any
point along either path.
This basic intuition of merging phases along the same path can be readily applied
to CNOT-dihedral circuits by simply computing the phase polynomial and applying one
total phase rotation of angle f̂(y) for each y ∈ supp(f̂). The rest of this section deals
with extending this intuition to universal gate sets including branching gates or otherwise
arbitrary gates.
4.1.1 Branching gates
Most useful quantum circuits contain branching gates, such as the Hadamard gate H,
which complicates the process of merging phases along a path. In particular, consider the







In this case there is no way to “slide” the phases along paths so that the total number of
phase gates are reduced. In particular, commuting the first ω phase past the Hadamard on









However, this adds an extra phase of ω to both paths starting from the |0〉 state. The
extra phase could be cancelled by applying a ω−1 phase on the |0〉 input state (e.g., with
XT †X), but the result is more phase gates than the circuit initially had! From an algebraic




Collecting the phase contribution of each gate we see that the the total phase is ωx+y(−1)xy =
ωx+y+4xy, hence no phases are merged.
On the other hand, if phase gates are applied to qubits which are not directly affected
by a branching gate, it is expected that such phase gates should be possible to merge.
38
Fortunately, this is easy to observe in the phase polynomial formalism. For instance,
consider the circuit below which applies a T gate to the second qubit before logically
swapping it with the first qubit, applying a Hadamard on the second, then swapping back
















Again the total phase is ω2x2+4x1y, hence the T gates – having input |x2〉 – can be merged,
with a total angle of π/2.
4.1.2 Abstraction
The intuition of the phase-folding algorithm so far is that phases gates which contribute to
the same term in the phase polynomial can be merged. However, non-phase gates – for
instance, the Hadamard gate – may also apply phase rotations which contribute to the
same term as a phase gate, but can’t be merged.




As the Hadamard gate applies a phase of ω4xy and moreover
ω4xy = ω2x+2y−2(x⊕y),
the total phase of the circuit is thus ω−2(x⊕y), cancelling the two phases of ω−2x and ω−2y
from the S† gates. However, it is not possible (over the Clifford+RZ gate set) to directly
implement the transformation |x〉 7→ 1√2
∑
y∈F2 ω
−2(x⊕y)|y〉 – the only way to implement
the desired branching is to use a Hadamard gate, which necessarily applies a phase of
ω2x+2y−2(x⊕y) and hence must be corrected. In effect, the phase contribution of the Hadamard
gate is fixed and cannot be merged with any other phase gates over the Clifford+RZ set.
A natural way to avoid such erroneous phase cancellations is to only track terms of the
phase polynomial which can be merged. We can do this by abstracting the phase polynomial
f to some f ] comprised of only the phase contributions due to RZ gates. For instance, the
above circuit has the real phase polynomial







but abstract phase polynomial




Uninterpreted gates We can further use abstraction to soundly approximate the effect
of gates U whose particular semantics are unknown – i.e. uninterpreted gates. In particular,
recall that any unitary transformation can be written as a sum-over-paths:




The above expression indicates that a unitary transformation U “takes” every possible branch
– however, depending on the semantics some branches may have amplitude 〈y|U |x〉 = 0,
and hence are infeasible.
If we know certain branches from a gate are infeasible, we can commute phases across








If 〈1|U |0〉 = 0 = 〈0|U |1〉, then we can combine the ω phases as below, since the extra phase

















On the other hand, if all branches have non-zero amplitude the phases, as is the case for
the Hadamard gate, no useful phase commutations are possible.
When we encounter some n-qubit uninterpreted gate U we over-approximate its feasible
branches by assuming that every branch (of qubits on which it acts) may be taken.
Practically speaking, the result is that phase gates cannot be commuted through an
uninterpreted gate, but can in fact be commuted around them, regardless of their semantics.
Path representatives The final element to our phase-folding algorithm is a further
abstraction to reduce the number of variables the phase polynomial is represented over.
40
Given an n-qubit circuit C with m Hadamard gates, recall that the sum-over-paths action
is specified over n input and m path variables – that is, the phase polynomial and the n
output states are functions of n+m variables.
We can instead view the n output states themselves as variables x′, subject to a set of
(linear) equalities between x,y and x′. For instance, the CNOT gate has input |x〉 and
output |x′〉, where x′1 = x1 and x′2 = x1 ⊕ x2. We then say that a parity of x,y, and x′ has
a representative over some subset S ⊆ {x,y,x′} if there exists an equivalent parity over
just variables in S.
Example 4.1.1. Consider the circuit
x1 T • x1
x2 T T H y
.
The full phase expression (excluding the Hadamard phase contribution) is ωx1+x2+(x1⊕x2). If
we label the outputs x′1 = x1 and x′2 = y, as a function of x′, ωx1 = ωx
′
1 , but x2 and x1⊕ x2
don’t have representatives over x′1 and x′2 – e.g., x2 6= ax′1 + bx′2 for any a, b ∈ F2.
It can be observed that if a particular parity does not have a representative over the
“primed” output variables, no subsequent phase gates can apply a rotation conditional on
the same parity – in particular, any phase conditional on that linear combination cannot be
merged with subsequent gates. Intuitively, we can then forget about the particular parity
and just record the fact that a phase rotation was applied to some parity. For instance, in
the above example we could (informally) rewrite the phase expression as
∃x1, x2 ∈ F2.ωx
′
1+x2+(x1⊕x2)
which is equivalent as well to ∃x1, x2 ∈ F2.ωx
′
1+x1+x2 . Effectively, we want to existentially
quantify the input and path variables, leaving only the n primed outputs.
The practical advantage is that using representatives over just the output states reduces
the number of terms in any particular parity to n, as opposed to the naïve n + m. As
in most realistic circuits m >> n, this can substantially reduce the space requirements of
phase-folding and leverage faster non-sparse representations such as bitvector arithmetic.
On the other hand, extra effort is spent calculating representatives. The full algorithm,
described in the next section, takes a hybrid approach where representatives over a set of
n variables spanning the output state are used. As a result, new representatives are only
needed when the set of variables changes due to an uninterpreted gate.
41
4.2 The phase-folding algorithm
We now describe the phase-folding algorithm in full. Our algorithm first performs a phase
analysis, interpreting a circuit C over X,CNOT, RZ , and uninterpreted gates U as a state
transformer over a domain of (abstract) path integrals, analogous to classical compiler
optimizations; the algorithm then uses the phase polynomial to merge phase gates. For
convenience we assume each phase gate has been labelled with a unique identifier ` ∈ L
where L ⊆ N, and we write a labelled phase gate as RZ(θ)`.
We additionally assume that no †’s appear in the circuit. Note that any circuit can
be normalized to have all inverses at the gate level according to the group laws, and then
absorbed into gates with the identities X† = X, CNOT† = CNOT, RZ(θ)† = RZ(−θ),
and U † = U ′ for some uninterpreted gate U ′. Finally, to simplify the presentation of
uninterpreted gates, without loss of generality we assume all uninterpreted gates act on the
last k qubits.
4.2.1 Phase analysis
Let T = P(L × {1,−1}) be the powerset of phase terms, each with a polarity in {1,−1}
denoting the affine subspace at that location in the circuit – that is, whether the qubit
state is χy(x) or 1⊕ χy(x), respectively, hence the rotation is eiθ`χy(x) or
eiθ`(1⊕χy(x)) = ei(θ`−θ`χy(x)).
We define F ]n = P(T × (Fn2 ∪ {⊥})) to be the set of abstract phase polynomials on n
variables, represented as a set of phases each with an optional representative n-bit parity.
We assume that for any f ] ∈ F ]n and x ∈ Fn2 , there is at most one pair (T,x) ∈ f ], and
so we use the functional notation [r → T ] to denote the set {(T, r)}. We define a union
operator ] on F ]n ×F ]n by taking the union of phases with the same representative:




{(T ∪ T ′, r)} if r = r′ and r ∈ Fn2{(T, r), (T ′, r′)} otherwise
Recall that A(Fn2 ,Fn2 ) is the set of affine transformations from Fn2 to Fn2 , and that a linear
(or affine) operator B acts on an affine operator Ab as B(Ab) = (BA)(Bb). For convenience
we write Ai to denote the ith row of a matrix A, ei ∈ Fn2 to denote the ith elementary
vector and Ei,j ∈ L(Fn2 ,Fn2 ) the elementary matrix adding row i to row j.
42
We now define the domain of our analysis as An = F ]n × A(Fn2 ,Fn2 ) – i.e. path integrals
represented as pairs of phase polynomials and output functions. The analysis proceeds by
executing an n-qubit circuit C with the abstract semantics JCKa : An → An and a suitable
input state. The definition of J·Ka is given below:
JXiKa(f ], Ab) = (f ], Ab⊕ei)
JCNOTi,jKa(f ], Ab) = (f ], Ei,jAb)
JRZ(θ)`iKa(f ], Ab) = (f ] ] [ATi → {(`, (−1)bi)}], Ab)

















k)T (y,0)→ T ] if r = y and (A
g
kAk)T (y,0) = (y,0)
[⊥ → T ] otherwise
JC ; C ′Ka(f ], Ab) = JC ′Ka ◦ JCKa(f ], Ab)
Note that A1,...,n−k and b1,...,n−k above denote the first n−k rows of A and b respectively,
and Agk is a generalized inverse [CM09] of Ak. In particular, a generalized inverse Ag of a
matrix A ∈ L(Fm2 ,Fn2 ) is an n by m matrix over F2 such that
AAgA = A,
and in particular AAgy = y whenever there exists x such that Ax = y. We use a generic
generalized inverse rather than the Moore-Penrose pseudoinverse, as the latter typically
does not exist over finite fields. By contrast, for A ∈ L(Fm2 ,Fn2 ), a generalized inverse Ag














Example 4.2.1. Consider the 3-qubit state (f ], A0) where
f ] = [101→ T1, 001→ T2], A =








φ(x′)|(x′1 ⊕ x′2)(x′1 ⊕ x′3)x′3〉
and applying phase rotations at locations in T1 and T2 on the states |x1 ⊕ x3〉 and |x3〉,
respectively.
We can see the effect of next applying a single-qubit uninterpreted gate U3 by calculating
a generalized inverse of Ak:
Ak =
1 1 0 01 0 1 0
0 0 0 1






 , AkAgk =
1 0 00 1 0
0 0 1
 .
The matrix AkAgk effectively projects a vector onto the row space of Ak – that is, onto the
basis
{x′1 ⊕ x′2, x′1 ⊕ x′3, y},
where y is the path variable corresponding to the output of U . Intuitively, the effect is a








Moreover, we see that the phase term corresponding to T1 has a representation as a

















On the other hand, since x′3 /∈ span{x′1 ⊕ x′2, x′1 ⊕ x′3, y}, T2 does not have a representative



















Algorithm 4.1 gives the phase folding algorithm. The algorithm takes as input a circuit
C and an initial state Ab ∈ A(Fn2 ,Fn2 ) giving the intended input space of the circuit (i.e.,
which qubits are initialized), and returns a circuit with merged phase gates.
Example 4.2.2. For a circuit over 2 input qubits and 1 ancilla initialized in the |1〉 state,
A =
1 0 00 1 0
0 0 0




Algorithm 4.1 Quantum phase-folding
1: function Phase-fold(C, Ab)
2: (f ], A′b′)← JCKa(∅, Ab)
3: for (T, r) ∈ f ] do





6: For all remaining (`′, a′) ∈ T , θ`′ ← 0
7: end for
8: end function
The algorithm, in analogy to classical data flow analysis, runs the circuit C with the
abstract semantics J·Ka to determine sets of phase gates which may be merged (terms
T ∈ T ). For each such set of phase gate locations, we then choose a single phase gate
location ` which will apply the total rotation of ∑(`′,a′)∈T a′ · θ`′ , where θ`′ is the argument
to the phase gate `′. The multiplicative factor of a adjusts (up to global phase) for whether
the rotation is applied to x or 1⊕ x, as
eiθx = eiθe−iθ(1⊕x).
While the algorithm as described only preserves unitary equivalence up to global phase,
it is an easy adjustment to preserve strict equivalence by tracking the global phase and
applying a correction at the end, noting that eiθI = RZ(θ)XRZ(θ)X.
Theorem 4.2.3. Algorithm 4.1 takes time polynomial in the volume (qubits × gates) of
the input circuit and is strictly non-increasing on all gate counts.
Proof. A consequence of fact that the state (f ], Ab) is only ever at most quadratic in the
volume of the circuit, and moreover no gates are added to the circuit.
A proof of correctness of Algorithm 4.1 is given in Chapter A.
45
4.2.3 Examples
To make the phase-folding algorithm more concrete, we now give some examples highlighting
various features.
Example 4.2.4 (Phase folding). We first give a simple example showing the merging of
phase gates across uninterpreted gates. Note that a Hadamard gate is treated identically
to an uninterpreted gate.
Recall that the controlled-S gate can be implemented as (T ⊗ T ) ; CNOT ; (I ⊗ T †) ;




T `1 • • T `4 (T †)`6
T `2 (T †)`3 H T `5 • •
After the first controlled-S gate, the state of the analysis is
([10→ {(`1, 1)}, 01→ {(`2, 1)}, 11→ {(`3, 1)}], I0) .
After the Hadamard gate, the second qubit is in the state y, hence parities x2 and x1 ⊕ x2
corresponding to 01 and 11 above no longer have representatives. The resulting state is
([10→ {(`1, 1)},⊥ → {(`2, 1)},⊥ → {(`3, 1)}], I0) .
Applying the remaining gates, corresponding to the second controlled-S, results in the set
of phase terms
[10→ {(`1, `4, 1)},⊥ → {(`2, 1)},⊥ → {(`3, 1)}, 01→ {(`5, 1)}, 11→ {(`6, 1)}].







It can be observed that each term in the final set of phase terms corresponds exactly to
one term of the actual phase polynomial, with the 2x1 factor representing the fact that
phase gates `1 and `4 rotate on the same parity (x1), applying a single total rotation of ω2.
Selecting `1 for this total rotation gives a final circuit of
S`1 • • (T †)`6
T `2 (T †)`3 H T `5 • •
46
Example 4.2.5 (Ancillas). Now we consider the affect of qubit initialization on phase-
folding. In particular, the following circuit begins with the second qubit initialized in the
|0〉 state and implements the transformation |x〉|0〉 7→ ix|x〉|x〉:
• T `2
0 T `1 T `3






, b = 0,
the phase folding algorithm computes







, b′ = 0.
If for the second phase term, `3 is chosen, the resulting circuit after phase-folding is
•
0 S`3
Example 4.2.6 (Global phase). Consider the circuit
T `1 X (T †)`2 X
The circuit implements the unitary |x〉 7→ ω7+2x|x〉, corresponding to a global phase of ω
and a relative phase of ω−2x.
To see the effect of phase-folding on the above circuit, we can first observe that
JT `1 ; XKa(∅, I0) = ([1→ {(`1, 1)}], I1).
Since b1 = 1, the T † gate applies a rotation of ω−(1⊕x) rather than ω−x, and hence the result
of the phase analysis is
f ] = [1→ {(`1, 1), (`2,−1)}]
Computing the total phase applied to the state |x〉 we get θ`1 − θ`2 = π/2. Choosing either
`1 or `2 to apply the rotation we get the circuits on the left and right below, implementing
47
the transformations |x〉 7→ ω2x|x〉 and |x〉 7→ ω6+2x|x〉 which differ from the original circuit
by a global phase of ω−1 and ω, respectively.
S`1 X X X (S†)`2 X .
In either case, the global phase can be corrected by adding a ωI = (HS)3 or (ωI)† gate
to the end of the circuit.
Example 4.2.7 (Optimization barriers). The OpenQASM language [CBSG17] includes a
barrier instruction which semantically has no effect, but blocks optimizations from crossing
the boundary it imposes. For instance, the T and T † gates below on either side of the
vertical barrier line cannot be cancelled, while the T and T † gates not separated by the
barrier can be cancelled:
T `1 (T †)`3
T `2 (T †)`4
Such barriers are useful particularly in circuits for error correction, where the structure of
the circuit is often important to preserve error propagation behaviours.
We can precisely emulate the barrier instruction in the phase-folding algorithm by
simply treating the barrier as an uninterpreted gate:
JCKa(∅, I0) = ([⊥ → {(`1, 1)}, 10→ {(`3, 1)}, 01→ {(`2, 1), (`4, 1)}]), I0).
Hence, even though semantically, the barrier gate has no effect, the analysis concludes that
T `1 and (T †)`3 are not merge-able, while T `2 and (T †)`4 are still merge-able as desired.
Example 4.2.8 (Parametrized circuits). The phase-folding algorithm naturally applies
to parametrized circuits which frequently appear in QRAM-style quantum programs and
are integrals to algorithms such as the variational quantum eigensolver (VQE). We can
model such circuits by extending phase angles to real-valued terms θ ∈ R[v1, . . . , vk] in
some parameters v1, . . . , vk. Then,
θ(v1, . . . , vk) + θ′(v1, . . . , vk) = (θ + θ′)(v1, . . . , vk)
and hence RZ(θ)RZ(θ′) = RZ(θ + θ′) for all values of the parameters.
This opens up phase-folding optimization to any quantum circuit description language
with real-valued variables and expressions. For instance, the following QCL [Ö03] code
defines a parametrized circuit (operator).
48
const pi = 3.141592653589793238462643383279502884197;




By interpreting the ∑ in line 5 of Algorithm 4.1 as the + expression of QCL, with phase-
folding we can optimize the body of foo to the equivalent program:
const pi = 3.141592653589793238462643383279502884197;
operator foo(real theta , qureg q) {
RotZ(theta + pi/2, q[0]);
}
4.3 CNOT-dihedral resynthesis
We now discuss a modification of the phase-folding algorithm to allow further optimization
by the resynthesis of CNOT-dihedral subcircuits. Recall that an n-qubit CNOT-dihedral
circuit C implements some affine permutation (A,b) ∈ GA(n,F2) of an input |x〉, together
with a phase function f : Fn2 → R – i.e.
JCK : |x〉 7→ eif(x)|Ax + b〉.
Moreover, as supp(f̂) has size linear in the size of C, CNOT-dihedral circuits are a natural
candidate for optimal synthesis.
Since the cost and complexity of CNOT-dihedral synthesis depends largely on the number
of terms in the phase polynomial, we naturally wish to apply phase-folding to first reduce
the complexity of the phase polynomials, before re-synthesizing CNOT-dihedral subcircuits.
In general however, there may be multiple ways to divide a circuit into CNOT-dihedral
chunks, which can affect the cost of the individual synthesis problems.
For instance, the T gate below can be associated with either the left or the right
CNOT-dihedral chunk C or C ′, respectively:
C C ′
• T T
T S H X • T
49
Noting that
JCK : |x1x2〉 7→ e
iπ
4 (x2+2(x1⊕x2))|x1〉|x1 ⊕ x2〉
JC ′K : |x1x2〉 7→ e
iπ
4 (2+7x2+7(x1⊕x2))|1⊕ x1 ⊕ x2〉|1⊕ x2〉,
associating the T gate with either subcircuit adds a term of π4x1 to the phase polynomial
in either case. We say that the term π4x1 can range between C and C
′. Labelling the phase






• S`3 T `4
T `1 T `2 H X • T `5
In this case, associating the term π4x1 with the subcircuit C allows the total circuit to be
implemented with T -depth 2, while associating it with C ′ gives a total T -depth of 3, as
shown on the left and right below, respectively.
C C ′
T • T
T S H X • T
C C ′
• T T
T S H X • T
The algorithm we now describe combines phase-folding with an additional backwards
pass to determine the range of each phase term. A particular synthesis algorithm may then
freely associate terms with particular CNOT-dihedral subcircuits in their range.
4.3.1 Phase range analysis
Informally, the algorithm works by recording the point in a circuit at which a given term no
longer has a representative parity over the current state, and hence is no longer computable
with just CNOT gates. By applying phase-folding both forwards and backwards, an interval
or range for each term is obtained. To simplify the presentation, we describe only the range
analysis – that is, we don’t account for phase gate merging in this analysis. For practical
implementations, phase folding can be easily combined with the initial forward analysis.
50
We assume that every gate in a circuit C is labelled with a unique identifier ` ∈ L, and
that if U ` occurs before U ′`′ in C, then ` < `′. We let Fn = Fn2 ×L denote the set of phase
polynomials on n-qubits and as before describe basis states by some Ab ∈ A(Fn2 ,Fn2 ). We
denote by ι ∈ RLn = P(L)L mappings from locations (of phase gates) to subsets of circuit
locations called ranges. We denote by [`, `′] the range {`′′ ∈ L | ` ≤ `′′ ≤ `′} and by ±∞
the least or greatest element of L respectively. For ι, ι′ ∈ RLn we define ι ∩ ι′ as
(ι ∩ ι′)(`) = ι(`) ∩ ι′(`).
The domain of our analysis is then An = Fn × A(Fn2 ,Fn2 )×RLn .
As in the phase-folding analysis, the algorithm proceeds by running a circuit C according
to the abstract semantics JCKa : An → An on an input state. We define J·Ka below:
JX`i Ka(f, Ab, ι) = (f, Ab⊕ei , ι)
JCNOT`i,jKa(f, Ab, ι) = (f, Ei,jAb), ι)
JRZ(θ)`iKa(f, Ab, ι) = (f ∪ {(ATi , `)}, Ab, ι)
JU `n−k+1,...,nKa(f, Ab, ι) = (f ′, (AkA
g












`′ 7→ [−∞, `]
JC ; C ′Ka = JC ′Ka ◦ JCK
Additionally we define J·K−1a identically, except that in the definition of ι′, `′ 7→ [`,∞].
The full algorithm is given in Algorithm 4.2. As with phase folding, a labelled circuit
and an initial state is given. The algorithm then uses the abstract semantics of C to
compute an upper bound on the range of each phase gate, as well as the output state A′b′ .
Using the output state as the initial state for the (inverse) semantics of C† then achieves a
lower bound for each phase gate.
Example 4.3.1. Consider the following circuit:
R`1Z H







Algorithm 4.2 Phase range analysis
1: function Phase-range(C, Ab)
2: Set ι(`) = [−∞,∞] for all ` ∈ L
3: (f ′, A′b′ , ι′)← JCKa(∅, Ab, ι)
4: (f ′′, A′′b′′ , ι′′)← JC†K−1a (∅, A′b′ , ι′)
5: return ι′′
6: end function













Running the forward phase of the analysis we find an upper bound for the range of each
phase gate – in particular, the point after which the parity the phase gate rotates on can













Finally, running the backwards phase of the analysis gives a lower bound, corresponding
52













In many problems – for instance, T -depth, T -count and CNOT-count optimization – it
is often desirable to associate phases such that the number of intervals (CNOT-dihedral
chunks) which contain phase terms is minimized. Visually this can be achieved by placing
the minimal number of vertical lines bisecting every range. For the circuit above, a minimal
solution is obtained by dividing up the phases between the second and third intervals.
We leave discussion of more advanced strategies to individual CNOT-dihedral synthesis
algorithms.
4.4 Experiments
We implemented both Algorithm 4.1 and Algorithm 4.2 in Haskell in the compiler backend
Feynman. To evaluate Algorithm 4.1 we ran it on a suite of benchmarks, described
below; evaluation of Algorithm 4.2 is left to chapters 5 and 6 where specific CNOT-dihedral
synthesis algorithms are given. Experiments were run in Linux on a quad-core 64-bit Intel
Core i7 2.40 GHz processor and 8 GB RAM.
Benchmarks As the set of benchmarks we use to evaluate phase-folding will be used
throughout this thesis to evaluate other algorithms, we describe them here. The benchmarks
suite is comprised of circuits taken from the literature or adapted from publicly available1
circuits. The full suite is including with Feynman. The benchmarks are primarily
reversible implementations of arithmetic, as such circuits are believed to be the primary
source of complexity in quantum algorithms [IAR13]. The benchmarks suite also contains
implementations of strict quantum algorithms – namely, Grover’s algorithm with oracle
1http://webhome.cs.uvic.ca/ dmaslov/
53
f(x) = ¬x1 ∧ ¬x2 ∧ x3 ∧ x4 ∧ ¬x5 and the Quantum Fourier Transform on 4 qubits with
higher-order rotation gates approximated over Clifford+T .
All benchmarks are initially written over Clifford+T and multiply-controlled Toffoli
gates. To generate Clifford+T implementations, k-controlled Toffoli gates are decomposed
into 2k + 3 Toffoli gates (i.e. TOF) with k − 2 0-initialized ancillas, as per the standard
decomposition in [NC00]. Toffoli gates are further decomposed into Clifford+T gates using
the T -depth minimal decomposition [AMMR13]
T • • T † •
T • T † T † •
H T • T • H
This decomposition was chosen as it achieves the minimal T -count and T -depth, and hence
is the lowest-cost decomposition in most fault-tolerant frameworks.
Results Table 4.1 reports the results of our experiments. The phase-folding algorithm
was run on each circuit, followed by a simple gate cancellation phase where adjacent gates
that are inverses of one another are cancelled. We include this phase as it is a standard
optimization which can further decrease some gate counts since the removal of phase gates
allows some new trivial reductions. It should be noted that the phase gate counts are
identical with and without the additional gate cancellation pass – that is, all of the phase
gate reductions are a result of phase-folding. Additionally, average reductions are not given
for S and Z gates, as they were typically increased from 0 due to T gate merging.
The results show that T -count is reduced by an average of 42.1% across all benchmarks,
with a maximum reduction of 65.7% for the VBE-Adder_3 benchmark. This shows the
remarkable amount of phase gate cancellation possible in real-world quantum circuits.
Moreover, depth, T -depth, and CNOT-counts were decreased in every benchmark. On the
other hand, H- and X-counts saw a small increase – 0.5% and 1.3%, respectively – which
is due to the global phase correction using the (HS)3 = ω and XSXS = i, ZXZX = (−1)
decompositions.
The run-times also show that the algorithm scales to some reasonably large circuits –
particularly the GF(2256)-Mult and HWB12 benchmarks, with circuit volumes 917, 291∗768 =
704, 479, 488 and 396, 101 ∗ 20 = 7, 922, 020 and run-times of 14h and 3h, respectively. The
high degree of difficulty of the HWB benchmarks appears to be due to the very high
number of Hadamard gates, which induce costly generalized inverse computations. In either
































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Phase gate merging Recently, Nam et al. [NRS+18] developed a similar algorithm for
merging phase gates in Clifford+RZ circuits, possibly with parametrized gates. Their work
also builds on [AMM14], extracting the logic of phase gate merging from the circuit re-
synthesis subroutine to allow optimization without increasing any gate costs. In comparison
to their work, our algorithm is applicable to and sound in the presence of uninterpreted
gates, and hence can merge phase gates over any gate set. Their phase merging algorithm
was also informally described, while ours was described in an analogous way to classical
compiler optimizations (via abstract interpretation), making a formal analysis possible and
allowing it to be easily implemented in other compilers.
Our work additionally diverges in the handling of optimization beyond the merging of
phase gates. Specifically, they pair phase gate merging with efficient mapping strategies as
well as local re-writes through a collection of circuit identities. In contrast, our methods
focus on the grouping of phase terms into CNOT-dihedral subcircuits which can then be
re-synthesized. This allows us to study and leverage optimal synthesis problems over such
circuits, which is the topic of the following two chapters.
Abstract interpretation The presentation of the phase-folding algorithm was inspired
by abstract interpretation [CC77], and particularly in the quantum domain by Perdrix’s
entanglement analysis [Per08]. However, in comparison to those works we have a very
simple program model without recursion, which allows us to do away with most of the
heavy machinery of abstract interpretation. On the other hand, the connection to abstract
interpretation motivates the possibility of extending phase-folding to quantum programs
which include measurement and classical control by abstracting the circuit collecting
semantics – in analogy to (trace-based) collecting semantics, the set of all possible sequences





The phase-folding optimization of the previous chapter is known to be strictly heuristic,







and hence the circuit
T • • • • • • • •
T T • • • •
T T T T • •
T T T T T T T T





χy(x)|x〉 is equal to the identity circuit. In
this case however, phase-folding will not reduce any of the T gates as each is applied to a
different parity of the input state. Seen another way, the phase-folding algorithm exploits
equalities between circuits with the same phase polynomial, since for CNOT-dihedral circuits
in particular it computes the exact phase polynomial and implements it with the minimal
number (|supp(f̂)|) of RZ gates. However, as Figure 5.1 illustrates, there exist equalities
between circuits with different phase polynomials which are not exploited by the phase
folding algorithm. Such equalities arise due to the equivalence f ≡ f ′ mod 2π.
In this chapter we consider the problem of optimally synthesizing a CNOT-dihedral
circuit given its sum-over-paths action – that is, a phase polynomial








C C ′ C ′′
(f,Ab) (f ′, A′b′)
U
Figure 5.1: Relationship between circuit, path integral and unitary representations. All three
circuits have the same unitary semantics, while only C and C ′ have the same (canonical)
sum-over-paths action.
and affine transformation Ab ∈ GA(n,F2). Here we define the cost to be the number of
highest-order phase gates, hence an optimal circuit is one with the fewest phase gates of
highest order. In particular, we consider the CNOT-dihedral group of order 2m for any
m ∈ N, and optimize the number of Zm = RZ(2π/m) gates. We consider this version of the
problem as a phase gate of any angle can always be decomposed as gates of smaller angle –
for instance, Zm = Zm/2Zm/2. This situation is reflected in fault-tolerance constructions,
where smaller angle gates are generally progressively more difficult to implement [DCP15].






where kblogmc . . . k1k0 is the binary expansion of k. We hence say that the Zm-cost of Zkm is
1 if k = 1 mod 2 and 0 otherwise – for instance, the T cost of T 6 = ZS is 0.
We prove that for any m = 2kl where l is odd, optimizing the number of Zm gates
in the n-qubit CNOT-dihedral group 〈CNOT, X, Zm〉 is polynomial-time equivalent to
minimum-distance decoding for the length 2n − 1 punctured Reed-Muller code of order
n− k− 1, RM(n− k− 1, n)∗. As a consequence, we give a concrete optimization algorithm
58
which leverages existing Reed-Muller decoders to perform phase gate optimization. We
also present several important corollaries, including that the optimal synthesis problem
is polynomial-time solvable for k ≤ 2, that the problem reduces to computing symmetric
3-tensors for k = 3, and that the number of T gates in a {CNOT, X, T} circuit is O(2n).
This work appears in [AM16] and is implemented in the T -par1 software package.
5.1 Coding theory
We first introduce some concepts from coding theory (see, e.g., MacWilliams & Sloane
[MS78]). A length n binary linear code is a linear subspace C of Fn2 and elements of C
are called codewords. A central concept to coding theory is that of decoding – given a
received vector w = x⊕ e where x ∈ C and e ∈ Fn2 is some error vector, find x. We define
the (Hamming) weight of a binary vector x ∈ Fn2 , denoted |x|, as the number of non-zero
entries of x, and further define the (Hamming) distance δ(x,y) between two binary vectors
x,y ∈ Fn2 to be the weight of their sum:
δ(x,y) = |x⊕ y|.
A decoding y ∈ C of a received vector w is said to be a minimum-distance decoding of w
in C if it minimizes δ(w,y), and the minimum-distance decoding problem is likewise to
compute a minimum-distance decoding of a received vector.
Definition 5.1.1 (minimum-distance decoding problem for C). Given a vector w ∈ Fn2 ,
find y ∈ C such that for all z ∈ C, δ(w,y) ≤ δ(w, z).
The problem of finding a minimum distance decoding is closely related to the more
general closest vector problem over a lattice, and in fact coincides with the closest vector
problem over the lattice C with the Hamming weight as the norm. Minimum distance
decoding is commonly studied as it is equivalent to maximum likelihood decoding when
errors are independent.
The covering radius of a code gives the maximum distance of a minimum-distance
decoding.








The family of binary Reed-Muller codes [Mul54,Ree54] can be defined as the evaluation
vectors of multivariate Boolean polynomials up to a maximum degree, called the code’s
order. In particular, let F2[X1, X2, . . . , Xn] be the ring of polynomials in n variables over F2.
We use the symbols X1, X2, . . . , Xn to denote formal variables so as to differentiate them
from binary vectors and values. Given f ∈ F2[X1, X2, . . . , Xn] we define the evaluation
vector of (the polynomial function) f to be the length 2n binary vector consisting of the







f(0, 0, . . . , 0)
f(1, 0, . . . , 0)
...
f(1, 1, . . . , 1)
 .
We denote the evaluation vector of a polynomial function f by bold upright font, f .





The degree of a polynomial function f ∈ F2[X1, X2, · · ·Xn], denoted deg(f), is defined as
the maximum total degree of each monomial. Since X2 = X for all X ∈ F2, the evaluation
vectors of polynomial functions are unique over the quotient ring F2[X1, X2, . . . , Xn]/〈X21 −
X1, . . . , X
2
n−X〉, and so we only consider reduced polynomials over this ring – that is, with
exponents in Fn2 .
Identifying the monomial Xy with the Boolean function f(X1, X2, . . . , Xn) = Xy, we
denote the evaluation vector of Xy by Xy, which is equal to the component-wise multiple











with exponentiation of a Boolean vector also defined component-wise. Table 5.2 illustrates
the evaluation vectors of the 2n monomials on n variables, from which is can be observed
that they form a linear basis of F2n2 , or functions F2
n
2 → F2.
We can now define the Reed-Muller family of codes and their punctured versions.
60
0 1 2 3 4 · · · 2n − 1
1 1 1 1 1 1 · · · 1
X1 0 1 0 1 0 · · · 1
X2 0 0 1 1 0 · · · 1
X1X2 0 0 0 1 0 · · · 1
X3 0 0 0 0 1 · · · 1
... ... ... ... ... ... . . . ...
X1X2 · · ·Xn 0 0 0 0 0 · · · 1
Figure 5.2: Evaluation vectors for monomials of n variables.
Definition 5.1.3 (Reed-Muller codes). The Reed-Muller code of order r and length 2n,
denoted RM(r, n), is the set of evaluation vectors of polynomials f ∈ F2[X1, X2, . . . , Xn]
with degree at most r.
The punctured Reed-Muller code of order r and length 2n − 1, denoted RM(r, n)∗, is
the Reed-Muller code of order r and length 2n with the first coordinate of every codeword
removed. Equivalently, the punctured Reed-Muller code can be defined as the quotient
RM(r, n)/ 〈(1, 0, . . . , 0)〉 .
Example 5.1.4. The vector






























We can now formally state our main result.
Theorem 5.1.5. Let m = 2kl where l is odd. Then the optimal synthesis problem for
〈CNOT, X, Zm〉 with respect to Zm-count, and minimum-distance decoding problem for
RM(n− k − 1, n)∗ are polynomial-time equivalent – that is, there exist polynomial-time
reductions from each to the other.
The rest of this chapter first frames the Zm-count optimization problem as a decod-
ing problem, then gives a set of generators and studies some consequences, including
Theorem 5.1.5.
61
5.2 Decoding phase polynomials
In this section we reduce the problem of synthesizing a CNOT-dihedral circuit with a
minimum number of Zm gates given its sum-over-paths action to a decoding problem over
Z2nm – specifically, minimizing the number of odd entries over a +G for some a ∈ Z2
n
m where
G is a particular subgroup of Z2nm .
First recall that the sum-over-paths action of an n-qubit CNOT-dihedral circuit is given
by a phase polynomial f : Fn2 → R and affine transformation Ab ∈ GA(n,F2) such that




Since we are strictly working with CNOT-dihedral circuits of order 2m – that is, circuits
over {CNOT, X, Zm} for some m ∈ N – we factor 2πim out of the phase polynomial so that





As e 2πim f(x) = e 2πim f ′(x) whenever f ′(x) = f(x) mod m, we can reduce the Fourier coefficients
mod m, giving f̂(y) ∈ Zm. We write these reduced coefficients as a tuple a ∈ Z2
n
m . We call
this tuple an implementation of the phase function f , and denote the phase polynomial
defined by a tuple a ∈ Z2nm by fa, i.e.,




It was noted in Chapter 3 that the Fourier coefficients of a sum-over-paths action
directly correspond to the phase gates in a circuit. Given a CNOT-dihedral circuit C,
the phase function fa is obtained by summing up the phase contribution of 2πkm χy(x) or
2πk
m
(1⊕χy(x)) = 2πkm +
−2πk
m
χy(x) of each (Zm)k gate, applied to a qubit in the state |χy(x)〉
or |1 ⊕ χy(x)〉, respectively. In the context of Zm gates, we see that the number of Zm
gates – the number of (Zm)k gates with odd k – is at least the number of odd entries of a.
Example 5.2.1. Recall that the doubly-controlled Z gate CCZ can be written as the

















Summing up the contributions from each T or T † gate we see that
f(x1, x2, x3) = x1 + x2 − (x1 ⊕ x2) + x3 − (x1 ⊕ x3)− (x2 ⊕ x3) + (x1 ⊕ x2 ⊕ x3),
which we can write as a tuple over Zm by reducing −1 = 7 mod 8, so a = (0, 1, 1, 7, 1, 7, 7, 1)
Notably, a has 7 odd entries, the same as the number of T gates in the circuit.
It can be observed that the converse is also true, in that given a set of coefficients
a ∈ Z2nm with t odd entries, a circuit implementing |x〉 7→ e
2πi
m
fa(x)|x〉 with at most t + 1
Zm gates exists. In particular, for each y 6= 0 such that ay 6= 0, the parity χy(x) is first
computed via a sequence of CNOT gates, then a (Zm)ay gate is applied to achieve a phase
rotation of e 2πim ayχy(x), and χy(x) is uncomputed. Note in general that the transformation
|x〉 7→ |Ax + b〉 is implementable over just CNOT and X gates via Gaussian elimination
(e.g., [PMH08]), hence the Zm cost of |x〉 7→ e
2πi
m
fa(x)|Ax+b〉 is the same as |x〉 7→ e 2πim fa(x)|x〉.





Up to global phase, the number of Zm gates required to implement a ∈ Z2
n
m with t odd
entries excluding a0 is exactly t.
Remark 5.2.2. Applying the global phase correction in this way uses two extra Zm gates
when a0 is odd. However, in cases where there exist z 6= 0 such that az is odd, these extra





(−az)(1⊕χz(x)) = e 2πim (−az+azχz(x)),















where the global phase a0 − az = 0 mod 2 can be implemented without Zm gates. As a
result, the only situation where global phase requires an extra Zm gate is when there are
no other odd phases, hence we consider global phase effectively free.
This link between the number of odd coefficients of a ∈ Z2nm and the number of Zm gates
in a circuit applying the phase rotation e 2πim fa(x) is summed up in the proposition below
63
Proposition 5.2.3. Given a ∈ Z2nm and Ab ∈ GA(n,F2), there exists a circuit C over
{CNOT, X, Zm} with t Zm gates such that
JCK : |x〉 7→ e 2πim fa(x)|Ax + b〉
up to global phase if and only if there exists a′ ∈ Z2nm such that fa(x) = fa′(x) mod m for
all x ∈ Fn2 and a′ has at most t odd entries excluding a′0. Moreover, if a′ exists, the circuit
is polynomial-time synthesizable in the number of non-zero entries of a′ and n.




fa(x)|Ax + b〉〈x| = e 2πim fa′ (x)|Ax + b〉〈x|
if and only if fa(x) = fa′(x) mod m for all x ∈ Fn2 . Then, as noted above, a circuit C
with phase function fa(x) up to global phase and t Zm gates gives a′ ∈ Z2
n
m with at most t
odd entries excluding a′0, and vice versa. Moreover, synthesis takes time polynomial in the
volume of the output circuit – itself polynomial in the number of non-zero entries of a′y
and n – as the synthesis of each phase factor is trivial and the final affine synthesis step is
polynomial in n.
For the rest of this chapter, we taking equality as meaning equal up to global phase. More-
over, we take the Z2nm as informally meaning the “punctured” group Z2
n
m / 〈(1, 0, . . . , 0)〉 '
Z2n−1m – i.e. tuples a ∈ Z2
n
m up to the value of a0. We take this informal approach to avoid
a heavy notational burden, e.g., the original publication of this result [AM16]. Note also
that the “number of odd coefficients excluding a0” in this case is taken to mean the number
of odd entries in Z2n−1m .
5.2.1 Coding interpretation
By Proposition 5.2.3, the problem of optimally synthesizing a CNOT-dihedral circuit, given
by a set of Fourier coefficients a ∈ Z2nm , with respect to Zm is equivalent to the problem of
finding some a′ ∈ Z2nm such that
fa(x) = fa′(x) mod m
for all x ∈ Fn2 minimizing the number of odd entries – indeed, they are polynomial-time
equivalent as reading the Fourier coefficients from a circuit takes polynomial time. We now
64
show that this corresponds to a minimization problem over a coset of Z2nm , which moreover
is equivalent to minimization of the Hamming weight over the binary residues of this coset.
Given an element a of Z2nm , let [a] be the equivalence class of implementations of fa –
that is, a′ ∈ [a], denoted a ∼ a′, if and only if
fa(x) = fa′(x) mod m
for all x. Let
Cnm = {c ∈ Z2
n
m | fc(x) = 0 mod m for all x ∈ Fn2}
be the set of 0-everywhere phase polynomials on Z2nm . By the fact that the mapping a 7→ fa
is a group homomorphism, in that
fa(x) + fa′(x) = fa+a′(x)
for all x, it is easy to see that Cnm / Z2
n
m is the kernel of this homomorphism, and moreover
that [a] = a + Cnm.
Proposition 5.2.4. For any a ∈ Z2nm ,
[a] = a + Cnm
Proof. Suppose c ∈ Cnm. Then clearly
fa+c(x) = fa(x) + fc(x) = fa(x) mod m.
Likewise, if a′ ∈ [a], then
fa−a′(x) = fa(x)− fa′(x) = 0 mod m,
hence a − a′ ∈ Cnm and in particular a′ = a + c for some c ∈ Cnm.
As a consequence of Proposition 5.2.4, we see that the problem of finding an implemen-
tation of fa minimizing Zm count is equivalent to finding an element c ∈ Cnm minimizing
the number of odd entries in a + c. Combined with a presentation of Cnm by generators, as
given below, we can directly perform this optimization by generating elements of Cnm.
Lemma 5.2.5. Let m = 2kl where l is odd. Then
Cnm =
〈




We defer the proof of Lemma 5.2.5 to Section 5.4.
Example 5.2.6. Consider the case where n = 4, m = 8 = 23, and so
C48 =
〈
2iXy | y ∈ Fn2 , |y| − i ≤ n− 4
〉
.
If we consider just the generators which have odd coefficients, we see that there is exactly
one such generator – X0 = 1, the all 1 string. Now we can verify that f1(x) = 0 mod 8




χy(x) = 0 mod 8,
which is exactly the example given at the beginning of this chapter.
Remark 5.2.7. As a consequence of Lemma 5.2.5, we also provide a characterization of
the set of n-qubit unitaries implementable over {CNOT, X, Zm}. In particular, it can be
observed that





5.2.2 Connection to Reed-Muller codes
With the above set of generators given in Lemma 5.2.5, the optimization can be performed
directly over Cnm. However, the particular metric of Zm-count optimality makes such
optimization unnatural; as the number of odd entries in an element of Z2nm doesn’t form
a norm, there are no natural reductions to the obvious problems of shortest vectors in a
lattice or minimum distance decoding in more general linear codes. We instead reduce the
optimization problem to a decoding problem over a binary code, where minimum-distance
decoding corresponds exactly to Zm-count optimization.
Defining Res2 : Z → F2 as the function taking the binary residue of an integer and
extending this component-wise to tuples, we see that the number of odd entries in a ∈ Z2nm
is equal to the weight of the binary residue vector, i.e. |Res2(a)|. We can further see that
|Res2(a + c)| = |Res2(a)⊕ Res2(c)| = δ(Res2(a),Res2(c)),
that, is the number of odd entries in a + c – and hence Zm-cost of implementing fa+c –
is the Hamming distance from Res2(a) to Res2(c). As a result, optimizing the number of
odd entries in a + c over all c ∈ Cnm is exactly the problem of minimum distance decoding
66
Res2(a) over Res2(Cnm), the set of binary residue vectors of Cnm. We further note that
Res2(Cnm) is in fact a binary linear code, since
Res2(c)⊕ Res2(c′) = Res2(c + c′) ∈ Res2(Cnm)
for any c, c′ ∈ Cnm, and as a direct consequence of Lemma 5.2.5 this code is exactly the
length 2n − 1 punctured Reed-Muller code of order n− k − 1.
Theorem 5.2.8. Let m = 2kl where l is coprime with 2. Then
Res2(Cnm) = RM(n− k − 1, n)∗







| y ∈ Fn2 , |y| − i ≤ n− k − 1
〉
= 〈Xy | y ∈ Fn2 , |y| ≤ n− k − 1〉
= RM(n− k − 1, n)∗
Note that Res2(Cnm) is the punctured Reed-Muller code as we are really working “up to




Before proving Lemma 5.2.5 in the next section, we discuss some consequences of Theo-
rem 5.2.8.
5.3.1 Upper bounds on phase gate counts
As a consequence of Proposition 5.2.3 and Theorem 5.2.8, the covering radius of Res2(Cn2kl) =
RM(n − k − 1, n)∗ gives a tight upper bound on the number of Zm gates required to
implement a circuit over {CNOT, X, Zm}. Here we mean tight in the sense that there
exists a unitary operator which requires a minimum of ρ(RM(n− k − 1, n)∗) Zm gates to
implement over {CNOT, X, Zm}. In particular, if ρ(RM(n− k − 1, n)∗) = t, there there








While no analytic formula is known for the covering radius of higher-order Reed-Muller
codes, some asymptotic upper bounds are known, which consequently give upper bounds
on the number of Zm gates required to implement. In particular, Cohen and Litsyn [CL92]
showed that for large n and orders r where n− r ≥ 3,
ρ(RM(r, n)) ≤ n
n−r−2
(n− r − 2)! .
Since the covering radius of RM(r, n)∗ is trivially bounded above by ρ(RM(r, n)),
it turns out that for sufficiently large n and k ≥ 2, ρ(RM(n − k − 1, n)∗) ≤ nk−1(k−1)! . In
particular, for the important case of {CNOT, X, T}, k = 3 and hence we obtain an O(n2)
bound on the number of T gates needed.
Theorem 5.3.1. Any order 16 CNOT-dihedral operator can be implemented with at most
O(n2) T gates over {CNOT, X, T}.
5.3.2 Optimization algorithm
Theorem 5.2.8 implies that some element c ∈ Cn2kl minimizing the number of odd coefficients
in a + c can be found by decoding Res2(a) in RM(n− k − 1, n)∗ – however, the decoding
itself isn’t enough to find c. In particular, decoding the binary residue Res2(a) produces
a minimal residue w ∈ F2n2 of a codeword c in Cn2kl. To actually synthesize a Zm-count
minimal circuit, we need to compute c with Res2(y) = w and then synthesize the phase
function fa+c. Fortunately, there is an easy way to do this, given the following lemma.
Lemma 5.3.2. For all y ∈ Fn2 with |y| ≤ n− k − 1 we have lXy ∈ Cn2kl.
Proof. Consequence of Lemma 5.2.5. In particular, 2ilXy ∈ Cn2kl for any |y| − i ≤ n− k− 1,
and hence substituting i = 0 we have lXy ∈ Cn2kl whenever |y| ≤ n− k − 1.
Using Lemma 5.3.2 and the definition of RM(r, n)∗, we can write a decoded word w
as a (binary) sum of monomials of degree at most n − k − 1, then reinterpret this sum
(scaled by l) over Zm. Specifically, if w =
⊕
y∈Fn2 ayX
y for some {ay} ⊆ F2, then we choose









Using this fact we develop an algorithm for the optimization of Zm-count based on Reed-
Muller decoding.
Algorithm 5.1 rm-optimize(C, m = 2kl)
1: Compute sum-over-paths action (fa, Ab) of C
2: w←rm-decode(n− k − 1, n, Res2(a))
3: Write w as a Boolean polynomial w = ⊕y∈Fn2 ayXy
4: c← ∑y∈Fn2 aylXy
5: return cnot-dihedral-synth(fa+c, Ab)
Algorithm 5.1 summarizes our algorithm for Zm-count optimization in {CNOT, X, Zm}
circuits, where m = 2kl for odd l. The algorithm works by computing the sum-over-paths
action of the circuit C as (fa, Ab), where a gives the coefficients of the phase polynomial.
The vector of residues modulo 2 is then decoded as w in RM(n − k − 1, n)∗ using the
procedure rm-decode(n− k − 1, n, Res2(a)). A vector c ∈ Cn2kl with Res2(c) = w is then
computed and added to the original set of coefficients, and a circuit is synthesized with
the new set of coefficients. In particular, the procedure cnot-dihedral-synth takes a






The algorithm is parametric in both the decoder and the synthesis procedure, meaning
any variable order Reed-Muller decoder may be used to implement rm-decode. If a
minimum distance decoder is used, Algorithm 5.1 synthesizes a minimal Zm-count circuit.
Likewise, any synthesis procedure may be used to implement cnot-dihedral-synth –
for instance, the T -depth minimizing T -par algorithm [AMM14], or the CNOT-minimizing
synthesis algorithm we describe in the next chapter.
Example 5.3.3. Consider the circuit below:
x1 • • x1
x2 • Z • x2
x3 • x3
x4 • S x4
=
x1 T • T † • • T • T † • • x1
x2 T • T † T † • Z T • T † T † • x2
x3 T • T • x3
x4 T • T • S x4
By annotating the circuit and computing the phase contributions, we find the phase
polynomial (reduced mod 8) for the above circuit is
f(x) = 2x1 + 6x2 + 6(x1 ⊕ x2) + x3 + 7(x1 ⊕ x3) + 7(x2 ⊕ x3) + (x1 ⊕ x2 ⊕ x3)
+ 3x4 + 7(x1 ⊕ x4) + 7(x2 ⊕ x4) + (x1 ⊕ x2 ⊕ x4).
69
Writing the Fourier coefficients as a tuple a ∈ Z2n8 we get
a = (0, 2, 6, 6, 1, 7, 7, 1, 3, 7, 7, 1, 0, 0, 0, 0),
which has 8 odd entries – hence can be implemented directly with 8 T gates, giving a
reduction of 6 T gates and corresponding to phase-folding as in Chapter 4.
Next we optimize the implementation of f further by decoding
Res2(a) = (0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0)
in the code RM(0, 4)∗. As RM(0, 4)∗ is the set of evaluation vectors for degree 0 Boolean
polynomials, there are exactly two vectors to choose from, corresponding to the constant 0
and 1 functions. Since the all 1 vector achieves the minimum distance of 7 (excluding the
first coordinate) from Res2(a), we choose w to be the all 1 vector – i.e. w = 1. Writing
c = 1, by Lemma 5.3.2 we have c ∈ C43 with Res2(c) = w. Finally we synthesize a circuit
for the tuple a + c = (0, 3, 7, 7, 2, 0, 0, 2, 4, 0, 0, 2, 1, 1, 1, 1), corresponding to the phase
polynomial
fa+c(x) = 3x1 + 7x2 + 7(x1 ⊕ x2) + 2x3 + 2(x1 ⊕ x2 ⊕ x3) + 4x4 + 2(x1 ⊕ x2 ⊕ x4)
+ (x3 ⊕ x4) + (x1 ⊕ x3 ⊕ x4) + (x2 ⊕ x3 ⊕ x4) + (x1 ⊕ x2 ⊕ x3 ⊕ x4).
The resulting circuit, optimized for T -depth, is shown below. Note that many different
circuits with the same T -count are possible, for instance targeting CNOT-minimization
instead of T -depth.
x1 T S • T T • x1
x2 T † • T † • • • • • x2
x3 S • S • x3
x4 Z • T T • S x4
The circuit decoding reduces the T -count from 14 (or 8, as just phase-folding would
achieve) to 7. Moreover, the number of T gates is equal to the distance from Res2(a) to
the decoded word w.
Remark 5.3.4. It is interesting to note that the minimal T -depth of the decoded phase
polynomial fa+c without additional ancillas is 3, while the minimal ancilla-free T -depth of
fa is 2, even though the number of T gates is reduced. Hence reducing the number of Zm
gates may result in an increase in other cost functions of the circuit, even in cases where
the increase can seem counterintuitive. We will return to this idea in the next chapter when
we discuss CNOT-optimal synthesis.
70
5.3.3 Complexity of phase gate optimization
It was claimed in the beginning of this chapter that the problems of phase gate optimization
and minimum-distance Reed-Muller decoding were polynomial-time equivalent. We now
prove this fact, and discuss some implications of the equivalence. Note that the inputs
to either problem are in general length 2n vectors (the 2n Fourier coefficients for optimal
synthesis and length 2n binary vector of Reed-Muller decoding), hence our reductions are
polynomial in 2n rather than n.
Theorem (5.1.5). Let m = 2kl where l is odd. Then the optimal synthesis problem for
〈CNOT, X, Zm〉 with respect to Zm-count, and minimum-distance decoding problem for
RM(n− k − 1, n)∗ are polynomial-time equivalent – that is, there exist polynomial-time
reductions from each to the other.
Proof. For the forward reduction, observe that Algorithm 5.1 finds a Zm-count optimal
circuit in time polynomial in 2n using an oracle for minimum-distance decoding in RM(n−
k − 1, n)∗. In particular, the input Res2(a) to the Reed-Muller decoder has size polynomial
in 2n and can be computed in polynomial time. Likewise, given w ∈ RM(n− k − 1, n)∗,
w can be written as a degree at most n− k − 1 polynomial in polynomial time (in 2n) via
Gaussian elimination. Since w has degree at most n− k − 1, it is a sum of polynomially
many terms, and hence c can be computed in polynomial time. Finally, CNOT-dihedral
synthesis can be completed in time polynomial in 2n.
For the opposite reduction, given a vector w ∈ F2n2 to be decoded synthesize a circuit C





The Fourier coefficients a of C can then be computed in time polynomial in n · |C|, which
is polynomial in 2n since any Zm-gate minimal CNOT-dihedral circuit can be implemented
in at most O(n2n−1) gates by simply enumerating each parity χy(x). As a consequence of
Theorem 5.2.8, Res2(a) + w ∈ RM(n− k − 1, n)∗ is a minimum-distance decoding of w,
and moreover can be computed in time polynomial in 2n as required.
This equivalence lends evidence to the difficulty of Z2kl-count optimization when k ≥ 3,
even in the restricted setting we consider here of CNOT-dihedral circuits. In particular, as a
function of n, any sub-exponential algorithm for exact optimization of T -count over n-qubit
{CNOT, T} circuits induces a linear-time (in 2n) minimum-distance decoding algorithm
for RM(n − 4, n)∗. By contrast, it appears very unlikely that even a polynomial-time
71
algorithm for minimum-distance decoding the order n − 4 punctured Reed-Muller code
exists; no minimum-distance decoding algorithms in time polynomial in 2n are currently
known for arbitrary order length 2n binary Reed-Muller codes. While some particular
orders of Reed-Muller codes have efficient decoders, e.g., order 1, it was shown by Seroussi
and Lempel [SL83] that minimum-distance decoding for RM(n− 4, n)∗ is equivalent to the
problem of finding a minimal decomposition of a symmetric 3-tensor into symmetric tryads
(rank 1 3-tensors), a problem that is widely believed to be computationally hard [SL83].
On the other hand, Theorem 5.1.5 also shows that for k < 3, the problem is efficiently
solvable. In particular, Seroussi and Lempel gave an efficient algorithm for maximum
likelihood decoding of RM(n − 3, n)∗ [SL83], and moreover efficient decoders exist for
k < 2, corresponding to the Hamming code when k = 2 and trivial codes when k = 0 or 1.
Hence we have the following corollary.
Corollary 5.3.5. The optimal synthesis problem for 〈CNOT, X, Zm〉 with respect to Zm-
count is efficiently solvable whenever k < 3 is the highest power of 2 dividing m.
5.4 Generators of Cnm
We now prove Lemma 5.2.5. Our proof proceeds by considering the two cases – where the
phase polynomial is 0 mod 2k and 0 mod l – separately, as per the proposition below. In
particular, we show that only the trivial phase polynomials are zero mod l for any odd l,
while the phase polynomials which are 0 mod 2k correspond to polynomials of order at
most n− k − 1.
Proposition 5.4.1. Let m = 2kl for some l coprime with 2. Then Cnm = Cn2k ∩ Cnl .
That the above proposition gives a set of generators of Cnm given generators of Cn2k and
Cnl is a trivial consequence of the particular presentations we derive below.
5.4.1 Rotations of odd order
We begin by proving that for odd order phase rotations, the zero-everywhere code is the
trivial code. In particular, there are no non-trivial phase polynomials such that fa(x) = 0
mod l for all x ∈ Fn2 when l is odd.
72
Lemma 5.4.2. Let l be an odd integer. Then
Cnl = 〈0〉 .
That is, Cnl is the trivial subgroup of Z2
n
l .
To prove Lemma 5.4.2, we recall the multilinear representation of a phase polynomial.





The fact, originally informally noted in Chapter 3, that this representation is unique is
a consequence of the following lemma, namely that the evaluation vectors of monomials
forms a basis for the module Z2nl .
Lemma 5.4.3. The set2 {Xy ∈ F2n2 ⊂ Z2
n
k | y ∈ Fn2} is a basis for Z2
n
k for any k.
Proof. Informally, given that the monomial evaluation vectors contain a pivot for every
entry of F2n2 (see, e.g., Figure 5.2), any element of Z2
n
k can be written as a sum of monomial
evaluation vectors via Gaussian elimination.
More formally, the set {Xy | y ∈ Fn2} has cardinality 2n and moreover is linearly
independent. Since the (module) rank of Z2nk is 2n, it is a basis of Z2
n
k .
The fact that uniqueness follows from Lemma 5.4.3 is a trivial consequence of the fact
that any such function can be uniquely represented by its vector of evaluations or truth
table – i.e. f 7→ f is an isomorphism. The uniqueness of the multilinear representation
moreover establishes that the trivial polynomial is the only multilinear representation of
the zero-everywhere function f(x) = 0.
The final piece needed to establish Lemma 5.4.2 is that a multilinear polynomial over
Zl for odd l can be rewritten as a Fourier expansion. For this fact, recall that
2xy = x+ y − (x⊕ y)
for any x, y ∈ F2. Since 2 is coprime with l it has a multiplicative inverse 2−1 in Zl, hence
we can rewrite this identity as
xy = 2−1x+ 2−1y − 2−1(x⊕ y) mod l.
2Since we really work “up to global phase”, to be precise the set of monomials excluding the constant




The equation above can be used to recursively rewrite a given monomial xy = xi1xi2 · · ·xi|y|
over the parity basis – i.e. as a Fourier expansion:
(xi1xi2)xi3 · · ·xi|y| = (2
−1xi1 + 2−1xi2 − 2−1(xi1 ⊕ xi2))xi3 · · ·xi|y| mod l
= 2−1xi1xi3 · · ·xi|y| + 2
−1xi2xi3 · · ·xi|y| − 2
−1(xi1 ⊕ xi2)xi3 · · ·xi|y| mod l,
where each term in the second line has degree |y| − 1, hence the recursion is terminating.
It follows that this representation is also unique, as the multilinear representation is unique
and both representations have 2n degrees of freedom.
Proposition 5.4.4. Let l be an odd integer. Then for any function f : Fn2 → Zl, there is a
unique a ∈ Z2nl such that for all x ∈ Fn2 ,
f(x) = fa(x).
Proof. As per the above, every such function has a unique multilinear representation, and
hence a unique Fourier expansion, over Zl. Taking the tuple a as the coefficients of the
Fourier expansion of f we have f(x) = fa(x) for all x ∈ Fn2 .
We can now prove Lemma 5.4.2.
Proof of Lemma 5.4.2. Recall that Cnl ⊆ Z2
k
l such that for any c ∈ Cnl , fc(x) = 0 mod l
for all x ∈ Fn2 . Since f0(x) = 0 mod l for all x, by Proposition 5.4.4 this is the unique
Fourier expansion of the zero-everywhere function (mod l), hence
Cnl = {0} = 〈0〉 .
Remark 5.4.5. From the perspective of phase gate optimization, Lemma 5.4.2 asserts that
for any Zm gate where m is odd, phase-folding achieves the optimal Zm-count for CNOT-
dihedral circuits of order 2m. In particular, phase-folding can be seen to give an optimal
circuit implementing a particular decomposition of the phase polynomial; for odd m, there
are no equivalent phase polynomials except up to equivalences of the phase gate angles
mod m, so the circuit obtained is exactly optimal.
74
5.4.2 Rotations of order 2k
We now turn our attention to the more interesting case of even-power orders. In particular,
we prove the following lemma:
Lemma 5.4.6. Let k be some integer. Then
Cn2k =
〈
2iXy | y ∈ Fn2 , |y| − i ≤ n− k − 1
〉
.
As a roadmap, we first show that the tuple a ∈ Z2nm can itself be written as the
evaluations of a polynomial in n variables. We then derive a formula for the value of
fa(x) as a function of the polynomial form of a, which we then use to show that every
zero-everwhere polynomial must arise as the evaluations of an order n− k − 1 polynomial,
where the order is (roughly) the maximum difference between the degree of a monomial
and the highest power of 2 dividing its coefficient.
The monomial basis
Our proof relies on a connection between the binary evaluations of polynomials over Z2k
and the Z-module Z2n2k . In particular, consider the set of degree at most n− 1 monomial
(Boolean) evaluation vectors
{Xy | y ∈ Fn2 , |y| < n}.
This set of vectors, under the natural inclusion of F2 in Z2k , generates Z2
n
2k/ 〈(1, 0, . . . , 0)〉.
Hence every tuple of Fourier coefficients a ∈ Z2n2k can be written up to global phase as the
evaluation vector of a degree less than n polynomial function in Z2k [X].
Lemma 5.4.7.
Z2n2k/ 〈(1, 0, . . . , 0)〉 = 〈Xy | y ∈ Fn2 , |y| < n〉 .
Proof. Recall that by Lemma 5.4.3, the set of all monomials {Xy | y ∈ Fn2} is a basis for
Z2n2k . It therefore suffices to prove that X1X2 · · ·Xn is in the span of {Xy | y ∈ Fn2 , |y| < n},
up to the first coordinate.
It may be observed that over F2, the set of all monomials up to constants is linearly
dependent up to the first coordinate:⊕
y∈Fn2
Xy = (1, 0, . . . , 0)
75
Since for every non-constant monomial Xy, exactly half of the valuations to X = X1, . . . , Xn







 = (1, 0, . . . , 0)
and so ∑y∈Fn2 Xy = a for some a ∈ Z2nm such that Res2(a) = (1, 0, . . . , 0). Since a can be
written up to the first coordinate over {Xy | y ∈ Fn2 , |y| > 0} we see∑
y∈Fn2












Now suppose α1 = 1 mod 2. Then












which is a contradiction since the non-constant monomials are linearly independent over F2.
Thus α1 = 0 mod 2 and in particular (1 + α1)−1 ∈ Z2k , hence we can write
















Lemma 5.4.7 tells us that any element a of Z2n2k/ 〈(1, 0, . . . , 0)〉 is the vector of evaluations











The next step in our proof is to give an analytic formula for the value of a phase function
fa applied to a vector x ∈ Fn2 as a function of the degree of the polynomial form of a.
Specifically, we show that fa(x) is equal to a linear combination of the Hamming weights
of certain Boolean polynomials arising from the multiplication of a monomial with a linear
polynomial.
Consider the value fa at x ∈ Fn2 :




We can rewrite the above equation as an inner product on Zm, since the value χy(x) is the
yth component of the evaluation vector χx(X) = x1X1 ⊕ · · · ⊕ xnXn.



































Computing the sum of the evaluation vectors multiplied by some x = x1x2x3 we see that









x1 ⊕ x2 ⊕ x3

.
Now it can be observed that the ith entry of x1X1 ⊕ x2X2 ⊕ x2X3 is exactly χi(x) – for
instance,
χ101(x) = x1 ⊕ x3 = (x1X1 ⊕ x2X2 ⊕ x2X3)101.
77
Formally, we define 〈a,b〉 for a,b ∈ Z2nm as
∑2n
i=1 aibi. Note that
〈a + b, c〉 = 〈a, c〉+ 〈b, c〉
for any a,b, c ∈ Z2nm since the inner product is linear in either argument. Using this
observation, we give an explicit formula for fa(x) up to global phase as a function of the
basis vectors appearing in a:




αy|Xy(x1X1 ⊕ · · · ⊕ xnXn)|.








ay(x1X1 ⊕ x2X2 ⊕ x2X3)y








αy|Xy(x1X1 ⊕ · · · ⊕ xnXn)|.
As the Hamming weight of the evaluation vector of a Boolean function is simply the
number of satisfying solutions to that function, the value of |Xy(x1X1 ⊕ · · · ⊕ xnXn)| in
Lemma 5.4.9 above may be restated as the number of solutions to the equation Xy(x1X1⊕
· · ·⊕xnXn) = 1. While generally the number of solutions to a Boolean function is non-trivial,
in this particular case a simple analytic formula of the polynomial’s degree suffices.
Lemma 5.4.10. For any x,y ∈ Fn2 ,
|Xy(x1X1 ⊕ · · · ⊕ xnXn)| = 2n−deg(X
y(x1X1⊕···⊕xnXn))
if Xy(x1X1 ⊕ · · · ⊕ xnXn) 6= 0, or 0 otherwise.
78
Proof. Clearly if Xy(x1X1⊕· · ·⊕xnXn) = 0, then |Xy(x1X1⊕· · ·⊕xnXn)| = 0 as required.
Suppose instead that Xy(x1X1 ⊕ · · · ⊕ xnXn) 6= 0. Since x1X1 ⊕ · · · ⊕ xnXn and Xy
has degree |y|, their multiplication has degree either |y| or |y|+ 1. We consider the two
cases separately.
Consider the degree |y| case first. Clearly x ⊆ y – that is, every variable Xi in the linear
combination x1X1 ⊕ · · · ⊕ xnXn is already in Xy, hence the degree remains unchanged.
Then
Xy(x1X1 ⊕ · · · ⊕ xnXn) = |x|Xy =
0 if |x| = 0 mod 2Xy otherwise .
Since Xy(x1X1 ⊕ · · · ⊕ xnXn) 6= 0, it must be the case that |x| = 1 mod 2. Moreover, the
equation Xy = 1 has exactly 2n−deg(Xy) solutions, corresponding to Xi = 1 if yi = 1.
Now consider the degree |y| + 1 case. We know x * y. Without loss of generality
we can assume x ∩ y = 0 – that is, there are no variables that appear in both Xy and
x1X1 ⊕ · · · ⊕ xnXn – as any common variables can be absorbed into Xy.
Recall that x1X1 ⊕ · · · ⊕ xnXn = 1 for exactly half of the values of all Xi such that
xi = 1. Since Xy = 1 for the 2n−deg(X
y) valuations on X1 . . . , Xn where Xi = 1 whenever
yi = 1 and yi = 1 implies xi = 0, exactly half of those solutions – of which there are
2n−deg(Xy)−1 – satisfy x1X1 ⊕ · · · ⊕ xnXn = 1. Hence
|Xy(x1X1 ⊕ · · · ⊕ xnXn)| = 2n−deg(X
y(x1X1⊕···⊕xnXn))
as required.
In general, it is not the case that the number of solutions to f(x) = 1 is 2n−deg(f) for
an n-variate Boolean polynomial function f . In particular, consider f(x) = 1⊕ x1x2 · · ·xi.
Since x1x2 · · ·xi = 1 has 2n−i solutions, the number of solutions to f(x) = 1 is
2n − 2n−i 6= 2n−deg(f).
An explicit set of generators
From Lemma 5.4.10 it’s immediate that if a ∈ Z2n2k may be written over the monomial basis
with degree at most n− k− 1, then fa(x) = 0 mod 2k for any x and so a ∈ Cn2k . However,
it may be the case that a is a sum of monomials with degree greater than n− k − 1 and
yet are still in Cn2k .
79
For instance, consider n = 4, k = 3 and let a = 2Xy where |y| = 1. Now for any x ∈ F42,
log2 fa(x) ≥ 1 + n− |y| − 1
= 3,
hence fa(x) = 0 mod 2k, and yet a has degree 1 > n− k − 1. In particular, with regards
to evaluation on x ∈ Fn2 , the term 2Xy acts as if it were a term of degree n− 4.
We use this intuition to define the order of a polynomial in Zm[X]. Specifically, we say
the order of a term αXy, denoted ord (αXy), is the degree of Xy minus the largest k such
that 2k | α. Analytically,
ord (αXy) = |y| − blog2 αc.
We extend order to polynomials, and moreover evaluation vectors of polynomials, by taking
the order of a polynomial as the highest order of each term.
The next lemma, and the main lemma of the section, now establishes that any a ∈ Z2n2k
with multilinear form of order greater than n− k − 1 is not zero-everywhere mod 2k.
Lemma 5.4.11. Let a ∈ Z2n2k have order m > n − k − 1. Then there exists x ∈ Fn2 such
that
fa(x) 6= 0 mod 2k





Since fa(x) = 0 mod 2k for all x ∈ Fn2 we must also have fa(x) = 0 mod 2n−m−1, as
n−m− 1 < n− (n− k − 1)− 1 = k. Recalling that if ord (αyXy) = i, by Lemmas 5.4.9
and 5.4.10
fαyXy(x) = 2n−i or 2n−i−1.
The latter case occurs exactly when x * y, and hence we see that fa(x) = 0 mod 2n−m−1
implies there are evenly many terms αyXy of maximum order m such that x * y. Alterna-










since every other term evaluates either to 2m (i.e. if it has order m by x ⊆ y), or to 2n−i or
2n−i−1 where i < m. Moreover, we know at least one such term exists, since by Lemma 5.4.7
a can be written up to global phase with degree at most n− 1
Our contradiction arises from the fact that there necessarily exists x ∈ Fn2 such that an
odd number of terms αyXy with order m such that x * y. In particular, let
Si = {y ∈ Fn2 | ord (αyXy) = m, yi = 0},
that is Si is the set of terms of maximum order which do not contain Xi. Given some
x ∈ Fn2 ,
⋃
i|xi=1 Si gives the set of terms of maximum order such that x * y, and hence
since fa(x) = 0 mod 2n−m−1, it follows that
|∪i|xi=1Si| = 0 mod 2.





= m, minimizing |y′| – that is, y′ is a term of
maximum order but with minimum degree. Since y′ has minimal weight, for every other y
such that ord (αyXy) = m, there necessarily exists i such that yi = 1 but y′i = 0. Hence
∩i|y′i=0Si = {y
′}.
By the principle of inclusion-exclusion, the cardinality of this set can be written as a sum
of cardinalities of unions of Si – in particular,
1 = |∩i|y′i=0Si| = |∪i|y′i=0Si|















2n − |Si1 ∪ · · · ∪ Sik |

However, since |∪i|xi=1Si| = 0 mod 2 for any x, we have 1 = 0 mod 2, hence we derive
our contradiction.
Lemma 5.4.11 shows that Cn2k is a subset of the evaluation vectors of order at most
n−k− 1 polynomials, and hence sums of the generators in Lemma 5.4.6. Together with the
fact that every generator gives a zero-everywhere phase polynomial, we can finally prove
the result.
81
Lemma (5.4.6). Let k be some integer. Then Cn2k = 〈2iXy | y ∈ Fn2 , |y| − i ≤ n− k − 1〉 .
Proof. For the forward direction,
Cn2k ⊆
〈
2iXy | y ∈ Fn2 , |y| − i ≤ n− k − 1
〉
,
suppose c ∈ Cn2k . Then fc(x) = 0 mod 2k for all x ∈ Fn2 , so by Lemma 5.4.11, c has order
at most n− k − 1 and can hence be written as a sum of the above generators.
Now consider some c = 2iXy where |y| − i ≤ n− k − 1. By Lemmas 5.4.9 and 5.4.10,
Pc(x) = 2i+n−|y|
for any x ∈ Fn2 . Since i + n − |y| ≥ i + n − (n + i − k − 1) = k − 1 we have fc(x) = 0
mod 2k so c ∈ Cn2k . Moreover since Cn2k is a group, we see that〈
2iXy | y ∈ Fn2 , |y| − i ≤ n− k − 1
〉
⊆ Cn2k .
Hence Cn2k = 〈2iXy | y ∈ Fn2 , |y| − i ≤ n− k − 1〉 .
5.5 Experiments
To evaluate Algorithm 5.1 experimentally, it was implemented in the open source T -par3
circuit optimization software, which implements a version of the phase-folding algorithm of
the previous chapter. Algorithm 5.1 was used to optimize greedily chosen CNOT-dihedral
sub-circuits both before and after phase-folding. Only the results for before phase folding
are shown, as the results after phase folding gave no improvements. The experimental set
up was identical to the previous chapter, with an additional timeout of 30 minutes due to
the high degree of difficulty of decoding.
We implemented and tested the algorithm with two Reed-Muller decoders – an early
majority logic decoder due to Reed [Ree54], and a modern recursive decoder due to
Dumer [Dum04]. The former has complexity in O(22n) for an n-qubit circuit while the
latter has a significantly lower complexity of O(2n). While both of these algorithms are
exponential in the number of qubits n, we nonetheless obtain reasonable performance for
large circuits by storing and operating directly on run-length encoded vectors, as most
vectors we saw had only a few 0-1 or 1-0 alternations. In order to optimize these large




Table 5.1 reports the T -count of circuits optimized with both phase-folding alone, and with
Algorithm 5.1 using either the majority logic or recursive decoder applied to {CNOT, X, T}
subcircuits. Instances where the algorithm failed to report a result within the timeout are
identified with a dash.
On average, Algorithm 5.1 performed slightly better than just phase-folding with both
the majority logic decoder and the recursive decoder. While the recursive decoder produced
the best results in some cases, notably the Galois field multipliers, and failed less often,
for many benchmarks it reported significantly increased T -counts compared to phase-
folding. Majority logic decoding by comparison typically produced less T -reduction, though
it consistently resulted in circuits with equal or lesser T -count than just phase-folding.
Counter-intuitively this appears to result from the recursive decoder actually doing a
better job optimizing T -count – after the recursive decoder performs significant rewrites
on individual CNOT-dihedral subcircuits, less phase gate merging possible. On the other
hand, phase-folding before re-synthesis gave no improvement in T -counts. This lends strong
evidence to the fact that for real-world circuits, as opposed to random circuits, phase-folding
alone is generally close to optimal.
5.6 Related work
Quantum fault-tolerance The relationship between Reed-Muller codes and T gates
has previously been studied from the perspective of fault-tolerance. Previous works [KLZ96,
ZCC11,AJO16] have shown that there exist Quantum Reed-Muller codes which admit
transversal implementations of Z2k gates for any k, and likewise that no such codes exist
for odd-order phase gates. Likewise, such results have given rise to magic-state distillation
algorithms [BH12, CAB12, LC13] for all levels of the Clifford hierarchy – that is, for
implementing Z2k gates via gate teleportation.
These results are directly related to our results, in that they arise from the same modular
equations of phase polynomials. The key difference is that while those results establish
existence theorems, so as to show that such error-correcting codes exist, we also establish
completeness results, so that our methods can be shown to be exactly optimal.
Subsequent work Since this work was originally made public [AM16], a series of works
[CH17b,CH17a,HC18] has used the basic connection we have described here to perform both
83
Table 5.1: T -count optimization results. n reports the number of qubits in the circuit.
T -counts are recorded for the original circuit, after optimization just by phase-folding, and
after optimization by Algorithm 5.1 with either the majority logic or recursive decoder.
Benchmark n T -count
original phase-folding majority recursive
Grover5 9 140 52 52 52
Mod 54 5 28 16 16 16
VBE-Adder3 10 70 24 24 24
CSLA-MUX3 15 70 62 62 58
CSUM-MUX9 30 196 140 84 76
QCLA-Com7 24 203 95 94 153
QCLA-Mod7 26 413 249 238 299
QCLA-Adder10 36 238 162 – 188
Adder8 24 399 215 213 249
RC-Adder6 14 77 63 47 47
Mod-Red21 11 119 73 73 73
Mod-Mult55 9 49 37 35 35
Mod-Adder1024 28 1,995 1,011 1,011 1,011
Mod-Adder1048576 58 16,660 7,340 – –
Cycle 17_3 35 4,739 1,945 1,944 1,982
GF(24)-Mult 12 112 68 68 68
GF(25)-Mult 15 175 111 111 101
GF(26)-Mult 18 252 150 150 144
GF(27)-Mult 21 343 217 217 208
GF(28)-Mult 24 448 264 264 237
GF(29)-Mult 27 567 351 – 301
GF(210)-Mult 30 700 410 – 410
GF(216)-Mult 48 1,792 1,040 – –
GF(232)-Mult 96 7,168 4,128 – –
GF(264)-Mult 192 28,672 16,448 – –
GF(2128)-Mult 384 114,688 65,664 – –
GF(2256)-Mult 768 458,752 262,400 – –
Ham15 (low) 17 161 97 97 97
Ham15 (med) 17 574 230 230 230
Ham15 (high) 20 2,457 1,019 1,019 1,019
HWB6 7 105 71 75 75
HWB8 12 5,887 3,551 3,531 3,531
HWB10 16 26,579 15,921 – 15,921
HWB12 20 159,341 85,897 – –
QFT4 5 69 67 67 67
Λ3(X) 5 28 16 16 16
Λ3(X) (Barenco) 5 21 15 15 15
Λ4(X) 7 56 28 28 28
Λ4(X) (Barenco) 7 35 23 23 23
Λ5(X) 9 84 40 40 40
Λ5(X) (Barenco) 9 49 31 31 31
Λ10(X) 19 224 100 100 100
Λ10(X) (Barenco) 19 119 71 71 71
84
state distillation of more complicated – specifically, CNOT-dihedral – states, as well as to
perform T -count optimization. Their methods used the connection between RM(n− 4, n)∗
decoding and minimal decomposition of symmetric 3-tensors to develop new decoding, and
hence optimization algorithms.
In [HC18], Heyfron and Campbell show that on random CNOT-dihedral circuits, their
decoders scale according to the upper bound of O(2n) given in this chapter. To optimize
general Clifford+T circuits, rather than apply their decoders to CNOT-dihedral sub-circuits,
they move all Hadamard gates to the beginning and end of the circuit at the cost of 1
ancilla per Hadamard, giving a single CNOT-dihedral circuit conjugated by Hadamard
gates. By doing so they are able to directly optimize T -count over the entire circuit at
once, giving an additional 20% improvement over phase-folding alone. It remains however
unclear how much of their optimization is due to their decoder, and how much is a result of




While the phase gate count, and more specifically RZ gate count, is particularly important
for fault-tolerant quantum computing, in most physical implementations of quantum
computing the CNOT gate is the most expensive. Moreover, it forms the backbone of most
discrete quantum circuits, as it is typically the only entangling operation and is hence
used judiciously in effectively any practically useful quantum circuit. Even in physical
implementations where gates commonly have tunable parameters many use the CNOT
gate, or a CNOT gate up to single qubit rotations, as the two qubit entangling gate
(e.g. [DLF+16]).
In this chapter, we again consider the problem of optimally synthesizing a CNOT-
dihedral circuit given a sum-over-paths action
|x〉 7→ e2πif(x)|Ax + b〉
where f : Fn2 → R and Ab ∈ GA(n,F2). However, rather than from the perspective of
RZ-count, we target CNOT-count as our cost function – the synthesis method we describe
here can moreover be combined with the RZ-count optimization of Chapter 5. As the
particular angle of rotations does not matter for the CNOT optimizations we consider here,
we return to the more general CNOT-dihedral group of arbitrary order – that is, circuits
over {CNOT, X,RZ(θ) | θ ∈ R}.
We introduce the notion of a parity network to characterize the CNOT-minimal synthesis
problem over CNOT-dihedral circuits. Informally, a parity network for a set S ⊆ Fn2 is
a CNOT circuit in which the parity χy(x) of the circuit’s input state |x〉 appears in the
annotated circuit for any y ∈ S. The intuition is that a parity network for supp(f̂) suffices
to implement the phase rotation e2πif(x).
86
Using this characterization, we prove that synthesizing a CNOT-optimal circuit over
{CNOT, X,RZ} is at least as hard as computing a minimal parity network. We then show
that the minimal parity network problem is NP-hard in two restricted cases: when all
CNOT gates are restricted to the same target bit, and when the m primary inputs are
encoded in the state of n > m qubits. The former case provides evidence for the hardness
of computing minimal parity networks, while the latter case is useful when combined with
phase-folding.
We further devise a new heuristic optimization algorithm for CNOT-dihedral circuits by
synthesizing parity networks. The optimization algorithm is inspired by Gray codes [Gra53],
which cycle through the set of n-bit strings using the exact minimal number of bit flips.
Like Gray codes, our algorithm achieves the minimal number of CNOT gates when all 2n
parities are needed.
This work appears in [AAM18] and was presented at Theory of Quantum Computation,
Communication & Cryptography 2018 (TQC’18). An implementation appears in Feynman.
6.1 Parity networks
A key observation from the last chapter is that only parities which appear in the annotated
circuit may have non-zero Fourier coefficient. Otherwise, the parities may appear in any
order, parities may appear in the circuit but not in the sum-over-paths form, or the same
parity may appear multiple times. Multiple RZ gates may also be applied with the same
incoming parity throughout a circuit, in which case the Fourier coefficient is the sum of all
the rotation angles and can be replaced with a single RZ gate – this effect was previously
used to optimize RZ-counts in the phase-folding algorithm of Chapter 4.
The inverse of the above observation is that a circuit in which every parity in supp(f̂)
appears as an annotation can be modified to implement the phase rotation f(x) = f̂(0) +∑


































with sum-over-paths form (f, Ab) where A = I,b = 0 and
f(x) = 18 (x1 + x2 + 3(x1 ⊕ x3) + 3(x2 ⊕ x3) + (x1 ⊕ x2 ⊕ x3) + 3(x1 ⊕ x2) + x3) .
87
The above circuit can be modified to give a new circuit with (non-equivalent) sum over
paths form (f ′, I0) for
f ′(x) = 23 (x2 ⊕ x3) +
1
3 (x1 ⊕ x2 ⊕ x3)
with no additional CNOT gates simply by changing the parameters of the fourth and fifth
RZ gates to 23 and
1












This motivates the definition of a parity network below as a CNOT circuit computing a set
of parities, which can be used to implement phase rotations with Fourier expansions having
support contained in that set.
Definition 6.1.1. A parity network for a set S ⊆ Fn2 is an n-qubit circuit C over CNOT
gates where, for each y ∈ S, the parity χy(x) appears in the annotated circuit.
As all parity networks apply some overall linear transformation of the input, we say a
parity network is pointed at A ∈ GL(n,F2) if the overall transformation is A, i.e.,
|x〉 7→ |Ax〉.
Note that as a parity network consists of just CNOT gates, this basis state transformation is
in fact a linear rather than affine one. For convenience we refer to a parity network with the
trivial transformation as an identity parity network. In the context of synthesizing parity
networks, we use the term pointed parity network to refer to a parity network applying a
specific linear transformation.
We can now formalize the above observations with the following proposition, stating that
the problem of finding a minimal size (pointed) parity network is equivalent to finding a
CNOT-minimal circuit having a particular sum-over-paths action – i.e. not up to equivalent
phase functions mod 2π.
Proposition 6.1.2. There exists a CNOT-dihedral circuit with sum-over-paths form (f, Ab)
and t CNOT gates if and only if there exists a parity network for supp(f̂) pointed at A with
t CNOT gates.
88
Proof. First assume there exists a circuit C with t CNOT gates and sum-over-paths form
(f, Ab). Then the annotated circuit necessarily contains either the label χy(x) or 1⊕ χy(x)
for every y ∈ supp(f̂). As the X gates only affect affine factors, and RZ gates don’t affect
labels, the circuit C ′ obtained by removing all X and RZ gates contains as a label χy(x)
for every y ∈ supp(f̂). Moreover, the overall linear transformation is |x〉 7→ |Ax〉, hence C ′
is a parity network for supp(f̂) pointed at A with t CNOT gates.
Now assume there exists a length t parity network C for supp(f̂) pointed at A. Then
the circuit C ′ obtained by, for each y ∈ supp(f̂) inserting RZ(f̂(y)) into C where χy(x)




As the additional affine factor |Ax〉 7→ |Ax + b〉 is implementable with just X gates –
i.e. Xb = ⊗Xbii – and the remaining global phase e2πif̂(0) can be implemented with
RZ(f̂(0))XRZ(f̂(0))X, there exists a CNOT-dihedral circuit with sum-over-paths form
(f, Ab) and t CNOT gates.
6.1.1 From CNOT–minimization to parity network synthesis
Proposition 6.1.2 implies that the problem of finding a minimal pointed parity network is
equivalent to the problem of finding a CNOT-minimal circuit for a particular sum-over-
paths form. However, it is not necessarily the case that a CNOT-minimal circuit having
a particular sum-over-paths form is a CNOT-minimal circuit implementing a particular
unitary matrix. Since for any integer-valued function g : Fn2 → Z,
e2πif(x) = e2πi(f(x)+g(x)),
it may in general be possible to instead synthesize a different sum-over-paths form giving
the same unitary operator, but with lower CNOT cost. For instance,
1





differ by an integer-valued function, k(x1, x2) = x1x2, and hence implement the same phase
rotation. The left expression, together with the identity basis state transformation, gives
rise to a minimal circuit containing 2 CNOT gates, while the expression on the right requires
no CNOT gates to implement at the expense of an extra phase gate, shown below:



















As we are concerned with the question of minimizing CNOT gates over circuits with
equal unitary representations, a natural question is how this relates to the question of
minimizing CNOT gates over circuits with equal sum-over-paths representations. The
remainder of this section shows that so long as no rotation gates have angles which are
dyadic fractions – numbers of the form a2b where a and b are integers – the problems coincide.
Recall from Chapter 3 that two (CNOT-dihedral) sum-over-paths forms correspond to
equivalent unitaries if and only if their phases are related by an integer-valued function. In
particular,
[f ] = {f ′ : Fn2 → R | f ′ = f + g where g : Fn2 → Z},
and we say f ∼ f ′ if f ′ ∈ [f ]. Recall as well that every pseudo-Boolean function f : Fn2 → R
has a unique Fourier expansion [O’D14]. To study the relationship between Fourier
expansions of equivalent functions, it will be important to know their precise form in the
case of integer-valued functions.
Proposition 6.1.3. For any integer-valued function g : Fn2 → Z, the Fourier coefficients
of g are dyadic fractions.
Proof. Let g : Fn2 → Z be an integer-valued pseudo-Boolean function. It is known [HR68]





where xy = xy11 xy22 · · ·xynn and ay ∈ Z for all y.
Using the identity x+ y − (x⊕ y) = 2xy for x, y ∈ F2, we derive an inclusion-exclusion





Note that binary vectors are viewed as subsets of {1, . . . , n} for convenience. Since ay ∈ Z
for all y and dyadic fractions are closed under addition, we observe that the Fourier












It remains to prove Equation (6.1), which we obtain as a simple corollary of the more
convenient lemma below.
Lemma 6.1.4. For any x ∈ Fn2 ,




Proof. Clearly the formula is satisfied for n = 1. Now consider n = k+ 1 for some k. Using
the identity x+ y − (x⊕ y) = 2xy for any x, y ∈ F2 and basic arithmetic we observe that
2kx1x2 · · ·xk+1 = 2k−1x1x2 · · · (2xkxk+1)
= 2k−1x1x2 · · · (xk + xk+1 − (xk ⊕ xk+1))
= 2k−1x1x2 · · ·xk + 2k−1x1x2 · · ·xk+1 − 2k−1x1x2 · · · (xk ⊕ xk+1)
Next we define the length k vectors x′,x′′,x′′′ ∈ Fn2 as follows:
x′i = xi, x′′i =
xk+1 if i = kxi otherwise , x′′′i =
xk ⊕ xk+1 if i = kxi otherwise
By induction we see that





























Remark 6.1.6. The proof of Proposition 6.1.3 also suffices to prove a more general result,
namely that any function from Fn2 to an Abelian group G in which 2 is a regular element
has a unique Fourier expansion over G. This also offers an alternative proof to the fact that
the code Cnm from Chapter 5 is the trivial code whenever m is odd; indeed, 2 is regular in
any such group.
We next use the above proposition to show that any pseudo-Boolean function with
non-dyadic spectrum has a property of minimal support over all equivalent functions. This
is important as a parity network for S is also a parity network for any subset of S.
Proposition 6.1.7. Let f : Fn2 → R be a pseudo-Boolean function having a Fourier
spectrum not containing any non-zero dyadic fractions. Then for any f ′ ∼ f ,
supp(f̂) ⊆ supp(f̂ ′).
Proof. Consider some pseudo-Boolean function f ′ such that f ′ ∼ f . By definition we have
f ′ = f + g for some function g : Fn2 → Z. Expanding f(x) and g(x) with their Fourier
expansions we have




Now since for any y, ĝ(y) = a2b , f̂(y) + ĝ(y) 6= 0. Thus supp(f̂) ⊆ supp(f̂ ′) as
required.
We can now prove that the problem of synthesizing a minimal (pointed) parity network
is at least as hard as general CNOT-minimization. As a corollary, synthesizing a minimal
parity network solves the CNOT-minimization problem whenever the rotation angles, and
hence the Fourier coefficients, are not dyadic fractions.
Theorem 6.1.8. Given A ∈ GL(n,F2), the problem of finding a minimal parity network for
S ⊆ Fn2 pointed at A reduces (in polynomial time) to the problem of finding a CNOT-minimal
circuit equivalent to an n-qubit circuit C over {CNOT, X,RZ}.







Recall that a circuit C over {CNOT, X,RZ} implementing the sum-over-paths form (f, A0)
can be constructed in polynomial time.
Now let C ′ be a CNOT-minimal circuit equivalent to C. We know the sum-over-
paths form of C ′ must be (f ′, A0) for some f ′ ∼ f . However, by Proposition 6.1.7,
S = supp(f̂) ⊆ supp(f̂ ′), so by definition, the circuit obtained from C ′ by removing all X
and RZ gates is a (necessarily minimal) parity network for S pointed at A.
Remark 6.1.9. As in Chapter 5, in cases when the Fourier coefficients contain dyadic
fractions, it may in general be possible to further minimize CNOT-count by optimizing over
all equivalent phase functions. However, in contrast to the Zm-gate optimization problem,
the Fourier spectrum does not correspond directly to the size of a minimal parity network –




2x2 but a larger minimal parity network
as shown earlier – which appears to make the problem of minimizing CNOTs size over all
equivalent functions more difficult.
6.2 Complexity of parity network minimization
We now turn to the question of the complexity of computing minimal parity networks. We
study two cases in particular where the problem can be shown to be NP-complete – the
fixed-target case, and with encoded inputs (i.e. with ancillae). At the end of the section we
discuss the case of synthesizing a minimal parity network with arbitrary targets and no
ancillae.
6.2.1 Fixed-target minimal parity network
We call the problem of synthesizing a minimal parity network in which every CNOT gate
has the same target the fixed-target minimal parity network problem. Formally, we define
the decision problem MPNPFT below:
Problem: Fixed-target minimal parity network (MPNPFT)
Instance: A set of strings S ⊆ Fn2 , and a positive integer k.
Question: Does there exist an n-qubit circuit C over CNOT gates of
length at most k such that C is a parity network for S?
Remark 6.2.1. In general not every set of strings S admits an ancilla-free fixed-target
parity network, as the value of the target bit necessarily appears in every parity calculation
of a fixed-target CNOT circuit. It follows that an (ancilla-free) parity network for S is
93
synthesizable if and only if there exists an index i such that for every y ∈ S, yi = 1. However,
a fixed-target parity network may always be synthesized by adding a single ancillary bit
initialized to the state |0〉. In particular, given a set S ⊆ Fn2 and A ∈ GL(n,F2), we may
construct
S ′ = {(y, 1) | y ∈ S},
where we recall that (y, 1) denotes the length n+ 1 string obtained by concatenating y with
1. It may then be observed that a fixed-target parity network for S ′ is always synthesizable,
and in particular forms a parity network for S when the (n+ 1)th bit is initialized to |0〉.
To show that the fixed-target minimal parity network problem is NP-complete, we
introduce the Hamming salesman problem (HTSP) [EKP85]. Recall that the n-dimensional
hypercube is the graph with vertices x ∈ Fn2 and edges between x,y ∈ Fn2 if x and y differ
in one coordinate (i.e. have Hamming distance 1).
Problem: Hamming salesman (HTSP)
Instance: A set of strings S ⊆ Fn2 , and a positive integer k.
Question: Does there exist a path in the n-dimensional hypercube of
length at most k starting at 0 and going through each
vertex y ∈ S?
An equivalent (from a complexity standpoint) version of the Hamming salesman problem
exists where a cycle rather than path is found. Intuitively, the Hamming salesman problem
is to find a sequence of at most k bit-flips iterating through every string in some set S
starting from the initial string 00 . . . 0. In the case when S = Fn2 the minimal number of bit
flips is known to be 2n, corresponding to one bit flip per string; this is the well known Gray
code, a total ordering on Fn2 where each subsequent string differs by exactly one bit. We
will come back to this connection later in Section 6.3 when designing a synthesis algorithm.
Ernvall, Katajainen and Penttonen [EKP85] show that the Hamming salesman problem
is in fact NP-complete, hence we can use a reduction from HTSP to prove NP-completeness
of MSPFT.
Theorem 6.2.2. MPNPFT is NP-complete.
Proof. Clearly MPNPFT is in NP, as the state of each bit as a parity of the input values at
each state in a CNOT circuit is polynomial-time computable [AMM14], and hence a parity
network can be efficiently verified. Since HTSP is NP-complete [EKP85] it then suffices to
show NP-hardness by reducing the Hamming salesman problem to the fixed-target minimal
parity network problem.
94
Given an instance (S ⊆ Fn2 , k) of HTSP, we construct an instance (S ′ ⊆ Fn+12 , k′) of
MPNPFT with size polynomial in |S| · n as follows:
S ′ = {(x, 1) | x ∈ S}, k′ = k.
Suppose there exists a fixed-target parity network C for S ′ with length at most k.
Without loss of generality we may assume that the fixed target is the (n+ 1)th bit, as if
some i 6= n+ 1 is the target index, then for all y ∈ S ′, yi = 1 = yn+1 and hence swapping
bits i and n+ 1 yields a parity network for S ′. We can then construct a length k hypercube
path through each vertex y ∈ S with starting point 0 by mapping C to a sequence of bit
flips on each CNOT’s control bit. Indeed, by noting that
CNOT|xi〉|xn+1 ⊕ χy(x)〉 = |xi〉|xn+1 ⊕ χy⊕ei(x)〉,
where ei is the ith elementary vector, each CNOT gate in C with control i has the affect of
flipping the ith bit of y. By the definition of a parity network, for every y ∈ S, the parity
xn+1 ⊕ χy(x)
appears as an annotation in the circuit, in particular on the (n+ 1)th bit which had initial
state xn+1⊕χ0(x), hence the sequence of bit flips passes through each vertex y ∈ S starting
from 0.
Likewise, if there exists a length k tour through each y ∈ S, given by a sequence of bit
flips, the circuit defined by mapping each bit flip on i to a CNOT with control i and target
n+ 1 is a length k parity network for S ′ and A′.
As the minimum k for which a parity network exists is at most (n − 1) · |S|, the
optimization version of MPNPFT is also in NP, and hence is NP-complete.
Corollary 6.2.3. The problem of finding a minimal fixed-target parity network is NP-
complete.
It may be observed that the proof of Theorem 6.2.2 can be modified to show that
the problem of finding a minimal pointed parity network with fixed CNOT targets. In
particular, taking A to be the identity transformation gives a reduction from the cycle
version of HTSP.
95
6.2.2 Minimal parity network with encoded inputs
Up to this point, we have discussed only the optimization of pure CNOT-dihedral circuits.
However, our main interest in such optimization problems is to optimally synthesize CNOT-
dihedral sub-circuits of larger circuits. In this case, the inputs to the sub-circuit may
generally be contained in some subset of Fn2 . Knowing this more precise information about
the input space of the circuit may allow shorter parity networks to be synthesized. For
instance, the boxed CNOT-dihedral sub-circuit below on the left performs a phase rotation
of 18(x2⊕ x3) – by noting that the ancilla begins the sub-circuit already in the state x2⊕ x3
we can remove both CNOT gates, as shown by the equivalent circuit on the right.
x1 H x′1
x2 • • • x2
x3 • T x3




0 • T x2 ⊕ x3
We now consider the problem of synthesizing minimal parity networks when some
of the inputs are linear combinations of others. Formally, given a linear transformation
E ∈ L(Fn2 ,Fm2 ) where m > n, a string w ∈ Fm2 is an encoding of x ∈ Fn2 if Ex = w. We
ignore affine factors in the input space since, as noted before, they do not affect CNOT-
counts. The minimal parity network with encoded inputs problem (MPNPE) is then to find
a parity network for a given set S ⊆ Fn2 as a function of the primary inputs x ∈ Fn2 , but
with the initial state |Ex〉 rather than |x〉.
Problem: Minimal parity network with encoded inputs (MPNPE)
Instance: A set of strings S ⊆ Fn2 , a linear transformation E ∈ Fm×n2 ,
and a positive integer k.
Question: Does there exist an m-qubit circuit C over CNOT gates of
length at most k such that C is a parity network for some
set S ′ ⊆ Fm2 where for any y ∈ S there exists w ∈ S ′ such
that ETw = y?
It can be observed that a parity network for some set S ′ as above is equivalent to a
parity network for S starting from the initial state |Ex〉 for any x ∈ Fn2 . In particular, for
any w ∈ S ′ and x ∈ Fn2 ,
χw(Ex) = wTEx = χETw(x) = χy(x).
96

















, corresponding to different ways of
computing x1 ⊕ x2 from the input state |Ex〉 = |x1x2(x1 ⊕ x2)〉.
MPNPE is again NP-complete, which we prove by a reduction from the well known
NP-complete Maximum-likelihood Decoding Problem (MLDP).
Problem: Maximum-likelihood decoding (MLDP)
Instance: A linear transformation H ∈ Fm×n2 , a vector y ∈ Fn2 , and a
positive integer k.
Question: Does there exist a vector w ∈ Fm2 of weight at most k such
that Hw = y?
In the case when H is the parity check matrix of a code C and y is the syndrome of
some vector z, finding the minimum such w gives the minimum weight vector in the coset of
z + C, corresponding to a minimum distance decoding of z. Berlekamp, McEliece, and van
Tilborg [BMvT06] proved that the Maximum-likelihood decoding problem is NP-complete,
and so we may reduce it to the minimal parity network with encoded inputs problem to
show NP-completeness.
Theorem 6.2.4. MPNPE is NP-complete.
Proof. As noted in the proof of Theorem 6.2.2, MPNPE is clearly in NP since a parity
network can be verified in polynomial time. To establish NP-hardness we give a reduction
from MLDP.
Given an instance (H,y, k) of MLDP we construct an instance (S, k′) of MPNPE as
follows:
S = {y}, E = HT , k′ = k − 1.
Suppose there exists a vector w ∈ Fm2 of weight at most k such that Hw = y. Then we
know S ′ = {w} satisfies the requirement that for any y ∈ S there exists w ∈ S ′ such that
ETw = Hw = y. Moreover, the parity computation χw can be computed with |w| − 1 ≤ k′
CNOT gates, hence there exists a parity network of length at most k′ for S ′
On the other hand, suppose there exists a length ≤ k′ parity network for some set S ′
where there exists w ∈ S ′ such that ETw = Hw = y. By noting that
CNOT|χy(x)〉|χz(x)〉 = |χy(x)〉|χy⊕z(x)〉,
97
|y⊕ z| ≤ |y|+ |z|. As each bit starts in some state xi = χei(x) we see that the size of the
parity in any bit at any point in the parity network is at most k′ + 1 ≤ k, and so |w| ≤ k
as required.
Corollary 6.2.5. The problem of finding a minimal parity network with encoded inputs is
NP-hard.
As in the fixed-target case, the problem of finding a minimal pointed parity network
with encoded inputs is also NP-complete, by virtue of the fact that a minimal identity
parity network for a singleton set S = {y} necessarily has the form C ; C ′ where both
C and (C ′)† are parity networks for S. Recall that the inverse circuit (C ′)† has the same
length as C ′.
6.2.3 Discussion
While the case of parity network synthesis with encoded inputs is the most relevant to
the usage (CNOT-dihedral sub-circuit synthesis) in this thesis, Theorem 6.2.4 relies on the
hardness of finding a minimal sum of linearly dependent vectors, or minimum distance
decoding. It would appear that the problem becomes easier when using unencoded inputs,
as each vector and hence parity may be expressed uniquely – i.e. minimally – over the
inputs.
We leave determining the complexity of the synthesizing a minimal parity network
without encoded inputs as an open problem. We do however conjecture that the identity
version – i.e. synthesizing a minimal identity parity network – is at least as hard as
synthesizing a minimal fixed-target identity parity network. In particular, it appears that
whenever one bit appears in every parity there exists a minimal identity parity network
that targets only that bit, and hence these two problems coincide.
6.3 A heuristic synthesis algorithm
We now present an efficient, heuristic algorithm for synthesizing parity networks. The
algorithm is inspired by Gray codes, which iterate through all 2n elements of Fn2 minimally
with one bit flip per string. The situation is distinct for synthesizing parity networks as the
bits which can be flipped depend on the state of all n bits, so our method works by trying
to identify subsets of S which can be efficiently iterated with a Gray code-like construction
on a fixed target. In the limit where S = Fn2 , the algorithm gives a minimal size parity
98
network for S. Again we focus just on the problem of synthesizing a parity network up to
some arbitrary overall linear transformation.
The algorithm gray-synth, is presented in pseudo-code in Algorithm 6.1. Recall that
Ei,j denotes the elementary F2-matrix adding row i to row j. Given a set of binary strings
S, the algorithm synthesizes a parity network for S by repeatedly choosing an index i
to expand and then effectively recurring on the co-factors S0 and S1, consisting of the
strings y ∈ S with yi = 0 or 1, respectively. As a subset S is recursively expanded, CNOT
gates are applied so that a designated target bit contains the (partial) parity χy(x) where
yi = 1 if and only if y′i = 1 for all y′ ∈ S – if S is a singleton {y′}, then y = y′, hence
the target bit contains the value χy′(x) as desired. Notably, rather than uncomputing this
sequence of CNOT gates when a subset S is finished being synthesized, the algorithm
maintains the invariant that the remaining parities to be computed are expressed over the
current state of the bits. This allows the algorithm to avoid the “backtracking” inherent in
uncomputing-based methods.
More precisely, the invariant of Algorithm 6.1 described above is expressed in the
following lemma.
Lemma 6.3.1. Let C be a CNOT circuit and S ⊆ Fn2 . For any positive integer i we let
C≤i denote the first i gates of C, ci and ti be the control and target of the ith CNOT gate
in C. If we define
A0 = I y0 = y
Ai = Eci,tiAi−1 yi = Eti,ciyi−1
for every y ∈ S, then it follows that for any x ∈ Fn2 , UC≤i|x〉 = |Aix〉 and
χyi(Aix) = χy(x).
Proof. The fact that UC≤i|x〉 = |Aix〉 follows simply from the fact that CNOTi,j|x〉 =
|Ei,jx〉. For the latter fact, clearly χyi(Aix) = χy(x) by definition. Moreover, recall that
χy(Ax) = yTAx = χATy(x)






Algorithm 6.1 Algorithm for synthesizing a parity network
1: function gray-synth(S ⊆ Fn2 )
2: New empty circuit C
3: New empty stack Q
4: Q.push(S, {1, . . . , n}, ε)
5: while Q non-empty do
6: (S, I, i) ← Q.pop
7: if S = ∅ or I = ∅ then return
8: else if i ∈ N then
9: while ∃j 6= i ∈ {1, . . . , n} s.t. yj = 1 for all y ∈ S do
10: C ← C ; CNOTj,i
11: for all (S ′, I ′, i′)∈ Q∪(S, I, i) do






18: j ← arg maxj∈I maxx∈F2 |{y ∈ S | yj = x}|
19: S0 ← {y ∈ S | yj = 0}
20: S1 ← {y ∈ S | yj = 1}
21: if i ∈ {ε} then
22: Q.push(S1, I \ {j}, j)
23: else
24: Q.push(S1, I \ {j}, i)
25: end if





It is clear to see that each yi for y ∈ S in Lemma 6.3.1 is the value of y after i iterations.
To see then that the output is in fact a parity network for S, it suffices to observe that
whenever S = {yi} and I = ∅, |yi| = 1 and thus by Lemma 6.3.1, some bit is in the state
|χyi(Aix)〉 = |χy(x)〉 after the ith CNOT gate. While the fact that |yi| = 1 is assured
by lines 9-16 in this case, the non-zero elements of the target strings are actually zero-ed
out earlier, as the algorithm expands each coordinate. In particular, the first time the “1”
branch is taken when expanding a set, corresponding to the first 1 seen over the indices
previously examined, the target bit i is set – taking further “1” branches result in the
row j being flipped to 0 with a single CNOT. In this way the algorithm makes use of the
redundancy in Fourier spectrum S.
In practice, a parity network implementing some particular basis state transformation
is typically needed. We take the approach of synthesizing a pointed parity network
by first synthesizing a regular parity network, then implementing the remaining linear
transformation – i.e. AA−1i where Ai is the linear transformation implemented by the
network. In our implementation we use the Patel-Markov-Hayes algorithm [PMH08] which
gives asymptotically optimal CNOT count. To achieve the necessary affine factor b in the
output, the circuit Xb = Xb11 ⊗ · · · ⊗Xbnn is appended to the end.
While the correctness of Algorithm 6.1 is independent of the choice of index j to
expand in line 18, in practice it has a large impact on the size of the resulting parity
network. We chose j so as to maximize the size of the largest subset, S0 or S1, i.e. j =
arg maxj∈I maxx∈F2 |{y ∈ S | yj = x}|. The intuition behind this choice is that as a subset S
of Fn2 with m bits fixed approaches |S| = 2n−m, the minimal parity network for S approaches
one CNOT per string, corresponding to the Gray code in the limit. We also ran experiments
with other methods of choosing j; we found that j = arg maxj∈I maxx∈F2 |{y ∈ S | yj = x}|
gave the best results on average.
6.3.1 Examples
Example 6.3.2. To illustrate Algorithm 6.1, we demonstrate the use of gray-synth to
synthesize a circuit over {CNOT, X,RZ} implementing the operator U : |x〉 7→ e2πif(x)|x〉
where
f(x) = 18 [(x2⊕x3) + x1 + (x1⊕x4) + (x1⊕x2⊕x3) + (x1⊕x2⊕x4) + (x1⊕x2)] .
101
0 1 1 1 1 1
1 0 0 1 1 1
1 0 0 1 0 0







Starting with the initial set S written as a matrix on the left, we choose a bit maximizing
the number of 0’s or 1’s in S. As j = 1 in this case, we construct the cofactors S0 and S1 on
the values in the first row, and recurse on S0 as indicated by the box in the diagram below.
0 1 1 1 1 1
1 0 0 1 1 1
1 0 0 1 0 0







The algorithm next selects row 2 and immediately descends into the 1-cofactor, since
S0 = ∅. Again, the algorithm selects row 3, and since both rows 2 and 3 have the value 1, a
CNOT is applied with bit 2 as the target and 3 as the control. The remaining vectors are
updated by multiplying with E2,3 – the modified entries are shown in red.
0 1 1 1 1 1
1 0 0 1 1 1
1 0 0 1 0 0







0 1 1 1 1 1
1 0 0 1 1 1
1 0 0 1 0 0
0 0 1 0 1 0

→
0 1 1 1 1 1
1 0 0 1 1 1
0 0 0 0 1 1




x2 x2 ⊕ x3
x3 • x3
x4 x4
As the final row has the value 0, we’re finished with this column and may continue with
the remaining ones.
0 1 1 1 1 1
1 0 0 1 1 1
0 0 0 0 1 1




x2 x2 ⊕ x3
x3 • x3
x4 x4
Again the algorithm chooses row 2 maximizing the number of entries which are the
same for the remaining columns, and recurses on the 0-cofactor as shown below.
102
0 1 1 1 1 1
1 0 0 1 1 1
0 0 0 0 1 1




x2 x2 ⊕ x3
x3 • x3
x4 x4
0 1 1 1 1 1
1 0 0 1 1 1
0 0 0 0 1 1




x2 x2 ⊕ x3
x3 • x3
x4 x4
In expanding the last row, we first examine the 0-cofactor and find nothing to do, then
the 1-cofactor, at which point we need to apply a CNOT with target bit 1 and control 4.
0 1 1 1 1 1
1 0 0 1 1 1
0 0 0 0 1 1




x2 x2 ⊕ x3
x3 • x3
x4 x4
0 1 1 1 1 1
1 0 0 1 1 1
0 0 0 0 1 1
0 0 1 0 1 0

→
0 1 1 1 1 1
1 0 0 1 1 1
0 0 0 0 1 1
0 0 0 1 0 1


x1 x1 ⊕ x4
x2 x2 ⊕ x3
x3 • x3
x4 • x4
Now backtracking and entering the 1-cofactor for the remaining columns, we find we
need to apply a CNOT between bits 2 and 1 to zero out the row.
0 1 1 1 1 1
1 0 0 1 1 1
0 0 0 0 1 1
0 0 0 1 0 1

→
0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 1 1
0 0 0 1 0 1


x1 x1 ⊕ x2 ⊕ x4
x2 • x2 ⊕ x3
x3 • x3
x4 • x4
Continuing on we recurse on the 0-cofactor of row 3 and apply a CNOT with target bit
1, control 4, before backtracking to the 1-cofactor.
0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 1 1
0 0 0 1 0 1


x1 x1 ⊕ x2 ⊕ x4




0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 1 1
0 0 0 1 0 1

→
0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 1 1
0 0 0 0 1 0


x1 x1 ⊕ x2
x2 • x2 ⊕ x3
x3 • x3
x4 • • x4
For the remaining two columns we first zero out row 3 by applying a CNOT gate between
bits 3 and 1, then finally descend into the cofactors on the last row.
0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 1 1
0 0 0 0 1 0

→
0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0


x1 x1 ⊕ x2 ⊕ x3
x2 • x2 ⊕ x3
x3 • • x3
x4 • • x4
0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0


x1 x1 ⊕ x2 ⊕ x3
x2 • x2 ⊕ x3
x3 • • x3
x4 • • x4
0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0

→
0 1 1 1 1 1
1 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0


x1 x1 ⊕ x2 ⊕ x3 ⊕ x4
x2 • x2 ⊕ x3
x3 • • x3
x4 • • • x4
The overall linear transformation applied is
A =

1 1 0 1
0 1 1 0
0 0 1 0
0 0 0 1

so the algorithm completes by appending a circuit computing A−1. Inserting T = RZ(1/8)
gates in the relevant positions, we get the following circuit computing |x〉 7→ e2πif(x)|x〉:
x1 T T T T T x1
x2 T • • x2
x3 • • • x3
x4 • • • • x4
104
x1 T • • • • x1
x2 T T † • • x2
x3 T T † T † T x3
Figure 6.1: Circuit implementing the doubly-controlled Z gate CCZ synthesized with
Algorithm 6.1. The CNOT-minimal Fourier expansion in this case gives S = F32 \ {0}, and
A = I.
x1 • x1
x2 • • x2
x3 • • • • x3
x4
x3 ⊕ x4 x2 ⊕ x3 ⊕ x4 x2 ⊕ x4 x1 ⊕ x2 ⊕ x4 x1 ⊕ x2 ⊕ x3 ⊕ x4 x1 ⊕ x3 ⊕ x4 x1 ⊕ x4
x1 ⊕ x4
Figure 6.2: Annotated parity network for the set S = {(y, 1) | y ∈ F32}. Note that the
parity network corresponds exactly to the Gray code on F32.
Example 6.3.3. Figure 6.1 shows a circuit implementing the doubly-controlled Z gate,
corresponding to an identity parity network for F32 \ {000}, synthesized with Algorithm 6.1
followed by the Patel-Markov-Hayes algorithm. In this case both the parity network and
the identity parity network are minimal, as verified by brute force search – further, the
circuit synthesized by Algorithm 6.1 reproduces exactly the minimal circuit for F32 \ {000}
from [WGMAG14]. In general, for any n the identity parity network for Fn2 \{0} synthesized
in this manner has the same structure, using 2n − 2 CNOT gates, compared to 2n bit flips
for the Gray code.
Example 6.3.4. If instead of S = Fn2 \{0} we have S ' Fm2 for some m < n, Algorithm 6.1
instead gives a circuit corresponding directly to the Gray code. In particular, Figure 6.2
shows a parity network for S = {(y, 1) | y ∈ F32} ' F32 synthesized with Algorithm 6.1,
where S ' F32. In this case it can be observed that the controls of the CNOT gates are
exactly the bits flipped in a Gray code for F32. Further, it may be noted that this is a
minimal size parity network for S, and is in fact a fixed-target parity network.
6.3.2 Synthesis with encoded inputs
Given an encoder E ∈ L(Fn2 ,Fm2 ), we can use Algorithm 6.1 to synthesize a parity network
for some set S with encoded inputs as follows. Recall that a parity network for S ⊆ Fn2 with
105




































Figure 6.3: Average CNOT counts of parity networks computed by Algorithm 6.1 and brute
force minimization.
inputs encoded by E corresponds to a parity network for some set S ′ ⊆ Fm2 such that for
any y ∈ S, there exists w ∈ S ′ where ETw = y. While finding the minimal such w would
require solving the NP-hard Maximum-likelihood decoding problem, we can efficiently
compute some w using a generalized inverse of ET , as in Chapter 4.
It is worth noting that it may be possible to perform additional optimization by
optimizing the set S ′ with a generalized inverse. In particular, it is known [CM09] that the
set of all solutions to the linear system Ax = y is given by {Agy + (I −AgA)w | w ∈ Fm2 },
which may be possible to optimize with classical techniques. We tried brute-force optimizing
the set S ′ for some small instances and found negligible effects on overall CNOT counts,
though we leave it as an open question as to whether scalable sub-optimal methods reduce
parity network sizes in large benchmarks.
6.4 Evaluation
To evaluate the performance of Algorithm 6.1, it was implemented in Feynman as a
CNOT-dihedral synthesis algorithm. The experimental set up was the same as in chapters 4
and 5.
We generated all 4-bit minimal parity networks by brute force search and compared
them with the parity networks generated with Algorithm 6.1. Figure 6.3 graphs the results,
106
















f(x) = x−min(x, 11)

















Figure 6.4: Average CNOT counts of parity networks synthesized with Algorithm 6.1 for
sets of parities on 10 bits.
with CNOT cost averaged over sets S of parities with the same size. The results show that
our algorithm synthesizes optimal or near-optimal networks for small and large sets S, and
diverges slightly for sets of parities containing around half of the possible parities. The
divergence peaks at |S| = 8, exactly half of the 24 parities, with Algorithm 6.1 coming
within 15% of the minimal number of CNOT gates on average. On examining the structure
of optimal parity networks for sets on which Algorithm 6.1 performed poorly, it appears
that the optimal parity networks save on CNOT cost by making more judicious use of
shortcuts – leaving qubits in particular states to flip between distant parities in other bits
quickly. In general we found that the optimal results in these cases were not achievable just
by using the gray-synth algorithm with different index expansion orders. An effective
synthesis algorithm may then be to combine Algorithm 6.1 for small and large sets with a
different heuristic for sets S of size close to 2n−1.
To further examine the scaling of our algorithm, we generated random sets of parities
on n bits and used Algorithm 6.1 to synthesize parity networks. Figure 6.4 graphs the
results for 10 bits, averaged across 50 randomly generated sets of cardinality 32i for each i,
against the theoretical lower bound of |S| −min(|S|, n+ 1) corresponding to one CNOT
per parity of weight at least 2. The results show a similar curve to the 4 qubit case, with
the number of CNOT gates per parity scaled, as expected.
107
6.4.1 Benchmarks
To evaluate the performance of Algorithm 6.1 on practical quantum circuits, we used it
to optimize the CNOT-dihedral subcircuits generated by the phase-folding algorithm of
Chapter 4 on the same suite of benchmarks. To assign phase terms to different CNOT-
dihedral subcircuits, we performed experiments with two simple strategies – either applying
each phase term at the earliest possible location, or the latest. We found that the former
“greedy” strategy worked best in almost all cases, and report only those results. We
reproduce T counts for comparison with other methods.
Table 6.1 reports the results of our experiments. On average, Algorithm 6.1 resulted in
an 22% reduction of CNOT gates, with 43% reduction in the best case. In reality there
may be more CNOT reduction on average, as the algorithm performed relatively poorly on
the Galois field multipliers, which comprise over a quarter of the benchmarks. Further, only
4 benchmarks took over a second to complete, lending evidence to the scalability of our
method. One benchmark, CSLA-MUX_3, did observe an increase in CNOT gates of 23% –
this appears to be due to our sub-optimal method of generating a pointed parity network
from a parity network, combined with the fact that very few T gates cancel (that is, the
support of the Fourier expansions synthesized have total size close to the number of T gates
in the original circuit). It may be possible to reduce the overhead in this case, and further
reduce CNOT counts in the other benchmarks, by synthesizing pointed parity networks
directly, rather than synthesizing a parity network followed by a linear permutation.
We compared Algorithm 6.1 against a recent heuristic optimization algorithm by Nam,
Ross, Su, Childs and Maslov [NRS+18]. When available the “light” optimization results
reported in [NRS+18] are given, as their software is not open-source. Algorithm 6.1 typically
results in similar CNOT counts to their circuit optimizer, with Algorithm 6.1 reporting
better CNOT counts on some circuits (e.g., VE-Adder_3, CSUM-MUX_9) and worse on
others (e.g., CSLA-MUX_3, QCLA-Mod_7). Additionally, it may be noted that in the
case of the Galois field multipliers, the reductions Nam et al. achieve are from using base
circuits with fewer CNOT gates. As their optimizations rely largely on special-purpose
synthesis and local rewrites, the techniques are complementary and so it may be possible
to combine both to further reduce CNOT counts.
Nam et al. also report on the results of a “heavy” optimization algorithm which generally
performs slightly better than Algorithm 6.1, with the exception of the VBE-adder_3 and
Mod 5_4 benchmarks. This optimization however does not scale to the largest of our
benchmark circuits, such as GF(264)-Mult, as a result of the use of local rule-based rewrites
to optimize CNOT-dihedral subcircuits. An interesting question is whether first performing
re-synthesis with Algorithm 6.1 would reduce run-times for the “heavy” local rewrites.
108
Table 6.1: CNOT-count optimization results. Original gives the original circuit statistics,
Nam et al. (L) reports the light optimization results from [NRS+18] (where available), and
gray-synth gives the results using phase-folding with Algorithm 6.1 for CNOT-dihedral
resynthesis. The % reduction in CNOT-count over the base circuit is reported in the last
column.
Benchmark n Original Nam et al. (L) gray-synth
CNOT T Time (s) CNOT T Time (s) CNOT T % Red.
Grover_5 9 336 336 – – – 0.001 226 154 32.7
Mod 5_4 5 32 28 < 0.001 28 16 0.001 26 16 18.8
VBE-Adder_3 10 80 70 < 0.001 50 24 0.004 46 24 42.5
CSLA-MUX_3 15 90 70 < 0.001 76 64 0.073 111 62 -23.3
CSUM-MUX_9 30 196 196 < 0.001 168 84 0.095 148 84 24.5
QCLA-Com_7 24 215 203 0.001 132 95 0.097 136 94 36.7
QCLA-Mod_7 26 441 413 0.004 302 237 0.145 360 237 18.4
QCLA-Adder_10 36 267 238 0.002 195 162 0.112 214 162 19.9
Adder_8 24 466 399 0.004 331 215 0.165 359 215 23.0
RC-Adder_6 14 104 77 < 0.001 73 47 0.080 71 47 31.7
Mod-Red_21 11 122 119 < 0.001 81 73 0.091 86 73 29.5
Mod-Mult_55 9 55 49 < 0.001 40 35 0.004 40 35 27.3
Mod-Adder_1024 28 2,005 1,995 – – – 0.739 1,390 1,011 30.7
Mod-Adder_1048576 58 16,680 16,660 – – – 12.272 11,080 7,339 33.6
Cycle 17_3 35 4,532 4,739 – – – 2.618 2,968 1,955 37.4
GF(24)-Mult 12 115 112 0.001 99 68 0.041 106 68 7.8
GF(25)-Mult 15 179 175 0.001 154 115 0.038 163 111 8.9
GF(26)-Mult 18 257 252 0.003 221 150 0.055 235 150 8.6
GF(27)-Mult 21 349 343 0.004 300 217 0.450 319 217 8.6
GF(28)-Mult 24 469 448 0.006 405 264 0.066 428 264 8.7
GF(29)-Mult 27 575 567 0.010 494 351 0.076 526 351 8.5
GF(210)-Mult 30 709 700 0.009 609 410 0.081 648 410 8.6
GF(216)-Mult 48 1,837 1,792 0.065 1,581 1,040 0.363 1,691 1,040 7.9
GF(232)-Mult 96 7,292 7,168 1.834 6,299 4,128 5.571 6,636 4,128 9.0
GF(264)-Mult 192 28,861 28,672 58.341 24,765 16,448 114.310 25,934 16,448 10.1
GF(2128)-Mult 384 115,069 114,688 1, 744.746 98,685 65,664 1, 745.724 102,490 65,664 10.9
GF(2256)-Mult 768 459,517 458,752 – – – – – – –
Ham_15 (low) 17 259 161 – – – 0.043 208 97 19.7
Ham_15 (med) 17 616 574 – – – 0.089 357 242 42.0
Ham_15 (high) 20 2,500 2,457 – – – 0.376 1,502 1,021 39.9
HWB_6 7 131 105 – – – 0.029 110 75 16.0
HWB_8 12 7,508 5,425 – – – 1.706 6,861 3,531 8.6
HWB_10 16 36,087 26,579 – – – 54.258 32,175 15,921 10.8
HWB_12 20 204,174 159,341 – – – 2, 849.475 175,805 85,897 13.9
QFT_4 5 48 69 – – – 0.005 48 67 0.0
Λ3(X) 5 21 21 < 0.001 14 15 < 0.001 14 15 33.3
Λ3(X) (Barenco) 5 28 28 < 0.001 20 16 < 0.001 18 16 35.7
Λ4(X) 7 35 35 < 0.001 22 23 0.001 22 23 37.1
Λ4(X) (Barenco) 7 56 56 < 0.001 40 28 < 0.001 36 28 35.7
Λ5(X) 9 49 49 < 0.001 30 31 0.003 30 31 38.8
Λ5(X) (Barenco) 9 84 84 < 0.001 60 40 < 0.001 54 40 35.7
Λ10(X) 19 119 119 < 0.001 70 71 0.071 70 71 41.2




CNOT circuits Previous work regarding CNOT optimization has largely focused on
strictly reversible circuits. Iwama, Kambayashi and Yamashita [IKY02] gave some transfor-
mation rules which they use to normalize and optimize CNOT-based circuits. More specific
to CNOT circuits, Patel, Markov and Hayes [PMH08] gave an algorithm for synthesizing
linear reversible circuits which produces circuits of asymptotically optimal size. Their
method modifies Gaussian elimination by prioritizing rows which are close in Hamming
distance and gives circuits of size at most O(n2/ log n), coinciding with the known lower
bound of Θ(n2/ log n) on the worst-case size of CNOT circuits [SPMH02]. We use their
algorithm to perform linear reversible synthesis in our implementation.
Clifford+RZ circuits In the realm of pure quantum circuits, Shende and Markov [SM09]
studied the CNOT cost of Toffoli gates. They proved that 6 CNOT gates is minimal for
the Toffoli gate, and gave a lower bound of 2n CNOT gates for the n qubit Toffoli gate.
More recently, Welch, Greenbaum, Mostame and Aspuru-Guzik [WGMAG14] studied the
construction of efficient circuits for diagonal unitaries. They used similar insights to the
ones we use here, notably the use of the 2n Walsh functions as a basis for n-qubit diagonal
operators which correspond to the parity functions in the Fourier expansions we use to
describe CNOT-dihedral circuits. While their main objective was to optimize circuits by
constructing approximations of the operator which use fewer Walsh functions, they give
a construction of an optimal circuit computing all 2n Walsh functions. They further give
CNOT identities which they use to optimize circuits when not all Walsh functions are used,
but give no experimental data as to the effectiveness of these optimizations. In contrast,
we present and test an algorithm which directly synthesizes an efficient circuit for a specific







Up to this point, the path integral model of quantum circuits has been used to perform
circuit optimization and synthesis. The shift from qubits and gates to paths and phases, in
particular, exposes gates which are physically distant in the circuit but act on the same
computational paths. Effectively, the shift in viewpoint automatically “mods out” certain
commutation rules, leaving just the computational logic.
A natural question is whether this same effect is helpful for the problem of verification.
Recall that the identity circuit HH, equal to the identity, has sum-over-paths action




and in particular ∑y2∈F2 eπi(xy1+y1y2)|y2〉 = 2|x〉. We see a similar effect whereby physically
separated Hadamard gates which nevertheless act on the same paths can be reduced with
the above equation. For instance, consider the circuit below:
H • •
• H







which by writing the right-hand side as |x2〉 ⊗ (12
∑
y∈F22 e
πi(x1y1+y1y2)|y2〉), we can use the
above equation to rewrite algebraically as |x2〉|x1〉. Hence the circuit implements a swap of
112
two basis states; in an equational circuit theory on the other hand, the most natural proof
would require rewriting the three CNOT gates as a swap gate, then commuting one of the
Hadamard gates through the swap and cancelling. While the “circuit” proof is relatively
simple in this case, in the circuit below implementing a swap between the first and third











which can be rewritten to |x3〉|x2〉|x1〉 via the above path integral equation.
In this chapter, we use this insight to address the functional verification problem for
quantum circuits over the Clifford+Rk = RZ(2π/2k) gate set. We develop a formal semantic
model of such circuits as path integrals, on top of which we build a calculus of rewrite rules
which identify and reduce interfering sets of computational paths. We consider this gate set
in particular as it will be more convenient to prove certain properties about the size of path
integrals – the rewrite system we develop works in principle for circuits over Clifford+RZ .
Recall also that most general quantum algorithms, such as the Quantum Fourier Transform
(see, e.g., [NC00]) and those algorithms using it, including Shor’s algorithm, are given over
the Clifford+Rk gate set.
The model we formalize in this chapter doubles as a natural specification language for
pure quantum circuits, hence our use of the term functional verification – verification of the
precise input-output relation of a circuit in relation to a (human-readable) specification. In
particular, with some syntactic sugar path integrals correspond directly to the mathematical
definition of circuits typically given in textbooks such as [NC00,KLM07], and moreover allow
the direct use of classical (functional) programs in specifications of quantum circuits. To
evaluate the suitability of path integrals for specification, we perform case studies verifying
quantum algorithms against specifications written directly as path integrals.
This work appears in [Amy18] and won the best student paper award at Quantum
Physics & Logic 2018 (QPL’18). The verification algorithm is implemented in Feynman
and has been used to verify most of the experimental results in chapters 4 and 6.
113
7.1 The path integral model
Recall from Chapter 3 that we describe a path integral informally as a collection of paths Π





Moreover, it was noted that for Clifford+RZ circuits, the path integral can be represented
by a collection of path variables, a pseudo-Boolean phase function and an affine basis state
transformation. In this section, we develop a concrete, computable representation with a
few additions to allow the specification of algorithms with ancillas, as well as our rewrite
rules which necessarily take us out of the strict Clifford+RZ path integral representation.
First recall that phase group for Clifford+Rk circuits is isomorphic to the group of
dyadic fractions D = { a2b |a, b ∈ Z}, and hence the phase function over such circuits can
be represented by a pseudo-Boolean function into D. Moreover, every pseudo-Boolean
function f : Fn2 → D has a unique presentation as a multilinear polynomial over variables
X = X1, X2, . . . , Xn. We denote by DM [X] = D[X]/〈X21 −X1, . . . , X2n −Xn〉 the space of
multilinear dyadic polynomials in the variables of X.
In this chapter, we define (concrete) path integrals over the Clifford+Rk gate set with
ancillas as a tuple consisting of an input space S ⊆ Fn2 , phase polynomial in n input and
m path variables P ∈ DM [x,y], and an output function f : Fn+m2 → Fn2 , represented as
(Boolean) polynomial functions.
Definition 7.1.1 (path integral). An n-qubit path integral ξ = (S, P, f) consists of
• an input signature S ⊆ Fn2 ,
• a phase polynomial P ∈ DM [X, Y ] over input variables X = X1, X2, . . . , Xn and path
variables Y = Y1, Y2, . . . , Ym, and






 where each fi ∈ F2[X, Y ].
The associated operator of a path integral is the partial linear map Uξ where for any x ∈ S,







Remark 7.1.2. In contrast to previous chapters, we use P and f to denote the phase
polynomial and output functions, respectively, as the phase polynomial is now strictly
speaking a polynomial, and the output function f is no longer linear, and instead an
arbitrary Boolean function.
A path variable is internal if it does not appear in the output signature. We typically
write a path integral informally by the action of its associated operator; when writing a path
integral by the action of its associated operator, we use Boolean variables and constants to












Example 7.1.3. Path integral specifications of common quantum gates and circuits are
listed below:
T :|x〉 7→ e2πix8 |x〉







Toffolin :|x1x2 · · ·xn〉 7→ |x1x2 · · · (xn ⊕
∏n−1
i=1 xi)〉









Addition and multiplication of Boolean vectors are interpreted as integer operations at the
bit level. In the QFT above, [x · y] denotes the integer value of x · y. For any classical
function f , we can lift the polynomial representation of f to a quantum operator via the path
integral |x〉|0〉 7→ |x〉|f(x)〉. Note that the polynomial representation of a classical function
may grow exponentially large, as in the case of addition. A practical implementation of path
integrals as a specification language should include a classical sub-language as syntactic
sugar for operations on Boolean polynomials.
As a unitary or partial isometry may admit many distinct path integral representations,
we define an equivalence between path integrals with the same associated operator.
Definition 7.1.4 (equivalence). Two path integrals ξ1, ξ2 are equivalent, denoted ξ1 ≡ ξ2,
if and only if their associated operators are equal – that is, Uξ1 = Uξ2 .
Non-isometric path integrals are possible in this model, as for instance |x〉 7→ |0〉 is a
syntactically valid path integral. As we are concerned only with the unitary circuit model
115
and by extension isometric path integrals, we define a notion of well-formedness for path
integrals.
Definition 7.1.5 (well-formed). A path integral is well-formed if its associated operator is
a (partial) isometry.
In practice, well-formedness is only an issue when writing path integrals directly as
specifications, and our verification methods work even when a path integral is not guaranteed
to be well-formed.
7.1.1 Composing path integrals
The abstract path integrals of Chapter 3 admit both vertical and horizontal compositions,
defined by adding their path variables and composing their phase and output functions. We
can define vertical composition for the concrete path integrals we consider here similarly, so
that







More concretely, we need to reify P (x,y)+P ′(x′,y′) and |f(x,y)〉|f ′(x′,y′)〉 as polynomials
in n+ n′ input and m+m′ path variables, which we can do by shifting the input and path
variables by n and m respectively in P ′ and f ′:
(S, P, f)⊗ (S ′, P ′, f ′) = (S × S ′, P + P ′ ↑n,m, (f, f ′ ↑n,m)),
where for and polynomial P in X and Y , P ↑n,m= P [Xi ← Xi+n][Yi ← Yi+m] and
P [X ← P ′] denotes the substitution of X with P ′ in P .
More care must be taken for horizontal composition, as in general not every such
composition of partial isometries is well formed. For instance, composing the path integral
|x〉 7→ |x〉 with |0〉 7→ |0〉 effectively post-selects on x = 0. For this reason we require that
only compatible signatures are composed; in particular, the path integrals (S, P, f) and
(S ′, P ′, f ′) are compatible if and only if
{f(x,y) | x ∈ S,y ∈ Fm2 } ⊆ S ′.
Determining compatibility is non-trivial in general, as it is at least as hard as Boolean
satisfiability. In practice, we only ever composes partial isometries with unitaries, and
116
hence all compositions we consider are trivially compatible. A more complete account of
compositions of morphisms in categories of partial isometries can be found in [HB09].
Given two compatible path integrals, (S, P, f) and (S ′, P ′, f ′) their horizontal compo-
sition (S, P, f) ; (S ′, P ′, f ′) may be defined by substituting the input variables Xi of the
latter with the outputs fi of the former. As the phase and output polynomials are defined
over different rings (D and F2, respectively), when substituting a variable with a Boolean
polynomial in the phase we first need to lift it into a functionally equivalent polynomial






2xy. We define the lifting of
a Boolean polynomial P to a polynomial P ∈ DM [X] recursively by
Xα = Xα,
P ⊕Xα = P +Xα − 2PXα,
where Xα = Xα11 Xα22 · · ·Xαnn for α ∈ Fn2 is a multi-index, the first equation uses the
inclusion of F2 in D, and the recursion is terminating since each polynomial on the right
hand side has strictly fewer terms. It can be easily verified that the lifting of a Boolean
polynomial preserves its action on elements of F2.
Lemma 7.1.6. For any n-variable Boolean polynomial P and all x ∈ Fn2 , P (x) = P (x).






Proof. By induction. Given α ∈ Fn2 , clearly for all x ∈ Fn2 we have xα = xα. Moreover, for
any Boolean P on n variables and α ∈ Fn2 ,
(P ⊕Xα)(x) = P (x) + (Xα)(x)− 2(PXα)(x)
= P (x) + (Xα)(x)− 2(PXα)(x)
= (P ⊕Xα)(x).
The last equality follows from the fact that P (x), (Xα)(x) ∈ {0, 1} and (PXα)(x) = 1 if
and only if P (x) = 1 = (Xα)(x).
117
We can now define the functional composition of (compatible) path integrals as
(S, P, f) ; (S ′, P ′, f ′) = (S, P + P ′ ↑0,m [Xi ← fi], f ′ ↑0,m [Xi ← fi]).
By Lemma 7.1.6, (P ′ ↑0,m [Xi ← fi])(x,y,y′) = P ′(f(x,y),y′), hence







e2πi(P (x,y)+P ′(f(x,y),y′))|f ′(f(x,y),y′)〉.
Proposition 7.1.7. For any well-formed path integrals ξ, ξ′, the compositions ξ ⊗ ξ′ and
(if ξ and ξ′ are compatible) ξ ; ξ′ are well formed. Moreover,
Uξ⊗ξ′ = Uξ ⊗ Uξ′ , and Uξ;ξ′ = Uξ′Uξ.
Remark 7.1.8. A useful property of path integrals, for the purposes of verification, is that
they unify structurally equivalent circuits without the use of string diagrams (e.g., [CD11]),
which can be difficult to reason about in automated ways [BGK+16]. By this we mean that
circuits which are equivalent up to symmetric monoidal laws are strictly equal in the path
integral picture. For instance, the bifunctoriality law and the naturality of SWAP, stated











are both strict equality in the path integral model. While much progress has been made
towards computational methods for diagrammatic reasoning [DL13,BGK+16,CDKW16,
GD17], our framework allows us to use standard algebraic tools (e.g., rewriting) without
explicitly managing structural laws.
Path integrals further unify many semantic equivalences of quantum circuits – for
instance, the merging of phase gates applied to the same paths as in Chapter 4. In
contrast, matrix representations unify all equivalences between unitaries, at the expense of
exponential space. As was the case in chapters 5 and 6 for CNOT-dihedral circuits, path
integrals hence provide an intermediary model, where many equivalences are “modded out”
while still remaining efficiently representable.
7.1.2 The path integral semantics
As path integrals admit both a symmetric tensor product and functional composition, we
can give a compositional path integral semantics for circuits in 〈G〉 by giving path integral
118
interpretations for each gate in G. As in previous chapters, we use the group laws to
“push” all †’s inwards onto gates, since there is no known method of efficiently inverting
a path integral. Below we give a path integral interpretation to the Clifford+Rk basis
{H,CNOT, Rk}1 for k > 0, and their inverses.
Definition 7.1.9. (Clifford+Rk path integral)
The path integral interpretation of an n-qubit circuit C over {H,CNOT, Rk}, denoted JCKp,










JCNOTKp = (F22, 0, (X1, X1 ⊕X2))
The path integral definitions for the {H,CNOT, Rk} basis given in Definition 7.1.9 are
somewhat obtuse – a more informal but intuitive definition is given by the sum-over-paths
actions of each gate below.







Rk : |x〉 7→ e2πi
x
2k |x〉
R†k : |x〉 7→ e
2πi−x
2k |x〉
CNOT : |x1x2〉 7→ |x1(x1 ⊕ x2)〉.
As a consequence of the above and Proposition 7.1.7 we have the following proposition:
Proposition 7.1.10. For any circuit C over {H,CNOT, Rk}, we have UJCKp = JCK.
7.1.3 Computational efficiency
As a composition of linear Boolean functions, it can trivially be observed that each of the
outputs of the path integral interpretation of a Clifford+Rk circuit is linear. Moreover, its
phase polynomial has degree at most k. To show this, recall the definition of the dyadic
order of a polynomial (Chapter 5, Section 5.4):
1We do not include the X gate in this chapter, as the effect on efficiency is negligible.
119
Definition 7.1.11. The order of a term a2bX
α where a is co-prime to 2 and α ∈ Fn2 is
b+ |α| − 1. The order of a polynomial P ∈ DM [X], denoted ord (P ), is the maximum order
of all terms in P .
The definition above is equivalent to the definition in Chapter 5 – however, this version



















An important fact, shown below, is that order is non-increasing with respect to substi-
tution of linear Boolean polynomials.




P [Xi ← Q]
)
≤ ord (P )









Substituting Q in for Xi we see that for any term a2bX



























Since the output function of a Clifford+Rk path integral is strictly linear, by Lemma 7.1.13
composing Clifford+Rk path integrals does not increase the order of the phase polynomial.
Moreover, each gate over {H,CNOT, Rk} has phase polynomial of order at most k and
denominator at most 2k, hence we obtain the following results.
Proposition 7.1.14. The phase polynomial of a (canonical) Clifford+Rk path integral has
degree at most k.
Corollary 7.1.15. The path integral interpretation of an n-qubit Clifford+Rk circuit C
has size polynomial in the volume of C and can be computed in polynomial time.
120
On representations of the phase While the representation of the phase as a multilinear
polynomial is indeed polynomial in the size of the circuit, at higher levels of the Clifford
hierarchy (i.e. large k) the degree of the polynomial can become prohibitively large. Even
for the standard Clifford+T gate set, the path integral of a circuit requires space cubic in
the volume of the circuit. In practice this makes verification of some larger circuits difficult.
By contrast, representing the phase polynomial with by its Fourier expansion was
previously noted to use linear space in the circuit volume. This however complicates
the process of verification as the Fourier expansion is not necessarily unique. A possible
compromise would be to store the Fourier expansion normally, and generate the multilinear
form for small subsets on demand.
7.2 A calculus for path integrals
The verification question we’re generally concerned with is that of functional verification –
given a circuit C and specification ξ, is JCKp ≡ ξ?. From an automated perspective it is
simpler to instead check that the miter [YM10] JC†Kp ◦ ξ is the identity transformation,
but in either case we need a method of efficiently establishing equivalence. To that end,
in this section we present a system of reduction rules for path integrals. A key feature
of the calculus is that the reduction rules strictly decrease the number of path variables,
producing a (not necessarily unique) normal form in polynomial time.
7.2.1 Motivation
Our calculus operates by reducing the number of paths when sets of paths interfere in
recognizable ways which we call interference patterns. As an illustration, recall once more
the identity circuit HH. Computing its canonical path integral we get







To see that the above path integral is equal to the identity, we can first expand the
















Since eπi = −1, it can be observed that if x+ y2 = 0 mod 2, the two paths corresponding
to y1 = 0 and y2 = 1 constructively interfere, whereas if x+y2 = 1 mod 2 they destructively
interfere. As F2 = x⊕ F2 = {x, 1⊕ x} for any x ∈ Z, we can rewrite the sum over x⊕ F2
















The reasoning above applies to any situation where an internal path variable yi only
appears with coefficients taken from the Boolean subgroup {0, 12} of D/Z, as the two
branches of yi are identical, except that yi = 1 path picks up a multiplicative factor of −1












y∈Fm2 ,Q(x,y)=0 mod 2
e2πiR(x,y)|f(x,y)〉
Note that the polynomial Q is integer-valued and hence can be reduced mod 2 to a
Boolean-polynomial – otherwise, the y0 = 1 path could pick up values not in {1,−1}. In
practice, we only perform such reductions when the restricted sum can be reified by solving
Q(x,y) = 0 mod 2 for some yi, as we did above with y2 = x.
7.2.2 Reduction rules
Figure 7.1 gives the rules of our calculus. We write ξ −→ ξ′ to denote that ξ reduces to
ξ′, and denote by −→∗ the transitive closure of −→. In all rules, Yi is an internal path
variable, and the quotient Q ∈ F2[X, Y ] means Q is an integer valued polynomial reduced






The rules were developed by translating known circuit identities into path integrals, then
minimizing the identities to obtain simple interference patterns which 1) strictly reduce the
number of path variables, and 2) can be efficiently matched. We found that most common
Clifford+T equalities reduce to a small set of rules – in particular, the [HH] rule derived
122
P = 14Yi +
1
2YiQ+R, Q ∈ F2[X,Y ]
Yi does not appear in Q, R or f




P = 12Yi(Yj +Q) +R, Q ∈ F2[X,Y ]
Yi does not appear in Q, R or f
Yj does not appear in Q
(S, P, f) −→ (S,R[Yj ← Q], f [Yj ← Q])
[HH]
Figure 7.1: Path integral reduction rules
from the equality HH = I as described above is sufficient for the vast majority of path
integral reductions. The [ω] rule arises from the identity (SH)3 = e 2πi8 I = ωI. It can be
observed that [ω] eliminates one variable (Yi), while [HH] eliminates two path variables (Yi
and Yj).
Proposition 7.2.1 (Correctness). If ξ −→∗ ξ′′, then ξ ≡ ξ′.
Proof. We verify each rewrite rule by direct calculation on the right-hand side of the





























y∈Fm2 (1 + i) e




y∈Fm2 (1− i) e





























































e2πi(R[yj←Q])(x,y)| (f [yj ← Q]) (x,y)〉
It is a trivial fact that the calculus is terminating, as every rule reduces the number of
path variables. Moreover, each rewrite rule can be matched against in polynomial time,
hence every path integral reduces to a normal form in polynomial time.
Proposition 7.2.2 (Strong normalization). Every sequence of rewrites terminates with an
irreducible path integral. The sequence is linear in the number of path variables m and for
an n-qubit path integral takes time polynomial in n and m.
7.2.3 Examples
To illustrate our rewrite system, we give examples below. We write path integrals by their
associated operators for clarity.
Example 7.2.3. Recall that the standard implementation of the Toffoli gate over Clifford+T
has the sum-over-paths action








We can verify that this is equivalent to the functional specification
|x1x2x3〉 7→ |x1x2(x3 ⊕ x1x2)〉
124
















7→ |x1x2(x3 ⊕ x1x2)〉 [HH]
Example 7.2.4. The controlled-T gate can be specified as the path integral
controlled-T : |x1x2〉 7→ e2πi
x1x2
8 |x1x2〉.
An implementation of the controlled-T gate over Clifford+T is given below:
• S† T T • H T H • T † T † S •
• • • • • •
|0〉 H • T † T † T T • H
Computing the canonical path integral as a partial isometry with the third qubit in the





































Hence the above circuit implements the controlled-T gate with a 0-initialized ancilla, and
moreover returns the ancilla to its original state.
If we had tried to verify the above circuit with an arbitrary state |x3〉 for the third
qubit, it would have failed since |x1x2〉|1〉 6 7→e2πi
x1x2
8 |x1x2〉|1〉. This illustrates the necessity
of modelling the semantics of qubit initialization for practical verification tasks.
125
Example 7.2.5. To show the use of the [ω] rule, we reduce the circuit (SH)3 to the ω
constant.




































Example 7.2.6. The one-bit full adder has the reversible path integral specification
|x1x2x3x4〉 7→ |x1(x1 ⊕ x2)(x1 ⊕ x2 ⊕ x3)(x1x2 ⊕ x1x3 ⊕ x2x3 ⊕ x4)〉.
The implementation below over Clifford+T was obtained by using the Reed-Muller decoding
method of Chapter 5 to reduce the number of T gates from the standard implementation
using two Toffoli gates.
P • T • T T P •
P T • T • • •
P • • •
H P T • T • H
















2y1(y2+x1x2+x1x3+x2x3+x4)|x1(x1 ⊕ x2)(x1 ⊕ x2 ⊕ x3)y2〉
7→ |x1(x1 ⊕ x2)(x1 ⊕ x2 ⊕ x3)(x1x2 ⊕ x1x3 ⊕ x2x3 ⊕ x4)〉 [HH, Elim]
In this case, verification works because of the normalization of the phase polynomial due to
its presentation as a multilinear polynomial, rather than the Fourier expansion.
126
7.3 Completeness
While our calculus computes a normal form in polynomial time, the normal forms are
not necessarily unique2 and hence our reduction system is incomplete. For instance, the
Clifford+T identity
• X • X
2
T H T H T † T H T † H T †
from [SB16] gives the irreducible path integral |x1x2〉 7→ 1√28
∑
y∈F82 e
2πi 18P (x,y)|x1y8〉 with
phase polynomial
P (x,y) = 2 + 6x1x2 + x2 + y1 + 4y1(x1 + x2 + y2) + 6y2 + 4y2y3 + 2y2x1 + 3y3 + 4y3(x1 + y4)
+ 4y4y5 + 6y4x1 + y5 + 4y5(x1 + y6) + 6y6 + 4y6y7 + 2y6x1 + 3y7 + 4y7(x1 + y8) + 7y8.
A complete verification procedure could proceed by explicitly expanding the values of
remaining variables in the path integral after all possible reductions have been made, and
then checking equivalence to the identity transformation. In practice we found that this is
generally not necessary, as our calculus, along with some additional observations, is sufficient
to prove or disprove equivalence for the majority of circuits. Moreover, these heuristics
combined with path integral reductions give a complete, polynomial-time procedure for
determining equivalence of Clifford group circuits.
7.3.1 Heuristics
Isometry restrictions Our first heuristic reduces the number of path variables in a
well-formed path integral when checking equivalence. Specifically, we denote by ξ|f(x,y)=x
the restriction of ξ to solutions x ∈ Fn2 ,y ∈ Fm2 such that f(x,y) = x, which we can write






Effectively, the sum 1√2m
∑
y∈Fm2 ,f(x,y)=x e
2πiP (x,y) gives the amplitude of the basis state |x〉 in
the output for a given input state |x〉. If the path integral ξ is well-formed (i.e. normalized),
then this sum will be equal to 1 exactly if ξ is the identity transformation. We sum this up
in the lemma below:
2Uniqueness would imply that equivalence checking of reversible Boolean circuits is in P. As this
problem is co-NP-complete, uniqueness of our normal forms would indeed imply P = co-NP.
127
Lemma 7.3.1. Suppose Uξ : |x〉 7→ 1√2m
∑
y∈Fm2 e
2πiP (x,y)|f(x,y)〉 is a well-formed path
integral. Then ξ implements the identity transformation if and only if ξ|f(x,y)=x implements
the identity transformation.
Note that Lemma 7.3.1 doesn’t hold if ξ is not well-formed, as Uξ may not be an
isometry and so it may be that Uξ|x〉 = α|x〉+ β|ψ〉 for some residual state |ψ〉. To reify
the restricted path integral ξ|f(x,y)=x we find path variable substitutions which give fi = Xi
– in particular, if for some index i we have fi = Yi⊕Q where Yi doesn’t appear in Q, we can
substitute Xi +Q for Yi to get fi = Xi and remove Yi from the sum. Any restrictions which
can’t be reified are simply ignored. In practice this results in a significant simplification for
some circuits, instantly removing up to n path variables.
Remark 7.3.2. Restricting a path integral to solutions of the form f(x,y) = x may break










which has no path variables but a transition amplitude of 1√2 and hence is not the identity.
For this reason when using isometry restrictions a normalization factor is required along
with the path integral to ensure the output amplitude of the |x〉 state is 1.
Disproving equivalence As the reduction rules of Figure 7.1 only suffice to prove
positive results (i.e. equivalence), when no more reductions are possible we attempt to
apply an observation that was found to be effective for proving that a path integral ξ is not










if Q(x,y) = 1 mod 2. If Q is a non-zero Boolean polynomial in only input variables Xi,








(1− 1)e2πiR(x,y)|f(x,y)〉 = 0.
We sum this up in the following lemma.
128
Lemma 7.3.3. Suppose ξ = (S, 12YiQ+R, f) where Yi does not appear in Q, R or f and
Q is a non-zero integer-valued polynomial not referencing any path variables. Then Uξ 6= I.
Hence we can use a variant of [HH] where Q references only input variables to prove
negative results – i.e. that a path integral is not the identity.
7.3.2 Clifford completeness
We can now show that together with the previous heuristics, our path integral reductions
are complete for deciding equivalence of Clifford group circuits. Recall that over the Clifford
group, the path integral interpretation of a circuit has phase polynomial of order at most 2.
Our proof of completeness rests on the fact that progress can always be made if the phase
polynomial is at most second-order.
Lemma 7.3.4 (Clifford progress & preservation). If ξ = (S, P, f) is a path integral where
ord (P ) ≤ 2 and no path variables appear in f . Then either there exists ξ′ = (S ′, P ′, f ′)
such that ξ −→ ξ′ where ord (P ′) ≤ 2 and no path variables appear in f ′, or Uξ 6= I.
Proof. Since P is at most second-order, we can write P = YiQ+R for some internal path







for some a, b ∈ F2 and linear Boolean polynomial Q′. We have 2 cases to consider,
corresponding to the [HH] and [ω] rules respectively.
Case 1: a = 0, b = 1. If the polynomial Q′ contains a path variable Yj , then Q′ = Yj +Q′′




≤ ord (R) ≤ 2 and ξ′ has
only internal paths since Yj /∈ f .
If on the other hand Q′ only contains input variables, by Lemma 7.3.3 Uξ 6= I.
Case 2: a = 1. The sum matches the left hand side of [ω], hence ξ −→[ω] ξ′. Further, by
Lemma 7.1.13






















Corollary 7.3.5. If C ∈ 〈H,CNOT, S〉, then JCK = I can be decided in time polynomial
in the volume of C.
Proof. Since JCKp is well-formed, by Lemma 7.3.1 it suffices to check whether JCKp|f(x,y)=x
is the identity. Further, as f(x,y) is linear, we can compute via Gaussian elimination a
solution y so that f(x,y) = x for any x – if no such solution exists, JCK 6= I. Since each
fi is linear, ord (P [yi ← fi]) ≤ ord (P ) ≤ 2, hence by Lemma 7.3.4 and Proposition 7.2.2,
either JCKp|f(x,y)=x reduces to the identity in polynomial-time or ξ′ is not the identity.
7.4 Case studies
To evaluate the performance of our reduction system for the task of verification, we
performed two sets of experiments – equivalence checking of the optimized benchmark
circuits from chapters 4 and 6, and the verification of circuits directly against functional
specifications given as path integrals. Again, the experiments were performed in Debian
Linux running on a quad-core 64-bit Intel Core i7 2.40 GHz processor and 8 GB RAM.
7.4.1 Translation validation
Translation validation is an important tool for verifying that the transformations a compiler
performs do not change the semantics of an input program. While it is generally desirable
to prove that a compiler operates correctly on all input programs, as we discuss in the next
chapter, in many cases this is infeasible since the best optimizations are typically difficult
to formally verify.
We used our algorithm to verify the benchmark optimization results of chapters 4
and 6. We give the results of the verification of the gray-synth algorithm of Chapter 6
in Table 7.1, as it includes phase-folding and is the harder optimization problem. All but
7 of the benchmark circuits were successfully verified, with the remaining 7 benchmarks
running out of memory with a 6 GB limit. The high memory usage may be mitigated in the
future by switching to a linear-space representation of the phase polynomial. The largest
(completed) benchmark GF(232), containing 96 bits, 252 path variables and over 25, 000
gates completed in under 10 minutes, with the remainder all taking under a minute.
To test the algorithm’s ability to disprove equivalence, we also performed the verification
of the optimized benchmark circuits after removing a randomly selected gate. Again, all
but 7 benchmarks were proven to be not equivalent, with the negative verification results
taking about the same amount of time as positive results.
130
Table 7.1: Optimization validation results. n lists the number of qubits, m gives the
number of path variables, and Clifford and T give the number of respective gates. Times for
positive and negative verification measure the time to prove equivalence or non-equivalence,
respectively. Benchmarks with no timing results ran out of memory.
Benchmark n m Clifford T Time (s)
Positive Negative
Grover_5 9 200 1,515 490 0.973 0.988
Mod 5_4 5 12 66 44 0.005 0.028
VBE-Adder_3 10 20 167 94 0.026 0.028
CSLA-MUX_3 15 40 289 132 0.099 0.055
CSUM-MUX_9 30 56 638 280 0.270 0.270
QCLA-Com_7 24 74 1237 297 0.530 0.543
QCLA-Mod_7 26 164 1641 650 9.446 10.517
QCLA-Adder_10 36 100 627 400 0.674 0.683
Adder_8 24 160 1419 614 1.968 2.018
RC-Adder_6 14 44 322 124 0.080 0.090
Mod-Red_21 11 60 392 192 0.110 0.119
Mod-Mult_55 9 28 180 84 0.028 0.009
Mod-Adder_1024 28 660 4,363 3,006 21.362 21.588
Mod-Adder_1048576 58 4,832 40,318 23,999 – –
Cycle 17_3 35 1,366 9,172 6,694 – –
GF(24)-Mult 12 28 263 180 0.063 0.061
GF(25)-Mult 15 36 393 286 0.143 0.141
GF(26)-Mult 18 44 559 402 0.279 0.291
GF(27)-Mult 21 52 731 560 0.501 0.527
GF(28)-Mult 24 60 975 712 0.837 0.881
GF(29)-Mult 27 68 1179 918 1.304 1.369
GF(210)-Mult 30 76 1,475 1,110 1.958 0.327
GF(216)-Mult 48 124 3,694 2,832 16.028 17.539
GF(232)-Mult 96 252 14,259 11,296 430.883 436.521
GF(264)-Mult 192 508 55,408 45,120 – –
GF(2128)-Mult 384 1,020 231,318 180,352 – –
GF(2256)-Mult – – – – – –
Hamming_15 (low) 17 76 612 158 0.367 0.168
Hamming_15 (med) 17 184 1,251 762 1.390 1.430
Hamming_15 (high) 20 716 5,332 3,462 24.360 24.303
HWB_6 7 52 369 180 0.200 0.207
HWB_8 12 2,282 17,583 8,895 – –
HWB_10 16 10,480 88,230 42,500 – –
HWB_12 20 56,824 500,974 245,238 – –
QFT_4 5 84 218 136 0.084 0.089
Λ3(X) 5 12 52 36 0.004 0.011
Λ3(X) (Barenco) 5 12 66 44 0.007 0.046
Λ4(X) 7 20 87 58 0.009 0.008
Λ4(X) (Barenco) 7 20 127 84 0.014 0.024
Λ5(X) 9 18 112 80 0.015 0.017
Λ5(X) (Barenco) 9 28 160 124 0.030 0.031
Λ10(X) 19 68 297 190 0.110 0.111
Λ10(X) (Barenco) 19 68 493 324 0.219 0.210
131
Table 7.2: Results of verifying formally specified quantum algorithms.
Algorithm n m Clifford T Time (s)
Positive Negative
Toffoli50 97 190 855 665 1.084 1.064
Toffoli100 197 390 1,755 1,365 5.566 5.275
Maslov50 74 192 481 384 0.801 0.778
Maslov100 149 392 981 784 3.987 3.983
Adder8 40 56 334 196 0.142 0.143
Adder16 80 120 710 420 25.527 92.607
QFT16 16 16 256 – 1.250 1.335
QFT31 31 31 961 – 16.929 15.295
Hidden Shift20,4 20 60 5,254 56 1.067 0.862
Hidden Shift40,5 40 120 6,466 70 3.383 2.826
Hidden Shift60,10 60 180 12,784 140 13.217 12.351
Symbolic Shift20,4 40 60 5,296 56 1.859 1.849
Symbolic Shift40,5 80 120 6,638 70 6.953 7.905
Symbolic Shift60,10 120 180 12,804 140 35.583 29.614
7.4.2 Verifying quantum algorithms
To evaluate path integrals as a tool for functional specification as well as verification,
we implemented and verified several quantum algorithms (both without and with errors)
directly against specifications given as path integrals. Table 7.2 reports the results of our
experiments, and we describe the algorithms and implementations below.
Reversible functions We implemented and verified a number of known algorithms for
reversible functions. In particular, we performed verifications of Clifford+T implementations
of the generalized Toffoli and (out-of-place) addition functions,
Toffolin : |x1x2 . . . xn〉 7→ |x1x2 . . . (xn ⊕ x1x2 . . . xn−1)〉,
Addern : |x〉|y〉|0〉 7→ |x〉|y〉|x + y〉.
The translation of the above specifications to path integrals required translating the outputs
into Boolean polynomials. For the Toffolin algorithm, this translation was trivial; for Addern,
a set of polynomials giving the bits of x + y was generated by performing binary addition
on symbolic vectors. In comparison to writing a reversible circuit or reversible addition
program, this translation – being strictly classical in nature and implementing a familiar
algorithm – was (empirically) easier to code, avoiding the difficulty of space management
132
and reversible cleanup. Moreover, being a classical program this translation can be tested
or otherwise verified by known methods.
For the circuit implementations, two versions of the n-bit Toffoli algorithm were used
– the standard decomposition into 2(n − 3) + 1 Toffoli gates and n − 3 ancillas, and the
Maslov decomposition [Mas16] using relative phase Toffolis and dn−32 e ancillas. For either
implementation we were able to verify up to 100 bit Toffoli gates in just seconds.
For the addition circuit, we used a standard out-of-place ripple-carry adder which uses
n − 1 ancilla bits to store intermediate carry values and an additional n bit register to
store the output, before copying out and uncomputing. The resulting circuit uses 5n− 1
bits of space for an n bit adder, and 4(n − 1) Toffoli gates, which are then expanded to
the Clifford+T gate set. In this case, the size of the bitwise expansion of x + y made it
difficult to push to implementation sizes (e.g., 32 bits), though smaller sizes such as 16 bits
were verifiable within a minute. Relational techniques – e.g., representing the outputs of a
path integral as “primed” variables along with a set of equations over the input and output
variables – may help to push verification of such functions to larger sizes.
The quantum Fourier transform To test our verification method against circuits using
higher-order phase gates, we verified an implementation of the quantum Fourier transform.
We use a circuit from [KLM07] together with a final qubit permutation correction and
verified it against the specification








The phase polynomial [x · y] was generated in the obvious way – by setting [x] = x1 + 2x2 +
. . .+ 2n−1xn and multiplying the polynomials. In this case our implementation was able to
verify implementations up to 31 bits in size, after which integer overflow occurs due to our
implementation of dyadic arithmetic. Given that the 31 bit implementation took only 16
seconds to verify, it appears that with better methods for handling dyadic arithmetic much
larger sizes of the QFT are likely verifiable.
The quantum hidden shift algorithm To test our framework on more general quantum
algorithms, we implemented a version of the quantum hidden shift algorithm [R1̈0] which
has been previously used to test quantum simulation algorithms [BG16]. In particular,
given oracles Of ′ : |x〉 7→ f(x + s)|x〉 and Of̃ : |x〉 7→ f̃(x)|x〉 for the shifted and dual bent
functions f ′, f̃ : Fn2 → {−1,+1} respectively, the circuit H⊗nOf̃H⊗nOf ′H⊗n is known [R1̈0]
















(b) Hidden shift with a symbolic shift.
Figure 7.2: Circuits for the Quantum Hidden Shift algorithm.
Following [BG16], we generated random instances of Maiorana McFarland bent functions
by setting f ′(x,y) = f((x,y) + s) = (−1)g(x)+xy with dual f̃(x,y) = (−1)g(y)+xy for a
random n2 bit Boolean function g of degree 3. The circuit for f is generated by, for a given
number of alternations A, alternating between selecting 200 random Z and controlled-
Z gates, then a random doubly controlled-Z gate, expanded out to Clifford+T . We
implemented two versions of the algorithm, one where a concrete shift is given by a
randomly generated Boolean vector, and another where the shift is supplied symbolically
via a quantum register. In the former case we verify the circuit for a given shift s against
the specification |0〉 7→ |s〉, and in the latter case we verify the specification |0〉|s〉 7→ |s〉|s〉.
Figure 7.2 shows both circuits.
Our verification algorithm found a bug in our first implementation, which was a direct
implementation of the circuit given in [BG16]. After reimplementing the circuit based
on [R1̈0], we were able to verify both versions of the hidden shift algorithm for sizes
exceeding those simulated in [BG16] with only a fraction of the time (seconds versus
hours [BG16]). Our calculus further finds the correct output |s〉 or |s〉|s〉 even without
providing the specification, effectively simulating the algorithm rather than verifying it.
Moreover, our implementation is deterministic compared to theirs which is probabilistic and
only samples the output distribution, rather than compute it outright. It is interesting to
note that their algorithm also uses a similar technique of effectively evaluating the circuit’s
phase polynomial – however, by including the T gate phases directly in the polynomial
and solving around them, rather than pushing them into state preparations, we save a
massive amount of time for this algorithm. An interesting question for future research is to




Equivalence checking Most of the previous work on functional verification – as opposed
to property checking – of quantum circuits has been from the perspective of strict equivalence
checking. Such works typically consider a circuit or program to be the specification of an
algorithm, and check that another circuit or program has the same semantics.
Some of the earliest work in this vein used SAT solvers to verify reversible circuits
of up to a hundred qubits and thousands of gates [WGMD09]. While this work is only
directly applicable to reversible circuits, Yamashita and Markov [YM10] later combined it
with a method of identifying classical transformations in quantum circuits to first rewrite
quantum circuits as reversible ones. Their method was able to verify circuits with up to 128
qubits but only after local simplifications, which due to the structure of their test circuits
effectively reduces the verification to at most a few gates. However, their work effectively
use classical verification techniques on classical circuits to achieve good results, and hence
are only applicable for verifying classical oracles before being expanded down to a quantum
circuit.
Diagrammatic calculi More recent work has used diagrammatic calculi such as the
ZX-calculus [CD11] to check equivalence of quantum circuits by reducing diagrams to the
identity. With the help of semi-automated proof assistants such as Quantomatic [KZ15]
and Globular [BKV18], correctness proofs of the Steane code and a particular colour
code have been developed [DL13,GD17]. However, due to the nature of diagrammatic
reasoning, automating this process has proved challenging [BGK+16], and moreover such
proofs frequently get “stuck” in local minima, where the diagram has to be expanded
before it can be further contracted. In contrast, our method has a very natural notion of
the complexity of an operator – the number of path variables – and all circuits we have
examined admit proofs which strictly reduce this parameter.
Formal proof Moving away from strict equivalence checking, recent work [RPZ17] has
formally proven the correctness of quantum circuits with respect to specifications as
superoperators using the Coq proof assistant. Their result shows classical proof assistants
can reasonably be used to proof correctness of quantum programs, though due to the
nature of the linear-algebraic model, their work was limited to small circuits, or circuits
with very simple recursive structure. Anecdotally [Pay18], specifying the operations in
the superoperator model proved to be difficult and error-prone, providing evidence that




In this chapter, we shift focus from formally verifying that a single circuit is correct to
verifying that all compiled circuits are correct. In contrast to the last chapter where the
correct specification of a circuit’s unitary action was one of the main questions, we now
take a program as the specification of a circuit – a program which itself may have bugs and
should be verified to certify the top-to-bottom correctness of a circuit. As per McCarthy
and Painter [MP67], this can be reframed as formally verifying that the compiler, which
takes a specification (e.g., a program) and compiles a circuit, is itself correct; that is, the
compiler always produces a circuit which correctly implements the program.
We implement and formally prove the correctness of a reversible circuit compiler called
ReVerC in the proof assistant F? [SHK+16]. The compiler itself builds on the Revs
language and compiler [PRS15], which compiled a subset of the classical, irreversible F#
language to reversible circuits, optimizing for space-efficiency by performing eager cleanup.
Their method makes use of a dependency graph to determine which bits may be eligible
to be cleaned eagerly without requiring re-computation, freeing up space without using
additional gates. However, like most real-world compilers Revs has bugs, as illustrated by
the F# program below to the left, compiled to the incorrect circuit on the right.
let f (a : bool array) =
let b = Array.zeroCreate 2
b.[0] <- a.[0]
a.[1] <- a.[1] <> b.[0]
b
let a = Array 2
let b = f a
a
−→
x0 • • x0
x1 x1
0 • • 0
0 0
136
Var x, Bool b ∈ {0, 1} = {0, 1}, Nat i, j ∈ N, Loc l ∈ N
Val v ::= unit | l | reg l1 . . . ln | λx.t
Term t ::= let x = t1 in t2 | λx.t | (t1 t2) | t1; t2 | x | t1 ← t2 | b | t1 ⊕ t2 | t1 ∧ t2
| clean t | assert t | reg t1 . . . tn | t.[i] | t.[i..j] | append t1 t2 | rotate i t
Figure 8.1: Syntax of Revs.
The main objective of this line of research was to build a similarly optimizing compiler
which is also verified without too much loss in optimization. As formally verified compilers
are significantly more constrained in terms of difficulty of implementation, and in the
optimizations possible, some loss in optimization is expected (see, e.g., [Ler06] and the
unsound optimizations in the GNU C Compiler [GCC16]). Our compiler nonetheless
achieves similar bit counts while also being certifiably correct, up to the correctness of the
F? proof checker. In addition to formal verification of the compiler, our implementation
provides an assertion checker which can be used to formally verify the source program itself,
allowing end-to-end verification.
This work appears in [ARS17], and is implemented in ReVerC.
8.1 Languages
In this section we give a formal definition of Revs, as well as the intermediate and target
languages of the compiler.
8.1.1 The Source
The abstract syntax of Revs is presented in Figure 8.1. The core of the language is a
simple imperative language over Boolean and array (register) types. The language is further
extended with ML-style functional features, namely first-class functions and let definitions,
and a reversible domain-specific construct clean which asserts that its argument evaluates
to 0 and frees a bit.
137
1 fun a b ->
2 let carry_ex a b c = (a ∧ (b ⊕ c)) ⊕ (b ∧ c)
3 let result = Array.zeroCreate(n)
4 let mutable carry = false
6 result .[0] ← a.[0] ⊕ b.[0]
7 for i in 1 .. n-1 do
8 carry ← carry_ex a.[i-1] b.[i-1] carry
9 result .[i] ← a.[i] ⊕ b.[i] ⊕ carry
10 result
Figure 8.2: Implementation of an n-bit adder.
In addition to the basic syntax of Figure 8.1 we add the following derived operations:
¬t ∆= 1⊕ t, t1 ∨ t2 ∆= (t1 ∧ t2)⊕ (t1 ⊕ t2),
if t1 then t2 else t3 ∆= (t1 ∧ t2)⊕ (¬t1 ∧ t3),
for x in i..j do t ∆= t[x 7→ i]; · · · ; t[x 7→ j].
Note that Revs has no dynamic control – i.e. control flow dependent on run-time values.
In particular, every Revs program can be transformed into a straight-line program. While
reversible instruction set architectures which allow control exist (e.g., [Vie95]), as our focus
is on quantum reversible circuits which are purely combinational, we can only compile
programs which can be statically unrolled.
The ReVerC compiler uses F# as a meta-language to generate Revs code with
particular register sizes and indices, possibly computed by some more complex program.
Writing an F# program that generates Revs code is similar in effect to writing in a
hardware description language [Cla01]. We use F#’s quotations mechanism to achieve this
by writing Revs programs in quotations <@. . . @>. Note again that unlike QRAM-based
languages such as Quipper, our strictly combinational target architecture doesn’t allow
computations in the meta-language to depend on computations within Revs.
Example 8.1.1. Figure 8.2 gives an example of a carry-ripple adder written in Revs.
Naïvely compiling this implementation would result in a new bit being allocated for every
carry bit, as the assignment on line 8 is irreversible (note that carry_ex 1 1 0 = 1 =
carry_ex 1 1 1, hence the value of c can not be uniquely computed given a, b and the
output). ReVerC reduces this space usage by automatically cleaning the old carry bit,
allowing it to be reused.
138
1 let s0 a =
2 let a2 = rot 2 a
3 let a13 = rot 13 a
4 let a22 = rot 22 a
5 let t = Array.zeroCreate 32
6 for i in 0 .. 31 do
7 t.[i] ← a2.[i] ⊕ a13.[i] ⊕ a22.[i]
8 t
9 let s1 a =
10 let a6 = rot 6 a
11 let a11 = rot 11 a
12 let a25 = rot 25 a
13 let mutable t = Array.zeroCreate 32
14 for i in 0 .. 31 do
15 t.[i] ← a6.[i] ⊕ a11.[i] ⊕ a25.[i]
16 t
17 let ma a b c =
18 let t = Array.zeroCreate 32
19 for i in 0 .. 31 do
20 t.[i] ← (a.[i] ∧ (b.[i] ⊕ c.[i]))
21 ⊕ (b.[i] ∧ c.[i])
22 t
23 let ch e f g =
24 let t = Array.zeroCreate 32
25 for i in 0 .. 31 do
26 t.[i] ← e.[i] ∧ f.[i] ∧ g.[i]
27 t
28 fun k w x →
29 let hash x =
30 let a = x.[0..31] , b = x.[32..63]
31 c = x.[64..95] , d = x.[96..127] ,
32 e = x.[128..159] , f = x.[160..191] ,
33 g = x.[192..223] , h = x.[224..255]
34 (% modAdd 32) (ch e f g) h
35 (% modAdd 32) (s0 a) h
36 (% modAdd 32) w h
37 (% modAdd 32) k h
38 (% modAdd 32) h d
39 (% modAdd 32) (ma a b c) h
40 (% modAdd 32) (s1 e) h
41 for i in 0 .. n - 1 do
42 hash (rot 32*i x)
43 x
Figure 8.3: Revs implementation of the SHA-256 algorithm with n rounds, using a
meta-function modAdd.
139
Example 8.1.2. The 256-bit Secure Hash Algorithm 2 (SHA-256) is a cryptographic hash
algorithm which performs a series of modular additions and functions on 32-bit chunks of a
256-bit hash value, before rotating the register and repeating for a number of rounds. In
the Revs implementation, shown in Figure 8.3, this is achieved by a function hash which
takes a length 256 register, then makes calls to modular adders with different 32-bit slices.
At the end of the round, the entire register is rotated 32 bits with the rotate command.
Semantics We designed the semantics of Revs with two goals in mind:
1. keep the semantics as close to the original implementation as possible, and
2. simplify the task of formal verification.
The result is a somewhat non-standard semantics that is nonetheless intuitive for the
programmer. Moreover, the particular semantics naturally enforces a style of programming
that results in efficient circuits and allows common design patterns to be optimized.
The big-step semantics of Revs is presented in Figure 8.4 as a relation ⇒ ⊆ Config×
Config on configuration-pairs – pairs of terms and Boolean-valued stores. A key feature of
the semantics is that Boolean, or bit values, are always allocated on the store. Specifically,
Boolean constants and expressions are modelled by allocating a new location on the store
to hold its value – as a result all Boolean values, including constants, are mutable.
The allocation of Boolean values on the store serves two main purposes: to give the
programmer fine-grain control over how many bits are allocated, and to provide a simple
and efficient model of registers – i.e. arrays of bits. Specifically, registers are modelled as
static length lists of bits. This allows the programmer to perform array-like operations such
as bit modifications (t1.[i]← t2) as well as list-like operations such as slicing (t.[i..j]) and
concatenation (append t1 t2) without copying out entire registers. We found that these were
the most common access patterns for arrays of bits in low-level bitwise code (e.g. arithmetic
and cryptographic implementations).
The semantics of ⊕ (Boolean XOR) and ∧ (Boolean AND) are also notable in that
they first reduce both arguments to locations, then retrieve their value. This results in
statements whose value may not be immediately apparent – e.g., x ⊕ (x ← y; y), which
under these semantics will always evaluate to 0. The benefit of this definition is that it
allows the compiler to perform important optimizations without a significant burden on
the programmer.
140
Store σ : N⇀ {0, 1}
Config c ::= 〈t, σ〉 [let]
〈t1, σ〉 ⇒ 〈v1, σ′〉 〈t2[x 7→ v1], σ′〉 ⇒ 〈v2, σ′′〉
〈let x = t1 in t2, σ〉 ⇒ 〈v2, σ′′〉
[refl]
〈v, σ〉 ⇒ 〈v, σ〉 [bexp]
〈t1, σ〉 ⇒ 〈l1, σ′〉 〈t2, σ′〉 ⇒ 〈l2, σ′′〉 l3 /∈ dom(σ′′)
〈t1 ? t2, σ〉 ⇒ 〈l3, σ′′[l3 7→ σ′′(l1) ? σ′′(l2)]〉
[bool]
b ∈ {0, 1} l /∈ dom(σ)
〈b, σ〉 ⇒ 〈l, σ′′[l 7→ b]〉
[app]
〈t1, σ〉 ⇒ 〈λx.t′1, σ′〉〈t2, σ′〉 ⇒ 〈v2, σ′′〉
〈t′1[x 7→ v2], σ′′〉 ⇒ 〈v, σ′′′〉
〈(t1 t2), σ〉 ⇒ 〈v, σ′′′〉
[seq]
〈t1, σ〉 ⇒ 〈unit, σ′〉
〈t2, σ′〉 ⇒ 〈v, σ′′〉
〈t1; t2, σ〉 ⇒ 〈v, σ′′〉 [assn]
〈t1, σ〉 ⇒ 〈l1, σ′〉
〈t2, σ′〉 ⇒ 〈l2, σ′′〉
〈t1 ← t2, σ〉 ⇒ 〈unit, σ′′[l1 7→ σ′′(l2)]〉
[append]
〈t1, σ〉 ⇒ 〈reg l1 . . . lm, σ′〉 〈t2, σ′〉 ⇒ 〈reg lm+1 . . . ln, σ′′〉
〈append t1 t2, σ〉 ⇒ 〈reg l1 . . . ln, σ′′〉
[index]
〈t, σ〉 ⇒ 〈reg l1 . . . ln, σ′〉 1 ≤ i ≤ n
〈t.[i], σ〉 ⇒ 〈li, σ′〉
[slice]
〈t, σ〉 ⇒ 〈reg l1 . . . ln, σ′〉 1 ≤ i ≤ j ≤ n
〈t.[i..j], σ〉 ⇒ 〈reg li . . . lj, σ′〉
[rotate]
〈t, σ〉 ⇒ 〈reg l1 . . . ln, σ′〉 1 < i < n
〈rotate t i, σ〉 ⇒ 〈reg li . . . li−1, σ′〉
[reg]
〈t1, σ〉 ⇒ 〈l1, σ1〉
〈t2, σ〉 ⇒ 〈l2, σ2〉
...
〈tn, σ〉 ⇒ 〈ln, σn〉
〈reg t1 . . . tn, σ〉 ⇒ 〈reg l1 . . . ln, σn〉
[clean]
〈t, σ〉 ⇒ 〈l, σ′〉 σ′(l) = 0





〈t, σ〉 ⇒ 〈l, σ′〉 σ′(l) = 1
〈assert t, σ〉 ⇒ 〈unit, σ′〉
Figure 8.4: Operational semantics of Revs.
141
8.1.2 Boolean expressions
Our compiler uses XOR-AND Boolean expressions – single output classical circuits over
XOR and AND gates – as an intermediate language. Compilation from Boolean expressions
into reversible circuits forms the main “code generation” step of our compiler.
A Boolean expression is defined as an expression over Boolean constants, variable indices,
and logical ⊕ and ∧ operators. Explicitly, we define
BExp B ::= 0 | 1 | i ∈ N |B1 ⊕B2 |B1 ∧B2.
Note that we use the symbols 0, 1,⊕ and ∧ interchangeably with their interpretation in
{0, 1}. We use vars(B) to refer to the set of free variables in B.
We interpret a Boolean expression as a function from (total) Boolean-valued states
to Booleans. In particular, we define State = N→ {0, 1} and denote the semantics of a
Boolean expression by JBK : State→ {0, 1}. The formal definition of JBK is given by the
obvious homomorphism into F2.
8.1.3 Target architecture
ReVerC compiles to quantum circuits over {X,CNOT,TOF}, which were earlier noted
as being reversible computing and hence are suitable for implementing classical functions
and oracles within quantum computations.
As we are working with strictly reversible circuits in this chapter, we use a slightly
different semantic model. To differentiate the reversible circuit language we target here
from the (unitary) circuit model used previous, we define
Circ C ::= − | NOT i | CNOT i j | Toffoli i j k | C1 ; C2.
Recall that all but the last bit in each gate is a control, and the final bit is denoted as the
target. In analogy to more classical programming concepts, the target of a gate is the only
bit modified or mutated by the gate. We use use(C), mod(C) and control(C) to denote
the set of bit indices that are used in, modified by, or used as a control in the circuit C,
respectively. A circuit is well-formed if no gate contains more than one reference to a bit –
i.e., the bits used in each controlled-NOT or Toffoli gate are distinct.
Similar to Boolean expressions, a circuit is interpreted as a function from states (maps
from indices to Boolean values) to states, given by applying each gate which updates the
142
previous state in order. The formal definition of the semantics of a reversible circuit C,
given by JCK : State→ State, is straightforward:
J−Ks = s
JNOT iKs = s[i 7→ ¬s(i)]
JCNOT i jKs = s[j 7→ s(i)⊕ s(j)]
JToffoli i j kKs = s[k 7→ (s(i) ∧ s(j))⊕ s(k)]
JC1 ; C2Ks = (JC2K ◦ JC1K)s
We use s[x 7→ y] to denote the function that maps x to y, and all other inputs z to s(z); by
an abuse of notation we use [x 7→ y] to denote other substitutions as well.
8.2 Compilation
In this section we discuss the implementation of ReVerC. The compiler consists of around
4000 lines of code in a common subset of F? and F#, with a front-end to evaluate and
translate F# quotations into Revs expressions.
8.2.1 Boolean expression compilation
The core of ReVerC’s code generation is a compiler from Boolean expressions into reversible
circuits. We use a modification of the method employed in Revs.
As a Boolean expression is already in the form of an irreversible classical circuit, the
main job of the compiler is to allocate ancillas to store sub-expressions whenever necessary.
ReVerC does this by maintaining a (mutable) heap of ancillas ξ ∈ AncHeap called
an ancilla heap, which keeps track of the currently available (zero-valued) ancillary bits.
Cleaned ancillas (ancillas returned to the zero state) may be pushed back onto the heap,
and allocations return previously used ancillas if any are available, hence not using any
extra space.
The function compile-BExp, shown in pseudo-code below, takes a Boolean expression B
and a target bit i and then generates a reversible circuit computing i⊕B. Note that ancillas
are only allocated to store sub-expressions of ∧ expressions, since i⊕(B1⊕B2) = (i⊕B1)⊕B2
and so we compile i⊕ (B1 ⊕B2) by first computing i′ = i⊕B1, followed by i′ ⊕B2.
143
function compile-BExp(B, i, ξ)
if B = 0 then −
else if B = 1 then NOT i
else if B = j then CNOT j i
else if B = B1 ⊕B2 then compile-BExp(B1, i, ξ) ; compile-BExp(B2, i, ξ)
else // B = B1 ∧B2
a1 ← pop-min(ξ); C ← compile-BExp(B1, a1, ξ);
a2 ← pop-min(ξ); C ′ ← compile-BExp(B2, a2, ξ);




The definition of compile-BExp above leaves many garbage bits that take up space and
need to be cleaned before they can be re-used. To reclaim those bits, we clean temporary
expressions after every call to compile-BExp.
To facilitate the cleanup – or uncomputing – of a circuit, we define the restricted inverse
uncompute(C,A) of C with respect to a set of bits A ⊂ N by reversing the gates of C, and
removing any gates with a target in A. For instance:
uncompute(CNOT i j, A) =
 − if j ∈ ACNOT i j otherwise
The other cases are defined similarly. Note that since uncompute produces a subsequence
of the original circuit C, no ancillary bits are used.
The restricted inverse allows the temporary values of a reversible computation to be
uncomputed without affecting any of the target bits. In particular, if C = compile-
BExp(B, i), then the circuit C ; uncompute(C, {i}) maps a state s to s[i 7→ JBKs⊕ s(i)],
allowing any newly allocated ancillas to be pushed back onto the heap. Intuitively, since
no bits contained in the set A are modified, the restricted inverse preserves their values;
that the restricted inverse uncomputes the values of the remaining bits is less obvious, but
it can be observed that if the computation doesn’t depend on the value of a bit in A, the
computation will be inverted. We formalize and prove this statement in Section 8.4.
144
8.2.2 Revs compilation
In studying the Revs compiler, we observed that most of what the compiler was doing was
evaluating the non-Boolean parts of the program – effectively bookkeeping for registers –
only generating circuits for a small kernel of cases. As a result, transformations to different
Boolean representations (e.g., circuits, dependence graphs [PRS15]) and the interpreter
itself reused significant portions of this bookkeeping code. To make use of this redundancy
to simplify both writing and verifying the compiler, we designed ReVerC as a partial
evaluator parameterized by an abstract machine for evaluating Boolean expressions. As a
side effect, we arrive at a unique model for circuit compilation similar to staged computation
(see, e.g., [JGS93]).
ReVerC works by evaluating the program with an abstract machine providing mecha-
nisms for initializing and assigning locations on the store to Boolean expressions. We call
an instantiation of this abstract machine an interpretation I , which consists of a domain
D equipped with two operators:
assign : D × N×BExp→ D
eval : D × N× State⇀ {0, 1}.
We typically denote an element of an interpretation domain D by σ. A sequence of
assignments in an interpretation builds a Boolean computation or circuit within a specific
model (i.e., classical, reversible, different gate sets) which may be simulated on an initial
state with the eval function – effectively an operational semantics of the model. Practically
speaking, an element of D abstracts the store in Figure 8.4 and allows delayed computation
or additional processing of the Boolean expression stored in a cell, which may be mapped
into reversible circuits immediately or after the entire program has been evaluated. We
give some examples of interpretations below.
Example 8.2.1. The standard interpretation Istandard has domain Store = N ⇀ {0, 1},
together with the operations
assignstandard(σ, l, B) = σ[l 7→ JBKσ]
evalstandard(σ, l, s) = σ(l).
Partial evaluation over the standard interpretation coincides exactly with the operational
semantics of Revs.
Example 8.2.2. The reversible circuit interpretation Icircuit has domain Dcircuit = (N⇀
N) × Circ × AncHeap. In particular, given (ρ, C, ξ) ∈ Dcircuit, ρ maps heap locations
145
to bits in C, and ξ is an ancilla heap. Assignment and evaluation are further defined as
follows:
assigncircuit((ρ, C, ξ), l, B) = (ρ[l 7→ i], C ; C ′, ξ)
where i = pop-min(ξ),
(C ′, ξ′) = compile-BExp (B[l′ ∈ vars(B) 7→ ρ(l′)], i, ξ)
evalcircuit((ρ, C, ξ), l, s)) = (JCKs) (ρ(l))
Interpreting a program with Icircuit builds a reversible circuit executing the program,
together with a mapping from heap locations to bits. Since the circuit is required to be
reversible, when a location is overwritten, a new ancilla i is allocated and the expression
B ⊕ i is compiled into a circuit. Evaluation amounts to running the circuit on an initial
state, then retrieving the value at the bit associated with a heap location.
Given an interpretation I with domain D, we define the set of I -configurations as
ConfigI = Term×D – that is, I -configurations are pairs of programs and elements of
D which function as an abstraction of the heap. The relation
⇒I ⊆ ConfigI ×ConfigI
gives the operational semantics of Revs over the interpretation I , and corresponds to
the definition of ⇒ (Figure 8.4) with all heap updates replaced with assign. Note that
dom(σ) refers to the set of locations on which eval is defined. To compile a program term t,
ReVerC evaluates t over a particular interpretation I (for instance, the reversible circuit
interpretation) and an initial heap σ ∈ D according to the semantic relation ⇒I . In this
way, evaluating a program and compiling a program to a circuit look almost identical. This
greatly simplifies the problem of verification, as we will see in the next section.
ReVerC supports three modes of compilation, defined by giving interpretations: a
default mode, an eagerly cleaned mode, and a “crush” mode. The default mode evaluates the
program using the circuit interpretation, and simply returns the circuit and output bit(s),
while the eager cleanup mode operates analogously, using instead the garbage-collected
interpretation defined below in Section 8.2.3. The crush mode interprets a program as a list
of Boolean expressions over free variables, which while unscalable allows highly optimized
versions of small circuits to be compiled, a common practice in circuit synthesis. We omit
the details of the Boolean expression interpretation.
146
8.2.3 Eager cleanup
It was previously noted that the circuit interpretation allocates a new ancilla on every
assignment to a location, due to the requirement of reversibility. Apart from ReVerC’s
additional optimization passes, this is effectively the Bennett method, and hence uses a
large amount of extra space. One way to keep the space usage from continually expanding
as assignments are made is to clean the old bit as soon as possible and then reuse it,
rather than wait until the end of the computation. Here we develop an interpretation that
performs this automatic, eager cleanup by augmenting the circuit interpretation with a
cleanup expression for each bit. Our method is based on the eager cleanup of [PRS15], and
was intended as a more easily verifiable alternative to mutable dependency diagrams.
The eager cleanup interpretation IGC has domain
D = (N⇀ N)×Circ×AncHeap× (N⇀ BExp),
where given (ρ, C, ξ, κ) ∈ D, ρ, C and ξ are as in the circuit interpretation. The partial
function κ maps individual bits to a Boolean expression over the bits of C which can be
used to return the bit to its initial state, called the cleanup expression. Specifically, we
have the following property:
∀i ∈ cod(ρ), s′(i)⊕ Jκ(i)Ks′ = s(i) where s′ = JCKs.
Intuitively, any bit i can then be cleaned by simply computing i 7→ i⊕ κ(i), which in turn
can be done by calling compile-BExp(κ(i), i).
Two problems remain, however. In general it may be the case that a bit can not be
cleaned without affecting the value of other bits, as it might result in a loss of information –
in the context of cleanup expressions, this occurs exactly when a bit’s cleanup expression
contains an irreducible self-reference. In particular, if i ∈ vars(B), then compile-BExp(B,
i) does not compile a circuit computing i⊕B and hence won’t clean the target bit correctly.
In the case when a garbage bit contains a self-reference in its cleanup expression that can
not be eliminated by Boolean simplification, ReVerC simply ignores the bit and performs
a final round of cleanup at the end.
The second problem arises when a bit’s cleanup expression references another bit that
has itself since been cleaned or otherwise modified. In this case, the modification of the
latter bit has invalidated the correctness property for the former bit. To ensure that the
above relation always holds, whenever a bit is modified – corresponding to an XOR of the
bit, i, with a Boolean expression B – all instances of bit i in every cleanup expression is
replaced with i⊕B. Specifically we observe that, if s′(i) = s(i)⊕ JBKs, then
s′(i)⊕ JBKs = s(i)⊕ JBKs⊕ JBKs = s(i).
147
The function clean, defined below, performs the cleanup of a bit i if possible, and
validates all cleanup expressions in a given element of D:
function clean((ρ, C, ξ, κ), i)
if i ∈ vars(κ(i)) then return (ρ, C, ξ, κ)
else
C ′ ← compile-BExp(κ(i), i, ξ)
if i is an ancilla then insert(i, ξ)
κ′ ← κ[i′ ∈ dom(κ) 7→ κ(i′)[i 7→ i⊕ κ(i)]]
return (ρ, C ; C ′, ξ, κ′)
end if
end function
Assignment and evaluation are defined in the eager cleanup interpretation as follows.
Both are effectively the same as in the circuit interpretation, except the assignment operator
calls clean on the previous bit mapped to l.
assignGC((ρ, C, ξ, κ), l, B) = clean((ρ[l 7→ i], C ; C ′, ξ, κ[i 7→ B′]), i)
where i = pop-min(ξ),
B′ = B[l′ ∈ vars(B) 7→ ρ(l′)]
C ′ = compile-BExp(B′, i, ξ)
evalGC((ρ, C, ξ, κ), l, s)) = (JCKs) (ρ(l))
The eager cleanup interpretation coincides with a reversible analogue of garbage collection
for a very specific case when the number of references to a heap location (or in our case, a
bit) is trivially zero.
Other optimizations During the course of compilation it is frequently the case that more
ancillas are allocated than are actually needed, due to the program structure. For instance,
when compiling the expression i← B, if B can be factored as i⊕B′ the assignment may
be performed reversibly rather than allocating a new bit to store the value of B. Likewise
if i is provably in the 0 or 1 state, the assignment may be performed reversibly without
allocating a new bit. Our implementation identifies some of these common patterns, as
well as general Boolean expression simplifications, to further minimize the space usage of
compile circuits. All such optimizations in ReVerC have been formally verified.
148
8.3 Parameter inference
While the definition of ReVerC as a partial evaluator streamlines both development
and verification, there is an inherent disconnect between the treatment of a (top-level)
function expression by the interpreter and by the compiler, in that we want the compiler to
evaluate the function body. Instead of formally defining a two-stage semantics for Revs
(e.g., [JGS93]) we took the approach of applying a program transformation, whereby the
function being compiled is evaluated on special heap locations representing the parameters.
This creates a further problem in that the compiler needs to first determine the size of
each parameter; to solve this problem, ReVerC performs a static analysis which we call
parameter interference, as we frame it as a dependent-type inference.
We note that many other solutions are possible: the original Revs compiler for instance
had programmers allocate inputs manually rather than write functions. This led to
unnatural-looking programs and semantics which were hard to reason about formally. On
the other hand, the problem could be avoided by delaying bit allocation until after the
circuit is compiled. We opted for a type inference approach as it simplified verification.
Type system
The type system for parameter inference includes three base types – the unit type, Booleans,
and fixed-length registers – as well as the function type. As expected, indexing registers
beyond their length causes a type error. To regain some polymorphism and flexibility
the type system allows (structural) subtyping, in particular so that registers with greater
size may be used anywhere a register with lesser size is required. Type inference in such
systems is generally considered a difficult problem – see, for e.g., [PZ04] which proves NP-
completeness of the problem for a similar type system involving record concatenation. While
many attempts have been made [Mit91,EST95,TS96,Pot96,OSW99], no such algorithms
are currently in common use. Given the simplicity of our type system (in particular, the
lack of non-subtype polymorphism), we have found a relatively simple inference algorithm
based on constraint generation and bound maps is effective in practice.
Figure 8.5 summarizes algorithmic rules of our type system, which specify a set of
constraints that any valid typing most satisfy. Constraints are given over a language of
type and integer expressions. Type expressions include the basic types of ReVerC, as
well variables representing types and registers over integer expressions. Integer expressions
are linear expressions over integer-valued variables and constants. Equality and order
relations are defined on type expressions as expected, while the integer expression ordering
149
corresponds to the reverse order on integers (≥) – this definition is used so that the
maximal solution to an integer constraint gives the smallest register size. Constraints may
be equalities or order relations between expressions of the same kind.
A type substitution Θ is a mapping from type variables to closed type expressions and
integer variables to integers. By an abuse of notation we denote by Θ(T ) the type T with
all variables replaced by their values in Θ. Additionally, we say that Θ satisfies the set
of constraints C if every constraint is true after applying the substitutions. The relation
Γ ` t : T ↓ C means t can be given type Θ(T ) for any type substitution satisfying C.
Constraint solving
Algorithm 8.1 Computation of bound sets.
C is a set of constraints
θ a set of (possible open at one end) ranges for each variable
while c ∈ C do
if c is T1 = T2 or I1 = I2 then use unification
else if c is T v T or I v I then computeBounds(S \ {c}, θ)
else if c is X v Unit or Unit v X then
computeBounds(S \ {c}, θ ∪ {X 7→ [Unit ,Unit ]})
else if c is X v Bool or Bool v X then
computeBounds(S \ {c}, θ ∪ {X 7→ [Bool ,Bool ]})
else if c is X v Register I then computeBounds(S \ {c} ∪ {I v x}, θ ∪ θ′)
where θ′={X 7→ [Register x,Register x], x 7→ [I,∞]}
else if c is Register I v X then computeBounds(S \ {c} ∪ {x v I}, θ ∪ θ′)
where θ′ = {X 7→ [Register x,Register x], x 7→ [0, I]}
else if c is X v T1 → T2 then computeBounds(S \ {c} ∪ {T1 v T ′1 , T ′2 v T2, θ ∪ θ′)
where θ′ = {X 7→ [T ′1 → T ′2 , T ′1 → T ′2 ]}
else if c is T1 → T2 v X then computeBounds(S \ {c} ∪ {T ′1 v T1, T2 v T ′2 , θ ∪ θ′)
where θ′ = {X 7→ [T ′1 → T ′2 , T ′1 → T ′2 ]}
else if c is X1 v X2 then computeBounds(S \ {c}, θ ∪ {X1 7→ [−, X2]})
else if c is x v I then computeBounds(S \ {c}, θ ∪ {x 7→ [−, I]}




The constraint solving algorithm finds a type substitution satisfying a set of constraints
150
IExp I ::= i ∈ N | x | I1 ± I2
TExp T ::= X |Unit | Bool | Register I | T1 → T2
Const c ::= I1 = I2 | I1 v I2 | T1 = T2 | T1 v T2
[C-let] Γ ` t1 : T1 ↓ C1 Γ, x : T1 ` t2 : T2 ↓ C2Γ ` let x = t1 in t2 : T2 ↓ C1 ∪ C2
[C-var]
(x : T ) ∈ Γ
Γ ` x : T ↓ ∅
[C-lambda] Γ, x : X ` t : T ↓ C X freshΓ ` λx.t : X → T ↓ C
[C-app]
Γ ` t1 : T1 ↓ C1 Γ ` t2 : T2 ↓ C2
C3 = {T1 = X1 → X2, X2 v T2} X1, X2 fresh
Γ ` (t1 t2) : X2 ↓ C1 ∪ C2 ∪ C3
[C-seq] Γ ` t1 : T1 ↓ C1 Γ ` t2 : T2 ↓ C2
Γ ` t1; t2 : T2 ↓ C1 ∪ C2 ∪ {T1 = Unit }
[C-bexp] Γ ` t1 : T1 ↓ C1 Γ ` t2 : T2 ↓ C2
Γ ` t1 ? t2 : Bool ↓ C1 ∪ C2 ∪ {T1 = Bool , T2 = Bool }
[C-true]
Γ ` 1 : Bool ↓ ∅
[C-false]
Γ ` 0 : Bool ↓ ∅
[C-assn] Γ ` t1 : T1 ↓ C1 Γ ` t2 : T2 ↓ C2
Γ ` t1 ← t2 : Unit ↓ C1 ∪ C2 ∪ {T1 = Bool , T2 = Bool }
[C-append]
Γ ` t1 : T1 ↓ C1 Γ ` t2 : T2 ↓ C2
C3 = {T1 = Register x1, T2 = Register x2, x3 = x1 + x2} x1, x2, x3 fresh
Γ ` append t1 t2 : Register x3 ↓ C1 ∪ C2 ∪ C3}
[C-index]
Γ ` t : T ↓ C1 C2 = {T = Register x, i v x} x fresh
Γ ` t.[i] : Bool ↓ C1 ∪ C2
[C-slice]
Γ ` t : T ↓ C1 C2{T = Register x, i v j, j v x} x fresh
Γ ` t.[i..j] : Register j − i+ 1 ↓ C1 ∪ C2 [C-reg]
Γ ` t1 : T1 ↓ C1
...
Γ ` tn : Tn ↓ Cn
Cn+1 = {Ti = Bool | ∀1 ≤ i ≤ n}




Γ ` t : T ↓ C1
C2{T = Register x, i v x} x fresh
Γ ` rotate t i : Register x ↓ C1 ∪ C2
[C-clean] Γ ` t : T ↓ C
Γ ` clean t : Unit ↓ C ∪ {T = Bool }
[C-assert] Γ ` t : T ↓ C
Γ ` assert t : Unit ↓ C ∪ {T = Bool }
Figure 8.5: Constraint typing rules
151
C. We use a combination of unification (see, for e.g., [DM82]) and sets of upper and lower
bounds for variables, similar to the method used in [TS96]. The function computeBounds,
shown in pseudo-code in Algorithm 8.1, iterates through a set of constraints performing
unification and computing the closure of the constraints wherever possible. Bounds – both
subtype and equality – on variables are translated to a range of type or integer expressions,
possibly open at one end, and the algorithm maintains a set of such bounds for each variable.
To reduce complex linear arithmetic constraints to a variable bound we use a normalization
procedure to write the constraint with a single positive polarity variable on one side.
After all constraints have translated to bounds we iterate through the variables, simpli-
fying and checking the set of upper and lower bounds. Any variable with no unassigned
variable in its upper and lower bound sets is assigned the maximum value in the intersection
of all its bounds; this process repeats until no variables are left unassigned. It can be
observed that the algorithm terminates, as the syntax of Revs does not allow circular
systems of constraints to be generated.
8.4 Verification
In this section we describe the formal verification of ReVerC and give the major theorems
proven. All theorems given in this section have been formally specified and proven using
the F? compiler [SHK+16]. We first give theorems about our Boolean expression compiler,
then use these to prove properties about whole program compilation. The total verification
of the ReVerC core’s approximately 2000 lines of code comprises around 2200 lines of F?
code, and took just over 1 person-month. This relatively low-cost verification is a testament
to the increasing ease with which formal verification can be carried out using modern
proof assistants. Additionally, the verification relies on only 11 unproven axioms regarding
properties of lookup tables and sets, such as the fact that a successful lookup is in the
codomain of a lookup table.
We give the mathematical intuition behind the formal F? proofs, as it is much more
enlightening than the primarily engineering-driven proof code.
8.4.1 Boolean expression compilation
Correctness Below is the main theorem establishing the correctness of the function
compile-BExp with respect to the semantics of reversible circuits and Boolean expressions.
152
It states that if the variables of B, the bits on the ancilla heap and the target are non-
overlapping, and if the ancilla bits are 0-valued, then the circuit computes the expression
i⊕B.
Theorem 8.4.1. Let B be a Boolean expression, ξ be an ancilla heap, i ∈ N, C ∈ Circ
and s be a map from bits to Boolean values. Suppose vars(B), ξ and {i} are all disjoint
and s(j) = 0 for all j ∈ ξ. Then
(Jcompile-BExp(B, i, ξ)Ks) (i) = s(i)⊕ JBKs.
Cleanup As remarked earlier, a crucial part of reversible computing is cleaning ancillas
both to reduce space usage, and in quantum computing to prevent entangled qubits from
influencing the computation. Moreover, the correctness of our cleanup is actually necessary
to prove correctness of the compiler, as the compiler re-uses cleaned ancillas on the heap,
potentially interfering with the precondition of Theorem 8.4.1. We use the following
lemma to establish the correctness of our cleanup method, stating that the uncompute
transformation reverses all changes on bits not in the target set under the condition that
no bits in the target set are used as controls.
Lemma 8.4.2. Let C be a well-formed reversible circuit and A ⊂ N be some set of bits. If
A ∩ control(C) = ∅ then for all states s, s′ = JC ; uncompute(C,A)Ks and any i /∈ A,
s(i) = s′(i)
Lemma 8.4.2 largely relies on the following important lemma stating in effect that the
action of a circuit is determined by the values of the bits used as controls:
Lemma 8.4.3. Let A ⊂ N and s, s′ be states such that for all i ∈ A, s(i) = s′(i). If C is a
reversible circuit where control(C) ⊆ A, then
(JCKs)(i) = (JCKs′)(i)
for all i ∈ A.
Lemma 8.4.2, together with the fact that compile-BExp produces a well-formed circuit
under disjointness constraints, gives us our cleanup theorem below that Boolean expression
compilation with cleanup correctly reverses the changes to every bit except the target.
Theorem 8.4.4. Let B be a Boolean expression, ξ be an ancilla heap and i ∈ N such that
vars(B), ξ and {i} are all disjoint. Suppose compile-BExp(B, i, ξ) = C. Then for all
j 6= i and states s we have
(JC ; uncompute(C, {i})Ks) (j) = s(j).
153
8.4.2 Revs compilation
It was noted in Section 8.2 that the design of ReVerC as a partial evaluator simplifies
proving correctness. We expand on that point now, and in particular show that if a relation
between elements of two interpretations is preserved by assignment, then the evaluator also
preserves the relation. We state this formally in the theorem below.
Theorem 8.4.5. Let I1,I2 be interpretations and suppose whenever (σ1, σ2) ∈ R for some
relation R ⊆ I1 ×I2,
(assign1(σ1, l, B), assign2(σ2, l, B)) ∈ R
for any l, B. Then for any term t, if 〈t, σ1〉 ⇒I1 〈v1, σ′1〉 and 〈t, σ2〉 ⇒I2 〈v2, σ′2〉, then
v1 = v2 and (σ′1, σ′2) ∈ R.
Theorem 8.4.5 lifts properties about interpretations to properties of evaluation over
those abstract machines – in particular, we only need to establish that assignment is correct
for an interpretation to establish correctness of the corresponding evaluator/compiler. In
practice we found this significantly reduces boilerplate proof code that is otherwise currently
necessary in F? due to a lack of automated induction.
Given two interpretations I ,I ′, we say elements σ and σ′ of I and I ′ are observa-
tionally equivalent with respect to a supplied set of initial values s ∈ State if for all i ∈ N,
evalI (σ, i, s) = evalI ′(σ′, i, s). We say σ ∼s σ′ if σ and σ′ are observationally equivalent
with respect to s. As observational equivalence of two domain elements σ, σ′ implies that
any location in scope has the same valuation in either interpretation, it suffices to show
that any compiled circuit is observationally equivalent to the standard interpretation. The
following lemmas are used along with Theorem 8.4.5 to establish this fact for the default
and eager-cleanup interpretations – a similar lemma is proven in the implementation of
ReVerC for the crush mode.
Lemma 8.4.6. Let σ, σ′ be elements of Istandard and Icircuit, respectively. For all l ∈
N, B ∈ BExp, s ∈ State, if σ ∼s σ′ and s(i) = 0 whenever i ∈ ξ, then
assignstandard(σ, l, B) ∼s assigncircuit(σ′, l, B).
Moreover, the ancilla heap remains 0-filled.
We say that (ρ, C, ξ) ∈ Dcircuit is valid with respect to s ∈ State if and only if s(i) = 0
for all i ∈ ξ. For elements of DGC the validity conditions are more involved, so we introduce
a relation, V ⊆ DGC × State, defining the set of valid domain elements:
154
((ρ, C, ξ, κ), s) ∈ V ⇐⇒ ∀i ∈ ξ, s(i) = 0 ∧ ∀l, l′ ∈ dom(ρ), ρ(l) 6= ρ(l′)
∧ ∀i ∈ cod(ρ), Ji⊕ κ(i)K(JCKs) = s(i)
Informally, V specifies that all bits on the heap have initial value 0, that ρ is a one-to-one
mapping, and that for every active bit i, XORing i with κ(i) returns the initial value of i –
that is, i⊕ κ(i) cleans i.
Lemma 8.4.7. Let σ, σ′ be elements of Istandard and IGC , respectively. For all l ∈ N, B ∈
BExp, s ∈ State, if σ ∼s σ′ and (σ′, s) ∈ V, then
assignstandard(σ, l, B) ∼s assignGC(σ′, l, B).
Moreover, (assignGC(σ′, l, B), s) ∈ V.
By setting the relation RGC as
(σ1, σ2) ∈ RGC ⇐⇒ σ2 ∈ V ∧ σ1 ∼s0 σ2
for σ1 ∈ Dstandard, by Theorem 8.4.5 and Lemma 8.4.7 it follows that partial evaluation/-
compilation preserves observational equivalence between Istandard and IGC . A similar
result follows for Icircuit.
8.5 Experiments
We ran experiments to compare the bit, gate and Toffoli counts of circuits compiled by
ReVerC to the original Revs compiler. The number of Toffoli gates is distinguished
as such gates are generally much more costly than NOT and controlled-NOT gates. We
compiled circuits for various arithmetic and cryptographic functions written in Revs using
both compilers and reported the results in Table 8.1. The experimental set up is the same
as previous chapters.
The results show that both compilers are more-of-less evenly matched in terms of bit
counts across both modes, despite ReVerC being certifiably correct. ReVerC’s eager
cleanup mode never used more bits than the default mode, as expected, and in half of the
benchmarks reduced the number of bits. Moreover, in the cases of the carryRippleAdder
and MD5 benchmarks, ReVerC’s eager cleanup mode produced circuits with significantly
fewer bits than either of Revs’ modes. On the other hand, Revs saw dramatic decreases
in bit numbers for carryLookahead and SHA-2 compared to ReVerC.
While the results show there is clearly room for optimization of gate counts, they appear
consistent with other verified compilers (e.g., [Ler06]) which take some performance hit
155
Table 8.1: Bit and gate counts for Revs and ReVerC in default and eager cleanup modes.
Entries with the fewest bits used or Toffolis are bolded.
Benchmark Revs Revs (eager) ReVerC ReVerC (eager)
bits gates Toffolis bits gates Toffolis bits gates Toffolis bits gates Toffolis
carryRippleAdd32 129 281 62 129 467 124 128 281 62 113 361 90
carryRippleAdd64 257 569 126 257 947 252 256 569 126 225 745 186
mult32 128 6,016 4,032 128 6,016 4,032 128 6,016 4,032 128 6,016 4,032
mult64 256 24,320 16,256 256 24,320 16,256 256 24,320 16,256 256 24,320 16,256
carryLookahead32 160 345 103 109 1,036 344 165 499 120 146 576 146
carryLookahead64 424 1,026 307 271 3,274 1,130 432 1,375 336 376 1,649 428
modAdd32 65 188 62 65 188 62 65 188 62 65 188 62
modAdd64 129 380 126 129 380 126 129 380 126 129 380 126
cucarroAdder32 65 98 32 65 98 32 65 98 32 65 98 32
cucarroAdder64 129 194 64 129 194 64 129 194 64 129 194 64
ma4 17 24 8 17 24 8 17 24 8 17 24 8
SHA-2 449 1,796 594 353 2,276 754 452 1,796 594 449 1,796 594
MD5 7,841 81,664 27,520 7,905 82,624 27,968 4,833 70,912 27,520 4,769 70,912 27,520
when compared to unverified compilers. In particular, unverified compilers may use more
aggressive optimizations due to the increased ease of implementation and the lack of a
requirement to prove their correctness compared to certified compilers. In some cases, the
optimizations are even known to not be correct in all possible cases, as in the case of fast
arithmetic and some loop optimization passes in the GNU C Compiler [GCC16].
8.6 Related work
Reversible circuit compilation Due to the reversibility requirement of quantum com-
puting, quantum programming languages and compilers typically have methods for gener-
ating reversible circuits. Quantum programming languages typically allow compilation of
classical, irreversible code in order to minimize the effort of porting existing code into the
quantum domain. In QCL [Ö00], “pseudo-classical” operators – classical functions meant to
be run on a quantum computer – are written in an imperative style and compiled with au-
tomatic ancilla management. As in Revs, such code manipulates registers of bits, splitting
off sub-registers and concatenating them together. The more recent Quipper [GLR+13] au-
tomatically generates reversible circuits from classical code by a process called lifting: using
Haskell metaprogramming, Quipper lifts the classical code to the reversible domain with
automated ancilla management. However, little space optimization is performed [SVM+17].
Other approaches to high-level synthesis of reversible circuits are based on writing code
in reversible programming languages – that is, the programs themselves are written in
156
a reversible way. Perhaps most notable in this line of research is Janus [YG07] and its
various extensions [WOD10,Tho12,Per14]. These languages typically feature a reversible
update and some bi-directional control operators, such as if statements with exit assertions.
Due to the presence of dynamic control in the source language, such languages typical
target reversible instruction set architectures like Pendulum [Vie95], as opposed to the
combinational circuits that ReVerC and in general, quantum compilers target.
Verification As noted in the previous chapter, verification of reversible circuits has
been studied from the viewpoint of checking equivalence against a benchmark circuit
or specification [WGMD09, YM10]. This can double as both program verification and
translation validation, but every compiled circuit needs to be verified separately. Moreover,
a program that is easy to formally verify may be translated into a circuit with hundreds
of bits, and is thus very difficult to verify. Recent work has shown progress towards
verification of more general properties of reversible and quantum circuits –for example, via
model checking [APTZ16] – but to the authors’ knowledge, no verification of a reversible
circuit compiler has yet been carried out. By contrast, many compilers for general purpose
programming languages have been formally verified in recent years – most famously, the
CompCert optimizing C compiler [Ler06], written and verified in Coq. Since then, many
other compilers have been developed and verified in a range of languages and logics
including Coq, HOL, F?, etc., with features such as shared memory [BSDA14], functional







Throughout this thesis we have studied aspects of quantum circuit design for quantum
compilation. We looked at the questions of both optimization of quantum circuits, and their
functional verification. On the optimization side, we developed a general framework for
circuit optimization which first applies phase-folding to merge phase gates and group them
into individual CNOT-dihedral subcircuits – these subcircuits can then be optimally or
sub-optimally synthesized for a variety of cost functions. We studied this optimal synthesis
problem from the perspective of RZ- and CNOT-count minimization, which were shown to
reduce to identifiable combinatorial problems in certain cases. Using these characterizations,
we developed concrete optimization algorithms which scale to large circuits and reduce
resource costs of real-world quantum circuits by 42% (T -count) and 22% (CNOT-count).
On the verification side, we developed an algorithm for the verification of the unitary
operator or partial isometry implemented by a circuit over Clifford hierarchy gates and
ancillary qubits. The algorithm was shown to scale to some large circuits of up to 200
qubits, exceeding previous equivalence checking results as well as the performance of recent
simulators on a Hidden Shift algorithm. Going beyond functional verification, a compiler
for strictly reversible circuits was formally verified. Together, the methods of this thesis –
implemented in the reversible circuit compiler ReVerC1 and quantum compiler backend
Feynman2 – form a stack of tools for compiling optimized, verified quantum circuits.
Between these pieces of software, a classical program can be compiled to a certified correct






This thesis represents one microscopic step towards the herculean task of making quantum
computing a reality. Many more steps will need to be taken before a scalable, reliable set of
circuit design tools are developed, let alone a programmable, accurate universal quantum
computer. And so, we close by discussing some directions those steps might take.
Path integrals & Formal models A major theme of this thesis was the use of path
integrals and phase polynomials as a formal model for the analysis of pure quantum circuits.
A natural direction of future research is to develop a framework for similar analyses applied
to more general quantum circuits – for instance, those involving measurements and classical
control. Indeed, recent research [BK18] has combined the intuition of phase polynomials
with graphical calculi (e.g, [CD11]) to develop new formal models applicable to the full
range of quantum computation, while retaining some of the insight developed here.
Optimization One of the natural next steps for the optimization algorithms present in
this thesis – in particular, the phase-folding framework – is to extend it in a natural way
to quantum programs as opposed to quantum circuits. One way to go about this may be
to develop analogous models for such programs as noted above. On the other hand, it
may be possible to forgo a direct path integral model of more general quantum programs
by using classical techniques of abstraction and collecting semantics to soundly lift these
purely unitary optimizations to such programs.
Many questions still remain regarding both RZ-count and CNOT-count optimizations.
For RZ-counts, the question of a exact upper bound – that is, the exact covering radius of
higher order Reed-Muller codes – for CNOT-dihedral circuits remains. More significantly,
our methods give a very loose upper bound of O(k · n2) T -gates in a Clifford+T circuit
containing k Hadamard gates, which can no doubt be brought down by examining the (more
difficult) problem of T or RZ gate optimization in Clifford+RZ circuits. Indeed, recent
work by Heyfron and Campbell [HC18] has brought this down to O(n2 + k2) by pushing
Hadamard gates onto ancillas at the beginning and end of the circuit, and further given a
new Reed-Muller decoder which empirically achieve this scaling. It remains a question for
future research to determine the relationship between ancillas, phase and Hadamard gates,
and moreover to find methods – if they exist – with O(n2 + k2) T gates that do not require
ancillas.
Regarding CNOT minimization, even in the restricted case of CNOT-dihedral circuits,
the exact complexity of CNOT-minimization remains unknown. The first order of business
160
is to establish the difficulty in general for the parity network synthesis problem, possibly
by showing that certain cases reduce to the fixed-target version of the problem. On the
other hand, it may yet turn out to be possible to find an efficient algorithm for synthesizing
minimal parity networks. While we have presented an effective heuristic algorithm for
synthesizing parity networks, better heuristics may also exist. As an additional point of
consideration, physical chip designs typically have limited connectivity and can only apply
CNOT gates between connected qubits. While arbitrary CNOT gates can be implemented
with a sequence of CNOT gates having length at most proportional to the diameter of
the connectivity graph, it nonetheless remains possible that a better circuit may be found
by synthesizing one directly for a given topology. A natural and important direction for
future research is then to find CNOT optimization algorithms which take into account
such connectivity constraints, a problem which likewise appears to be computationally
intractable [HND17].
Verification The functional verification work presented in this thesis represents a pre-
liminary step towards a fully-automated system of formal specification and verification for
quantum circuits, and as such there are many issues for future work to address. First and
foremost is to develop a concrete syntax – or specification language – so that libraries of
portable, verified circuits may be developed, greatly reducing the difficulty of implementing
quantum algorithms. Specification languages form the backbone of most modern hardware
design workflows, and so it is highly desirable – and in particular, not yet well studied
– to develop such languages for quantum circuit design. A related direction, motivated
by our experience writing path integral rewriting proofs “by hand,” is to implement path
integral-based verification in an interactive proof assistant. This could further extend the
applicability of this framework to inductive proofs, allowing the verification of entire families
of quantum circuits. As in classic interactive provers, the rewrite rules and verification
algorithm we develop here could ostensibly be the starting point for proof automation, just
as the F? language uses SMT solvers to automatically prove verification sub-goals.
On the algorithmic side many improvements can yet be made to the verification algorithm.
It was noted that the algorithm has high memory usage, due to the cubic (or higher) degree
of the phase polynomial. Representing phase polynomials by their Fourier expansion,
breaking normalization but saving space, may be useful time-space trade-off for difficult to
verify circuits. On the other hand, specifications of algorithms as well as the intermediate
states of the path integrals can grow exponentially large – in these cases, it remains to be
seen whether the use of relational representations, which have proven useful for saving space
in classical verification methods such as model checking, may be an effective solution here.
Other open questions regard finding new reduction rules, particularly finding a complete set
161
of rules for Clifford+T circuits, as well as to improve methods for completing verification
once no more reductions can be made. An interesting direction for this latter problem may
be to used algebraic decision diagrams [NWM+16], which have proven remarkably useful for
uniquely and compactly representing finite functions in the classical world. Indeed, recent
work has hinted at the use of algebraic decision diagrams in quantum circuit simulation
and verification [ZW18].
The development of a verified compiler for full quantum circuits – rather than strictly
reversible circuits – is also left for future work. However, in contrast to reversible circuits
where the programming language is naturally any classical programming language, the exact
nature of this problem depends on what we mean by a quantum programming language. If
we take a quantum programming language to simply mean a circuit description language,
the job of the compiler is massively simplified; in most cases, the compiler effectively unrolls
a program to a straight line circuit, possibly decomposing some gates according to their
implementation in a particular gate basis. The natural verification questions for such
a compiler, beyond equivalence checking of gate decompositions, are not yet clear. As
quantum programs are naturally probabilistic, a more interesting question may be does the
compiler preserve probabilities up to some error rate? A natural direction to look in this
case is the classical work done on reliability (e.g., [CMR13]). More concretely, improvements
can be made to the compilation methods of ReVerC– particularly its space optimizations
– as well as the verification of the interpreter with respect to a more formal relational
implementation of the semantics of Revs.
162
References
[AADS16] Nabila Abdessaied, Matthew Amy, Rolf Drechsler, and Mathias Soeken.
Complexity of Reversible Circuits and their Quantum Implementations. The-
oretical Computer Science, 618:85–106, 2016. doi:10.1016/j.tcs.2016.01.011.
[AAM18] Matthew Amy, Parsiad Azimzadeh, and Michele Mosca. On the CNOT-
complexity of CNOT-PHASE Circuits. Quantum Science and Technology,
4(1):015002, 2018. doi:10.1088/2058-9565/aad8ca.
[AASD16] Nabila Abdessaied, Matthew Amy, Mathias Soeken, and Rolf Drechsler.
Technology Mapping of Reversible Circuits to Clifford+T Quantum Circuits.
In Proceedings of the 46th International Symposium on Multiple-Valued Logic,
ISMVL ’16, pages 150–155, 2016. doi:10.1109/ISMVL.2016.33.
[ABL+18] Divesh Aggarwal, Gavin Brennen, Troy Lee, Miklos Santha, and Marco
Tomamichel. Quantum Attacks on Bitcoin, and How to Protect Against
Them. Ledger, 3(0), 2018. doi:10.5195/ledger.2018.127.
[AC04] Samson Abramsky and Bob Coecke. A Categorical Semantics of Quantum
Protocols. In Proceedings of the 19th annual IEEE Symposium on Logic in
Computer Science, LICS ’04, pages 415–425, 2004. doi:10.1109/LICS.2004.1.
[ACJR17] Matthew Amy, Jianxin Chen, and Neil J. Ross. A Finite Presentation
of CNOT-Dihedral Operators. In Proceedings of the 14th International
Conference on Quantum Physics and Logic, QPL ’17, pages 84–97, 2017.
doi:10.4204/EPTCS.266.5.
[ADH97] Leonard Adleman, Jonathan DeMarrais, and Ming-Deh A. Huang. Quan-
tum Computability. SIAM Journal on Computing, 26(5):1524–1540, 1997.
doi:10.1137/S0097539795293639.
163
[AJO16] Jonas T. Anderson and Tomas Jochym-O’Connor. Classification of Transver-
sal Gates in Qubit Stabilizer Codes. Quantum Information & Computation,
16(9-10):771–802, 2016. doi:10.26421/QIC16.9-10.
[AM16] Matthew Amy and Michele Mosca. T-count Optimization and Reed-Muller
Codes. arXiv preprint, 2016, arXiv:1601.07363.
[AMG+16] Matthew Amy, Olivia Di Matteo, Vlad Gheorghiu, Michele Mosca, Alex
Parent, and John Schanck. Estimating the Cost of Generic Quantum Pre-
image Attacks on SHA-2 and SHA-3. In Proceedings of the 24th Confer-
ence on Selected Areas in Cryptography, SAC ’16, pages 317–337, 2016.
doi:10.1007/978-3-319-69453-5_18.
[AMM14] Matthew Amy, Dmitri Maslov, and Michele Mosca. Polynomial-Time T-
depth optimization of Clifford+T circuits via matroid partitioning. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
33(10):1476–1489, 2014. doi:10.1109/TCAD.2014.2341953.
[AMMR13] Matthew Amy, Dmitri Maslov, Michele Mosca, and Martin Roetteler. A
Meet-in-the-Middle Algorithm for Fast Synthesis of Depth-Optimal Quantum
Circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 32(6):818–830, 2013. doi:10.1109/TCAD.2013.2244643.
[Amy18] Matthew Amy. Towards Large-Scale Functional Verification of Universal
Quantum Circuits. In Proceedings of the 15th International Conference on
Quantum Physics and Logic, QPL ’18, 2018. doi:10.4204/EPTCS.287.1.
[APTZ16] Linda Anticoli, Carla Piazza, Leonardo Taglialegne, and Paolo Zuliani. To-
wards Quantum Programs Verification: From Quipper Circuits to QPMC. In
Proceedings of the 8th International Conference on Reversible Computation,
RC ’16, pages 213–219, 2016. doi:10.1007/978-3-319-40578-0_16.
[ARS17] Matthew Amy, Martin Roetteler, and Krysta M. Svore. Verified Compilation
of Space-Efficient Reversible Circuits. In Proceedings of the 29th International
Conference on Computer Aided Verification, CAV ’17, pages 3–21, 2017.
doi:10.1007/978-3-319-63390-9_1.
[ASD14] Nabila Abdessaied, Mathias Soeken, and Rolf Drechsler. Quantum Circuit
Optimization by Hadamard Gate Reduction. In Proceedings of the 6th
International Conference on Reversible Computation, RC ’14, pages 149–162,
2014. doi:10.1007/978-3-319-08494-7_12.
164
[BCM08] Pedro Baltazar, Rohit Chadha, and Paulo Mateus. Quantum Com-
putation Tree Logic – Model Checking and Complete Calculus. In-
ternational Journal of Quantum Information, 06(02):219–236, 2008.
doi:10.1142/S0219749908003530.
[Ben73] Charles H. Bennett. Logical Reversibility of Computation. IBM Journal of
Research and Development, 17:525–532, 1973. doi:10.1147/rd.176.0525.
[Ben89] Charles H. Bennett. Time/Space Trade-Offs for Reversible Computation.
SIAM Journal on Computing, 18(4):766–776, 1989. doi:10.1137/0218053.
[BG16] Sergey Bravyi and David Gosset. Improved Classical Simulation of Quantum
Circuits Dominated by Clifford Gates. Physical Review Letters, 116:250501,
2016. doi:10.1103/PhysRevLett.116.250501.
[BGK+16] Filippo Bonchi, Fabio Gadducci, Aleks Kissinger, Paweł Sobociński, and Fabio
Zanasi. Rewriting Modulo Symmetric Monoidal Structure. In Proceedings of
the 31st ACM/IEEE Symposium on Logic in Computer Science, LICS ’16,
pages 710–719, 2016. doi:10.1145/2933575.2935316.
[BH12] Sergey Bravyi and Jeongwan Haah. Magic-State Distillation with Low Over-
head. Physical Review A, 86:052329, 2012. doi:10.1103/PhysRevA.86.052329.
[BK05] Sergey Bravyi and Alexei Kitaev. Universal Quantum Computation with
Ideal Clifford Gates and Noisy Ancillas. Physical Review A, 71:022316, 2005.
doi:10.1103/PhysRevA.71.022316.
[BK18] Miriam Backens and Aleks Kissinger. ZH: A Complete Graphical Calculus
for Quantum Computations Involving Classical Non-linearity. In Proceedings
of the 15th International Conference on Quantum Physics and Logic, QPL
’18, pages 23–42, 2018. doi:10.4204/EPTCS.287.2.
[BKV18] Krzysztof Bar, Aleks Kissinger, and Jamie Vicary. Globular: an Online Proof
Assistant for Higher-dimensional Rewriting. Logical Methods in Computer
Science, Volume 14, Issue 1, 2018. doi:10.23638/LMCS-14(1:8)2018.
[BMvT06] Elwyn Berlekamp, Robert McEliece, and Hank van Tilborg. On the Inherent
Intractability of Certain Coding Problems. IEEE Transactions on Information
Theory, 24(3):384–386, September 2006. doi:10.1109/TIT.1978.1055873.
165
[BRS15] Alex Bocharov, Martin Roetteler, and Krysta M. Svore. Efficient Synthesis of
Universal Repeat-Until-Success Quantum Circuits. Physical Review Letters,
114:080502, 2015. doi:10.1103/PhysRevLett.114.080502.
[BSDA14] Lennart Beringer, Gordon Stewart, Robert Dockins, and Andrew. Appel.
Verified Compilation for Shared-Memory C. In Proceedings of the 23rd
European Symposium on Programming Languages and Systems, ESOP ’14,
pages 107–127, 2014. doi:10.1007/978-3-642-54833-8_7.
[BU11] David Buchfuhrer and Christopher Umans. The Complexity of Boolean
Formula Minimization. Journal of Computer and System Sciences, 77(1):142–
153, 2011. doi:10.1016/j.jcss.2010.06.011.
[BV97] Ethan Bernstein and Umesh Vazirani. Quantum Complexity
Theory. SIAM Journal on Computing, 26(5):1411–1473, 1997.
doi:10.1137/S0097539796300921.
[CAB12] Earl T. Campbell, Hussain Anwar, and Dan E. Browne. Magic-State Distilla-
tion in All Prime Dimensions Using Quantum Reed-Muller Codes. Physical
Review X, 2:041021, 2012. doi:10.1103/PhysRevX.2.041021.
[CBSG17] Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta.
Open Quantum Assembly Language. arXiv preprint, 2017, arXiv:1707.03429.
[CC77] Patrick Cousot and Radhia Cousot. Abstract Interpretation: A Unified Lat-
tice Model for Static Analysis of Programs by Construction or Approximation
of Fixpoints. In Proceedings of the 4th annual ACM SIGACT-SIGPLAN Sym-
posium on Principles of Programming Languages, POPL ’77, pages 238–252,
1977. doi:10.1145/512950.512973.
[CD11] Bob Coecke and Ross Duncan. Interacting Quantum Observables: Categorical
Algebra and Diagrammatics. New Journal of Physics, 13(4):043016, 2011.
doi:10.1088/1367-2630/13/4/043016.
[CDKW16] Bob Coecke, Ross Duncan, Aleks Kissinger, and Quanlong Wang. Generalised
Compositional Theories and Diagrammatic Reasoning. In G. Chiribella and
R. W. Spekkens, editors, Quantum Theory: Informational Foundations and
Foils, pages 309–366. Springer Netherlands, Dordrecht, 2016. doi:10.1007/978-
94-017-7303-4_10.
166
[CFH97] David G. Cory, Amr F. Fahmy, and Timothy F. Havel. Ensemble quantum
computing by NMR spectroscopy. Proceedings of the National Academy of
Sciences, 94(5):1634–1639, 1997. doi:10.1073/pnas.94.5.1634.
[CGK17] Shawn X. Cui, Daniel Gottesman, and Anirudh Krishna. Diagonal
Gates in the Clifford Hierarchy. Physical Review A, 95:012329, 2017.
doi:10.1103/PhysRevA.95.012329.
[CH17a] Earl T. Campbell and Mark Howard. Unified Framework for Magic State
Distillation and Multiqubit Gate Synthesis with Reduced Resource Cost.
Physical Review A, 95:022316, 2017. doi:10.1103/PhysRevA.95.022316.
[CH17b] Earl T. Campbell and Mark Howard. Unifying gate synthesis and
magic state distillation. Physical Review Letters, 118:060501, 2017.
doi:10.1103/PhysRevLett.118.060501.
[Chl10] Adam Chlipala. A Verified Compiler for an Impure Functional Language.
In Proceedings of the 37th annual ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages, POPL ’10, pages 93–106, 2010.
doi:10.1145/1706299.1706312.
[CL92] Gérard D. Cohen and Simon N. Litsyn. On the Covering Radius
of Reed-Muller Codes. Discrete Mathematics, 106:147 – 155, 1992.
doi:http://dx.doi.org/10.1016/0012-365X(92)90542-N.
[Cla01] Koen Claessen. Embedded Languages for Describing and Verifying Hardware.
PhD thesis, Chalmers University of Technology and Göteborg University,
2001.
[CM09] Sephen L. Campbell and Carl D. Meyer. Generalized Inverses of Lin-
ear Transformations. Classics in Applied Mathematics. SIAM, 2009.
doi:10.1137/1.9780898719048.
[CMR13] Michael Carbin, Sasa Misailovic, and Martin C. Rinard. Verifying Quanti-
tative Reliability for Programs That Execute on Unreliable Hardware. In
Proceedings of the 2013 ACM SIGPLAN International Conference on Object
Oriented Programming Systems Languages & Applications, OOPSLA ’13,
pages 33–52, 2013. doi:10.1145/2509136.2509546.
[CO16] Earl T. Campbell and Joe O’Gorman. An Efficient Magic State Approach to
Small Angle Rotations. Quantum Science and Technology, 1(1):015007, 2016.
doi:10.1088/2058-9565/1/1/015007.
167
[Coq17] The Coq Proof Assistant, version 8.7.0, 2017. doi:10.5281/zenodo.1028037.
[CZ95] J. Ignacio Cirac and Peter Zoller. Quantum Computations with
Cold Trapped Ions. Physical Review Letters, 74:4091–4094, 1995.
doi:10.1103/PhysRevLett.74.4091.
[DCP15] Guillaume Duclos-Cianci and David Poulin. Reducing the Quantum-
computing Overhead with Complex Gate Distillation. Physical Review A,
91:042315, 2015. doi:10.1103/PhysRevA.91.042315.
[Deu85] David Deutsch. Quantum theory, the Church-Turing Principle and the
Universal Quantum Computer. Proceedings of the Royal Society of London
A: Mathematical, Physical and Engineering Sciences, 400(1818):97–117, 1985.
doi:10.1098/rspa.1985.0070.
[DHM+05] Christopher M. Dawson, Andrew P. Hines, Duncan Mortimer, Henry L.
Haselgrove, Michael A. Nielsen, and Tobias J. Osborne. Quantum Computing
and Polynomial Equations over the Finite Field Z2. Quantum Information
& Computation, 5(2):102–112, 2005. doi:10.26421/QIC5.2.
[DL13] Ross Duncan and Maxime Lucas. Verifying the Steane Code with Quan-
tomatic. In Proceedings of the 10th International Conference on Quantum
Physics and Logic, QPL ’13, pages 33–49, 2013. doi:10.4204/EPTCS.171.4.
[DLF+16] Shantanu Debnath, Norbert M. Linke, Caroline Figgatt, Kevin Landsman,
Kevin Wright, and Chris Monroe. Demonstration of a Small Programmable
Quantum Computer with Atomic Qubits. Nature, 536(7614):63–66, 2016.
doi:10.1038/nature18648.
[DM82] Luis Damas and Robin Milner. Principal type-schemes for functional pro-
grams. In Proceedings of the 9th annual ACM SIGACT-SIGPLAN Symposium
on Principles of Programming Languages, POPL ’82, pages 207–212, 1982.
doi:10.1145/582153.582176.
[DN06] Christopher M. Dawson and Michael A. Nielsen. The Solovay-Kitaev
Algorithm. Quantum Information & Computation, 6(1):81–95, 2006.
doi:10.26421/QIC6.1.
[Dum04] Ilya Dumer. Recursive Decoding and its Performance for Low-rate Reed-
Muller Codes. IEEE Transactions on Information Theory, 50(5):811–823,
2004. doi:10.1109/TIT.2004.826632.
168
[EK09] Bryan Eastin and Emanuel Knill. Restrictions on Transversal En-
coded Quantum Gate Sets. Physical Review Letters, 102:110502, 2009.
doi:10.1103/PhysRevLett.102.110502.
[EKP85] Jarmo Ernvall, Jyrki Katajainen, and Martti Penttonen. NP-Completeness
of the Hamming Salesman Problem. BIT Numerical Mathematics, 25(1):289–
292, 1985. doi:10.1007/BF01935007.
[EST95] Jonathan Eifrig, Scott Smith, and Valery Trifonov. Sound polymorphic type
inference for objects. In Proceedings of the Tenth annual Conference on Object-
oriented Programming Systems, Languages, and Applications, OOPSLA ’95,
pages 169–184, 1995. doi:10.1145/217838.217858.
[Fey82] Richard Feynman. Simulating Physics with Computers. International Journal
of Theoretical Physics, 21:467–488, 1982. doi:10.1007/BF02650179.
[FH65] Richard P. Feynman and Albert R. Hibbs. Quantum Mechanics and Path
Integrals. International series in pure and applied physics. McGraw-Hill,
1965.
[FHTZ15] Yuan Feng, Ernst Moritzb Hahn, Andrea Turrini, and Lijun Zhang. QPMC:
A Model Checker for Quantum Programs and Protocols. In Proceedings of the
19th International Symposium on Formal Methods, FM ’15, pages 265–272,
2015. doi:10.1007/978-3-319-19249-9_17.
[FMMC12] Austin G. Fowler, Matteo Mariantoni, John M. Martinis, and Andrew N. Cle-
land. Surface Codes: Towards Practical Large-scale Quantum Computation.
Physical Review A, 86:032324, 2012. doi:10.1103/PhysRevA.86.032324.
[FR99] Lance Fortnow and John Rogers. Complexity Limitations on Quantum
Computation. Journal of Computer and System Sciences, 59(2):240–252,
1999. doi:10.1006/jcss.1999.1651.
[FSC+13] Cedric Fournet, Nikhil Swamy, Juan Chen, Pierre-Evariste Dagand, Pierre-
Yves Strub, and Benjamin Livshits. Fully Abstract Compilation to JavaScript.
In Proceedings of the 40th annual ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages, POPL ’13, pages 371–384, 2013.
doi:10.1145/2429069.2429114.
[FYY13] Yuan Feng, Nengkun Yu, and Mingsheng Ying. Model Checking Quantum
Markov Chains. Journal of Computer and System Sciences, 79(7):1181–1198,
2013. doi:10.1016/j.jcss.2013.04.002.
169
[GC97] Neil A. Gershenfeld and Isaac L. Chuang. Bulk Spin-Resonance
Quantum Computation. Science, 275(5298):350–356, 1997.
doi:10.1126/science.275.5298.350.
[GCC16] Using the GNU Compiler Collection. Free Software Foundation, Inc., 2016.
URL https://gcc.gnu.org/onlinedocs/gcc/.
[GD17] Liam Garvie and Ross Duncan. Verifying the Smallest Interesting Colour
Code with Quantomatic. In Proceedings of the 14th International Con-
ference on Quantum Physics and Logic, QPL ’17, pages 147–163, 2017.
doi:10.4204/EPTCS.266.10.
[GLR+13] Alexander S. Green, Peter LeFanu Lumsdaine, Neil J. Ross, Peter Selinger,
and Benoît Valiron. Quipper: A Scalable Quantum Programming Lan-
guage. In Proceedings of the 34th ACM SIGPLAN conference on Program-
ming Language Design and Implementation, PLDI ’13, pages 333–342, 2013.
doi:10.1145/2491956.2462177.
[GLRS16] Markus Grassl, Brandon Langenberg, Martin Roetteler, and Rainer Stein-
wandt. Applying Grover’s Algorithm to AES: Quantum Resource Estimates.
In Proceedings of the 7th International Workshop on Post-Quantum Cryptog-
raphy, PQCrypto ’16, pages 29–43, 2016. doi:10.1007/978-3-319-29360-8_3.
[GNP08] Simon J. Gay, Rajagopal Nagarajan, and Nikolaos Papanikolaou. QMC: A
Model Checker for Quantum Systems. In Proceedings of the 20th International
Conference on Computer Aided Verification, CAV ’08, pages 543–547, 2008.
doi:10.1007/978-3-540-70545-1_51.
[Got98] Daniel Gottesman. The Heisenberg Representation of Quantum Computers.
In International Conference on Group Theoretic Methods in Physics, page
9807006, 1998, arXiv:quant-ph/9807006.
[Gra53] Frank Gray. Pulse Code Communication, March 17 1953. US Patent
2,632,058.
[HB09] Peter Mark Hines and Sam Braunstein. The Structure of Partial Isome-
tries. In S. Gay and I. Mackie, editors, Semantic Techniques in Quan-
tum Computation, pages 361–388. Cambridge University Press, 11 2009.
doi:10.1017/CBO9781139193313.
170
[HC18] Luke E. Heyfron and Earl T. Campbell. An Efficient Quantum Compiler
that Reduces T Count. Quantum Science and Technology, 4(1):015004, 2018.
doi:10.1088/2058-9565/aad604.
[HFDM12] Clare Horsman, Austin G Fowler, Simon Devitt, and Rodney Van Meter.
Surface Code Quantum Computing by Lattice Surgery. New Journal of
Physics, 14(12):123011, 2012. doi:10.1088/1367-2630/14/12/123011.
[HND17] Daniel Herr, Franco Nori, and Simon J. Devitt. Optimization of Lat-
tice Surgery is NP-hard. npj Quantum Information, 3(1):35, 2017.
doi:10.1038/s41534-017-0035-1.
[HR68] Peter L. Hammer and Sergiu Rudeanu. Boolean Methods in Operations
Research and Related Areas. Ökonometrie und Unternehmensforschung.
Springer-Verlag, 1968.
[IAR13] IARPA. Quantum Computer Science, 2013. URL https://www.iarpa.gov/
index.php/research-programs/qcs.
[IKY02] Kazuo Iwama, Yahiko Kambayashi, and Shigeru Yamashita. Transformation
Rules for Designing CNOT-based Quantum Circuits. In Proceedings of the
39th annual Design Automation Conference, DAC ’02, pages 419–424, 2002.
doi:10.1145/513918.514026.
[JGS93] Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. Partial Evaluation
and Automatic Program Generation. Prentice-Hall, Inc., Upper Saddle River,
NJ, USA, 1993.
[JPK+15] Ali JavadiAbhari, Shruti Patil, Daniel Kudrow, Jeff Heckey, Alexey Lvov,
Frederic T. Chong, and Margaret Martonosi. ScaffCC: Scalable Compilation
and Analysis of Quantum Programs. Parallel Computing, 45(C):2–17, 2015.
doi:10.1016/j.parco.2014.12.001.
[KIQ+18] Nathan Killoran, Josh Izaac, Nicolás Quesada, Ville Bergholm, Matthew
Amy, and Christian Weedbrook. Strawberry Fields: A Software Platform for
Photonic Quantum Computing. arXiv preprint, 2018, arXiv:1804.03159.
[KLM01] Emanuel Knill, Raymond Laflamme, and Gerard J. Milburn. A Scheme for
Efficient Quantum Computation with Linear Optics. Nature, 409:46–52, 02
2001. doi:10.1038/35051009.
171
[KLM07] Phillip Kaye, Raymond Laflamme, and Michele Mosca. An Introduction to
Quantum Computing. Oxford University Press, 2007.
[KLZ96] Emanuel Knill, Raymond Laflamme, and Wojciech Zurek. Threshold Ac-
curacy for Quantum Computation. arXiv preprint, 1996, arXiv:quant-
ph/9610011.
[KMM13a] Vadym Kliuchnikov, Dmitri Maslov, and Michele Mosca. Asymptotically
Optimal Approximation of Single Qubit Unitaries by Clifford and T Circuits
Using a Constant Number of Ancillary Qubits. Physical Review Letters,
110:190502, 2013. doi:10.1103/PhysRevLett.110.190502.
[KMM13b] Vadym Kliuchnikov, Dmitri Maslov, and Michele Mosca. Fast and Effi-
cient Exact Synthesis of Single-qubit Unitaries Generated by Clifford and
T Gates. Quantum Information & Computation, 13(7-8):607–630, 2013.
doi:10.26421/QIC13.7-8.
[KZ15] Aleks Kissinger and Vladimir Zamdzhiev. Quantomatic: A Proof Assistant
for Diagrammatic Reasoning. In Proceedings of the International Conference
on Automated Deduction, CADE-25, pages 326–336, 2015. doi:10.1007/978-3-
319-21401-6_22.
[LA04] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework
for Lifelong Program Analysis & Transformation. In Proceedings of
the International Symposium on Code Generation and Optimization:
Feedback-directed and Runtime Optimization, CGO ’04, pages 75–, 2004.
doi:10.1109/CGO.2004.1281665.
[LAS14] Vu Le, Mehrdad Afshari, and Zhendong Su. Compiler Validation via Equiv-
alence Modulo Inputs. In Proceedings of the 35th annual ACM SIGPLAN
Conference on Programming Language Design and Implementation, PLDI
’14, pages 216–226, 2014. doi:10.1145/2594291.2594334.
[LC13] Andrew J. Landahl and Chris Cesare. Complex Instruction Set Computing
Architecture for Performing Accurate Quantum Z Rotations with Less Magic.
arXiv preprint, 2013, arXiv:1302.3240.
[LD98] Daniel Loss and David P. DiVincenzo. Quantum Computation with Quantum
Dots. Physical Review A, 57:120–126, 1998. doi:10.1103/PhysRevA.57.120.
172
[Ler06] Xavier Leroy. Formal Certification of a Compiler Back-end or: Programming
a Compiler with a Proof Assistant. In Proceedings of the 34th annual ACM
SIGPLAN-SIGACT International Symposium on Principles of Programming
Languages, POPL ’06, pages 42–54, 2006. doi:10.1145/1111037.1111042.
[Llo96] Seth Lloyd. Universal Quantum Simulators. Science, 273(5278):1073–1078,
1996. doi:10.1126/science.273.5278.1073.
[LWZ+17] Shusen Liu, Xin Wang, Li Zhou, Ji Guan, Yinan Li, Yang He, Runyao Duan,
and Mingsheng Ying. Q|SI〉: A Quantum Programming Environment. arXiv
preprint, 2017, arXiv:1710.09500.
[Mas16] Dmitri Maslov. Advantages of using Relative-Phase Toffoli Gates with an
Application to Multiple Control Toffoli Optimization. Physical Review A,
93:022311, 2016. doi:10.1103/PhysRevA.93.022311.
[Mit91] John C. Mitchell. Type Inference with Simple Subtypes. Journal of Functional
Programming, 1:245–285, 1991. doi:10.1017/S0956796800000113.
[Mon17] Ashley Montanaro. Quantum Circuits and Low-degree Polynomials over
F2. Journal of Physics A: Mathematical and Theoretical, 50(8):084002, 2017.
doi:10.1088/1751-8121/aa565f.
[MP67] John Mccarthy and James Painter. Correctness of a Compiler for Arith-
metic Expressions. In Proceedings of Symposia in Applied Mathematics,
volume 19, pages 33–41. American Mathematical Society, 1967. URL
http://jmc.stanford.edu/articles/mcpain.html.
[MS78] Florence J. MacWilliams and Neil J.A. Sloane. The Theory of Error Correct-
ing Codes. North-Holland Mathematical Library, Volume 16. North-Holland
Publishing Company, 1978.
[Mul54] David E. Muller. Application of Boolean Algebra to Switching Cir-
cuit Design and to Error Detection. Transactions of the I.R.E. Pro-
fessional Group on Electronic Computers, EC-3(3):6–12, Sept 1954.
doi:10.1109/IREPGELC.1954.6499441.
[MW72] Robin Milner and Richard W. Weyhrauch. Proving Compiler Correctness
in a Mechanised Logic. Machine Intelligence, 7:51–73, 1972. URL http:
//www.cs.umd.edu/~hjs/pubs/compilers/archive/mi72-mil-wey.pdf.
173
[NC00] Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and Quan-
tum Information. Cambridge Series on Information and the Natural Sciences.
Cambridge University Press, 2000.
[NHK+15] Georg Neis, Chung-Kil Hur, Jan-Oliver Kaiser, Craig McLaughlin, Derek
Dreyer, and Viktor Vafeiadis. Pilsner: A Compositionally Verified Compiler
for a Higher-order Imperative Language. In Proceedings of the 20th ACM
SIGPLAN International Conference on Functional Programming, ICFP ’15,
pages 166–178, 2015. doi:10.1145/2784731.2784764.
[NRS+18] Yunseong Nam, Neil J. Ross, Yuan Su, Andrew M. Childs, and Dmitri Maslov.
Automated Optimization of Large Quantum Circuits with Continuous Pa-
rameters. npj Quantum Information, 4(1):23, 2018. doi:10.1038/s41534-018-
0072-4.
[NWM+16] Philipp Niemann, Robert Wille, David M. Miller, Mitchell A. Thornton, and
Rolf Drechsler. QMDDs: Efficient Quantum Function Representation and Ma-
nipulation. IEEE Transactions on Computer-Aided Design of Integrated Cir-
cuits and Systems, 35(1):86–99, Jan 2016. doi:10.1109/TCAD.2015.2459034.
[Ö00] Bernhard Ömer. Quantum Programming in QCL. Master’s thesis, Technical
University of Vienna, 2000. URL http://tph.tuwien.ac.at/~oemer/qcl.
html.
[Ö03] Bernhard Ömer. Structured Quantum Programming. PhD thesis, Technical
University of Vienna, 2003. URL http://tph.tuwien.ac.at/~oemer/qcl.
html.
[O’D14] Ryan O’Donnell. Analysis of Boolean Functions. Cambridge University Press,
2014. doi:10.1017/CBO9781139814782.
[OSW99] Martin Odersky, Martin Sulzmann, and Martin Wehr. Type Inference with
Constrained Types. Theory and Practice of Object Systems, 5(1):35–55, 1999.
doi:10.1002/(SICI)1096-9942(199901/03)5:1<35::AID-TAPO4>3.0.CO;2-4.
[PA14] James T. Perconti and Amal Ahmed. Verifying an Open Compiler Using
Multi-language Semantics. ACM Transactions on Programming Languages
and Systems, 8410:128–148, 2014. doi:10.1007/978-3-642-54833-8_8.
[Pay18] Jennifer Paykin. Private Communication, 2018.
174
[Per08] Simon Perdrix. Quantum Entanglement Analysis Based on Abstract Inter-
pretation. In Proceedings of the 15th International Symposium on Static
Analysis, SAS ’18, pages 270–282, 2008. doi:10.1007/978-3-540-69166-2_18.
[Per14] Kalyan S. Perumalla. Introduction to Reversible Computing. CRC Press,
2014.
[PMH08] Ketan N. Patel, Igor L. Markov, and John P. Hayes. Optimal Synthesis of
Linear Reversible Circuits. Quantum Information & Computation, 8(3):282–
294, 2008. doi:10.26421/QIC8.3-4.
[Pot96] François Pottier. Simplifying Subtyping Constraints. In Proceedings of the
1996 ACM SIGPLAN International Conference on Functional Programming,
ICFP ’96, pages 122–133, 1996. doi:10.1145/232627.232642.
[Pre18] John Preskill. Quantum Computing in the NISQ Era and Beyond. Quantum,
2:79, August 2018. doi:10.22331/q-2018-08-06-79.
[PRS15] Alex Parent, Martin Roetteler, and Krysta M Svore. Reversible Circuit
Compilation with Space Constraints. arXiv preprint, 2015, arXiv:1510.00377.
[PS14] Adam Paetznick and Krysta M. Svore. Repeat-until-success: Non-
deterministic Decomposition of Single-qubit Unitaries. Quantum Information
& Computation, 14(15-16):1277–1301, 2014. doi:10.26421/QIC14.15-16.
[PZ04] Jens Palsberg and Tian Zhao. Type Inference for Record Concatena-
tion and Subtyping. Information and Computation, 189(1):54 – 86, 2004.
doi:10.1016/j.ic.2003.10.001.
[R1̈0] Martin Rötteler. Quantum Algorithms for Highly Non-linear Boolean Func-
tions. In Proceedings of the 21st annual ACM-SIAM Symposium on Discrete
Algorithms, SODA ’10, pages 448–457, 2010. doi:10.1137/1.9781611973075.37.
[Ree54] Irving Reed. A Class of Multiple-error-correcting Codes and the Decoding
Scheme. Transactions of the I.R.E. Professional Group on Information
Theory, 4(4):38–49, 1954. doi:10.1109/TIT.1954.1057465.
[RPZ17] Robert Rand, Jennifer Paykin, and Steve Zdancewic. QWIRE Practice:
Formal Verification of Quantum Circuits in Coq. In Proceedings of the 14th
International Conference on Quantum Physics and Logic, QPL ’17, pages
119–132, 2017. doi:10.4204/EPTCS.266.8.
175
[RS16] Neil J. Ross and Peter Selinger. Optimal Ancilla-free Clifford+T Approxima-
tion of Z-rotations. Quantum Information & Computation, 16(11-12):901–953,
2016. doi:10.26421/QIC16.11-12.
[SB16] Peter Selinger and Xiaoning Bian. Relations for 2-qubit Clifford+T Operator
Group, 2016. URL https://www.mathstat.dal.ca/~xbian/talks/slide_
cliffordt2.pdf.
[SCZ16] Robert S. Smith, Michael J. Curtis, and William J. Zeng. A Practical Quan-
tum Instruction Set Architecture. arXiv preprint, 2016, arXiv:1608.03355.
[Sel15] Peter Selinger. Efficient Clifford+T Approximation of Single-qubit Op-
erators. Quantum Information & Computation, 15(1-2):159–180, 2015.
doi:10.26421/QIC15.1-2.
[SGT+18] Krysta Svore, Alan Geller, Matthias Troyer, John Azariah, Christopher
Granade, Bettina Heim, Vadym Kliuchnikov, Mariia Mykhailova, Andres
Paz, and Martin Roetteler. Q#: Enabling Scalable Quantum Computing
and Development with a High-level DSL. In Proceedings of the 3rd ACM
International Workshop on Real World Domain Specific Languages, RWDSL
’18, pages 7:1–7:10, 2018. doi:10.1145/3183895.3183901.
[SHK+16] Nikhil Swamy, Cătălin Hriţcu, Chantal Keller, Aseem Rastogi, Antoine
Delignat-Lavaud, Simon Forest, Karthikeyan Bhargavan, Cédric Fournet,
Pierre-Yves Strub, Markulf Kohlweiss, Jean-Karim Zinzindohoue, and Santi-
ago Zanella-Béguelin. Dependent Types and Multi-monadic Effects in F?.
In Proceedings of the 43rd annual ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages, POPL ’16, pages 256–270, 2016.
doi:10.1145/2837614.2837655.
[Sho94] Peter W. Shor. Algorithms for Quantum Computation: Discrete Log-
arithms and Factoring. In Proceedings of the 34th annual Symposium
on Foundations of Computer Science, SFCS ’94, pages 124–134, 1994.
doi:10.1109/SFCS.1994.365700.
[Sho95] Peter W. Shor. Scheme for Reducing Decoherence in Quantum
Computer Memory. Physical Review A, 52:R2493–R2496, 1995.
doi:10.1103/PhysRevA.52.R2493.
176
[SHT18] Damian S. Steiger, Thomas Häner, and Matthias Troyer. ProjectQ: An Open
Source Software Framework for Quantum Computing. Quantum, 2:49, 2018.
doi:10.22331/q-2018-01-31-49.
[SL83] Gadiel Seroussi and Abraham Lempel. Maximum Likelihood Decoding of
Certain Reed-Muller Codes. IEEE Transactions on Information Theory,
29(3):448–450, 1983. doi:10.1109/TIT.1983.1056662.
[SM09] Vivek V. Shende and Igor L. Markov. On the CNOT-cost of TOF-
FOLI Gates. Quantum Information & Computation, 9(5):461–486, 2009.
doi:10.26421/QIC9.5-6.
[SPMH02] Vivek V. Shende, Aditya K. Prasad, Igor L. Markov, and John P. Hayes.
Reversible Logic Circuit Synthesis. In Proceedings of the 2002 IEEE/ACM
International Conference on Computer-aided Design, ICCAD ’02, pages
353–360, 2002. doi:10.1145/774572.774625.
[SVM+17] Artur Scherer, Benoît Valiron, Siun-Chuon Mau, Scott Alexander, Eric
van den Berg, and Thomas E. Chapuran. Concrete Resource Analysis of the
Quantum Linear-system Algorithm used to Compute the Electromagnetic
Scattering Cross Section of a 2D Target. Quantum Information Processing,
16(3):60, 2017. doi:10.1007/s11128-016-1495-5.
[Tho12] Michael K. Thomsen. A Functional Language for Describing Reversible Logic.
In Proceedings of the 2012 Forum on Specification and Design Languages, FDL
’12, pages 135–142, 2012. URL https://ieeexplore.ieee.org/document/
6336999.
[Tof80] Tommaso Toffoli. Reversible Computing. In Proceedings of the 7th Colloquium
on Automata, Languages and Programming, ICALP ’80, pages 632–644, 1980.
doi:10.1007/3-540-10003-2_104.
[TS96] Valery Trifonov and Scott Smith. Subtyping Constrained Types. In Proceed-
ings of the 3rd International Symposium on Static Analysis, SAS ’96, pages
349–365, 1996. doi:10.1007/3-540-61739-6_52.
[Vie95] Carlin Vieri. Pendulum: A Reversible Computer Architecture. Master’s
thesis, MIT Artificial Intelligence Laboratory, 1995. URL https://dspace.
mit.edu/bitstream/handle/1721.1/36039/33342527-MIT.pdf.
177
[WGMAG14] Jonathan Welch, Daniel Greenbaum, Sarah Mostame, and Alan Aspuru-
Guzik. Efficient Quantum Circuits for Diagonal Unitaries without An-
cillas. New Journal of Physics, 16(3):033040, 2014. doi:10.1088/1367-
2630/16/3/033040.
[WGMD09] Robert Wille, Daniel Grosse, David M. Miller, and Rolf Drechsler. Equivalence
Checking of Reversible Circuits. In Proceedings of the 39th International
Symposium on Multiple-Valued Logic, ISMVL ’09, pages 324–330, 2009.
doi:10.1109/ISMVL.2009.19.
[WOD10] Robert Wille, Sebastian Offermann, and Rolf Drechsler. SyReC: A Program-
ming Language for Synthesis of Reversible Circuits. In Proceedings of the
2010 Forum on Specification and Design Languages, FDL ’10, pages 1–6,
2010. doi:10.1049/ic.2010.0150.
[WZ82] William K. Wootters and Wojciech H. Zurek. A Single Quantum Cannot be
Cloned. Nature, 299:802, 1982. doi:10.1038/299802a0.
[YG07] Tetsuo Yokoyama and Robert Glück. A Reversible Programming Language
and its Invertible Self-interpreter. In Proceedings of the 2007 ACM Symposium
on Partial Evaluation and Semantics-Based Program Manipulation, PEPM
’07, pages 144–153, 2007. doi:10.1145/1244381.1244404.
[YM10] Shigeru Yamashita and Igor L. Markov. Fast Equivalence-checking for
Quantum Circuits. Quantum Information & Computation, 10(9):721–734,
2010. doi:10.26421/QIC10.9-10.
[ZCC11] Bei Zeng, Andrew Cross, and Isaac L. Chuang. Transversality Versus Uni-
versality for Additive Quantum Codes. IEEE Transactions on Information
Theory, 57(9):6272–6284, 2011. doi:10.1109/TIT.2011.2161917.
[ZPW18] Alwin Zulehner, Alexandru Paler, and Robert Wille. An Efficient Method-
ology for Mapping Quantum Circuits to the IBM QX Architectures. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
2018. doi:10.1109/TCAD.2018.2846658.
[ZW18] Alwin Zulehner and Robert Wille. Advanced Simulation of Quantum Compu-
tations. IEEE Transactions on Computer-Aided Design of Integrated Circuits






In this appendix we formally prove the correctness of the phase-folding algorithm. It will be
convenient to consider L-parametrized circuits C : RL → 〈G〉. We call θ ∈ RL a parameter
set and denote the L-parametrized circuit C applied to phase angles θ by C(θ). We extend
J·Ka to L-parametrized circuits in the obvious way.
The intuition behind the phase-folding algorithm is that for any term T ∈ T of the
phase polynomial, any two parameter sets θ, θ′ which agree on the sum of all angles at
locations in T – that is, ∑(`,a)∈T a · θ` = ∑(`,a)∈T a · θ′` – give circuits C(θ) and C(θ′) which
are equivalent up to global phase. We say in this case that θ and θ′ are f ]-equivalent.
Definition A.1.1 (f -equivalence). Two parameter sets θ, θ′ ∈ RL are f ]-equivalent for
some f ] ∈ F ]n, denoted θ ∼f] θ′, if or all T ∈ range(f ]),∑
(`,a)∈T




and for all other `, θ` = θ′`. Here range(f ]) denotes {T | (r, T ) ∈ f ]}.
With the definition of f ]-equivalence, we can now state the main theorem.
Theorem A.1.2. Given a labelled circuit C and input state Ab, let (f ], A′b′) = JCKa(∅, Ab).
For any parameter sets θ, θ′ ∈ RL such that θ ∼f] θ′, and for any x ∈ Fn2
JC(θ)K|Ax + b〉 = JC(θ′)K|Ax + b〉
up to global phase.
180
Note that the circuit C before and after phase folding corresponds to C(θ) and C(θ′)
where θ ∼f] θ′, hence Theorem A.1.2 implies the correctness of Algorithm 4.1.
We prove Theorem A.1.2 by showing that J·Ka is a sound abstraction of a L-parametrized
path integral semantics, which itself is equivalent to the standard semantics. Specifically,
we denote by Fn = RL → (Fn2 → R) the L-parametrized phase polynomials, Φn : Fn2 → C




Fn × Φn × A(Fn+m2 ,Fn2 )
as the domain of L-parametrized path integrals. Intuitively, the state given by an element
(f, φ, Ab) ∈ Dn and parameter set θ is defined as




The amplitude function φ is included to simplify the proof by separating the amplitude due
to uninterpreted gates. The concrete semantics, JCKc : Dn → Dn is defined recursively as
JXiKc(f, φ, Ab) = (f, φ, Ab⊕ei)
JCNOTi,jKc(f, φ, Ab) = (f, φ, Ei,jAb)
JRZ(·)`iKc(f, φ, Ab) = (f ′(θ)(x) = f(θ)(x) + θ`(Aix⊕ bi), φ, Ab)
JUn−k+1,...,nKc(f, φ, Ab) = (f, φ′(x) = φ(x) + 〈Akx + bk|U |Ax + b〉, (Ak)bk)
JC ; C ′Kc = JC ′Kc ◦ JCKc
The lemma below establishes the correctness of the path integral semantics with respect
to the standard matrix semantics.
Lemma A.1.3. For all L-labelled circuits C and parameters θ, if |f(θ), φ, Ab〉 = |ψ〉, then
(JCKc(f, φ, Ab))(θ) = JC(θ)K|ψ〉.
181
Proof. The base cases all follow from direct calculation:
JXiK|f(θ), φ, Ab〉 =
∑
x∈Fn+m2
eif(θ)(x)φ(x)|Ax + (b⊕ ei)〉


















eif(θ)(x) (φ(x) + 〈Ak(x,y) + bk|U |Ax + b〉) |Ak(x,y) + bk〉
Hence by induction on C, (JCKc(f, φ, Ab))(θ) = JC(θ)K|ψ〉.
As is customary [CC77], we define a soundness relation τ ⊆ An ×Dn between states of
the phase analysis and L-parametrized path integrals.
Definition A.1.4 (soundness relation). For P ] = (f ], A]b]) ∈ An and P = (f, φ, Ab) ∈ Dn,










up to global phase, where χy(x) = χy](x]) whenever r = y] ∈ Fn2 and A]x] = Ax.
The intuition behind the soundness relation above is that A]b] overapproximates the
basis states with non-zero amplitude in |f(θ), φ, Ab〉 – i.e. for every y = Ax + b there exists
x] such that A]x] + b] = y. Moreover, the abstract phase polynomial f ] has the same
(non-global) coefficients as f – ∑(`,a)∈T a · θ` where (T, r) ∈ f ] – and in particular when
r 6= ⊥, the parities match.
From this definition it can be easily verified that if (P ], P ) ∈ τ , then for all θ, θ′ such
that θ ∼f] θ′, we have f(θ) = f(θ′) up to constant factors. Moreover, this shows that our
soundness relation is in a sense strong enough to prove Theorem A.1.2, in particular with
the following lemma:
182
Lemma A.1.5. If (P ], P ) ∈ τ , then for any θ, θ′ ∈ RL such that θ ∼f] θ′,
|f(θ), φ, Ab〉 = |f(θ′), φ, Ab〉
up to global phase.
Proof. Trivial since f(θ) = f(θ′) up to constant factors.
By Lemma A.1.5 above, the last piece we need is to show that J·Ka preserves soundness.
Lemma A.1.6. Let C be a L-parametrized circuit, and let (P ], P ) ∈ τ . Then
(JCKaP ], JCKcP ) ∈ τ.
Proof. Our proof follows by induction on the structure of C.
Case: C = Xi
We have
JCKa(f ], A]b]) = (f
], A]b]⊕ei), JCKc(f, φ, Ab) = (f, φ, Ab⊕ei)
Since b] = b, b]⊕ ei = b⊕ ei. Moreover, since A], A, are unchanged for all x ∈ Fn+m2








where χy(x) = χy](x]) whenever r = y] ∈ Fn2 .
Case: C = CNOTi,j
Similar to the previous case.
Case: C = RZ(·)`i
Let JCKa(f ], A]b]) = ((f ])′, A
]
b]), JCKc(f, φ, Ab) = (f ′, φ, Ab). As the basis states are








where χy(x) = χy](x]) whenever r = y] ∈ Fn2 .
183
By the definition of J·Kc we have
f ′(θ)(x) = θ`(Aix⊕ bi) + f(θ)(x)














up to global phase. Since the basis states are unchanged, we only need to verify that
whenever A]x] = Ax, then χATi (x) = χ(A])Ti (x
]). In particular, we know
χ(A])Ti (x
]) = A]ix] = Aix = χATi (x).
Case: C = Un−k+1,...,n
First note that (b])′ = b]k = bk = b′, as required.





first note that there exists y such that
A]ky = Akx



















ky = Akx = Akx.
Finally, observe that f ′ = f and (f ])′ has the same terms as f ], so














where χy(x) = χy](x]) whenever r = y] ∈ Fn2 and A]x] = Ax. We only need


















The second from last equality follows since (y],0) is in the column space of (A]k)T –
i.e. is a linear combination of the first n − k rows of A]k. The last equality follows
since χ((A])gA)Ty](x) = χy]((A])gAx) and A](A])gAx = Ax since Ax is in the column
space of A] (see, e.g., [CM09]).
Case: C = C ′ ; C ′′
By the inductive hypothesis.
Proof (Theorem A.1.2). Given an input stateAb, let (f ], A]b]) = JCKa(∅, Ab) and (f ′, φ′, A′b′) =
JCKc(0, 0, Ab). Clearly ((∅, Ab), (0, 0, Ab)) ∈ τ , and hence by Lemma A.1.6, ((f ], A]b]), (f ′, φ′, A′b′) ∈
τ.
By Lemmas A.1.3 and A.1.5 we have for any x ∈ Fn2 ,
JC(θ)K|Ax + b〉 = |f ′(θ), φ′, A′b′〉 = |f ′(θ′), φ′, A′b′〉 = JC(θ′)K|Ax + b〉
up to global phase, as required.
185
