Quilc is an open-source, optimizing compiler for gate-based quantum programs written in Quil or QASM, two popular quantum programming languages. The compiler was designed with attention toward NISQ-era quantum computers, specifically recognizing that each quantum gate has a non-negligible and often irrecoverable cost toward a program's successful execution. Quilc's primary goal is to make authoring quantum software a simpler exercise by making architectural details less burdensome to the author. Using Quilc allows one to write programs faster while usually not compromising-and indeed sometimes improving-their execution fidelity on a given hardware architecture. In this paper, we describe many of the principles behind Quilc's design, and demonstrate the compiler with various examples. * 1 The source code of Quilc is licensed under the Apache 2.0 license. The source code can be found at github.com/rigetti/quilc. This document refers to Quilc version 1.12.1.
Introduction
Noisy intermediate-scale quantum (NISQ) computers are an active area of research. New quantum computer architectures are sometimes the result of incremental improvements in the manufacturing process, and at other times are paradigm-shifts in the qubit technologies themselves. While each new architecture is universal in a computational sense, the impermanence of their designs challenges one's ability to write software for them. As has been the case with classical computers, the role of a compiler is to attenuate this challenge. Software for a quantum computer is ideally written in the manner that is simplest and most straightforward to the programmer, without necessarily requiring knowledge of the particulars of the target architecture. It is then the job of the compiler to produce both an efficient and an appropriate expression of this software which accounts for the details of the target architecture.
In this paper we present Quilc, an open-source 1 software application used to compile quantum programs written in Quil [24, 3] into an optimized program that is expressed in the native operations of a target quantum computer architecture. Quilc does not requireand indeed has no means to accept-instruction from the user on a fine-grained compilation strategy. Instead, it consumes a simple description of the architecture for which Quilc must compile the user's program. The architecture description language is general enough to handle most 2 manufactured gate-based computer architectures to date, and anticipates new ones. For these reasons, we say Quilc is automatic and retargetable. Quilc is also more than a desk calculator-a convenience to avoid doing manual, repetitive calculationsas it acts as a repository of knowledge about the compilation of programs, and it is able to synthesize this information to discover non-trivial expressions of a quantum program. We provide examples of this in Section 5. It is also production-grade, and is deployed as an essential component of Rigetti Computing's software stack.
The structure of the paper is as follows. First, in Section 2 we provide an overview of Quilc, including a mathematical formulation of quantum architectures as they pertain to compilation. This formalism is used in Section 3 to describe how Quilc achieves retargetablity, a high-level overview of which is presented in Figure 1 . In Section 4 we consider two compilation stages (the "addressing" and "compression" stages) in more detail. We follow this in Section 5 with a few non-trivial examples, which make use of many of the features present in Quilc. In Section 6 we investigate the performance of Quilc on a set of benchmarks. Finally, in the appendix we consider some implementation details which may be of interest to compiler authors or potential contributors to Quilc, including examples of our "compilation subroutine" domain-specific language, additional features of Quilc, and a short history of the project's development.
A Quantum Compiler Target
The structure of Quilc is informed by the task at hand: it must conform to the features and constraints of gate-based quantum computational devices (i.e., the shape of its output), and it must understand the features and constraints of the quantum programming language, Quil (i.e., the shape of its input). Gate-based quantum computational devices are made up of quantum resources (typically qubits) and support operations which affect the state of a subsystem, commingle the states of two or more subsystems, or collapse and copy the state of a subsystem for classical interpretation. The Quil language itself provides support for all of these operations in an assembly-like format: the system's resources are addressed by "quantum registers", the individual assembly instructions correspond to the aforementioned operations, and there are mechanisms for further specifying other classical requirements (e.g., memory to store user-defined parameters) and their operation.
To remain portable, the Quil language is also designed to be hardware-agnostic: it makes The layout stage looks for an optimal initial mapping of logical qubits to physical qubits: in this example, the compiler has opted to relabel qubit 0 as qubit 4, and qubit 4 as qubit 5, so that in the (here unspecified) chip topology the two-qubit interaction can occur without introducing SWAP instructions. The nativization stage converts any non-native gate into a native gate; in this example CNOT is first compiled to RY, Z, and CZ gates, and then finally nativization compiles those RY and Z gates into native RX, RZ and CZ gates. The final optimization stage here rewrites several RX or RZ instructions acting on the same target qubit into fewer instructions. Figure 2 : A typical, small instance of a "QAOA" quantum program. The instructions H, CPHASE, RX, and MEASURE respectively put the qubit registers into superposition states, commingle the registers, attempt to "unmix" them, and measure the residual commingling.
D E F C I R C U I T RESET q s c r a t c h: M E A S U R E q s c r a t c h JUMP -UNLESS @done s c r a t c h X q LABEL @done D E F G A T E U (% alpha , % beta ):
cis (2* pi *% alpha ) , 0 , 0 , 0 0 , cis ( -2* pi *% alpha ) , 0 , 0 0 , 0 , cis ( -2* pi *% beta ) , 0 0 , 0 , 0 , cis (2* pi *% beta ) Figure 3 : On the left: a Quil snippet demonstrating the implementation of a RESET instruction as applied to a qubit q. The instruction X is the quantum equivalent of a NOT instruction. On the right: a Quil snippet defining a custom instruction.
no particular assumptions on the availability of resources or what particular operations they support. Instead, its execution semantics are formally specified against a mathematical backend in such a way that makes it clear how to abstractly simulate the effects of a Quil program. At the other extreme, physical devices do labor under a host of severe constraints: there is a fixed (and typically small) number of resources, there are only a few very particular instructions which the device can enact on those resources, operations are subject to other requirements (e.g., spatial locality: distant qubits are typically unable to directly interact), and operations may be error-prone. The role of a quantum compiler is to convert an abstract specification of such a quantum program into machine-executable bytecode, interpretable by the classical electronics which manipulate the engineered quantum system. The compilation process cleaves roughly into two parts: the conversion of the program's quantum aspects to a form that comports with the constraints of the engineered quantum system, followed by the conversion of the classical aspects to a form that comports with the structure of the control electronics. The quantum concerns are primarily those announced above, and while we will chiefly concern ourselves with them, we also name some classical concerns for completeness: memory management, timing and synchronization, classical communication, as well as a host of others. Quilc's approach to the satisfaction of the quantum constraints is to order them by severity:
1. Any operations of large arity must be decomposed into an arity supported by the system.
2. Any operations between non-interacting or indirectly interacting regions must be spatially rearranged to accommodate the system's preferred set of interactions.
3. All operations must be written in terms of the set of operations (presumed universal) that the system can perform.
4. The program should be structured so as to avoid suffering performance penalties.
The first three requirements are all equally important, in the sense that the program cannot be executed on a given physical device if any are not satisfied-but on a heterogeneous device, it's not possible to discern what operations the system can perform without first resolving (2) , which in turn requires resolving (1) , so that only then can (3) be redressed.
Quilc leaves the resolution of (4) for last, since it can be satisfied "by degrees": for instance, it is generally better for (4) to use fewer instructions, but there are also no clear hard limits to either of the maximum count tolerable or the minimum count achievable. 3 There is a natural data structure which captures these constraints of the target architecture and which forms the backbone of Quilc. To describe it, fix a set S of quantum resources. A target architecture amounts to specifying the available interacting subsets S ′ ⊆ S, each 3 The decision to resolve arity and addressing before circuit optimization does incur a loss of information. In particular, certain high-level circuit identities may not be obvious after the program has been lowered to elementary operations. However, in many instances such high-level circuit optimizations are most appropriately handled by library authors, in concert with the low-level optimizations of the compiler. In this respect, programming a quantum computer does not differ so much from programming a classical one.
indicating a collection of resources that are permitted to interact-for instance, a pair of qubits on which one can perform a two-qubit gate. Let us make the additional assumption that if S ′ is an interacting subset, then every further nonempty subset S ′′ ⊆ S ′ is also an interacting subset. Under this assumption, such a collection of interacting subsets forms a simplicial complex Σ [25, Section 3.1] with S as its set of vertices (or 0-simplices), interacting pairs as its edges (or 1-simplices), interacting triplets as its 2-simplices, and so on. Additionally, we tag each simplex with a set of instructions, each corresponding to a specific means by which the interacting set can evolve as an ensemble. Finally, each such instruction is tagged with metadata, e.g., a matrix defining the associated unitary transformation, the average fidelity of the device's execution of the instruction compared to the ideal, or the temporal duration of the instruction execution on the device.
We have arranged the description of the target architecture into these tiers because this reflects the kind of information needed as input to the four compilation constraints:
1. The limit on the arity of an instruction corresponds to the dimension of the largest simplex.
2. The simplicial complex structure of interactions describes both the spatial constraints to which a quantum program is subject, as well as the pathways by which information can be rerouted or permuted in order to satisfy these constraints. Figure 5 : A simple target architecture in serialized form. Note that π/2 = 1.57 . . .. Addressing Quilc resolves constraints (1), (2) , and (3) by walking the graph of quantum instructions, ordered by resource dependence, in a breadth-first manner and tracking a mapping from logical quantum resources (i.e., those specified in the user's program) to physical quantum resources (i.e., those actually available on the device).
Compression Lastly, Quilc gives attention to (4) by finding sequences of instructions that act on overlapping resources through a kind of depth-first walk and applying reduction techniques (e.g., a peephole rewriter) to the paths appearing during the walk. See Figure 6 for an example.
There are two important points to note. First, the addressing and compression stages require intensive computation to fully explore the graph: for example, deciding whether the addressing problem admits a solution without the insertion of SWAP instructions is an instance of the (NP-complete) subgraph isomorphism problem [21] . Instead, we employ heuristics, randomization, and approximate algorithms to tame their computational complexity; see Section 4 for more details, as well as Figure 14b and Figure 14a in Appendix A for some empirical analysis of these approximations. The second important point is that although the addressing and compressing stages crawl the program in a highly different way, the actual manipulation of quantum instructions is similar in each each. This promotes the segmentation of these stages into small, interruptible compilation subroutines which adopt a uniform interface. This API which we expect compilation subroutines to provide consists of • A literal subroutine, which consumes some fixed number of instructions and some data about the state of the compiler, and which emits a sequence of instructions which may replace its input. The subroutine is allowed to be partially defined: compilation subroutines can signal that they are not applicable in a given situation by employing an interrupt which is handled appropriately by the caller.
• A description of the kinds of instructions that the subroutine can consume.
• A description of the kinds of instructions (and, ideally, their counts) that the routine can emit.
Equipped with this extra data, Quilc can decide whether a particular subroutine falls into one of two privileged classes (or neither):
Nativizers A compilation routine is relevant for nativization when it consumes a single instruction and when its output belongs to the native interactions of Σ (perhaps after further applications of other nativizers).
Optimizers A compilation routine is relevant for optimization when it both consumes and emits instruction sequences which belong to the native interactions of Σ and its emitted sequences have better execution properties than its inputs.
The addressing step employs the first class of compilation subroutines in order to convert the input program's non-native instructions to instructions that are native for Σ's interactions. The compression step makes use of both of these special classes: the optimizers are directly relevant to the compression of instruction sequences, but it is also fruitful to destroy some of the structure of a sequence of instructions by considering their holistic effect and to renativize it. This kind of design lends itself to the support of a few features:
Planning Verifying that a particular long sequence of reductions gives an optimal strategy (from the perspective of constraint (4)) is computationally expensive, frequently remitted to heuristic, and costly to guess incorrectly. It is less expensive to make analogous decisions about individual, simple reductions, and so interruptibility permits the compiler a greater degree of flexibility in quickly planning its next best move.
Internal reusability Requiring that subroutines leave the overall compiler in a good intermediate state puts dramatic limitations on the API to which they conform. From the perspective of Quilc, this tends to make such subroutines suitable for use at various stages of compilation. It also promotes a separation of concerns between the code responsible for crawling the input program (as described above in, e.g., "addressing" and "compression") and these subroutines, so that the crawlers are written in such a generic way that their specification does not directly depend on knowing the set of subroutines to be applied to the input program.
Ease of authorship
The same limitations on the API means that the author of quantumspecific subroutines need not concern themselves with the precise implementation of the crawlers-or, in Quilc's case, even how to instruct the crawlers that they should make use of a new subroutine.
External reusability It is possible to wrap routines provided by external compilation libraries via this API, so that Quilc can make seamless use of their specialized routines without reimplementation or serious diversion. One such example is tweedledum, a quantum circuit optimizing library which provides an efficient routine for compilation of gates that can be represented as permutation matrices [8].
A Closer Look at Addressing and Compression
Much of the "action" of the compiler occurs in the addressing and compression stages. Indeed, these are the most computationally intensive stages of compilation, and the techniques employed are critically responsible for the quality of the output of the compiler (e.g., with respect to gate depth). In this section we give brief summaries of the underlying techniques of these two compilation stages, along with points of contact in the literature.
Addressing
In principle, the addressing stage may be resolved by any sequence of circuit transformations which preserve the semantics of the source program and which result in an output program conforming to the constraints of the hardware. However, the set of such possible transformations is immense. Here we constrain ourselves to two classes of transformations: those which serve to translate gates into native constituents (e.g., the aforementioned "nativization" routines) and those which manage the assignment of logical to physical qubits (known as "qubit allocation" in the literature).
A number of general-purpose techniques for qubit allocation have been proposed; we briefly mention some here. In [21] , the authors consider four sorts of program transformations ("virtual CNOTs", reversals, bridges, and swaps), formulating the problem precisely in these terms and presenting heuristics to approximate the optimal sequence of transformations to solve the qubit allocation problem.
In [28] , the authors first produce an initial decomposition of the source circuit (consisting of 1Q and 2Q gates) into layers, then construct an initial qubit mapping, and then finally identify swap operations between layers via A* search using a cost function which may incorporate look-ahead to successive layers. The method of [5] uses a similar layered decomposition, but with different heuristics for constructing the initial mapping and for selecting swaps between layers.
Alternative approaches have formulated the addressing problem in a form amenable to solution by off-the-shelf solvers. For example, in [12] , the authors formulate qubit allocation as a satisfiability modulo theories (SMT) problem which may be solved by an SMT solver. Along similar lines, [27] proposes and evaluates a formulation which is solvable by temporal planners.
Core to the approach taken by Quilc is a representation of the source program which expresses the constraints, with respect to hardware usage, implicit in the linear source program. Here, the source instructions are taken as the vertices of a directed, acyclic graph, with edges expressing resource conflicts (whether classical or quantum) between logically successive operations. The addressing pass proceeds in a greedy fashion, consuming the source program in topological order while maintaining a certain amount of state, including a partial logical-to-physical qubit mapping, estimated swap costs between pairs of qubits, and a buffer of emitted instructions operating on physical qubits.
In this scheme, gate applications involving a number of qubits exceeding the underlying arity of the device (e.g., 3Q gates for Rigetti's hardware) are first translated to an equivalent series of smaller-arity operations by means of any number of nativization routines, as described in Section 3. We note that the goal here is not to find an "optimal" sequence of native operations, but rather to quickly find a viable realization of the gate in native terms. However, Quilc does attempt to select the translation so as to be in harmony with the particular qubits' native gate sets: for instance, ISWAP-based decompositions are preferred to CNOT-based decompositions when the native gate sets includes the ISWAP gate and not CNOT.
As each low-arity operation (e.g., 1Q and 2Q gates on Rigetti's hardware) is processed, the logical-to-physical qubit mapping may be updated, either by assigning a logical qubit to a currently unassigned physical qubit, or via the introduction of SWAP operations in order to satisfy addressing constraints. At each such decision point, the ambition of Quilc is to select the action which minimizes the total cost of the final scheduled program.
To this end, Quilc employs heuristics along two axes. The first are cost heuristics, which, given a logical Quil program and a compilation target, determine a cost indicating the "badness" of the program on the underlying hardware. At present, there are two of these available: a duration-based heuristic, informed by the underlying gate times of the target architecture, and a fidelity-based heuristic, informed by the reported gate fidelities of the target architecture. The second axis consists of search heuristics, which are used to select from available swap operations in order to assign a logical gate to physical qubits in a cost-minimizing fashion. These include both A* search as well as greedy search heuristics.
A challenge with such an approach is that SWAPs inserted cheaply at one point in a program may end up being costly with respect to later operations. To account for this, both the duration-based and the fidelity-based cost heuristics incorporate a look-ahead: here the cost associated with a Quil program depends not just on the next instructions, but also more weakly on those following them (via an exponential discounting factor). Here, the motivation is to dampen the miserliness of the (otherwise greedy) addressing strategy.
Compression
During the compression stage, Quilc employs two kinds of rewriting strategies:
1. Directly apply a peephole rewriter to an instruction sequence.
2. Convert a (pure quantum) instruction sequence to a composite matrix, rewrite the matrix as native instructions, and apply a peephole rewriter to the resulting sequence.
Both of these are best served by sequences with two properties:
Lengthiness Long sequences provide more opportunities for the rewriter to act and give the bounded nativization routines better odds of producing a shorter sequence.
Resource-sparsity
Sequences which act only on a few resources have correspondingly fewer false negatives in the form of instructions which are nonadjacent in the sequence but which could be commuted next to one another.
With this in mind, the compressor has been designed to produce contiguous sequences of instructions with these properties. The compressor first arranges the instructions in a program into a dependency graph by resource usage. It then begins forming subgraphs, tagged by resource utilization, and peforms a topological walk of the instruction graph while adhering to the following rules:
• If this instruction's resources do not meet those of any subgraph and it does not contain a forbidden resource, start a new subgraph containing this instruction, and restart consideration of the next instruction in the walk.
• Otherwise, this instruction's resources meet one or more existing subgraphs or are forbidden. Compute the sum of their resource tags with this instruction's resources.
• If the resource sum contains a forbidden resource collection or if the resource sum is larger than the compressor's limit, then do nothing with this instruction for now. Remove each met subgraph from the overall graph. For each met subgraph:
-If the resource sum contains a prohibited resource collection, then mark this subgraph's resource tag as forbidden. If the resource sum is larger than the compressor's limit, then mark the sum as forbidden.
-Linearize the contents of the subgraph into a sequence, and pass that sequence to the peephole rewriter.
-Re-walk the instructions emitted by the peephole rewriter (i.e., try to form subgraphs out of them).
-Unmark any forbidden resources added in this step.
Write this instruction, as well as any instructions in any met subgraphs (which may be newly formed, as in the above loop), out to the end results.
• Otherwise, merge the subgraphs which meet this instruction, add it to the newly formed subgraph, tag the subgraph with the resource sum, and proceed to the next instruction in the walk.
This walk is similar in effect to the walk considered in Iten, Soetter, and Werner [9] . However, ours is somewhat less thorough, since it separately walks the graph and applies template rewriting, and since it does not perform backward matching. It partially makes up for these deficiencies in its output by its simplicity of implementation.
In the course of the compressor's graph-walking, a given instruction may be considered by the peephole rewriter as many times as there are subresources which contain the instruction and which are contained by the subgraph's tag. By installing a limit to the size of subgraph tags which the compressor will consider, this value becomes bounded. In practice, even a small such limit (e.g., less than four qubits) has good run time properties without appreciable decline in output quality.
Long-Form Examples
In this section, we demonstrate the above considerations via a few practical examples which highlight the influence of target architecture and the ubiquitous role of compilation subroutines in the compilation process. The first two subsections consider compilation of the Toffoli gate, a well studied example which has been implemented on a variety of architectures and for which optimal decompositions are known [19] . Our aim here is to demonstrate how various compilation routines may be combined to realize CCNOT across several device architectures. In the third subsection, we consider state-aware compilation applied to an example from computational chemistry.
SWAP recombination with different targets
One of the basic tasks of the "addressing" stage of the compiler is to construct a mapping from logical qubits, as expressed in the source program, to physical qubits, as realized in a specific architecture. A guiding principle here is that constraints in physical qubit connectivity may be satisfied through the addition of appropriate SWAP instructions. This introduction of SWAP gates comes at a price: namely, an increase in the total number of logical operations performed. This is further complicated by the demands of nativization, since for many architectures of interest SWAPs must be translated to native operations. Bounds for the complexity of the resulting native instruction sequence have been considered in the literature. For example, it is known that SWAP requires 3 CNOT gates [26]. On the other hand, an arbitrary two-qubit unitary operator is equivalent, up to a global phase factor, to one expressed as a circuit with at most 3 CNOT operations [20] .
Thus, for many architectures of interest, the demand of nativization presents itself as an opportunity when selecting SWAP targets, since the native gate cost of a single SWAP gate is the same as that of any subcircuit consisting of the SWAP gate and adjacent gates, if these are on the swapped qubits. For example, Figure 7 shows one realization of a CCNOT gate on a fully-connected chip, which is optimal in the sense that involves a total of 6 CNOT operations [19] . If nearest-neighbor connectivity is imposed, then there is a natural decision of where a SWAP should be inserted. Considering that nativization can convert any two-qubit unitary subcircuit into an equivalent one involving at most 3 CNOT gates, placing a SWAP on the first two qubit lines (cf. Figure 8 ) is preferred to placing on the second two lines (cf. Figure 9 ), as this is cheaper by a CNOT gate.
The Quilc addresser makes use of this information, and more, when selecting SWAP Figure 7 : CCNOT on a fully-connected chip. Figure 8 : CCNOT with nearest-neighbor connectivity, placing a SWAP on the top two qubit lines. The entire highlighted subcircuit may be translated to at most 3 CNOT gates.
placements. In practice, we observe that this "SWAP recombination" trick is compatible with additional optimizations. For example, when compiling CCNOT 0 1 2 to a chip supporting only nearest-neighbor connectivity, Quilc is able to produce native circuits containing only 7 CNOT gates, which is less than the 8 suggested by Figure 8 .
Native targets for CCNOT
One of the guiding philosophies of Quilc is that the burden of deciding whether a given compilation subroutine should be preferred to another in some specific context need not be borne by the quantum programmer. Instead, what is specified by the user is a set Figure 9 : CCNOT with nearest-neighbor connectivity, placing a SWAP on the bottom two qubit lines. The entire highlighted subcircuit may be translated to at most 3 CNOT gates.
( d e f i n e -c o m p i l e r C C N O T -t o -C N O T (( input ( " CCNOT " () q0 q1 q2 ))) ( inst " H " () q2 ) ( inst " CNOT " () q1 q2 ) ( inst " RZ " '(-pi /4) q2 ) ( inst " CNOT " () q0 q2 ) ( inst " RZ " '( pi /4) q2 ) ( inst " CNOT " () q1 q2 ) ( inst " RZ " '(-pi /4) q2 ) ( inst " CNOT " () q0 q2 ) ( inst " RZ " '( pi /4) q1 ) ( inst " RZ " '( pi /4) q2 ) ( inst " CNOT " () q0 q1 ) ( inst " H " () q2 ) ( inst " RZ " '( pi /4) q0 ) ( inst " RZ " '(-pi /4) q1 ) ( inst " CNOT " () q0 q1 )) of hardware constraints. 4 Presented with this information, Quilc performs the tasks of selecting those compilation subroutines suited to the problem at hand. To demonstrate the flexibility afforded by this approach, we consider a simple experiment in compiling CCNOT, and in particular consider the effect that the choice of native gateset and availability of particular compilation subroutines has on the resulting gate complexity. Recalling our previous example, we note that the circuit expressed in Figure 7 is embodied in Quilc as a compilation subroutine, CCNOT-to-CNOT (cf. Figure 10 and the discussion in Appendix A). Strictly speaking, specific subroutines such as this one are not needed, as Quilc supports fully generic techniques such as the recursive "Quantum Shannon Decomposition" of [18] . Nonetheless, specific compilation subroutines may be preferred to general ones when they offer a reduction in final native gate counts.
In what follows, we consider a three qubit chip with linear connectivity, so that qubit 0 is connected to 1, and qubit 1 is connected to 2. Amongst the possible two-qubit operations, we restrict attention to CZ, ISWAP, and CPHASE, and for each subset of these we consider a target architecture in which those operations are native across connected qubits. With respect to compilation subroutines, Quilc has several enabled by default, and we consider only the effect of including or excluding CCNOT-to-CNOT from this set. In all instances, the compiler is able to translate CNOT gates to the native gate of choice and is able to take advantage of "SWAP recombination" as described earlier.
In Table 1 we show the complexity of CCNOT in terms of native two-qubit gate counts. In all instances, it is always advantageous to incorporate the special information provided by CCNOT-to-CNOT. The best results occur for a device supporting CZ along with one of {ISWAP, CPHASE}, which results in a circuit using only 6 two-qubit gates.
State-Aware Compilation
Many compilation routines express simple circuit equivalencies, which depend only on the syntactic structure of some instruction or block of instructions. However, Quilc is capable of providing more information to those routines which might benefit from this context. When
Native Gates
Gate Count Without With  9  7  12  9  8  8  7  6  8  6  10  9  7 6 Table 1 : Two-qubit native gate complexity of CCNOT with linear nearest-neighbor connectivity. Gate counts are shown both with and without the use of the CCNOT-to-CNOT compilation subroutine. Note the uniform improvement in gate count introduced by the addition of the subroutine, whether or not CNOT appears in the target instruction set.
CZ ISWAP CPHASE
state-aware compilation is enabled, Quilc performs a partial simulation of addressed Quil instructions during the compression stage, until such a simulation is obstructed, e.g. up to an entanglement limit for performance reasons (by default, interactions of up to three qubits), or until a run-time data dependency is encountered, due to the unavailability of such data at the time of compilation. The results of this simulation are made available to additional compilation routines. Those routines which make use of this additional information are called state-aware.
As an example, note that in general a compilation subroutine translates a gate application or sequence of gate applications to some "equivalent" sequence. Under most circumstances, the notion of equivalence here is that the corresponding unitary transformations, represented by the instructions, should be equal, perhaps up to some discretization error. However, when the quantum state is known prior to the execution of an instruction or block of instructions, the requirement of unitary equivalence may be relaxed. Indeed, given full information about the initial state, a sufficient notion of equivalence is that the resulting states be equal. The corresponding task, of preparing a target state given an initial state, is known as state preparation. Many instruction sequences which are not unitarily equivalent may be equivalent in this sense, and this increase in flexibility allows for additional synthesis techniques [16] , [18, Section 4] .
Quilc incorporates a number of methods for state preparation, ranging from special purpose compilation subroutines (for example, in the case of one-qubit, two-qubit, or fourqubit systems), to generic subroutines which may recurse to one of these special cases. State preparation is fully compatible with the optimizations available through other compilation subroutines, although it is only applicable in circumstances in which the initial state may be effectively computed.
We demonstrate this by way of an example. In Figure 11 , we have a short program which expresses the unitary coupled cluster ansatz for deuterium, truncated to single and Figure 11 : UCCSD ansatz for H 2 in the STO-3G basis. double excitation levels [11, Section VII.1] . This program may be used as part of a hybrid classical/quantum variational algorithm such as described in [14] . In such hybrid algorithms, it is typical that a single parametric program "template" is executed for a variety of numerical parameter values. 5 On near term devices, numerical accuracy of variational methods reflects a trade-off between the sophistication of the ansatz and the resulting depth or complexity of the circuit. When compiled to native hardware without state preparation routines, the resulting program involves 6 two-qubit operations as in Figure 12 . When state preparation routines are allowed, knowledge that the program begins in the zero state |0000 is exploited to reduce this number to 3, as in Figure 13 .
In this example, one can see the practical effect of state-aware compilation is to reduce the initial portion of the circuit up until a point at which the state is no longer feasible to track. Here we remark that the presence of run-time parameters obstructs the partial-state simulation, and hence the reduction in gate count is primarily due to optimizations in the first half of the circuit. Additional details of state-aware compilation are discussed in Section A. Figure 13 : UCCSD ansatz for H 2 in the STO-3G basis, compiled to native gates with stateaware optimizations enabled, requiring only 3 CZ instructions.
Performance
There are two meanings to the word "performance" when it comes to compilers: its effectiveness at compiling a quantum program, and how many resources it consumes to perform that task.
To measure compilation effectiveness, we use the benchmarks from [29]-a suite of QASM files-that test Quilc holistically in two ways: with state-aware compilation disabled (i.e., the input and output of the compiler are equivalent unitaries), and with state-aware compilation enabled (i.e., the input and output of the compiler are only guaranteed to act identically on the ground state). Table 2 contains the benchmarks.
To date, Quilc has not been optimized for time or space performance (i.e., how long it takes to compile and how much memory is required), and with an appropriate combination of engineering tricks, those metrics could be improved. Even in the absence of these optimizations, the run-time statistics presented in Figure 14a and Figure 14b demonstrate the scaling laws associated with our approximate solvers deployed during the addressing and compressing stages. Wall clock performance is also included in Table 2 . Table 2 : QASM benchmarks from [29] targeting the IBM qx5 architecture, performed in the same environment as Figure 14b . Compare with [29, Table I ]. Columns labeled with an asterisk '*' hold data for circuits which were produced with the additional hypothesis that the quantum device begins in the ground state. The percentage in parentheses column indicates the rounded percentage reduction in 2Q depth by taking this assumption into account. Figure 2 for an example of a QAOA-type program. 
Contributions & Acknowledgements

A Implementation Details
A.1 Implementation Language
Quilc is written in ANSI Common Lisp, and can be extended with code written in languages with C ABI compatibility. Common Lisp was chosen because it provides a highly-performant substrate for both dynamic, interactive work as well as batch-mode computation, and offers convenient abstractions for implementing embedded domain-specific languages (DSLs), which are useful for many of the tasks of program analysis and manipulation. Quilc is also compatible with Rigetti's open-source quantum computer simulator, the Quantum Virtual Machine [2] . One reason Common Lisp is particularly suitable for implementing DSLs is its syntax. Common Lisp has very regular syntax, most of which follows just a single syntactic construction:
( operator operand 1 . . . operand n ). That is, operators precede their operands and are surrounded by a pair of parentheses; see Table 3 for a sample of syntax. Parentheses don't indicate precedence, but instead play double duty: as prefix-notation for the language, but also as syntax for (sometimes nested) lists. Since the syntax can be viewed as both notation and a data structure, Lisp code is ripe for both automatic generation and manipulation. For this reason, we call Lisp a homoiconic language. Interested readers can find lengthy discussions of these ideas at different levels in the literature [1, 13] . In the subsections to follow, excerpts from Quilc's source code will make use of this syntax.
The Quilc source code is separated into an application domain and a library domain. The application can be used to consume textual input as a UNIX command-line tool, or it Algebraic Syntax Lisp's Syntax 2(1 − n) (* 2 (-1 n)) f (x, g(y), x + y) (f x (g y) (+ x y)) k → k 3 (i.e., λk.k 3 ) (lambda (k) (expt k 3)) let z = √ 5 in ζ(z/2) (let ((z (sqrt 5))) (zeta (/ z 2))) Table 3 : A comparison between usual algebraic syntax as found in mathematics and many programming languages and the syntax of Common Lisp.
can be used to provide a persistent server front-end. The library domain includes all of the routines for interpreting and manipulating Quil code, as well as some features that benefit from that availability but which do not directly participate in the compilation pipeline (e.g., generation and manipulation of Clifford group elements).
A.2 Compiler Subroutine DSL
Quilc implements a domain-specific language (DSL) for writing compilation subroutines. This makes it easy to specify algebraic relationships which Quilc can use as a part of its automatic process of program decomposition and optimization. The basic method of definition is define-compiler, whose basic syntax is given by the following defun mimic: This defines a compiler which consumes an instruction argument for each binding, evaluates the body forms in order, and returns the collection of instructions that they aggregate. Each binding is specified by a variable in which the input gate is stored, as well as an optional destructuring pattern to capture its operator name, parameter list, and argument list. This can be further manipulated by specifying options, which might install a further matching predicate on the destructured information. Altogether, these take the following form: The forms describing the body of the compiler are largely identical to forms elsewhere in Lisp, but there are a few special-use forms available as well. The inst and inst* operators send an instruction to the output queue, which gets emptied for use as the return value at the conclusion of the compiler body. Alternatively, finish-compiler and give-up-compilation can be used to manually signal the end of the compiler body, additionally providing optional manual control over what is used for the return value. We now demonstrate the construction of compilation routines of increasing levels of complexity. We remark that these examples are taken to be didactic; Quilc houses roughly a hundred distinct compilation routines, ranging from algebraic rewriting rules specialized on gates which are common targets for implementation on current and near term hardware (e.g. RZ, CNOT, CZ) to general purpose recursive routines applicable to arbitrary unitary operations.
Example 1. Linearity of certain gates allows the compiler to collapse sequences of applications of a single gate. The optimizer agglutinate-RZs below rests on the linearity property RZ(θ) · RZ(φ) = RZ(θ + φ).
It matches against two RZ instructions (bound to the variables x and y) acting on a particular common qubit q, and it binds their parameter values to the variables theta and phi respectively. The compiler then emits a single instruction RZ(θ + φ) that replaces the two input instructions.
( d e f i n e -c o m p i l e r a g g l u t i n a t e -R Z s (( x ( " RZ " ( theta ) q )) ( y ( " RZ " ( phi ) q ))) ( inst " RZ " '( ,( param-+ theta phi )) q ))
We note that on Rigetti's current hardware, the RZ gate is the primary target for parametric compilation, and the above routine is capable of acting both on numeric and symbolic values of θ and φ in a manner in which numeric routines (e.g. depending on the underlying gate matrix) would not be.
Example 2. Parametric gates for some parameter values will be equivalent to the identity operation (NOP). This is easily seen for those gates that impart a rotation upon a qubit's state about some axis: a rotation of 2π is equivalent to not having rotated at all. 6 The During the addressing phase, the non-native Hadamard gate will be translated to native gates by means of euler-zyz-compiler, yielding:
At this point, the "logical" qubits 0 and 1 have been assigned to "physical" qubits 0 and 1, without the need for any SWAP operations. During the compression stage, Quilc scans the program and will first apply eliminate-full-CPHASE to the spurious gate, yielding:
Finally, it applies agglutinate-RZs to the first pair of Z-rotations to produce:
If we were to additionally tell Quilc that RZ(0) is the identity:
( d e f i n e -c o m p i l e r (( instr ( " RZ " (0 d0 ) _ ))) nil )
then Quilc will strike the outer instructions, leaving just:
One can explore Quilc's treatment of an input program with the --verbose option at the command line.
A.3 Other Features of Quilc
We've described the backbone of Quilc's operation but have not described exhaustively all of Quilc's features. We mention a few additional salient features here.
Details of state-aware compilation
Quilc includes a minimalistic Quil interpreter which is optionally used for various optimizations. If the user indicates that the state of the quantum system is initialized to |0 , as is defined by Quil, then Quilc will attempt to partially simulate the program up to a certain entanglement limit. The partially simulated state is then supplied to compilers defined with define-compiler, where it can be optionally used. For instance, we can detect if the state is an eigenvector of an instruction and consequently eliminate it:
( d e f i n e -c o m p i l e r e l i d e -a p p l i c a t i o n s -o n -e i g e n v e c t o r s (( instr : a c t i n g -o n ( psi q u b i t -i n d i c e s ) ; ; ( c o l l i n e a r p u v) r e t u r n s true iff v = e iθ u for some θ.
: where ( c o l l i n e a r p psi ( n o n d e s t r u c t i v e l y -a p p l y -i n s t r -t o -w f instr psi q u b i t -i n d i c e s )))) nil )
More generally, if U represents our original partial program and |ψ = U |0 represents the partially simulated state, then Quilc can find an alternative V such that |ψ = V |0 which compiles to fewer instructions than U does. These features are enabled at the command line with --enable-state-prep-reductions.
Permutation matrices and contributed modules
Quilc does not have native support for compiling certain special kinds of unitary matrices, such as permutation matrices or diagonal matrices. This, in turn, makes Quilc particularly poorly suited for compilation of classical reversible logic. However, due to its flexible mechanism for extension and automatic selection of compilers, and due to easy integration of libraries written in C or C++, Quilc makes use of a library called Tweedledum [8] . Tweedledum has specialized routines for synthesizing such circuits and can be employed automatically by Quilc.
Approximate compilation
Quilc is able to profitably make use of gate fidelities to produce programs which have better overall fidelity. The fundamental observation is as follows. If a program U = U m . . . U 1 is expressed as m native gates, then the program as executed on a quantum computer will be some other U ′ = U ′ m . . . U ′ 1 , where the difference or uncertainty of U i and U ′ i is reflected in the native gates' fidelity. We might deliberately compile U as some non-equivalent sequence of native gates V = V n . . . V 2 V 1 such that, when run on a quantum computer with initial state |ψ 0 and realized as V ′ = V ′ n . . . V ′ 2 V ′ 1 , the average fidelity is increased:
The existence of such alternative circuits is discussed in [6, 15] .
Parametric compilation
Quilc is capable of dealing with symbolic parameters in a variety of cases beyond the modest use in Section 5. As a small example, Quilc reduces the following snippet D E C L A R E a REAL RZ ( a ) 0 RZ (0.5* a ) 0 RZ (0.2) 0 to the single instruction RZ(0.2+1.5*a) 0. As a more complicated example, the instruction CPHASE(t/3) 0 1 compiles to Clifford manipulation Quilc's source code includes a comprehensive library for manipulating the Clifford group, including subroutines suitable for randomized benchmarking and stabilizer simulation of Clifford gates of arbitrary dimension. Implementation details of the routines included in Quilc for randomized benchmarking can be found in [23] .
B A Chronology of Quilc
Quilc originated as a "compilation framework" for Quil in the summer of 2016. Much of the early work was on the the "front-end" scaffolding common to most classical compilers (e.g. control-flow graphs, resource parallelization, and nano-pass concepts [17] ).
Core work on the modern architecture of this paper began in the summer of 2017 with an implementation of the recursive cosine-sine decomposition of [22] . By fall of 2017, the broad structure of the compiler, including the division into separate addressing and compression stages, had been worked out. At this point, duration-based heuristics (with look-ahead) were adopted as the default for SWAP selection, and a number of additional compilation routines (including the optimal 2Q implementation of [20] as well as the recursive quantum Shannon decomposition of [18] ) had been implemented.
In spring of 2018, both parametric compilation and state-aware compilation were introduced. Over the following summer, a number of enhancements to the addressing stage were implemented, including the introduction of additional heuristics (such as A* search for SWAP selection) as well as the adoption of partial logical-to-physical qubit mappings.
On January 30, 2019 Quilc was released as an open source project [4] . A number of contributions have been made since then. Of relevance to this paper, we mention that work on fidelity-based addressing heuristics continued into fall of 2019, and the compiler front-end was extended to support QASM [7] programs in December of that year.
