We describe staq, a full-stack quantum processing toolkit written in standard C++. staq is a quantum compiler toolkit, comprising of tools that range from quantum optimizers and translators to physical mappers for quantum devices with restricted connectives. The design of staq is inspired from the UNIX philosophy of "less is more", i.e. staq achieves complex functionality via combining (piping) small tools, each of which performs a single task using the most advanced current state-of-the-art methods. We also provide a set of illustrative benchmarks.
Introduction
Quantum computing is a new paradigm of physics that promises significant computational advantages for a plethora of applications, ranging from optimizing, material design, drug discovery to sensing and measurement and secure communication. The idea of harnessing the power of quantum mechanics to perform computations that are believed to be much harder or even intractable by classical computers dates back to * matt.amy@dal.ca † vlad@softwareq.ca
Feynman [1] . However, due to experimental challenges, the first public-access small programmable quantum computing platforms appeared within the last five years, whereas the first demonstration of a computational task that can be performed significantly faster by a quantum computer was publicly released on October 2019 [2] . The currently available quantum computing platforms are not (yet) fault tolerant, i.e. they can not perform arbitrary long quantum computations with arbitrarily low error, but consist mainly of noisy qubits with restricted connectivity, for which the computation length is restricted by the depth of the logical circuit to be run. Such platforms are informally termed "Noisy Intermediate-Scale Quantum" computers (NISQ) [3] , and represent the first step towards realization of large-scale fault-tolerant quantum platforms. Quantum algorithms are usually described in a high-level language (e.g. plain English or quantum "pseudo-code") then "translated" into a quantum circuit consisting of a series of quantum gates applied in a sequential manner 1 , followed by a measurement from which the end result of the algorithm is being inferred via classical post-processing techniques, see e.g. [6] for more details.
The translation from a high-level quantum algorithm to a quantum circuit is informally called "quantum compiling", and consists of a series of steps, which depend on the particular quantum architecture that is being used. Those steps are usually thought of as a quantum compiling top-down "stack" (with the most abstract layers higher up in the stack), and may involve e.g. translation of parts of a high level quantum algorithm to logical Boolean circuits (such as the mapping of quantum oracles in quantum searching algorithms to Boolean functions), converting Boolean functions to quantum reversible circuits, optimizing the latter in terms of a particular cost and taking into account connectivity constraints (if any), and finally mapping the resulting quantum circuit to a specific physical architecture, or translating it to some particular kind of quantum machine assembly language.
Any optimizations or improvements along the stack affect (beneficially) the quantum computation speed (QPU cycles/wall time), and may even allow longer computations on NISQ devices, which otherwise would be in-feasible due to their prohibitively-large circuit depth. Therefore constructing "full-stack" quantum processing toolkits is of paramount importance for both the NISQ regime and also far-future large-scale fault-tolerant quantum platforms.
staq represents a joint effort at softwareQ Inc. to construct such a full-stack quantum computing toolkit. Our effort is not the first and likely not the last in the fast dynamic field of quantum software. What differentiates staq from the other quantum software stacks is the minimalist design, inspired from the UNIX [7] world, in which one can achieve complex functionality via combining (piping) small tools, each performing one single task well. staq is a collection of such tools, ranging from circuit optimizers and translators to physical mappers for NISQ architectures with restricted connectives. Our design consideration is in high contrast with monolithic software design, in which functionality is being attained via a single tool that does "everything". Our modular design reduces software maintenance efforts, increases backwards compatibility for future releases, and allows adding new functionality relatively easy.
In addition, staq uses the latest state-of-the-art methods in quantum compiling while targeting the whole spectrum of the quantum software stack, starting from the abstract higher algorithmic layers and ending at the physical mapping layer. Moreover, staq is highly portable, being written in standard C++ (using the C++17 standard), and fast, as shown by our benchmarks in Section 2.6. staq also offers some unique features, such as the ability of translating logical Boolean circuits specified in an industrial language such as Verilog [8] to reversible circuits. Finally, staq can generate quantum code in a variety of formats that encompass most of the currently available quantum platforms [9, 10, 11, 12, 13, 14] .
The reminder of this paper describes the staq toolkit and its functionality and uses cases while providing a set of benchmarks (Section 2), followed by conclusions and future directions (Section 3).
staq
staq is a new compiler and software toolkit for the openQASM language [9] written in C++. The primary goal is to provide a suite of transformation, optimization and compilation tools that can operate on a single, common language, and output to a number of different simulation and hardware execution platforms. On the technical side, a focus of staq is to support state-of-the-art circuit transformation algorithms, which are typically implemented on small subsets of circuits or in restricted research contexts, and apply them natively to any valid quantum program. This algorithmic focus is distinct from other quantum computing toolchains, which are typically slow to adopt bleeding-edge techniques. In this section we give an overview of the architecture, use, and algorithmic methods of staq.
The openQASM language The official specification of openQASM can be found in [9] , but we provide a brief overview here. Programs in openQASM are structured as sequences of declarations and commands. As an intermediate-to low-level language, openQASM provides a small number of basic programming features: declaration of static-size classical or quantum registers, definition of (unitary) circuits or gates, gate application, measurement and initialization of qubits, and finally classically controlled gates. The listing below gives an example of an openQASM program performing quantum teleportation: 
Overview
To support the minimalist philosophy of small, single-function tools, staq was designed from the bottomup to allow the manipulation, transformation, compilation and translation of QASM files according to the following goal: no process should affect the original structure of the program more than absolutely necessary.
In particular, an un-transformed program should output to something that looks identical to the input source code, modulo changes in whitespace. Similarly, one should be able to optimize programs without disturbing the structure of the original program, or to the extent that the developer wishes to enable further optimizations, for instance by first inlining and then performing whole-program optimization.
To achieve this, staq stores and operates directly on QASM syntax trees, rather than an intermediate representation. This approach was inspired by Clang [15] , which acts as an effective middle-end for the analysis and transformation of C code. In keeping with the Clang style of program analysis, staq provides Figure 1 gives an overview of the staq toolchain and typical usage. The main command line compilerstaq -offers a flexible pass registration system, whereby passes are given via command-line arguments and are executed in the order given. In particular, it is often useful to perform basic gate simplifications both before and after other optimizations, or to inline certain gates (e.g., ccx) first, optimize, then inline fully to primitive gates -these types of usage patterns are supported by the pass registration system, for instance with 2 staq -s -r -s circuit.qasm in the former case.
As all of the transformations are defined directly on QASM ASTs, the order of operations is generally interchangeable, with two major exceptions:
1. desugaring must occur before any other transformation, and 2. the program must be fully inlined before mapping.
For this reason, the compiler automatically applies a desugaring pass after parsing and semantic checks, and an inlining pass preceding a hardware mapping pass. Desugaring mainly involves replace uniform gatesgates applied to registers -with a sequence of gates applied to individual qubits. For instance, if x and y are qubit registers of length 2, the desugarer will replace cx x,y; with cx x[0],y[0]; cx x [1] ,y [1] ; The semantic checker ensures that all such uniform gates are well-formed according to the specification in [9] , as well as other semantic properties such as correct argument types.
The general inline pass supports overrides, whereby the user can specify which gates should not be inlined. By default, the gates defined in qelib1.inc [9] are not inlined, except before hardware mapping. The remaining synthesis, optimization, and mapping passes are described in more detail in the follow sections.
Tool suite In addition to a single compiler, the staq software package also includes a suite of lightweight command line tools which can be chained together using Unix-style pipelines to perform a range of compilation tasks. Each tool reads a QASM file from stdin, performs a specific function, and outputs the transformed QASM source on stdout -as an exception, the compiler tools output the QASM source in various other languages. This offers a more flexible and customizable compilation pipeline at the expense of extra parsing stages, as well as the option to only build the relevant tools for a particular use case. For a full description of the available tools, the reader is directed to the staq wiki 3 .
openQASM extensions staq supports a number of extensions to the openQASM language, both implemented and planned. In particular, staq supports the declaration and use of ancillas local to gate declarations, as well as the declaration of quantum oracles from classical Verilog logic files. These extensions are described in more detail in Section 2.2. Future planned versions will support iteration and register arguments to gatesà la metaQASM [16] .
Circuit synthesis
A unique feature of staq is the ability to splice classical logic directly into quantum programs, and moreover the ability to synthesize a circuit implementing the classical logic during compilation. This is done through a QASM language extension adding oracle gate declarations, with synthesis handled by the EPFL Logic Synthesis Libraries [17] . At this time, staq supports classical logic written in (the combinational subset of) the Verilog hardware description language.
To declare an oracle gate in a staq-QASM file, the keyword oracle is used in place of gate, and the classical logic defining the gate is given in the body as the name of a verilog file: Combinational logic can be written in Verilog as a sequence of assignments of logical expressions to either outputs or temporary wires. For a full overview of the Verilog programming language, the reader is directed to [18] . Due to the reversibility of quantum oracles, there must be exactly one oracle input for every input and output of the given Verilog file. Oracle inputs are mapped to Verilog inputs and outputs sequentially, regardless of naming.
The logic synthesis pass of the compiler visits the AST and replaces each oracle declaration with a corresponding gate declaration. When an oracle declaration is encountered, it parses the specified file to generate an Majority-inverter graph (MIG), which is then synthesized over the Clifford group and T gates using the EPFL implementation of the LUT-based Hierarchical Reversible Logic Synthesis (LHRS) framework [19] .
In general, a classical function may require ancillas to be implemented reversibly, and so synthesis of classical logic may require additional ancillas that have not been given directly as inputs to the gate. staq handles the introduction of ancillas with another QASM language extension adding support the declaration of local ancillas within gate bodies. Inside gate declarations, both clean and dirty ancilla registers -registers initialized in the state |00 · · · 0 or in some unknown state, respectively -can be declared similarly to regular QASM registers by using the keyphrases ancilla or dirty ancilla, respectively. All ancillas are assumed to be returned to their initial state at the end of the gate body, and it remains the programmer's responsibility to ensure that this requirement is satisfied. Lightweight verification methods such as path-sum verification [20] can be adopted in the future to ensure that all ancillas are properly cleaned at the end of a gate.
The result of synthesizing the MUX gate above is shown below. Despite the use of temporary wires in the input Verilog file, the LHRS synthesis algorithm is able to find an ancilla-free implementation, hence the resulting local ancilla register has size 0.
gate MUX sel ,x ,y , out { a n c i l l a anc [0]; h out ; cx x , out ; tdg out ; cx sel , out ; t out ; cx x , out ; tdg out ; cx sel , out ; t out ; cx sel , x ; tdg x ; cx sel , x ; t sel ; t x ; h out ;
cx y , out ; t out ; cx sel , out ; t out ; cx y , out ; tdg out ; cx sel , out ; tdg out ; cx sel , y ; tdg y ; cx sel , y ; t sel ; tdg y ; h out ; } Remark 2.1. The synthesized circuit above has T -count 14, since the 3-qubit multiplexor can be implemented with 2 Toffoli gates. This can further be reduced to 8 with the light optimization pass, -O1.
Ancilla management As local ancilla allocation is non-standard QASM and moreover not supported by most QPUs, staq performs automatic ancilla management during the inlining phase of compilation. In particular, local ancilla declarations are hoisted, as regular qubit registers, to the top of the global scope when a gate is inlined.
Since ancillas are assumed to be returned clean -i.e. returned in their initial state -it is not always necessary to allocate a new ancilla for every gate application. staq handles the re-use of ancillas during compilation by maintaining a pool of previously allocated ancillas and available dirty qubits, re-using qubits from these pools to fulfill ancilla requirements whenever possible. Figure 2 shows an example of ancilla sharing between gate applications.
Optimization
Circuit optimization is necessary to produce efficient circuits which both utilize existing technology to the best of its ability, and to provide accurate resource estimates to guide the development of quantum algorithms and hardware. In contrast to other quantum computing software packages, staq was designed with circuit O P E N Q A S M 2.0; i n c l u d e " qelib1 . inc " ; gate foo a { a n c i l l a b [ Gate simplifications The simplify optimization pass performs basic gate cancellations. In particular, it scans the program dependence graph and removes pairs of adjacent inverse gates whenever found, repeating until a fixpoint is reached. By (implicitly) using the program dependence graph rather than looking for gates adjacent in the AST, trivial commutations of gates acting on different qubits are modded out. As an example, s x; h x; t y; h x; sdg x; reduces to t y; with the simplify optimization pass.
In general, because each pair of eliminated gates may open up other simplifications, the fixpoint computation may run O(l) times, where l is the number of gates in the program. In some cases this extra costmaking the optimization quadratic in the length of the program -may be prohibitive. In these cases, the user can opt to run single-pass simplifications instead of repeating until a fixpoint is reached.
Rotation folding The main optimization of the staq compiler is an extended implementation of Fang and Chen's T -count optimization algorithm [21] . Their algorithm reframes the problem of merging T gates in Clifford+T circuits in the Pauli sum view of Gosset et al. [22] , in contrast to the phase polynomial approach [23, 24] . We give a brief overview of their algorithm here.
Recall that the T gate can be written as the following sum of Pauli gates:
Following [21] we write R(P ), where P is an n-qubit Pauli operator possibly with a phase, for the Pauli sum R(P ) :
As Clifford gates permute Pauli operators by conjugation, it can be observed that the following commutation rules hold for any Clifford gate C and Paulis P, P ′ :
Moreover, since the following equations hold:
T gates can be merged by repeatedly applying commutations 1 and 2 and merging with any rotations satisfying 3 or 4.
The staq implementation provides a number of extensions to Fang and Chen's algorithm -notably, to handle X-, Y -, and Z-axis rotations of any angle natively, and to allow arbitrary gates outside of the Clifford+{R X , R Y , R Z } gate set. This allows the optimization to be performed directly over arbitrary QASM programs, and further adds applicability to NISQ-style circuits which use rotations of general angles in all Pauli axes. To extend Fang and Chen's algorithm in these ways we write R(θ, P ) for a rotation of angle θ around the Pauli P , that is
and then extend the commutation rules to
U and P act non-trivially on disjoint qubits =⇒ U R(θ, P ) = R(θ, P )U
where U refers to an arbitrary unitary gate. The merging equations are likewise extended to
which are both easy to verify.
Internally, staq provides a library for working with circuits in the Pauli sum representation, with classes for generic Clifford operators, Pauli rotations and uninterpreted gates, and methods for testing commutations and rotation merging. The rotation folding optimization is implemented as a Visitor on the AST which builds a representation of each basic block (i.e. gate bodies and the main program body) as a circuit in the Pauli sum representation and applies the above equations to determine which gates can be merged or cancelled.
CNOT-dihedral synthesis Currently implemented as a hardware mapping pass, the hardware-independent gray-synth optimization is accessible by mapping to a fully-connected device with the Steiner tree mapping algorithm, described later in Section 2.4. The gray-synth optimization attempts to reduce the number of cx (i.e. CNOT) gates in the program by performing the CNOT-dihedral resynthesis algorithm of Amy, Azimzadeh, and Mosca [25] . Specifically, the algorithm computes CNOT-dihedral blocks -circuit blocks containing only CNOT, X, and arbitrary phase rotations -and then resynthesizes these blocks using using a Gray code inspired algorithm to construct an efficient tour of the necessary rotations. For more details on the algorithm, the reader is directed to [25] .
The gray-synth algorithm is highly dependent on the number of R Z gates, and hence is typically most effective when used after a rotation folding pass. In contrast to the implementation in [25] , CNOTdihedral blocks are computed greedily, and so not all foldable R Z are found with the gray-synth algorithm alone. A planned future extension will make this pass directly accessible as a high-level optimization pass, rather than as a mapping pass.
O P E N Q A S M 2.0; i n c l u d e " qelib1 . inc " ; 
Hardware mapping
The NISQ era of quantum computing [3] carries with it specific hardware challenges -notably, that of efficiently mapping or routing a quantum program onto a hardware device with constrained two-qubit interactions and noisy gates. In particular, this involves (1) mapping qubits of the program to physical qubits of the device, and (2) rewriting the circuit so that all two-qubit gates act on coupled qubits, and further satisfy the direction of the coupling in the case of directed topologies.
The staq compiler performs hardware mapping in two stages -by first selecting an initial mapping from qubits of the program to physical qubits as in [26] , and then adjusting each two-qubit gate to conform to the given device topology. In this section we describe the algorithms for each stage implemented in staq. Currently, only physical CNOT gates are supported by staq.
Devices Devices in the staq toolchain are instances of the Device class, which at minimum specifies a number n of qubits addressable on the device with addresses 0, . . . , n − 1, and a digraph where each directed edge represents an admissible CNOT gate with target at the edge's endpoint. A device may additionally specify average one-and two-qubit gate fidelities for each qubit and digraph edge, respectively, as floatingpoint numbers between 0 and 1. The Device class further contains a number of useful utilities for mapping circuits to devices with or without known fidelities; notably, the ability to retrieve the available couplings in order of decreasing fidelity, as well as fidelity-weighted shortest-path computation and approximation of minimal weighted Steiner trees -trees spanning a subset of nodes.
While at present devices are hard-coded, built-in devices include the 8-and 16-qubit Rigetti Agave and Aspen4 chips, respectively, the 20-qubit IBM Tokyo device, a generic 9-qubit square lattice and fullyconnected devices for any number of qubits.
Initial layout generation
It is known that the efficiency of hardware mapping is highly dependent on the chosen initial placement of qubits [26] . While better gate counts can often be achieved if the mapping algorithm is allowed to modify the initial placement [27, 28] , a good initial layout can reduce CNOT counts by over 50% [29] .
staq currently implements three layout generation algorithms: linear, eager, and best-fit. The linear layout generator functions as a basic layout, whereby physical qubits are assigned in-order as virtual qubits are allocated. By contrast, the eager and best-fit algorithms attempt to generate an initial layout which has a high degree of overlap between the CNOT gates present in the program, and the couplings present in the device.
The eager layout generator assigns highest-fidelity couplings on a first-come, first-serve basis. In particular, when a CNOT gate is encountered in the circuit, the highest-fidelity coupling which is compatible with the control and target -that is, doesn't invalidated previous assignments of the control or target to physical qubits -is chosen. This strategy typically results in lower CNOT counts compared to the linear strategy when combined with basic swapping for CNOT mapping.
To generate a better initial layout for CNOT mapping algorithms which are not based on local qubit swapping, an additional best-fit layout generation algorithm is implemented in staq. The best-fit algorithm, in contrast to the linear and eager strategies first scans the entire program before assigning physical qubits to virtual ones. In particular, it builds a histogram of couplings between virtual qubits, assigning the highest-fidelity couplings to virtual qubits with the most CNOT gates between them. Experimentally, we found that such an initial layout works best when qubits are not permuted intermittently by the CNOT mapping algorithm.
To illustrate the different initial placement approaches, fig. 4 gives an example of each layout generation algorithm applied to a circuit for a simple square lattice shown in fig. 4b .
CNOT mapping
The problem of mapping two-qubit gates to a topologically constrained architecture has received a great deal of attention recently [29, 28, 30, 26, 27] . Most common techniques (e.g., the IBM-QX contest-winning [27] ) rely on inserting swap gates, or more generally permutations, so that the a given two-qubit gate or set of gates satisfies the device topology. Figure 5 shows an example of this technique.
staq implements a version of permutation-based mapping (swap) where for a CNOT gate between uncoupled qubits, the endpoints are swapped along the shortest (weighted) path in the coupling graph until they are adjacent. In the case of directed edges, Hadamard gates are inserted to flip the control and target of a CNOT. Rather than swap the qubits back to their original position as in fig. 5 , the resulting permutation is propagated through the rest of the circuit.
Steiner tree mapping An alternative to permutation-based mappings which has recently been gaining popularity is topologically-constrained synthesis [29] . With this technique, a circuit or subcircuit is re-synthesized using circuit synthesis techniques that directly account for the topology of the intended architecture. For circuits consisting solely of CNOT gates, [29] and [30] simultaneously developed methods of synthesizing efficient circuits satisfying a given topology by performing constrained Gaussian elimination, whereby the rows which can be added to one another, corresponding to qubits, are restricted by the underlying architecture. These results show that in the case of CNOT -or linear reversible circuits -constrained Gaussian elimination typically results in lower CNOT counts than permutation-based techniques [29] . Both results further sketch extensions to the topologically-constrained synthesis of CNOT-dihedral circuits. Along with the swap mapping algorithm, staq includes a mapping algorithm (steiner) based on constrained CNOT and CNOT-dihedral synthesis in the style of [29] and [30] . We give a brief overview of the steiner mapper here.
The standard method 4 of synthesizing an n-qubit CNOT circuit, is to perform Gaussian elimination on the n×n binary matrix giving the classical function (see fig. 6 ) and reverse the row operations, corresponding O P E N Q A S M 2.0; qreg q [9] ; CX q [2] ,q [1] ; CX q [2] ,q [1] ; CX q [6] ,q [8] ; CX q [7] ,q [3] ; CX q [7] ,q [3] ; CX q [7] ,q [3] ; CX q [5] ,q [7] ; CX q [5] ,q [7] ; CX q [5] ,q [7] ; CX q [5] ,q [7] ; CX q [4] Figure 4 : Laying out a circuit on a 9-qubit square lattice Figure 5 : Mapping a CNOT gate via local swaps to a device with couplings (0, 2), (1, 2), (1, 3)
The corresponding binary matrix Figure 6 : Linear reversible circuits to CNOT gates. When the hardware topology is constrained however, it may not be possible to "zero-out" all the off-diagonal entries of a column by adding the pivot row to them directly. Instead, a path in the coupling graph from the pivot to each row with a leading 1 may be used by first filling in 1's along the path by applying CNOT gates, then flushing by applying CNOT gates along the path in reverse. For example, with the "straight-line" topology, the 1 in entry (2, 0) of the matrix in fig. 6b can be eliminated by first filling in 1's along the shortest path from 0 to 2, then removing them in reverse:
To zero-out all the leading 1's, excluding the pivot, a tree with root at the pivot qubit and endpoints at all rows with a leading 1 can be used instead. As noted in [29, 30] , computing a minimal such tree is the well-known Steiner tree problem, which is NP-hard in general but admits effective polynomial-time approximations, notably via all-pairs-shortest-paths and minimal spanning tree algorithms. Zeroing all nonpivot rows for a given column proceeds similarly, by first filling 1's into every node of the tree by adding rows along every edge leading to a 0 -called Steiner points -then zero-ing all non-root nodes by traversing the tree and adding rows along edges in reverse. An example is given below with a minimal tree spanning {q 0 , q 2 , q 4 } with edges shown in bold: fill: Crossing the diagonal The "fill-then-flush" method may fail when the computed tree contains nodes above the diagonal, as this may propagate 1's to the left of the current column, as in the following example: In [29] , the above-diagonal dependencies are handled by ordering the rows according to a Hamiltonian path in the graph (if it exists), so that the matrix can be reduced to echelon form without crossing the diagonal. As not all possible topologies admit a Hamiltonian path -and in general computing one is an NP-hard problem -they also give a recursive method which works for arbitrary graphs. In contrast, [30] doesn't assume the existence of a Hamiltonian path and instead uses an uncompute stage to effectively "undo" all changes to other columns.
The implementation in staq follows the method of [30] , with the exception that only changes to rows with (transitive) dependencies on above-diagonal rows are uncomputed. In practice this reduces the number of CNOT gates required, as not every iteration will cross the diagonal.
Constrained Gray-synth The Steiner mapping algorithm in staq actually implements a more general form of re-synthesis, targeting CNOT-dihedral 5 circuits by using the Gray-synth CNOT-optimization algorithm [25] extended with constrained Gaussian synthesis. Similar extensions were considered in [29] and [30] -by comparison staq contains a full-scale implementation which operates on arbitrary circuits by generating synthesis events whenever a non-CNOT-dihedral gate is encountered. In the case when no R Z (θ) gates are present, the algorithm coincides with the basic constrained Gaussian synthesis.
Again, the implementation of constrained Gray-synth differs from those sketched in [29, 30] , so we give a brief overview of our method here. The Gray-synth algorithm functions by ordering a given set of paritiescorresponding to the states "being rotated on" by phase gates -of the inputs so that an efficient tour can be constructed. Moreover, this ordering is elaborated as the circuit is synthesized, by recursively partitioning and synthesizing the remaining parities and updating the remaining parities as the state is modified by the synthesized CNOT gates.
staq performs constrained Gray-synthesis by delaying CNOT gates until a partition of size 1 is reached. Such a partition corresponds to the computation of a parity x i → x i ⊕ f (x 1 , . . . , x n ) where f is a linear (parity) function over {x j =i }. Again, a (Steiner) tree rooted at x i and spanning S = {x j | x j appears in f } can be used to synthesize the above parity -however, in contrast to the Gaussian elimination situation, where the root is added to each leaf, each leaf needs to be added to the root. This is done by first adding each Steiner point (nodes in the tree but not in S) to its predecessors in breadth-first order, then adding each node to its predecessors in reverse breadth-first order. An example of this process is given below for the parity x 0 ⊕ x 1 ⊕ x 2 ⊕ x 4 ⊕ x 8 rooted at x 0 on a square lattice. Note that the matrix in this case gives the function computed by the series of row additions (i.e. CNOT gates).
fill: Rather than uncompute the changes to the other rows, the Gray-synth algorithm takes the linear transformation into account when recursively partitioning the remaining parities. Once all parity computations have been completed, the algorithm computes the final linear transformation (see [25] ) using regular constrained Gaussian synthesis.
Compilation
Along with the default QASM output, staq includes a suite of source-to-source compilers or transpilers, currently supporting output to Quil [12] , ProjectQ [13] , Q# [14] , and Cirq [10] . Effort has been made to translate QASM code to idiomatic code in the target language as much as possible -in particular, translating qelib1.inc gates and gate declarations to standard library gates and idiomatic gate declarations in the target language whenever possible. Figure 7 gives an example of the Q# output for a QASM program.
staq also includes an option to output just the resource counts of a program. By default the resource counter un-boxes resource counts for all declared gates except for those from the standard library, but the resource counter can be configured with a list of gates to leave boxed.
Performance
To assess the performance of staq, we compare it against the well-known software toolkit and compiler Qiskit [11] . In particular, we compare each tool's highest standard optimization setting and default hardware mapping settings for total gate counts and CNOT counts, respectively. In particular, we compare the Qiskit transpiler's level 3 optimization against staq's -O2 command line option. The default mapping setting in staq applies the steiner mapping algorithm with the best-fit initial layout. For the optimization experiments, both tools unbox the program to the following subset of qelib1.inc:
{u3, cx, h, rx, ry, rz}.
We use a common benchmark suite [23, 24] to benchmark our compiler, consisting of largely reversible arithmetic and a few quantum algorithms. All experiments were run on 2.3GHz Intel Core i7 processor with 8GB of RAM running Arch Linux.
The results of optimization passes and mapping passes are reported in Tables 1 and 2 Figure 7 : Translation between QASM and Q# set) for all but one of the benchmark circuits, achieving 31.7% reduction in gate counts on average compared to Qiskit's 25.9% average reduction. Similarly, in all but one benchmark circuit with the highest number of qubits, staq was also significantly faster. It remains a focus of future work to improve the scalability of staq's optimization algorithms as the number of qubits increases.
For hardware mapping benchmarks, IBM's 20 qubit Tokyo chip was selected as the target architecture, and as such only the benchmark circuits which fit onto the chip are reported in Table 2 . In contrast to optimization, the experimental results for hardware mapping were mixed. While the Steiner mapping algorithm with best-fit initial layout was consistently orders of magnitude faster than Qiskit's default mapping algorithm, Qiskit outperformed staq in terms of CNOT counts in the majority of cases. As hardware mapping is very sensitive to initial qubit placement [26] , similar to [29] a simple hill-climb algorithm was implemented to optimize the initial layout and combined with the Steiner mapping algorithm (last two columns of Table 2 ). With this qubit layout optimization, staq's default mapping algorithm outperforms Qiskit in the majority of cases, with an average CNOT-count increase of 103.4% compared to Qiskit's 125.7%, while still running faster than Qiskit in almost all cases. Moreover, many of the cases where Qiskit outperformed staq appear to be pathological cases for the underlying Gray-synth algorithm [25] . It remains to be seen whether more sophisticated layout optimization algorithms which avoid local minimafor instance, simulated annealing -can improve the results further.
Conclusions and future directions
In this article we described staq along with its main use cases, and provided a set of benchmarks. staq is a modular quantum compiling toolkit, which is easy to extend, its design being inspired by Clang [15] .
The dynamic field of quantum software is evolving rapidly, being driven by a variety of factors, ranging from progress in quantum hardware to improved compilation techniques. While we cannot foresee what the future will reserve, we still have a set of QASM-based features we will most likely add to staq, such as: i) more QASM syntax extensions that will not break backwards compatibility, ii) ability to perform iterations and loops, iii) having registers as arguments to gates instead of qubits. Such extensions will allow the user to design quantum software libraries in a relatively straight-forward manner while focusing on efficiently achieving the desired functionality. 
