Many quantum algorithms make use of oracles which evaluate classical functions on a superposition of inputs. In order to facilitate implementation, testing, and resource estimation of such algorithms, we present quantum circuits for evaluating functions that are often encountered in the quantum algorithm literature. This includes Gaussians, hyperbolic tangent, sine/cosine, inverse square root, arcsine, and exponentials. We use insights from classical high-performance computing in order to optimize our circuits and implement a quantum software stack module which allows to automatically generate circuits for evaluating piecewise smooth functions in the computational basis. Our circuits enable more detailed cost analyses of various quantum algorithms, allowing to identify concrete applications of future quantum computing devices. Furthermore, our resource estimates may guide future research aiming to reduce the costs or even the need for arithmetic in the computational basis altogether.
Many quantum algorithms make use of oracles which evaluate classical functions on a superposition of inputs. In order to facilitate implementation, testing, and resource estimation of such algorithms, we present quantum circuits for evaluating functions that are often encountered in the quantum algorithm literature. This includes Gaussians, hyperbolic tangent, sine/cosine, inverse square root, arcsine, and exponentials. We use insights from classical high-performance computing in order to optimize our circuits and implement a quantum software stack module which allows to automatically generate circuits for evaluating piecewise smooth functions in the computational basis. Our circuits enable more detailed cost analyses of various quantum algorithms, allowing to identify concrete applications of future quantum computing devices. Furthermore, our resource estimates may guide future research aiming to reduce the costs or even the need for arithmetic in the computational basis altogether.
I. INTRODUCTION
Quantum computers are expected to excel at certain computational tasks with an asymptotic advantage over their classical counterparts. Examples for such tasks include factoring [1] and the simulation of quantum chemical processes [2, 3] . While new quantum algorithms tackling these problems offer favorable asymptotic behavior, exact runtime estimates are often lacking due to the absence of reversible implementations for functions such as the ones considered in this paper. However, the implementation details of these functions greatly influence the constant overheads involved and, thus, also the crossover points at which the choice of quantum/classical algorithm changes.
We address this issue by presenting circuits for arithmetic which can be added to a quantum software stack such as LiQU i | [4], Quipper [5] , ScaffCC [6] , Q# [7] , and ProjectQ [8] to name a few. In particular, we discuss the implementation of general smooth functions via a piecewise polynomial approximation, followed by functions that are used in specific applications. Namely, we analyze the costs of implementing an inverse square root (1/ √ x) using a reversible fixed-point version of the method used in the computer game Quake III Arena [9] and we then combine this with our evaluation scheme for smooth functions in order to arrive at an implementation of arcsin(x).
Having reversible implementations of these functions available enables more detailed cost analyses of various quantum algorithms such as HHL [10] , where the inverse square root can be used to arrive at x → 1/x and * haenert@phys.ethz.ch † martinro@microsoft.com ‡ ksvore@microsoft.com arcsin(x) can be used to get 1/x from the computational basis state into the amplitude. Similar use cases arise in Quantum Metropolis sampling [11] , Gibbs state preparation [12] and in the widely applicable framework of Quantum Rejection Sampling [13] to transform one or more samples of a given quantum state into a quantum state with potentially different amplitudes, while maintaining relative phases. In all these examples the computation of arcsin(x) is useful for the rejection sampling step. Further applications of numerical functions can be anticipated in quantum machine learning, where sigmoid functions may need to be evaluated on a superposition of values employing tanh(x), and 1/ √ x can be used for (re-)normalization of intermediate results [14] . In quantum algorithms for chemistry, further examples for numerical functions arise for on-the-fly computation of the one-and two-body integrals [3] . There, 1/ √ x as well as the evaluation of smooth functions such as Gaussians is needed. Similarly, on-the-fly computation of finite element matrix elements often involves the evaluation of functions such as sin(x) and cos(x) [15] .
Related work. As a result of the large impact that the implementation details of such functions may have on the practicality of a given quantum algorithm, there is a vast amount of literature available which provides circuits for various low-level arithmetic functions such as addition [16] [17] [18] [19] . Furthermore, Refs. [20] [21] [22] discuss implementations of higher-level arithmetic functions such as sin(x), arcsin(x) and √ x which we also consider in the present work, although using different approaches.
In particular, our piecewise polynomial evaluation circuit enables evaluating piecewise smooth functions to high accuracy using polynomials of very low degree. As a result, we require only a small number of additions and multiplications, and few quantum registers to hold intermediate results in order to achieve reversibility. While Ref. [20] employs several evaluations of the sin(x) function in order to hone in on the actual value of its inverse, our implementation of arcsin(x) features costs that are similar to just one invocation of sin(x) for x ∈ [−0.5, 0.5]. Otherwise, if x ∈ [−1, 1], our implementation also requires an evaluation of the square root. For evaluating inverse square roots, we optimize the initial guess which was also used in [21] in order to reduce the number of required Newton iterations by 1 (which corresponds to a reduction by 20-25%). In contrast to the mentioned works, we implement all our high-level arithmetic functions at the level of Toffoli gates in the quantum programming language LIQU i | . As a result, we were able to test our circuits on various test vectors using a Toffoli circuit simulator, ranging up to several hundreds of qubits. Throughout this paper, we adapt ideas from classical high-performance computing in order to reduce the required resources in the quantum setting. While these methods allow to reduce the Toffoli and qubit counts significantly, the resulting circuits are still quite expensive, especially in terms of the number of gates that are required. We hope that this highlights the fact that more research in the implementation of quantum algorithms is necessary in order to further reduce the cost originating from arithmetic in the computational basis.
II. LEARNING FROM CLASSICAL ARITHMETIC LIBRARIES
While there is no need for computations to be reversible when using classical computers, a significant overlap of techniques from reversible computing can be found in vectorized high-performance libraries. In quantum computing, having an if-statement collapses the state vector, resulting in a loss of all potential speedup. Similarly, if-statements in vectorized code require a readout of the vector, followed by a case distinction and a read-in of the handled values, which incurs a tremendous overhead and results in a deterioration of the expected speedup or even an overall slowdown. Analogous considerations have to be taken into account when dealing with, e.g., loops. Therefore, classical high-performance libraries may offer ideas and insights applicable to quantum computing, especially for mathematical functions such as (inverse) trigonometric functions, exponentials, logarithms, etc., of which highly-optimized implementations are available in, e.g., the Cephes math library [23] or games such as Quake III Arena (their fast inverse square root [9] is reviewed in [24] ).
Although some of these implementations rely on a floating-point representation, many ideas carry over to the fixed-point domain, and remain efficient enough even when requiring reversibility. Specifically, we adapt implementations of the arcsine function from [23] and the fast inverse square root from [24] to the quantum domain by providing reversible low-level implementations. Furthermore, we describe a parallel version of the classical Horner scheme [25] , which enables the conditional evaluation of many polynomials in parallel and, therefore, efficient evaluation of piecewise polynomial approximations.
III. EVALUATION OF PIECEWISE POLYNOMIAL APPROXIMATIONS
A basic scheme to evaluate a single polynomial on a quantum computer in the computational basis is the classical Horner scheme, which evaluates
by iteratively performing a multiplication by x, followed by an addition of a i for i ∈ {d, d − 1, ..., 0}. This amounts to performing the following operations:
A reversible implementation of this scheme simply stores all intermediate results. At iteration i, the last iterate y i−1 is multiplied by x into a new register y i , followed by an addition by the (classically-known) constant a i , which may make use, e.g., the addition circuit by Takahashi [16] (if there is an extra register left), or the in-place constant adder by Häner et al. [26] , which does not require an ancilla register but is more costly in terms of gates. Due to the linear dependence of successive iterates, a pebbling strategy can be employed in order to optimize the space/time trade-offs according to some chosen metric [27] .
Oftentimes, the degree d of the minimax approximation over a domain Ω must be chosen to be very high in order to achieve a certain L ∞ (Ω)-error. In such cases, it makes sense to partition Ω, i.e., find Ω i such that
and to then perform a case distinction for each input, evaluating a different polynomial for x ∈ Ω i than for y ∈ Ω j if i = j. A straight-forward generalization of this approach to the realm of quantum computing would loop over all subdomains Ω i and, conditioned on a casedistinction or label register |l , evaluate the corresponding polynomial. Thus, the cost of this inefficient approach grows linearly with the number of subdomains.
In order to improve upon this approach, one can parallelize the polynomial evaluation if the degree d is constant over the entire domain Ω. Note that merely adding the label register |l mentioned above and performing Figure 1 . The LABEL gate initializes the label register |l , which consists of log 2 (M ) qubits, to indicate the subdomain Ω l to which x belongs. Pi computes the predicate indicating whether x ∈ Ωi into the ancilla qubit. Conditioned on this result, the label is then initialized to the value chosen to represent the i-th interval. enables the evaluation of multiple polynomials in parallel. The impact on the circuit size is minor, as will be shown in Appendix B. The depth of the circuit remains virtually unaltered, since the initialization step (1) can be performed while multiplying the previous iterate y i−1 by x, see Fig. 2. An illustration of the circuit computing the label register |l can be found in Fig. 1 . A slight drawback of this parallel evaluation is that it requires one extra ancilla register for the last iteration, since the in-place addition circuit [26] can no longer be used. Resource estimates of a few functions which were implemented using this approach can be found in Table II . The small overhead of using many intervals allows to achieve good approximations already for low-degree polynomials (and thus using few qubit registers).
Using reversible pebble games [28] , it is possible to trade the number of registers needed to store the iterates with the depth of the resulting circuit. The parameters are: the number n of bits per register, the total number m of these n-qubit registers, the number r of Horner iterations, and the depth d of the resulting circuit. The m\r 1 2 3 4 5 6 7 8 16 32 Table I . Optimal pebbling strategies for fixed number m of registers and fixed number r of sequential iterations. This table can be used for both the Horner scheme for polynomial evaluation, where r corresponds to the polynomial degree, and for Newton's method, where r denotes the number of iterations. The number for entry (m, r) denotes how many pebbling steps it takes to compute the entire sequence. The circuit width and depth can be obtained from these numbers.
trade-space we consider involves m, r, and d. In particular, we consider the question of what the optimal circuit depth is for a fixed number m of registers and a fixed number r of iterations. As in [29, 30] we use dynamic programming to construct the optimal strategies as the dependency graph is just a line which is due to the sequential nature of Horner's method (the general pebbling problem is much harder to solve, in fact finding the optimal strategy for general graphs is known to be PSPACE complete [31] ). The optimal number of pebbling steps as a function of m and r can be found in Table I .
IV. SOFTWARE STACK MODULE FOR PIECEWISE SMOOTH FUNCTIONS
In order to enable automatic compilation of an oracle which implements a piecewise smooth function, the Remez algorithm [32] can be used in a subroutine to determine a piecewise polynomial approximation, which can then be implemented using the circuit described in the previous section.
In particular, we aim to implement the oracle with a given precision, accuracy, and number of available quantum registers (or, equivalently, the polynomial degree d if no pebbling is employed) over a user-specified interval Ω = [a, a + L). Our algorithm proceeds as follows: In a first step, run the Remez algorithm which, given a function f (x) over a domain Ω ⊂ R and a polynomial degree d, finds the polynomial P (x) which approximates f (x) with minimal L ∞ (Ω)-error, and check whether the achieved error is low enough. If it is too large, reduce the size of the domain Ω 1 := [a, a + L 2 ) and check again. Repeating this procedure and carrying out binary search on the right interval border will eventually lead to the first subdomain Ω 1 = [a, b 1 ) which is the largest interval such that the corresponding degree d polynomial achieves the desired accuracy. Next, one determines the next subdomain Ω 2 = [b 1 , b 2 ) using the same procedure. This is iterated until b i ≥ b, meaning that all required subdomains and their corresponding polynomials have been determined and f (x) can be implemented using a parallel polynomial evaluation circuit. This algorithm was implemented and then run for various functions, target accuracies, and polynomial degrees in order to determine approximate resource estimates for these parameters, see Table II in the appendix.
V. INVERSE SQUARE ROOT
For quantum chemistry or machine learning applications, also non-smooth functions are required. Most notably, the inverse square root can be used in both examples, namely for the calculation of the Coulomb potential and to determine the reciprocal when employing HHL [10] for quantum machine learning.
In classical computing, inverse square roots appear in computer graphics and the term "fast inverse square root" is often used: It labels the procedure to approximate the inverse square root using bit-operations on the floating-point representation of the input, as it was done in Quake III Arena [9] (see [24] for a review). The code ultimately performs a Newton-Raphson iteration in order to improve upon a pretty accurate initial guess, which it finds using afore-mentioned bit-operations. Loosely speaking, the bit-operations consist of a bit-shift to divide the exponent by two in order to approximate the square root, followed by a subtraction of this result from a magic number, effectively negating the exponent and correcting the mantissa, which was also shifted together with the exponent. The magic number can be chosen using an auto-tuning procedure and varies depending on the objective function being used [24] . This provides an extremely good initial guess for the Newton iteration at very low cost.
In our reversible implementation, we use a similar procedure to compute the inverse square root using fixedpoint arithmetic. While we cannot make use of the floating-point representation, we can still find a low-cost initial guess which allows for a small number of Newton iterations to be sufficient (i.e., 2-4 iterations). This includes determining the position of the first one in the , 5] using m ∈ {2, 3, 4} Newton iterations and corresponding bit sizes n ∈ {25, 35, 55}. The fixed-point position is p = 12, in order to ensure that no overflow occurs for small inputs.
bit-representation of the input, followed by an initialization which involves a case distinction on the magic number to use. Our three magic constants (see Appendix C) were tuned such that the error peaks near powers of two in Fig. 3(a) vanish. The peaks appear due to the fact that the initial guess takes into account the location of the first one but completely ignores the actual magnitude of the input. For example, all inputs in [1, 2) yield the same initial guess. The error plot with tuned constants is depicted in Fig. 3(b) . One can clearly observe that an entire Newton iteration can be saved when aiming for a given L ∞ -error. Newton iterations for calculating the inverse square root. The fixed-point position is chosen to be p = 2 and total bit size n was chosen to be in {35, 50, 55}.
VI. ARCSINE
Following the implementation used in the classical math library Cephes [23] , an arcsine can be implemented as a combination of polynomial evaluation and the square root. Approximating the arcsine using only a polynomial allows for a good approximation in [−0.5, 0.5], but not near ±1 (where it diverges). The Cephes math library remedies this problem by adding a case distinction, employing a "double-angle identity" for |x| ≥ 0.5. This requires computing the square root, which can be achieved by first calculating the inverse square root, followed by
Alternatively, the new square root circuit from Ref. [22] can be used.
We have implemented our circuit for arcsine and we show the resulting error plot in Fig. 4 . The oscillations stem from the minimax polynomial which is used to approximate the arcsine on [−0.5, 0.5]. More implementation details and resource estimates can be found in Appendix D.
Note that certain applications may allow to trade off error in the arcsine with, e.g., probability of success by rescaling the input such that the arcsine needs to be computed only for values in [−0.5, 0.5]. This would allow one to remove the case-distinction and the subsequent calculation of the square root: One could evaluate the arcsine at a cost that is similar to the implementation costs of sin/cos. Estimates for the Toffoli and qubit counts for this case can also be found in the appendix, see Table II .
VII. SUMMARY AND OUTLOOK
We have presented efficient quantum circuits for the evaluation of many mathematical functions, including (inverse) square root, Gaussians, hyperbolic tangent, exponential, sine/cosine, and arcsine. Our circuits can be used to obtain accurate resource estimates for various quantum algorithms and the results may help to identify the first large-scale applications as well as bottlenecks in these algorithms where more research is necessary in order to make the resource requirements practical. When embedded in a quantum compilation framework, our general parallel polynomial evaluation circuit can be used for automatic code generation when compiling oracles that compute piecewise smooth mathematical functions in the computational basis. This tremendously facilitates the implementation of quantum algorithms which employ oracles that compute such functions on a superposition of inputs.
[1] Peter W Shor. Algorithms for quantum computation:
Discrete logarithms and factoring. https://github.com/ id-Software/Quake-III-Arena/blob/master/code/ game/q_math.c.
In fixed-point arithmetic, one represents numbers x using n bits as
, where x i ∈ {0, 1} is the i-th bit of the binary representation of x, and the point position p denotes the number of binary digits to the left of the binary point. We choose both the total number of bits n and the point position p to be constant over the course of a computation. As a consequence, over-and underflow errors are introduced, while keeping the required bit-size from growing with each operation.
Fixed-point addition. We use a fixed-point addition implementation, which keeps the bit-size constant. This amounts to allowing over-and underflow, while keeping the registers from growing with each operation.
Fixed-point multiplication. Multiplication can be performed by repeated-addition-and-shift, which can be seen from
i with x i ∈ {0, 1} denotes the binary expansion of the n-bit number x. Thus, for i ∈ {0, ..., n− 1}, 2 i−(n−p) y is added to the result register (which is initially zero) if x i = 1. This can be implemented using n controlled additions on 1, 2, ..., n bits if one allows for pre-truncation: Instead of computing the 2n-bit result and copying out the first n bits before uncomputing the multiplication again, the additions can be executed on a subset of the qubits, ignoring all bits beyond the scope of the n-bit result. Thus, each addition introduces an error of at most ε A = 1 2 n−p . Since there are (at most) n such additions, the total error is ε = n 2 n−p , a factor n larger than using the costly approach mentioned above.
Negative multipliers are dealt with by substituting the controlled addition by a controlled subtraction when conditioning on the most significant bit [33] because it has negative weight w M SB = −2 n−1 in two's-complement notation. The multiplicand is assumed to be positive throughout, which removes the need for conditional inversions of input and output (for every multiplication), thus tremendously reducing the size of circuits that require many multiplications such as, e.g., polynomial evaluation.
Fixed-point squaring. The square of a number can be calculated using the same approach as for multiplication. Yet, one can save (almost) an entire register by only copying out the bit being conditioned on prior to performing the controlled addition. Then, the bit can be reset using another CNOT gate, followed by copying out the next bit and performing the next controlled addition. The gate counts are identical to performing |x |0 |0 → |x |x |0 → |x |x x 2 → |x x 2 |0 , while allowing to save n − 1 qubits.
Appendix B: Resource estimates for polynomial evaluation
The evaluation of a degree d polynomial requires an initial multiplication a d · x, an addition of a d−1 , followed by d − 1 multiply-accumulate instructions. The total number of Toffoli gates is thus equal to the cost of d multiplyaccumulate instructions. Furthermore, d + 1 registers are required for holding intermediate and final result(s) if no in-place adder is used for the last iteration (and no non-trivial pebbling strategy is applied). Other strategies may be employed in order to reduce the number of ancilla registers, at the cost of a larger gate count, see Table I for examples.
Note that all multiplications can be carried out assuming x > 0, i.e. x can be conditionally inverted prior to the polynomial evaluation (and the pseudo-sign bit is copied out). The sign is then absorbed into the coefficients: Before adding a i into the |y i−1 x -register, it is inverted conditioned on the sign-bit of x being set if the coefficient corresponds to an odd power. This is done because it is cheaper to implement a fixed-point multiplier which can only deal with y i−1 being negative (see Sec. A).
The Toffoli gate count of multiplying two n-bit numbers (using truncated additions as described in Sec. A) is
if one uses the controlled addition circuit by Takahashi et al. [16] , which requires 3n + 3 Toffoli gates to (conditionally) add two n-bit numbers. The subsequent addition can be implemented using the addition circuit by Takahashi et al. [16] , featuring 2n − 1 Toffoli gates. Thus, the total cost of a fused multiply-accumulate instruction is
Therefore, the total Toffoli count for evaluating a degree d polynomial is
Evaluating M polynomials in parallel for piecewise polynomial approximation requires only n+ log 2 M additional qubits (since one n-qubit register is required to perform the addition in the last iteration, which is no longer just a constant) and 2M log 2 M -controlled NOT gates, which can be performed in parallel with the multiplication. This increases the circuit size by
Toffoli gates per multiply-accumulate instruction, since a k-controlled NOT can be achieved using 4(k − 2) Toffoli gates and k −2 dirty ancilla qubits [34] , which are readily available in this construction.
The label register |l can be computed using 1 comparator per subinterval
The comparator stores its output into one extra qubit, flipping it to |1 if x ≤ a i+1 . The label register is then incremented from i − 1 to i, conditioned on this output qubit still being |0 (indicating that x > a i ). Incrementing |l can be achieved using CNOT gates applied to the qubits that correspond to ones in the bit-representation of (i − 1) ⊕ i. Finally, the comparator output qubit is uncomputed again. This procedure is carried out M times for i = 0, ..., M − 1 and requires 1 additional qubit. The number of extra Toffoli gates for this label initialization is
where, as a comparator, we use the CARRY-circuit from [26] , which needs 2n Toffoli gates to compare a classical value to a quantum register, and another 2n to uncompute the output and intermediate changes to the n required dirty ancilla qubits. In total, the parallel polynomial evaluation circuit thus requires
Toffoli gates and (d + 1)n + log 2 M + 1 qubits.
Appendix C: (Inverse) Square root
The inverse square root, i.e.,
can be computed efficiently using Newton's method. The iteration looks as follows:
where a is the input and
if the initial guess is sufficiently close to the true solution.
Reversible implementation
Initial guess and first round. Finding a good initial guess x 0 ≈ 1 √ a for Newton's zero-finding routine is crucial for (fast) convergence. A crude approximation which turns out to be sufficient is the following:
where log 2 a can be determined by finding the first "1" when traversing the bit-representation of a from left to right (MSB to LSB). While the space requirement forx 0 is in O(log 2 n), such a representation would be impractical for the first Newton round. Furthermore, noting that the first iteration onx 0 = 2 k leads tõ
one can directly choose this x 0 as the initial guess. The preparation of x 0 can be achieved using (n − 1) + n + 1 ancilla qubits, which must be available due to the space requirements of the subsequent Newton steps. The one ancilla qubit is used as a flag indicating whether the first "1" from the left has already been encountered. For each iteration i ∈ {n − 1, ..., 1, 0}, one determines whether the bit a i is 1 and stores this result r i in one of the n work qubits, conditioned on the flag being unset. Then, conditioned on r i = 1, the flag is flipped, indicating that the first "1" has been found. If r i = 1, the x 0 -register is initialized to the value in (C1) as follows: Using CNOTs, the x 0 -register can be initialized to the value 1.5 shifted
2 , where p denotes the binary point position of the input, followed by subtracting the (3k − 1)-shifted input a from x 0 , which may require up to n − 1 ancilla qubits.
In order to improve the quality of the first guess for numbers close to 2 k for some k ∈ Z, one can tune the constant 1.5 in (C1), i.e., turn it into a function C(k) of the exponent k. This increases the overall cost of calculating x 0 merely by a few CNOT gates but allows to save an entire Newton iteration even when only distinguishing three cases, namely
The Newton iteration. Computing x n+1 from x n by
can be achieved as follows:
1. Compute the square of x n into a new register.
Multiply x 2
n by the shifted input to obtain ax 2 n /2.
3. Initialize another register to 1.5 and subtract ax 2 n /2.
4. Multiply the result by x n to arrive at x n+1 .
5. Uncompute the three intermediate results.
The circuit of one such Newton iteration is depicted in Fig. 5 . Therefore, for m Newton iterations, this requires m+3 n-qubit registers if no pebbling is done on the Newton iterates, i.e., if all x i are kept in memory until the last Newton iteration has been completed.
Resource estimates
Computing the initial guess for the fast inverse square root requires n controlled additions of two n-bit numbers plus 2n Toffoli gates for checking/setting the flag (and uncomputing it again). Thus, the Toffoli count for the initial guess is T init (n) = nT cadd (n) + 2n = 3n 2 + 5n . . Circuit for the n-th Newton iteration of computing the inverse square root of a, given in a quantum superposition in |a . SQR computes the square of the previous iterate xn into an empty result-register, which is then multiplied by the input a (MUL), followed by subtracting (SUB) this intermediate result from the value 1.5 (initialized using the SET1.5-gate). Finally, the next iterate, i.e., xn+1 = xn(1.5 − Each Newton iteration features squaring, a multiplication, a subtraction, a final multiplication (yielding the next iterate), and then an uncomputation of the three intermediate results. In total, one thus employs 5 multiplications and 2 additions (of which 2 multiplications and 1 addition are run in reverse), which yields the Toffoli count
The number of Toffoli gates for the entire Newton procedure (without uncomputing the iterates) for m iterations thus reads
Since each Newton iteration requires 3 ancilla registers (which are cleaned up after each round) to produce the next iterate, the total number of qubits is n(m+4), where one register holds the initial guess x 0 . Note that this is an upperbound on the required number of both qubits and Toffoli gates. Since Newton converges quadratically, there is no need to perform full additions and multiplications at each iteration. Rather, the number of bits n used for the fixed point representation should be an (increasing) function of the Newton iteration.
The square root can be calculated using
i.e., at a cost of an additional multiplication into a new register. Note that this new register would be required anyway when copying out the result and running the entire computation in reverse, in order to clear registers holding intermediate results. Thus, the total number of logical qubits remains unchanged.
the polynomial approximation on [0, 0.5] to the entire domain [0, 1] employing the double-argument identity above. First, the (pseudo) sign-bit of x is copied out and x is conditionally inverted (modulo two's-complement) to ensure x ≥ 0. Since there are plenty of registers available, this can be achieved by conditionally initializing an extra register to |1 and then using a normal adder to increment x by one, where x denotes the bit-or one'scomplement of x. Since x ∈ [0, 1], one can determine whether x < 0.5 using just one Toffoli gate (and 4 NOT gates). The result of this comparison is stored in an ancilla qubit denoted by |a . z = (1−x)/2 can be computed using an adder (run in reverse) acting on x shifted by one and a new register, after having initialized it to 0.5 using a NOT gate. Then, conditioned on |a (i.e., on a being 0), this result is copied into the polynomial input register |p in and, conditioned on |a , x is squared into |p in . After having applied our polynomial evaluation circuit (which uncomputes intermediate results) to this input, |p in can be uncomputed again, followed by computing the square root of z. Then, the result of the polynomial evaluation must be multiplied by either √ z or x, which can be achieved using 2n controlled swaps and one multiplier. The final transformation of the result consists of an initialization to π/2 followed by a subtraction, both conditioned on |a , and a copy conditioned on |a . Finally, the initial conditional inversion of x can be undone after having (conditionally) inverted the output.
Following this procedure, the Toffoli count for this arcsine implementation on n-bit numbers using m Newton iterations for calculating √ z and a degree-d polynomial to approximate arcsin(x) on [0, 0.5] can be written as T arcsin = 3T inv + (2T poly − T fma ) + 2T csquare + T mul + T cadd + (2T invsqrt + T mul ) + 5n + 2 + T add = 3T add + 2T poly + 3T mul + T cadd + 2T invsqrt + 9n + 2 = d(3n 2 + n(6p + 7) − 6(p − 1)p − 2) + m(n(15n + 30p + 23) − 30p(p − 1) − 4) + 9(n + 1)p + 9 2 n(n + 1) + 6n 2 + 28n − 9p 2 + 2 where T inv (n) denotes the Toffoli count for computing the two's-complement of an n-bit number and T csquare (n, p) = T mul (n, p) + 2n is the number of Toffoli gates required to perform a conditional squaring operation. Furthermore, 2n Toffoli gates are needed to achieve the conditional n-bit swap operation (twice), and another 3n are used for (conditional) copies. 
Appendix E: Results of the Reversible Simulation
All circuits were implemented at the gate level and tested using a reversible simulator extension to LIQU i | . The results are presented in this section.
Piecewise polynomial approximation
A summary of the required resource for implementing tanh(x), exp(−x 2 ), and sin(x) can be found in Tbl. II. For each function, one set of parameters was implemented reversibly at the level of Toffoli gates in order to verify the proposed circuits.
(Inverse) Square root
The convergence of our reversible fast inverse square root implementation with the number of Newton iterations can be found in Fig. 3(b) , where the bit sizes and point positions have been chosen such that the roundoff errors do not interfere significantly with the convergence. For all practical purposes, choosing between 3 and 5 Newton iterations should be sufficient. The effect of tuning the constants in the initial guess (see Eqn. C2) can be seen when comparing Fig. 3(a) to Fig. 3(b) : The initial guess is obtained from the location of the first non-zero in the bit-representation of the input, which results in large rounding-effects for inputs close to an integer power of two. Tuning the initial guess results in almost uniform convergence, which allows to save an entire Newton iteration for a given L ∞ -error.
The square root converges better than the inverse
