Quantum circuits for basic mathematical functions such as the square root are required to implement scientific computing algorithms on quantum computers. Quantum circuits that are based on Clifford+T gates can easily be made fault tolerant, but the T gate is very costly to implement. As a result, reducing T-count has become an important optimization goal. Further, quantum circuits with many qubits are difficult to realize, making designs that save qubits and produce no garbage outputs desirable. In this work, we present a T-count optimized quantum square root circuit with only 2 · n + 1 qubits and no garbage output. To make a fair comparison against existing work, the Bennett's garbage removal scheme is used to remove garbage output from existing works. We determined that out proposed design achieves an average T- 
INTRODUCTION
Among the emerging computing paradigms, quantum computing appears promising due to its applications in number theory, encryption, search, and scientific computation [4, 7, 12, 13, 19, 24, [29] [30] [31] . Quantum circuits for arithmetic operations such as addition, multiplication, square root, and fractional powers are required in the quantum circuit implementations of many quantum algorithms [4, 6, 7, 9, 10] . For example, arithmetic circuits for the square root can be used in the circuit implementation of quantum algorithms such as those for computing roots of polynomials, evaluating quadratic congruence, and the principal ideal problem [12, 13, 29] . Quantum square root circuits also reduce the resources needed in the circuit implementations of higher-level functions computing the natural logarithm [6] . An efficient quantum circuit of the natural logarithm has use in quantum algorithms such as those for Pell's equation and the principal ideal problem [13] .
36:2 E. Muñoz-Coreas and H. Thapliyal
The design of quantum circuits for arithmetic operations such as addition and multiplication have received notable attention in the literature. However, the design of quantum circuits for crucial arithmetic functions such as the square root is still at an initial stage.
Reliable quantum circuits must be able to tolerate noise errors [21, 23, 32, 33] . Fault-tolerant quantum gates (such as Clifford+T gates) and quantum error correcting codes can be used to make quantum circuits resistant to noise errors [1, 2, 8, 15, 17, 22] . However, the increased tolerance to noise errors comes with the increased implementation overhead associated with the quantum T gate [1, 8, 22, 33] . Because of the increased cost to realize the T gate, T-count has become an important performance measure for fault-tolerant quantum circuit design [1, 11] . Further, existing quantum computers have few qubits and large-scale quantum computers are difficult to realize [14, 18] . As a result, the total number of qubits required by a quantum circuit is an important performance measure. Quantum circuits have overhead called ancillae and garbage output that add to the total number of qubits of a quantum circuit. Any constant inputs in the quantum circuit are called ancillae. Garbage output exist in the quantum circuit to preserve one-to-one mapping. Garbage output are not primary inputs or useful outputs. Minimizing the overhead from ancillae and garbage output is a means to reduce overall qubit cost of a quantum circuit.
The design of quantum circuits for the calculation of the square root has only recently begun to be addressed in the literature. A design for the calculation of the square root based on the Newton approximation algorithm is presented in Reference [6] . While an interesting design, the implementation requires 5 · loд 2 (b) multiplications and 3 · loд 2 (b) additions (where b is the number of bits of accuracy in the solution and b ≥ 4) [6] . This arithmetic operation cost translates into significant T gate and qubit cost. The design in Reference [26] presents a quantum circuit for calculating the square root based on the non-restoring square root algorithm. This design requires only n 2 additions or subtractions making the design far more efficient than the design in Reference [6] in terms of qubits and T gates. However, the design in Reference [26] does not include the additional ancillae and T gate costs required for removing garbage outputs. Additional recent quantum circuit designs for calculating the square root presented in Reference [3] also require only n 2 controlled subtraction operations. The designs in Reference [3] are based on the non-restoring square root algorithm. Thus, the designs presented in Reference [3] also offer more efficient alternatives to the design in Reference [6] in terms of qubits and T gates. One of the designs presented in Reference [3] has been optimized for gate count further reducing its T-gate cost. Both designs presented in Reference [3] produce significant garbage output. In Reference [3] for both designs the qubit and T gate cost associated with removing this garbage output was not considered in the circuit cost calculations. Thus, the quantum square root circuits in References [26] and [3] have significant overhead in terms of T-count and qubits. To overcome the limitations of existing designs, we present the design of a quantum square root circuit that is garbageless, requires 2 · n + 1 qubits and is optimized for T-count. The quantum square root circuit based on our proposed design is compared and is shown to be better than the existing designs of quantum square root circuit in terms of both T-count and qubits.
This article is organized as follows: Section 2 presents background information on the Clifford+T gates. In Section 3 we present the design of our proposed quantum square root circuit. In Section 4 our proposed design is compared to the existing work.
BACKGROUND

Fault-Tolerant Quantum Circuits
The fault-tolerant Clifford+T gate set is used in fault-tolerant quantum circuit design. Table 1 shows the gates that make up the Clifford+T gate set. The quantum square root circuit proposed in this work is composed of the quantum NOT gate, Feynman (CNOT) gate, inverted control CNOT gate, SWAP gate, and Toffoli gate. Fault-tolerant quantum circuit performance is evaluated in terms of T-count and T-depth, because the implementation costs of the T gate is significantly greater than the implementation costs of the other Clifford+T gates [1, 8, 11, 22, 33] . T-count is the total number of T gates or Hermitian transposes of the T gate in a quantum circuit. As illustrated in Figure 1 , the inverted control CNOT gate and the SWAP gate both have a T-count of 0 while the Toffoli gate has a T-count of 7. T-depth is the number of T gate layers in the circuit, where a layer consists of quantum operations that can be performed simultaneously. As illustrated in Figure 1 , the inverted control CNOT gate and the SWAP gate both have a T-depth of 0. The Toffoli gate has a T-depth of 3, because the most T gate layers encountered by any qubit in the Toffoli gate is 3.
DESIGN OF THE PROPOSED QUANTUM SQUARE ROOT CIRCUIT
The proposed quantum square root circuit calculates the square root by implementing the nonrestoring square root algorithm. The non-restoring square root algorithm is illustrated in Figure 2 (Algorithm 1). Researchers have demonstrated the correctness of the non-restoring square root algorithm through functionally correct circuit implementations such as those in References [26] , [3] , and [25] . A specific example illustrating how Algorithm 1 calculates the square root of a number a is available in Appendix A.
We now present the design of our proposed quantum square root circuit. The proposed method is garbageless and requires fewer qubits than existing designs. The proposed circuit also has a lower T-count compared to existing designs. Consider the square root of the number a. We represent a as a positive binary value in 2 s complement that has an even bit length n. a is stored in quantum register |R . Further, let |F be a quantum register of size n initialized to 1 and let |z be a 1 qubit ancillae set to 0. At the end of computation, quantum register locations |F n 2 +1 through |F 2 of |F will have the value Y ( √ a). In addition, quantum register |R that initially stored a will have the remainder from the calculation of 
Part 1: Initial Subtraction
This part only occurs once. The quantum circuit for Part 1 takes quantum registers |R , |F , and |z as inputs. Part 1 has six steps. Figure 4 illustrates the generation of Part 1 with an example of a 6 bit square root circuit.
• Step 1: At location |R n−2 , apply a quantum NOT gate.
• Step 2: At locations |R n−1 and |R n−2 , apply a CNOT gate such that the location |R n−2 is unchanged while location |R n−1 now has the value |R n−2 ⊕ R n−1 (where
Step 1 and Step 2 implement lines 2 through 4 of Algorithm 1. • Step 3: At locations |R n−1 and |F 1 , apply a CNOT gate such that the location |R n−1 is unchanged while location |F 1 now has the value |R n−2 ⊕ R n−1 ⊕ F 1 . If |R n−1 = 1, then this step partially implements line 11 of Algorithm 1, because the value at location |F 1 (|R n−2 ⊕ R n−1 ⊕ F 1 ) simplifies to |1 . Otherwise, this step helps to implement line 18 of Algorithm 1, because the value at location |F 1 (|R n−2 ⊕ R n−1 ⊕ F 1 ) simplifies to |0 when |R n−1 = 0. • Step 4: At locations |R n−1 and |z , apply an inverted control CNOT gate such that the location |R n−1 is unchanged while location |z now has the value |R n−2 ⊕ R n−1 ⊕ z , which simplifies to
). This step prepares register |z for use in subsequent steps.
• Step 5: At locations |R n−1 and |F 2 , apply an inverted control CNOT gate such that the location |R n−1 is unchanged while location |F 2 now has the value |R n−2 ⊕ R n−1 ⊕ F 2 , which simplifies to
. If |R n−1 = 1, then this step completes execution of line 11 of Algorithm 1 and quantum register |F will have the value: |0 · · · |0 |Y n 2 −1 |1 |1 . Conversely, if |R n−1 = 0, then this step completes execution of line 18 of Algorithm 1 and quantum register |F will have the value: |0 · · · |0 |Y n 2 −1 |0 |1 . • Step 6: this step has two sub-steps.
-Step 1: At locations |R n−1 through |R n−4 of register |R and locations |F 3 through |F 0 of register |F , apply the quantum conditional addition or subtraction (ADD/SUB) circuit such that locations |F 3 through |F 0 are unchanged while locations |R n−1 through |R n−4 will hold the results of computation. -Step 2: At location |z , apply the quantum ADD/SUB circuit such that the operation of the circuit is conditioned on the value at location |z . Location |z is unchanged. After this step, if |R n−1 = 1, then the quantum register |R will equal |R + |F (line 13 of Algorithm 1). If |R n−1 = 0, then the quantum register |R will equal |R − |F (line 20 of Algorithm 1).
Part 2: Conditional Addition or Subtraction
This part is repeated a total of n 2 − 2 times. The quantum circuit for each iteration of Part 2 takes quantum registers |R , |F , and |z as inputs. Part 2 has seven steps. Figure 5 illustrates the generation of Part 2 with an example of a 6-bit square root circuit. We show the steps for iteration i where 2 ≤ i ≤ n 2 − 1.
• Step 1: At locations |z and |F 1 , apply an inverted control CNOT gate such that the location |z is unchanged while location |F 1 now has the value |z ⊕ F 1 . This step restores |F 1 to its initial value such that |F has the value:
Step 2: At locations |F 2 and |z , apply a CNOT gate such that the location |F 2 is unchanged while location |z now has the value |F 2 ⊕ z , which reduces to 0. Steps 1 and 2 prepare |z and |F for iteration i of the FOR loop in Algorithm 1.
• Step 3: At locations |R n−1 and |F 1 , apply a CNOT gate such that the location |R n−1 is unchanged while location |F 1 now has the value |z ⊕ R n−1 ⊕ F 1 . If |R n−1 = 1, then this step partially implements line 11 of Algorithm 1, because the value at location |F 1 (|z ⊕ R n−1 ⊕ F 1 ) simplifies to |1 . Otherwise, this step helps to implement line 18 of Algorithm 1, because the value at location |F 1 (|z ⊕ R n−1 ⊕ F 1 ) simplifies to |0 when |R n−1 = 0.
• Step 4: At locations |R n−1 and |z , apply an inverted control CNOT gate such that the location |R n−1 is unchanged while location |z now has the value |R n−1 ⊕ F 2 ⊕ z (where
• Step 5: At locations |R n−1 and |F i+1 , apply an inverted control CNOT gate such that the location |R n−1 is unchanged while location |F i+1 now has the value |R n−1 ⊕ F i+1 (where Step has two sub-steps.
-Step 1: At locations |R n−1 through |R n−2·i−2 of register |R and |F 2·i+1 through |F 0 of register |F , apply the quantum conditional addition or subtraction (ADD/SUB) circuit such that locations |F 2·i+1 through |F 0 are unchanged while locations |R n−1 through |R n−2·i−2 will hold the results of computation. -Step 2: At location |z , apply the quantum ADD/SUB circuit such that the operation of the circuit is conditioned on the value at location |z . Location |z is unchanged. After this step, if |R n−1 = 1, then the quantum register |R will equal |R + |F (line 13 of Algorithm 1). If |R n−1 = 0, then the quantum register |R will equal |R − |F (line 20 of Algorithm 1).
Part 3: Remainder Restoration
This part only occurs once. The quantum circuit for Part 3 takes quantum registers |R , |F , and |z as inputs. Part 3 has nine steps. Figure 6 illustrates the generation of Part 3 with an example of a 6-bit square root circuit.
• Step 1: At locations |z and |F 1 , apply an inverted control CNOT gate such that location |z is unchanged while location |F 1 now has the value |z ⊕ F 1 . This step restores |F 1 to its initial value such that |F has the value |0 · · · |0 |Y n 2 −1 · · · |Y 1 |0 |1 . Thus, this step partially completes line 26 of Algorithm 1 when |R < 0 or partially completes line 31 otherwise.
• Step 2: At locations |F 2 and |z , apply a CNOT gate such that location |F 2 is unchanged while location |z now has the value |F 2 ⊕ z , which simplifies to the value 0.
Step 1 and
Step 2 prepare |z and |F for the IF statement in Algorithm 1.
• Step 3: At locations |R n−1 and |z , apply an inverted control CNOT gate such that location |F 2 is unchanged while location |z now has the value |R n−1 ⊕ F 2 ⊕ z (where
. This step prepares register |z for use in subsequent steps.
• Step 4: At locations |R n−1 and |F n 2 +1 , apply an inverted control CNOT gate such that location |R n−1 is unchanged while location |F n 2 +1 now has the value |R n−1 ⊕ F n 
. This step prepares |z for subsequent computations.
• Step 6: This step has the following two sub-steps.
-Step 1: Apply quantum registers |F and |R to a quantum CTRL-ADD circuit such that |F is unchanged while |R will hold the result of computation. -Step 2: At location |z , apply a quantum conditional addition (CTRL-ADD) circuit such that the operation of the quantum CTRL-ADD circuit is conditioned on the value at location |z . After this step, if |R n−1 = 1, then the quantum register |R will equal |R + |F (line 28 of Algorithm 1). If |R n−1 = 0, then the value in quantum register |R is unchanged. After this step, |R will contain the remainder from calculating Y (or √ a). • Step 7: At location |z , apply a quantum NOT gate. The value of |z is restored to the value • Step 9: At locations |F 2 and |z , apply a CNOT gate such that location |F 2 is unchanged while location |z now has the value |R n−1 ⊕ F 2 ⊕ F 4 ⊕ z , which simplifies to |0 . This step completes the restoration of |z to its initial value (0).
Thus, the square root of a is at locations |F n 2 +1 through |F 2 of quantum register |F and the remainder of calculating the square root of a is at quantum register |R . The quantum register |z along with locations |F n−1 through |F n 2 +2 and locations |F 1 through |F 0 of quantum register |F are restored to their initial values. Thus, the proposed design methodology generates a quantum square root circuit that correctly implements the non-restoring square root algorithm.
COST ANALYSIS 4.1 T-count Cost
The proposed design methodology reduces the T-count by incorporating T gate efficient implementations of quantum CTRL-ADD circuits and quantum ADD/SUB circuits. Garbageless and T-gate-optimized quantum ADD/SUB and CTRL-ADD circuits in the literature such as the designs in References [16, 20, 27] can be used in our proposed quantum square root circuit. The T-count of the proposed quantum square root circuit is illustrated shortly for each part of the proposed design.
Part 1: Initial Subtraction.
• Steps 1 through 5 do not require T-gates.
• Step 6 requires 42 T gates. We use a quantum ADD/SUB circuit of T-count 14 · n − 14 in this step (where n = 4). • The T-count for the ith iteration of Step 7 is 14 · (2 · (i + 1)) − 14, which simplifies to 28 · i + 14. We use a quantum ADD/SUB circuit of T-count 14 · n − 14 in this step (where n = 2 · (i + 1)).
Part 2: Conditional
Part 3: Reminder Restoration.
• The T-count for Step 6 is 21 · n − 14. We use a quantum CTRL-ADD circuit of T-count 21 · n − 14 in this step.
• Steps 7 through 9 do not require T-gates.
Calculation of T-count.
To calculate the total T-count, we add the total T-count for each part of the design. The total T-count for Part 1 is 42 (or 14 · n − 14 where n = 4). The total Tcount for Part 2 is given as n 2 −1 i=2 28 · i + 14, and the total T-count for Part 3 is given as 21 · n − 14. Combining the total T-count for each part of the proposed quantum square root circuit results in the following expression:
The expression for the T-count (expression 1) can be simplified into the following expression:
T-depth Cost
We now calculate the T-depth for our proposed design. Our proposed design is based on T-depthefficient designs of quantum ADD/SUB circuits and quantum CTRL-ADD circuits. We determined that garbageless and T-gate-optimized quantum ADD/SUB circuits in the literature such as the design in Reference [27] have a T-depth that is constant and independent of the circuit size n. Thus, these ADD/SUB circuits have T-depth of order O(1). We determined as well that CTRL-ADD circuits in the literature such as the design in Reference [20] scale as a function of circuit size n. Thus, these CTRL-ADD circuits have a T-depth of order O(n). The T-depth of the proposed quantum square root circuit is illustrated shortly for each part of the proposed design.
Part 1: Initial Subtraction.
• Step 6 has a constant T-depth of 10. This T-depth is seen by locations |R n−2 and |R n−3 of quantum register |R . We use a quantum ADD/SUB circuit in this step. The ADD/SUB circuit has a constant T-depth 10 that is independent of the circuit's size.
Part 2: Conditional Addition or Subtraction.
The steps in this part are repeated n 2 − 2 times. We show the T-count for the ith iteration of Part 2 where 2 ≤ i ≤ n 2 − 1 • The ith iteration of 1 through 6 do not require T-gates.
• Step 7 has a constant T-depth of 10. This T-depth is seen by locations |R n−2 through |R n−2·i−1 of quantum register |R . We use a quantum ADD/SUB circuit in this step. The ADD/SUB circuit has a constant T-depth 10 that is independent of the circuit's size.
Part 3: Reminder Restoration.
• Step 6 has a T-depth of 2 · n. This T-depth is seen by quantum register |z . We use a quantum CTRL-ADD circuit of T-depth 2 · n in this step.
Calculation of T-depth.
We now illustrates the steps we use to determine the total T-depth for the proposed quantum square root circuit:
• Step 1: Calculate the T-depth for Part 1. Part 1 has a T-depth of 10. This T-depth is seen by locations |R n−2 and |R n−3 of quantum register |R . • Step 2: Calculate the T-depth for Part 2. Part 2 has a T-depth of 10 · ( n 2 − 2), because Part 2 requires n 2 − 2 quantum ADD/SUB circuits. The total T-depth 10 · ( n 2 − 2) simplifies to 5 · n − 20. This T-depth is seen by locations |R n−2 and |R n−3 of quantum register |R .
• Step 3: Calculate the T-depth for Part 3. Part 3 has a T-depth of 2 · n. This T-depth is seen by quantum register |z . • Step 4: Determine which qubits see the most T gate layers. We find after comparing all the qubits in our proposed design quantum register |z and quantum register locations |R n−2 and |R n−3 of |R see the most T gate layers.
• Step 5: Determine the total number of T gate layers seen by quantum register |z in the proposed design. Quantum register |z will see a total of 2 · n T gate layers, because in Part 1 and Part 2, no T gates operate on quantum register |z .
36:12
E. Muñoz-Coreas and H. Thapliyal 
1 is the design by Sultana et al. [26] . 2 is the design by Bhaskar et al. [6] . 3 is the first design by AnanthaLakshmi et al. [3] . 4 is the second design by AnanthaLakshmi et al. [3] . Table entries are marked NA where a closed-form expression is not available for the T-depth.
•
Step 6: Determine the total number of T gate layers seen by quantum register locations |R n−2 and |R n−3 of |R in the proposed design. Quantum register locations |R n−2 and |R n−3 will see a total of 10 T gate layers from Part 1, 5 · n − 20 T gate layers from Part 2, and 13 T gate layers from Part 3. The total T-depth for locations |R n−2 and |R n−3 is 5 · n + 3. Quantum CTRL-ADD circuits in the literature such as the design in Reference [20] present a constant T-depth to locations |R n−2 and |R n−3 of quantum register |R when |R is supplied as an input. We use a quantum CTRL-ADD circuit that presents a constant T-depth of 13 to locations |R n−2 and |R n−3 . • Step 7: Determine which qubits see the most T gate layers. We determined that locations |R n−2 and |R n−3 see more T gate layers than register |z , because 5 · n + 3 > 2 · n. The number of T gate layers on qubits with the most T gate layers will determine the T-depth for the proposed quantum square root circuit.
Thus, our proposed design has a T-depth of 5 · n + 3, and this T-depth is seen by locations |R n−2 and |R n−3 of quantum register |R .
Cost Comparison
The comparison of the proposed quantum square root circuit with the current state of the art is illustrated in Table 2 . To compare our proposed square root circuit against the existing designs by Sultana et al. [26] and AnanthaLakshmi et al. [3] , we implemented the designs with Clifford+T gates. We also apply the Bennett's garbage removal scheme (see Reference [5] ) to remove the garbage output from the designs by Sultana et al. and AnanthaLakshmi et al. The total qubit cost for each design by AnanthaLakshmi et al. are calculated by summing the garbage output produced by the controlled subtraction circuits and the circuit outputs.
To compare our proposed square root circuit against the existing design by Bhaskar et al. [6] , we implemented the design with Clifford+T gates. The square root design in Bhaskar et al. requires 5 · loд 2 (b) multiplications and 3 · loд 2 (b) additions (where b is the number of bits of accuracy of the solution). We use an implementation that has the lowest possible accuracy and thus let b = 4. This is because the T gate and qubit costs increases as a function of solution accuracy. Thus, the square root circuit based on the design by Bhaskar et al. requires 10 multiplications and 6 additions. Bhaskar et al. did not specify a quantum adder or multiplier design for use in their square root quantum circuit design. Therefore, to have a fair comparison against our work, we use the quantum adder presented in Reference [28] and the quantum multiplier shown in Reference [20] . The quantum adder has a T-count of 14 · n − 7 and a qubit cost of 2 · n + 1 and produces no garbage output. Further, the quantum multiplier has a T-count of 21 · n 2 − 14 and a qubit cost of 4 · n + 1 and produces no garbage output. We assume that given two inputs on quantum registers |a and |b , the quantum multiplier will produce the product of |a and |b on 2 · n + 1 ancillae. The inputs |a and |b will maintain the same value at the end of computation. Consequently, at the end of computation, the square root circuit based on the design by Bhaskar et al. will have garbage outputs. We apply the Bennett's garbage removal scheme to remove the garbage output from the quantum circuit implementation of the design by Bhaskar et al. Table 2 illustrates how T-depth and qubit cost are linked and that minimizing one will result in an increase in the other resource cost measure. Table 2 shows that the T-depth of our proposed design and the design by Sultana et al. are of order O(n). Table 2 illustrates the tradeoff between the T-depth and number of qubits. However, the design by Sultana et al. is only able to achieve a constant factor of T-depth improvement against the proposed work at the expense of having a qubit cost of order O(n 2 ). Our proposed design achieves a qubit cost of order O(n). Thus, we significantly reduced the number of qubits in our proposed circuit and maintained a T-depth of the same order (O(n)) as the work by Sultana et al.
Cost Comparison in
Cost Comparison in Terms of Qubits.
Cost Comparison in Terms of T-depth.
CONCLUSION
In this work, we present a new design of a quantum square root circuit. The proposed design has zero overhead in terms of garbage output. The proposed design also requires fewer T gates and qubits than the current state of the art. The proposed quantum square root circuit has been formally verified. The proposed quantum square root circuit could form a crucial component in the quantum hardware implementations of scientific algorithms where qubits and T-count are of primary concern.
APPENDIX A EXAMPLE OF THE NON-RESTORING SQUARE ROOT ALGORITHM
In this section, we present an example of Algorithm 1. We shall illustrate the calculation of the square root of 26. We represent 26 as a 6 bit positive binary number in 2 s complement (a = 011010). The square root of 26 is 5 with a remainder of 1. At the end of computation, R will have the remainder (1) 
