Abstract
Introduction
Approximate squaring circuits have numerous applications as mentioned in [12] [13] [14] [15] 20] such as cryptography, computation of Euclidean distance among pixels for a graphics processor or in rectangular to polar conversions in several signal processing circuits where full precision results are not required. As indicated in [14, 25] , customized squaring modules do have important applications in digital signal processing. Specifically, in [6] , a method is described where resolution can be increased during a graphics blend operation through the incorporation of a squaring operation implemented by a multiplier followed by a truncation circuit. Clearly the approach described in this paper allows for improvement in such application.
In [7] , a method for frame synchronization in a digital radio is described where a digital squaring circuit is integral to the process. Hardware transcendental function designs [9] have also employed approximate squaring circuits. These are just a few examples where high-performance approximate squaring circuits are desirable.
Often, designers implement a squaring operation using a multiplier circuit. Approximate multiplication has been investigated using a truncated multiplier [8] . The multiplier may utilize a radix-4 or radix-8 Booth recoding to reduce the size of the partial product array [1, 2, 3] . The squaring operation yields symmetry in the partial product array when compared to a standard multiplier. This property has been investigated to provide optimizations in multiplier design at the bit level [18, 19] . The design focusing on a squaring circuit employing this symmetry was proposed in [21] , and numerous studies optimizing binary squaring circuits appear in [12] [13] [14] [15] 20] . These designs primarily optimize by using hardwired bit product arrangements to reduce array sizes for efficient accumulation, mostly focusing on low precision. Since squaring is a unary operation, lookup tables have also been incorporated in proposed designs of squaring circuits [23, 24] . Extension to the design of a radix-4 squaring circuit employing Booth recoding and "folding" of the partial products was introduced in [16] , with further implementation optimization studies discussed in [10, 11, 17] .
Booth recoded multipliers yield partial products whose formation requires the complexities of both sign extension and two's complementation. The Booth-folded recoded squarer in [16] reduces two's complementation to a straightforward one's complementation. Avoidance of two's complementation particularly simplifies the generation of integer squares modulo the integer word size, as further investigated in [22] . In this paper, we investigate implementations of a new radix-4 operand dual recoding method [5] for the squaring operation. The recoding yields non-negative partial squares avoiding need for sign extensions and furthermore yields a radix-16 reduced array of partial products. This recoding is particularly effective for the design of an approximate squaring circuit where the partial squares can be generated with shifts and one's complements using a few guard bits.
The paper is organized as follows. First, a summary of the distinctions between existing radix-2 and our radix-4 squaring methods is provided sufficient to make the paper self contained. Then our implementation of the proposed methods is elaborated and synthesis results are given. The results show a substantial advantage for the use of a specialized approximate squaring circuit compared to a truncated multiplier, with the radix-4 squarer a significant improvement over the radix-2 squarer [4] .
Squaring Methodology
As introduced in [19, 21] , the squaring operation for an operand in binary can be realized by a modified partial product array of about one half the size of the full multiplier array without the need for the traditional Booth multiplier recoding. Specifically, for binary squaring of the normalized p-bit operand 
For a state-of-the-art approximate radix-2 squarer we employ this reduced depth array design from [4] and truncate the lower order half of the array, except for a couple of columns of guard bits to tightly bound the approximate square. This optimized approximate radix-2 squarer array is of order about 1/4 th the size of a comparable full multiplier array, or equivalently about ½ the size of the truncated approximate multiplier.
Recently [5] an operand "dual recoded" radix-4 squaring method has been introduced which is particularly suited for approximate squaring. We adopt the method from [5] and perform various implementation studies. Before proceeding to performance comparison of this high-radix squarer, to make this presentation self contained we summarize the methodology and foundation for the new dual recoded radix-4 squarer.
For the squaring operation, the single operand assumes both the role of the multiplier and multiplicand. The high radix dual recoding recognizes these distinct asymmetric roles of the single operand. The dual recoding concurrently provides a "squarer" digit string in the high radix and a corresponding sequence of successively truncated "squarands" in binary form. The i th squarer digit multiplies the i th squarand in the i th partial square generator, with the array of partial squares summed to generate the square. The following summary is taken from [5] .
For the left-to-right leading digit dual recoding, the i th squarand is determined only from bits of lesser or equal significance to the bits determining the i th high radix squarer digit. The catalyst for characterizing the left-toright higher radix dual recoding is the sequence of two's complement tails of the operand. • The partial squares are each scaled down by another power of 16, so an n-term sum provides an approximate square of about 4n bits of accuracy.
• The partial square generators are similar in design to Booth radix-4 partial product generators but simpler in two ways -no sign extensions are needed, and, on average, they are about half the size for the same precision.
For more details on the operand dual recoding see [5] . It is illustrative to consider a sample squaring operation as shown in the tables. Consider the 16-bit squaring operation with x normalized in the interval [½,1), in particular x=0.1100010110001011 2 in Example 1. This example uses the Radix-2 optimizations that were discussed before [4] . As can be seen, the array contains 8 rows and 95 terms including guard bits. To visualize the optimizations achieved with the proposed Radix-4 method, the array in Example 2 below can be referred to. This utilizes a radix-4 dual recoding and employs g=3 guard bits yielding a result that has a 1½ ulp lower bound on x 2 . In comparison, the array has only 4 rows and 50 terms respectively. The Verilog™ HDL used to implement the circuits and the Synopsys Design Compiler™ is used to synthesize the circuit using both 130nm and 90nm cell libraries from Texas Instruments. Various squaring circuits were synthesized for operand sizes of n=12, 16, and 24 bits. In each of these cases additional guard bits of g=2, 2, and 3 respectively were included to ensure that the approximation is bounded by at most 2 ulps accuracy. The resulting circuits were also analyzed for maximum path delay and power dissipation. For comparison purposes, an n-bit operand truncated multiplier was also synthesized into the same cell libraries with an n-bit product and an appropriate number of guard bits as this type of circuit is commonly used for the generation of an approximate square. Tables 1 and 2 contain the synthesis results in terms of path delay, power dissipation and area. We note that these results do not include delay due to routing however, since the multiplier adder array is more complicated than the reduced squaring adder arrays, the comparison of the truncated multiplier array to the squaring circuits is likely very conservative after including actual wire delays.
The results show that both the radix-2 and the radix-4 circuits yield a dramatic improvement in performance, power and area compared to the truncated multiplier circuit. The reductions range from a factor of two-to-three for delay, three-tofour for area and five-to-six for power. The radix-4 squarer performs better than the radix-2 squarer by about 10-to-20% in all metrics with the greatest reduction being in area. Figure 2 illustrates the comparisons of the radix-2 and radix-4 squarers in more detailed showing greater improvements as the size of the operand increases.
To summarize, the savings obtained with the approximate squaring circuit when compared with the equivalent sized radix-2 multiplier, Figure 2 contains comparison charts that illustrate the delay, power and area results of the 12-, 16-, and 24-bit operand approximate squaring circuits. It is clearly seen that the proposed circuit has significant savings and the efficiency of the proposed circuit gets better with increasing operand size. 
Conclusion
A new approach for recoding radix-4 approximate squaring is investigated. The recoding avoids sign-extensions and generates a radix-16 reduced array of partial squares resulting in a low power and high performance approximate squaring circuit.
When compared to the truncated multiplier and the approach described in [4] , this squaring circuit shows improvements in power, area, and delay when synthesized to standard cell libraries.
In the future, we plan to design other arithmetic circuits using the squaring circuit as a basis such as general multiplication and division circuits.
