Abstract-Power is becoming a precious resource in modern VLSI design, even more so than area. With large number of applications requiring support of functional units like squares, cubes and other higher order units, it becomes imperative that such functions be implemented in hardware. Implementing those functions using existing general-purpose multipliers in a design may be economical in terms of area but requires more power consumption than is necessary. We propose to use dedicated squaring units to perform squares. We study the tradeoffs of using a dedicated squaring unit compared with a general-purpose multiplier designed with Radix-4 Modified Booth encoding scheme. We compare area and power requirements for different widths. We are able to reduce power consumed per computation by more than 50% with this approach. Moreover, when an application demands a large number of squaring operations compared to multiplies, there is a strong case for using multiple squaring units for performing multiplication.
INTRODUCTION
With advances in VLSI technology, more and more functionality complexity has been integrated into digital designs to better support target applications. With many applications requiring support for floating point arithmetic, complex arithmetic modules like multipliers and powering units are now being extensively used in design. With technology scaling, the goal has been to operate designs at the fastest possible frequency to achieve high performance. The problem with these complex arithmetic blocks like multipliers and squaring units is that they require longer cycle times for computation. In order to achieve the frequency requirements, these designs invariably end up being pipelined, which results in increases in area and thus incurs a power penalty for operating at higher clock speeds. In many applications a higher power penalty cannot be tolerated and designers have to budget the power associated with individual resources.
In many applications multipliers are often used to perform squares and higher order power operations. Multiplier designs require large area and consume a considerable amount of power per computation. For powering operations where a general-purpose multiplier is not necessary, this results in power being wasted. We propose to use dedicated powering units which perform a specific function in place of a multiplier which has been designed for general-purpose computation. The advantage with using dedicated units is that they consume less power compared to general-purpose multipliers. By using dedicated resources one can save a considerable amount of power which allows designers to remain inside their power budgets. Squaring operations are common in many applications such as floating point divide and square-root [1] , cryptography, computation of Euclidean distance among nodes in graphic processing, and rectangular to polar conversion. [7] In many DSP applications the squaring operation is the most commonly used powering operation compared to other higher powering operations. Hence targeting the most commonly used operations would result in maximum savings in terms of area and power.
Often, designers use general-purpose multipliers to compute squares of a number. Even though using multipliers that are available as part of design packages reduces design time, it results in increased area and power requirements for the design. Dedicated squaring units are area efficient and also require less energy per computation as compared to generalpurpose multipliers. Due to inherent advantages in reducing energy required per computation for powering units, a dedicated squaring unit performs better as compared to a general-purpose multiplier.
The remainder of this paper is organized as follows. Section II presents a brief description of existing algorithms used in the multiplication of two binary numbers followed by the designs of squaring units for unsigned and signed numbers. We present a way to use squaring units to perform multiplication of two binary numbers in section III. Section IV details the implementation and experimental results followed by a conclusion in section V.
II. BINARY MULTIPLICATION AND SQUARING

A. Binary Multiplication
All the techniques for performing binary multiplication involve three basic steps: namely, -Generation of Partial Products, Reduction of Partial Products and Addition of the final two rows of partial products. An M×N bit multiplication can be viewed as forming N partial product arrays, each of Mbits and adding them together according to their weights. Multiplication is performed either by using a Shift -Add algorithm or by using Parallel multiplication techniques. The Shift -Add method requires M-cycles to perform M×N-bit multiplication. There are various techniques for performing parallel multiplication. The choice of the technique depends on the latency, throughput, area, and complexity requirements of the design [2] . The key in parallel multiplication techniques is the reduction of the partial products arrays. Many techniques deal with reducing the partial products arrays, and then Wallace tree or array tree adders are used to reduce the number of logic levels required to perform summation. The final two rows are added using a fast Carry Propagate Adder.
The most commonly used parallel multiplication technique is the Booth Encoding [3] or Radix-4 Modified Booth Recoding, which essentially examines the multiplier for strings of 1's or 0's. The basic idea in a Radix-4 Modified Booth Encoding (MBE) [4] scheme is that instead of shifting and adding for each bit of multiplier term, iterations are performed for every two bits of the multiplier. For every group of bits a pattern is obtained based on the next and previous bits and using the obtained 3-bit pattern a recoding according to Table I . is performed. All the recoding bit arrays are then added together according to their weights to obtain the final product. The Radix-4 MBE technique results in the halving of the partial products terms, as compared with shiftadd technique.
B. Binary Squaring
Squares are a special case of multiplication where both inputs are identical. Since the two inputs are identical, many optimizations can be made in the implementation of a dedicated squaring unit. Such a squaring unit requires less area compared to multipliers as nearly half of the partial products can be combined using the equivalence A i A j + A j A i = 2 A i A j which can be represented by adding A i A j to the next column to the left. This reduces the depth, which can be defined as the number of partial products to be added together in a column. With a reduction in depth, the design can operate faster as the number of terms on the critical path reduces. Fig. 1 shows a 4-bit unsigned squaring unit. We can observe from Fig. 1 that two A 1 A 0 terms in column 2 are reduced to having only one A 1 A 0 term in column 3. Similarly other partial products can be reduced. Also the property that A 0 A 0 = A 0 allows reducing terms in the final partial products. The square of a 4-bit number can be computed by adding the rows at the bottom part of Fig. 1. From Fig. 1 we observe that the depth has also reduced; an initial depth of four for a multiplier configuration was reduced to three for squaring.
With a two's-complement number scheme widely used, one needs support for signed squaring units. In designing signed squaring unit one has to subtract partial product terms containing sign-bits while adding remaining partial products. This complicates the design as one needs support for both additions and subtractions. In the multiplier space, the BaughWooley multiplication algorithm [5] provides a way to multiply signed numbers represented in two's-complement form. In a way it eliminates all the subtraction of terms, and only addition is required to compute the result. Let X v be the value of the signed number to be squared. Then the two's complement representation of X v is given by
The value S v of Square S is given by 
When x n-1 = 0, terms containing x n-1 are zero and can be ignored, but when x n-1 = 1 then terms containing x n-1 are to be subtracted. Hence third term in (3) using (5) can be written as-
The first term in (3) would always be positive and the second term indicates all the positive partial products. By transforming the third term in (3) which contains the partial product terms that needed subtraction into the addition terms as expressed in (6), we can now perform two's complement squaring with addition. Fig. 2 shows signed multiplication for a 4-bit signed number. All the partial product terms containing A 3 need to be subtracted. By inverting those partial product terms containing A 3 and adding 1 to them results in a two'scomplement of the partial product term, which can be added to effect subtraction. The drawback with this method is that it requires the inverse of every bit of the input, but that can easily be computed by using inverters. As shown in Fig. 2 , the depth of the squaring unit using the Baugh-Wooley algorithm is more than that of the unsigned case, as the signed bit also needs to be taken into consideration when adding column 5. As the width of the squaring unit increases, the additional penalty of sign-bit amortizes over the number of bits, and as shown later even the signed squaring units' average depth requirements is nearly 50% of the multiplier.
Both the above signed and unsigned squaring units assume we have the partial products available, which are easily computed using AND gates. 
III. QUARTER SQUARE TECHNIQUE
With squaring units requiring less area and power as compared to multipliers, it is interesting to assess the use of squaring units to perform multiplication. There are various methods to obtain a multiplication of two numbers using squares instead of using multipliers. One of the most widely used methods in algebra is the quarter square method [6] . In mathematical terms, the quarter square algorithm can be expressed as
In this method, to obtain the product of two numbers, we obtain their sum and difference. The obtained sum and difference are squared, and the difference of these two squares when divided by 4 provides the result. As in binary arithmetic, divide by 4 operation can be easily accomplished by shifting right two digits. The quarter square technique is illustrated in Fig. 3 .
From Fig. 3 we observe that if we have two 8-bit unsigned numbers, the sum can result in a carry, similarly with two 8-bit signed numbers, the difference can generate an overflow. In order to produce a correct result we need a (8+1) bit adder for computation of sum and difference, and hence one would need at least (n+1) bit squaring units to correctly perform an n-bit squaring operation.
IV. EXPERIMENTS AND RESULTS
An 8/16/32-bit multiplier performing signed / unsigned operations based on the Radix-4 Modified Booth Recoding Algorithm has been described in Verilog. We also developed Signed Squaring units based on the Baugh Wooley Algorithm for 8-bit, 16-bit, and 32-bit in Verilog. As multipliers support signed operations, we use squaring units designed for signed operations for all the results and comparisons.
We implemented the Quarter square algorithm using the squaring unit designs to perform signed/unsigned multiplication. All the designs were synthesized to an Artisan 90nm standard-cell library with similar constraints. In order to support both signed and unsigned number formats, we designed 10-, 18-, 34-bit squaring units to support 8-, 16-, 32-bit multiplication, respectively. We compare radix-4 MBE multiplier with the squaring units for performing squares and then compare it with the multiplier based on the quarter square algorithm. For the comparison between Booth's multiplier and squaring units, we compare two designs based on their depth or the maximum number of partial products in a column. Table II shows the depth requirements for the Booth multiplier and the squaring units. As seen from Table II, the maximum depth requirement for the squaring unit is more than that of the Booth multiplier, but the average depth is much smaller than that of the Booth's multiplier, allowing it to be reduced using 4:2 or 3:2 reducers resulting in less number of logic levels. The columns with maximum depth were not observed to be on the critical path in the squaring units. Fig. 4 plots the area requirements for various designs under the same constraints. From the results we can observe that the squaring units require only about 55% of the Radix-4 MBE multiplier area. Designing multipliers with quarter square techniques results in an area penalty of about 20-60% over Radix-4 MBE multiplier. As multiplication width increases we observe that the area difference between both multiplication techniques reduces. This can be attributed to the optimization performed by the tool for the large number of partial product terms with the available design library components. From the area requirements of quarter square multiplier and squaring units, we find that the area overhead of adders in the multiplier design is about 20-30% of the area of the squaring unit.
The power required for each design is shown in Table III . As seen from Table III , we observe that the squaring units consume about 50% of the power consumed by the Radix-4 MBE multiplier to perform squaring. However, when a multiplier is built using the quarter square technique, it consumes more power than the Radix-4 MBE multiplier as the design requires the use of two squaring units and three adders for every multiplication. The adder overhead significantly affects the overall power. Therefore, building multipliers using the quarter square technique allows us to reduce power in applications where the frequency of squaring operations is much greater than that of general-purpose multiplications. In such cases, one can turn-off unused circuitry using many of the standard VLSI power saving techniques, which would result in substantial power savings but also provide an option of performing multiplications. One option to reduce power requirements in the quarter square design is to reuse components, i.e., by configuring one squaring unit and one adder for multiplication.
V. CONCLUSION
The paper presents a case for the use of dedicated squaring units in applications where squares are required in large numbers, which otherwise would be implemented using general purpose multipliers. A method of using squaring units to perform multiplications is presented, and the tradeoffs as compared to conventional multipliers are presented. We provide results for area and power requirements in signed squaring units and quarter square multiplier for 8/16/32-bits. The low area and power required per computation provide significant advantages when dedicated squaring units are used in a design instead of a general purpose multiplier.
