Abstract-In this paper, a systematic compensation approach is presented to efficiently design the approximate squaring function with a simple combinational logic circuit. Also, a set of recursive Boolean equations for general outputs is derived such that the logic circuit can be rapidly designed and reused for various bit-width inputs. In logic implementation, our design approach possesses less circuit cost and lower critical delay. Moreover, in error analysis, the maximum relative error (MRE) and average relative error (ARE) of squaring approximation are significantly improved by at least 26.95% and 61.59%, respectively, as compared with the existing approaches. Finally, a 7-bit approximate squaring function chip is accomplished to verify the circuit performance based on 0.6-m CMOS technology. The chip layout occupies 127 135 m 2 and the total number of transistors is 186.
I. INTRODUCTION

I
N MANY applications, such as Viterbi algorithm, VQ algorithm, etc. [1] - [3] , the squaring function is a very fundamental arithmetic operation to perform the Euclidean squared distance estimation. So, it is necessary to develop a fast squaring function to meet the operation requirement for real-time applications. For exact squaring function, the look-up table and multiplier [4] were commonly employed in circuit implementation. Shammanna [5] proposed a new design by using cellular logic array. However, it has speed limitation and circuit complexity as the number of input bits increases. Recently, the square approximation [6] was proposed and it was proved that the algorithm performance using the approximation was not degraded after Monte-Carlo simulation. Eshraghi et al. [6] first presented an approximate squaring design, but it has a maximum relative error of 25% and an average relative error of 11.3%. Subsequently, Hiasat et al. [7] employed a high degree of regularity and pattern repetition to develop a combination logic design. This paper proposes a new compensative design algorithm to design the approximate squaring function. The basic idea of the algorithm is to extract some basic terms in the exact squaring equation, then develop a sequence compensation processes for increasing output accuracy dramatically. Finally, the outputs of the approximate squaring function can be simplified as a set of recursive Boolean equations.
II. COMPENSATIVE DESIGN ALGORITHM
For an approximate squaring function, the input is a -bit binary data , and the corresponding output is a -bit binary number , where represents the th bit of the output as input is -bit data. For the Publisher Item Identifier S 0018-9200(02)00130-0.
sake of clarity, a 4-bit example is used to explain the compensation design approach. At first, the output expression for the exact squaring function can be written as
From the viewpoint of hardware design, (1) has many additions needing carry propagation. So, the following compensative algorithm will eliminate these additions to progress the circuit performance. In the beginning, and in (1) are defined as pure terms. The other terms , , and are called composite terms since they consist of two literals. Based on these terms, the compensative design algorithm is described as follows.
Step 1) Let the approximate result equal to the sum of pure terms:
Step 2) Select the closest composite terms for compensation.
To observe the defined terms in (1), it can be found that three composite terms and , called the closest composite terms, are very intimate with the pure terms and , respectively, because they have the same weight and common literal. By using these closest terms to make up for , (2) can be rewritten and simplified the corresponding addition operations as (3) where stands for exclusive-OR operation and represents AND operation.
Step 3) Choose the second closest composite terms for compensation.
0018-9200/02$17.00 © 2002 IEEE Similarly, comparing the rest composite terms and the terms in (3), two terms and , so-called the second closest composite terms, are related to the terms and in (3). Therefore, and are introduced into (3). That is (4) Consequently, reconsidering the term , (or ) in (4), the addition operation can be replaced by the following sum and carry expressions: and where stands for OR operation.
Using the above results for (4), we have (5) Step 4) Use the remaining composite term(s) to compensate for in OR operation. In this 4-bit example, the remaining composite term is . Then, is compensated by this term in OR operation, and (5) can be rewritten as (6) Fig. 2 . Error of 5-bit squaring approximation. MRE: 13.17%. ARE: 1.96%.
TABLE I ERROR RATE COMPARISON
So as to reduce the gate delay, each output in (6) can be expressed in sum-of-product (SOP) form for two-level logic implementation:
According to (7), Fig. 1 lists the error data between exact squaring value and our approximate output. Then, the relative error rate is defined as , and the maximum relative error (MRE) given by our design is only 9.48% with an average relative error (ARE) of 1.04%. Similarly, using the compensative design algorithm, the outputs of the 5-bit squaring function and the corresponding error data are exhibited in (8) and Fig. 2. (8) 
where , and .
III. PERFORMANCE COMPARISON AND VLSI IMPLEMENTATION
For doing accuracy comparison completely, Table I indicates the maximum relative errors and average relative errors of approximate results obtained from distinct approaches. It is obvious that the errors are enlarged as input bit-width increases. However, our MAR and ARE are improved by at least 26.95% and 61.59% compared to the last work [7] . Next, in order to make fair comparison of hardware complexity, the number of transistors for the hardware implementation is counted based on SOP form, and the logic is realized with the static CMOS circuit. For our approach, the total number of transistors required by -bit approximate squaring function is for (10) Table II shows the comparison for the transistor number used by different approaches in several cases of . Our designs have minimal transistor and logic delay. In order to demonstrate the performance of our proposed circuit, a 7-bit approximate squaring function is designed and implemented. Fig. 3 shows the final layout design with 186 transistors. The layout area is m in 0.6-m SPDM CMOS process technology. The power consumption is measured to be 16 pW/Hz under a power supply of 5 V. Based on SPICE simulation, the circuit delay including the I/O pad circuit is about 3.1 ns and 2.5 ns for rising and falling edges, respectively.
