A method to determine the word length required by implementations of Digital Signal Processing (DSP) algorithms is presented. The method uses a C/C++ fixed-point simulation tool to model the impact of finite word length on overall accuracy. It finds a combination of quasi-optimum bit resolutions in arbitrary data flow graphs by computing dissimilarities between fixed-point and floating-point simulation results. The selected algorithm minimizes these dissimilarities and finds a combination of word lengths that meets objectives specified by the user. This method is applicable to a wide range of DSP algorithms. It was tested on 2 benchmarks, the fifth order elliptic filter and the Inverse Discrete Cosine Transform (IDCT), and arrived to known optimum solutions.
INTRODUCTION
Digital signal processing (DSP) algorithms are often expressed in floating operations or in 32-bit word length for convenience. Indeed, these data formats match those of existing commercial off-the-shelf processors. Yet in spite of a very significant growth of the performance of available processors, the requirements of many real-time applications in terms of pure performance or low power operation command the use of specialized hardware. Since the cost and power dissipation of hardware are highly dependent on the number of bits and on the use of floating-point operators, the problem of word length determination is important. This is especially true for the design of effective systems on a chip (SOC).
This work is based on three assumptions: I ) Increasing the bit resolution improves the accuracy of the output of the DSP algorithm to be implemented, 2) increasing the bit resolution increases processing complexity and power dissipation for a given processing rate, and increases propagation delays, and 3) an exhaustive search to find the optimal solution for the sizing of operators in complex data paths is impractical. Related to the first assumption is the requirement for the DSP algorithm to be stable, that is. for a small change at the input to produce a small change at the output. If the algorithm is unstable, then any fast search for a quasi-optimal solution could get trapped in a local minimum in the search space. In this paper, a quasi-optimal solution is the best known solution, but exhaustive simulations are required to confirm that is optimal, which is impractical due to computational complexity. For completeness, the tool is briefly described in the next section. The proposed method is introduced in Section 3. and applied on two benchmarks in Section 4. Finally, results are analyzed and compared with those obtained by W. Sung & al. in Section 5.
FIXED-POINT SIMULATION TOOL
The fixed-point simulation utility developed by W. Sung & al.
[8.9,10] converts a floating-point DSP program written in C or C++ into a fixed-point equivalent description. The word length (WL). integer word length (IWL), sign. overflow handling scheme, and quantization mode of individual operands and constants can be assigned using comments in the floating-point program. These informations are processed by a floating-point simulation utility that extracts the peak and computes the mean and standard deviation of each individual operand and constant over some specified input data set.
The operations associated with the fixed-point data type. such as *='. '+*. *-.. 'x' and '/' are defined at the class declaration.
Then, fixed-point arithmetic operations. instead of floatingpoint arithmetic, are conducted automatically due to the operator overloading capability of the C++ language. Except for declaring the operands to belong to a fixed-point class type and adding a fixed-point header file, no other part of the original program is changed during the conversion process.
0-7803-6685-9101 1$10.0002001 IEEE v-53
The fixed-point data format (or fixed-point C data class) employed in the fixed-point simulation tool is defined as follows:
( WL. IWL, "t,,t2.tj*q ) where the characters " t l , t g j " define the type of fixed-point operand. The first character, t l , defines the sign attribute of the operand, which can either be unsigned ('U') or two's complement ('t'). The two's complement sign format requires one additional bit, which is the Most Significant Bit (MSB) in the fractional format. The overflow-processing mode, declared in I?. specifies whether no treatment ('0') or saturation ('s') is applied. Finally. the quantization mode is specified in t3. It can be set either to rounding ('r') or truncation ('t'). A fixed-point format of ( 10.2. "tsr" ) corresponds to a word length of 10 bits (including an eventual sign bit). integer word length of 2 bits (excluding an eventual sign bit), two's complement representation, saturation upon overflow, and rounding for word length reduction. A IO-bit binary data 0101000000 in the above format should be interpreted as 010.1000000. which corresponds to a real value of 2.5. 
Errors computation
Results of the floating-point simulation stored in
Step 2 and fixed-point simulation stored in Step 4 are used to compute various errors (max peak error, mean error, mean square error). This step computes the dissimilarities between the simulations for each test bench generated in
Step 1. The mean and mean square errors are used for selecting the next operand whose precision should be modified. The max peak error is used to compare the solution to the application specification. The three criteria used to quantify the error are useful but other critera could equally be used to guide the search of a quasi-optimal solution.
Minimization algorithm
A minimization algorithm tries to find the global system solution (word length combination) that meets the specifications of the application while keeping hardware cost as low as possible.
In this work, four minimization algorithms are examined. a) A n e.Y/?ausfive procedure tries all the word length combinations, and keeps the one that meets tht: specifications, while minimizing the hardware cost defined as follows:
where WLi is the word length of the i th operand and x, is the corresponding hardware cost per bit defined in the range [O, I] . Each 3: is fixed by the user. Other definitions of the hardware cost may be more appropriate and will bt: explored in the future. An exhaustive procedure is rapidly prohibitive for real life problems, but it is useful on small size problem, as well as to obtain the true optimum for a given problem formulation. Knowing the true optimum allows to assess the quality of solutions produced by faster heuristic methods.
v-54
A mill i I bit procedure which proceeds in three execution phases. The first phase finds, for each operand 0,, i=O, I ,..., ( N -I ) , the minimum bit resolution that satisfies the specifications when all other operands 0,, j=O, I , ..., (N-I), ja', have an arbitrarily long word length (or floating-point format). The second phase sets the resolution of each operand to the value found in the first phase. The third phase is an iterative competition between the operands to gain one bit. A bit is temporarily assigned to each operand, and the one for which the error is minimized wins the right to keep the bit. This repeated until the error specifications are met.
A mar -I bit procedure starts with the maximum bit resolution allowed by the fixed-point simulation tool for all operands, which is 32 bits in our case. Then the operands compete to loose their bits as follows. One bit is temporarily removed from each operand 0,. i=O, I,.... ( N -I ) , while all other operands 0,. j=O, 1 ,.... ( N -I ) , j d , remain unchanged. The operand 0, for which the error is minimized wins the right to loose a bit. The process is repeated for another bit, as long as the error specifications are not met.
An ridrrtiiv procedure starts with all operands having floating-point precision. The bit resolution of a first operand, say 0,, is set to 0 and gradually increased until the system meets the specifications of the application. The value of WL, is then fixed, while a second operand is analyzed in the same way. The user determines the order in which the operands are processed. This step is repeated until all UT, are fixed.
BENCHMARKS
proposed method was applied on two well-known DSP . . ..
benchmarks: the fifth order elliptic filter and the lnverse Discrete Cosine Transform (IDCT). These ? benchmarks were already implemented in the simulation system of W. Sung & al. These authors also provided optimum bit resolutions found by a manual procedure. The provided solution demonstrates the potential to reduce implementation cost while satisfying the error specifications.
IDCT
The Discrete Cosine Transform (DCT) is an effective transform coding technique for still images and video compression. It is used in the JPEG and MPEG standards. An Inverse Discrete Cosine Transform (IDCT) is used to convert transform-domain data back into the spatial domain.
A multiplier-adder based architecture that supports computation of the IDCT is illustrated in Figure 2 
Fifth order elliptic filter
The second benchmark is the 51'~ order elliptic filter. This low pass filter uses coefficients defined in [ 151 and white noise in the range 2l5 ] for input signals. The signal to noise ratio of the output signal is set to 40 dB for evaluating dissimilarities between obtained and desired frequency responses. The elliptic filter is shown in Figure 3 . 
RESULTS
The results obtained by W. 161. These results show that the proposed optimization procedures either reach the optimum solution or come very close.
As shown in Figure 3 . several adders and multipliers are needed in the elliptic filter architecture. The 10-1 1 and 10-12 in Table I . mean that to meet the error specification some adders end up with IO bits. while others have 1 1 and I ? . The e.~-hnus~iiw algorithm finds the optimal solution for the IDCT. but its use is prohibitive on the elliptic filter.
The evolrrtiw algorithm is the one that requires the smallest number of iterations. However, the optimum solution is not always reached because it is the user that decides in which order the operands are processed, and this greatly influences the results. In some instances, the procedure does not converge to a solution.
The rnar -I bit algorithm always finds a solution. albeit not always the optimal one. However. the number of iterations it uses is high compared to the others algorithms.
The rniri + I bit algorithm appears to be good at rapidly reaching a quasi-optimum bit resolution combination.
CONCLUSION
A method for the automatic determination of optimum word length in fixed-point implementations of DSP algorithms has been proposed. This method has been applied on 2 well-known benchmarks. the fifth order elliptic filter and the IDCT. Based on a comparison of four minimization algorithms from the standpoints of the number of iterations and implementation cost. fast searches appear promising for rapidly finding quasioptimum word length combinations.
