Abstract-To reduce the gap between the VLSI technology be determined from the application performances through capability and the designer productivity, design reuse based on IP the technique presented in [2]. In our approach, analytical (Intellectual properties) is commonly used. In terms of arithmetic methods are used to evaluate the computation accuracy. Thus, accuracy, the generated architecture can generally only be con-.
I. INTRODUCTION underlined with several experiments in Section IV.
The advance in VLSI technology offers the opportunity to
II. LMS/DLMS ALGORITHM AND ARCHITECTURE
integrate hardware accelerators and heterogenous processors in a single chip (System on Chip) or to obtain FPGA with several A. LMS and DLMS algorithms millions of gate-equivalent. Thus, complex signal processing The aim of adaptive filters is to estimate a sequence of applications can be now implemented in embedded systems. scalars from an observation sequence filtered by a system in The time-to-market requires to reduce the system development which coefficients vary. These coefficients converge towards time and thus, high-level design tools are needed. To reduce the optimum coefficients which minimize the mean square the gap between the hardware complexity and the designer error (MSE) between the filtered observation signal and the productivity, design reuse [1] based on IP (Intellectual prop-desired sequence. This type of filters is used in different erties) has to be used. fields such as noise cancellation, equalization, linear prediction To reduce the cost and the power consumption, the fixed-and channel estimation. The LMS based algorithms are the point arithmetic is required. For efficient hardware imple-most common used because their implementation in embedded mentation, the chip size and power consumption have to be systems is simpler than the RLS algorithm. The LMS adaptive minimized. Thus, the goal of this hardware implementation is algorithm, presented in Figure 1 .a, estimates a sequence of to minimize the operator word-length as long as the desired scalars Yn from a sequence of N-length vectors xn [3] . The accuracy constraint is respected.
linear estimate of yn is wnt x where w, is a N-length weight In an arithmetic point of view, the available IP are limited. vector which converges to the optimal vector wopt. The vector The IP user can only configure the input and output word-wn is updated according to the following equation length and sometimes the word-length of some specific operators. The link between the application performances and . t the data word-length is not immediate. Moreover, the fixedW±n+l n + lXnCen-D with e-n -wJxn (1) point design search space can not be explored easily with where ,u is a positive constant representing the adaptation this approach. Thus, the IP user must convert the application step. The delay D is null for the LMS algorithm and different into fixed-point. But, the manual fixed-point conversion is a of zero for the Delayed-LMS. tedious, time-consuming and error prone task.
In this paper, a new kind of IP is presented through the LMS B. Generic LMS architecture (Least Mean Square) and Delayed-LMS (DLMS) examples.
The generic architecture for the LMS/DLMS algorithm is These IP are configurable according to an accuracy con-presented in Figure l .b. The architecture is made-up of a filter straint influencing the algorithm quality. The IP user specifies part and an adaptation part to compute the new coefficients the accuracy constraint and the operator word-lengths are values. To satisfy the throughput constraint the filter part and automatically optimized. The optimal operator word-lengths the adaptation part can be parallelized. For the filter part, K which minimize the architecture cost and respect the accuracy multiplications are used in parallel and for the adaptation part constraint must be researched. ------------______--___ (n i,.
-.h (~~~~~~~~~~~~~~~~~~~~~~~~~~~~~)~~,( where an is the noise associated with the term te' x and C. Throughput constraint depends on the way the filter is computed. The error in finite
The system must verify a given constraint to ensure a realprecision iS given by time execution. The LMS Architecture presented in Figure 1 Cen = (5) and detailed in section II-B is divided in two parts corresponding to the filter part and the adaptation part. Even if the Delayed-LMS algorithm has a slower convergence speed compared to the LMS Algorithm, as the error is b) Noise power expression: The study is made at steady-delayed, the filter part and the adaptation part can be computed state, once the filter coefficients have converged. The noise is in parallel which gives it a higher execution frequency. The measured at the filter output. The power of the error between constraints become filter output in finite precision and in infinite precision is determined. It is composed of three terms.
TFIR < Te and TAdapt < Te (13) nE(4w)2 + E(ptxn)2 + E(in) (8) The parallelism level is obtained by solving the expression 12 and 13. These expressions require the knowledge of the operator latency which depends on the operator word-lengths.
At the steady-state, the vector wn can be approximated by Thus, firstly, the operator word-lengths are optimized with the optimum vector wopt. So the term E(ctwn)2 is equal to a K equal to 1. The obtained operator word-lengths allow wo 2(m2 + Or2) with wõpt2 = EWop to determine the operator latency. Secondly, the term K is computed from the throughput constraint and then, the The second term is detailed in [6] and is equal to The LMS and DLMS IP blocks have been used for different
The last term E(rjq) depends on the specific implementation experiments to underline the necessity to optimize the operator chosen for the filter output computation (filtered data). The IP processing unit is based on a collection of operators The LMS and DLMS IP have been tested for different valextracted from a library. This library contains the arithmetic ues of the throughput constraint Te and the accuracy constraint operators, the registers and the multiplexors for the different SQNRmin. For each Te and SQNRmin value, the operator possible word-lengths. Each library element is automatically and memory word-lengths are optimized under the accuracy generated and characterized in terms of area and energy constraint. Then, the architecture is generated. The architecture consumption from scripts for the Synopsys tools.
area, the parallelism level and the energy consumption are The IP architecture area and energy consumption are ob-measured and the results are presented respectively in Figure   tamned from the sum of the different basic element area and 3.a, 3.b and 3.c. The operator library has been generated from energy consumption. The elements correspond to the memory the 0.18 ,um technology from ST Microelectronics. The results (coefficients wn and input data xxn), the operators (multiplier, are presented for an timing constraint between 60 ns and adder, subtracter), the registers and the multiplexors used 170 ns and for an accuracy constraint between 30 dB and inside the datapath. 90 dB. The architecture area increases when the timing constraint Compared to a classical approach, for a same computation decreases. Indeed, to respect this constraint, the parallelism accuracy, the architecture area and the energy consumption are level K must be more important. More operators are needed reduced respectively by 30 % and 23 %. With our approach, and thus the processing unit area is increased. The architecture the user can optimize the trade-off between the architecture costs (area, energy consumption) increase with the accuracy cost, the accuracy and the execution time. Accuracy models constraint. High values of accuracy constraint require to use have been defined for other specific applications like NLMS, operators and data with a greater word-length. This operator APA [10] . Moreover, an automatic and generic floating-toword-length rising, increases the energy consumption and the fixed-point conversion methodology is under development [9] . area of the processing and memory units. Moreover, this operator word-length rising, increases the operator latency. 
