Abstract-We propose a novel discrete wavelet transform (DWT) architecture which is fully scalable, flexible, and modular. This architecture is bit serial, and therefore, has low hardware complexity and low power requirement. Nevertheless, because of its particular structure, it operates on-the-fly (i.e., it does not require wait cycles between consecutive input samples). Moreover, a very small hardware overhead can upgrade the architecture to compute also the inverse DWT ("double-face" utilization). Hardware complexity and computing performance are analyzed in detail.
I. INTRODUCTION
The Discrete wavelet transform (DWT) [1] - [3] is a mathematical technique that decomposes a signal in the time domain by using dilated/contracted and translated versions of a single basis function, named the prototype wavelet. In the last decade, the DWT has resulted as being preferable with respect to other traditional signal processing techniques, since it offers useful features such as inherent scalability, efficient computational complexity and VLSI implementation, low aliasing distortion for signal processing applications, and adaptive time-frequency windows. Hence, the DWT has been studied and applied to a wide range of applications: different branches of image and video processing [1] , [4] , signal processing techniques [5] , speech compression/decompression [6] , numerical analysis [7] , biomedicine [8] , etc. In many of these applications, real-time performances are required in order to achieve attractive results. Therefore, the implementation of the DWT by means of dedicated VLSI ASIC's has recently captivated the attention of a certain number of researchers (e.g., [9] - [15] ).
In this paper, we propose a bit-serial architecture computing the DWT. The bit-serial processing mode has been largely adopted in DSP ASIC's (e.g., [16] - [25] ) since it has many advantages with respect to the parallel approach [26] , such as a simpler communication strategy Manuscript received July 24, 1998 ; revised September 1999. This paper was recommended by Associate Editor J. Astola.
The author is with the Dipartimento di Elettrotecnica ed Elettronica, Facoltà di Ingegneria, Politecnico di Bari, 70125 Bari, Italy.
Publisher Item Identifier S 1057-7130(00)00581-4.
(single wires instead of data-buses), a reduced number of pins, lower power requirement, less hardware complexity, and the possibility of achieving very high throughput by pipelining at the bit level. Moreover, the bit-serial approach often allows internal regular structures which are suitable for VLSI implementation. The removal (total or partial) of "wait-cycles" between two consecutive input samples is a key for increasing the achievable throughput in bit-serial signal processors. In this context, some convolvers have been already designed [20] - [25] . Here, we introduce the first (on the best of our knowledge) architecture which totally avoids the need of wait cycles in the DWT bit-serial computation. As a consequence, the proposed architecture can operate at the same frequency of the effective data input and with a short latency (on-the-fly computation). In addition, the bit-serial approach is exploited to double the functionality of the device: an insignificant hardware overhead (only a few bitserial multiplexers and control lines) can upgrade the architecture to compute also the inverse DWT (IDWT). As a result, the derived "double-face" architecture can perform indistinctly analysis or synthesis.
II. THE ARCHITECTURE

A. Forward DWT
One level of forward DWT [1] - [3] can be seen as a decomposition of an input sequence x (having N samples) into two subbands a and c (both having N=2 coefficients) performed by means of two convolutions followed by decimation (1)
In the above equation, h and g are, respectively, the ML -taps low-pass and the MH-taps high-pass analysis filters.
Since each level of decomposition generates subbands by means of convolution followed by decimation, our basic idea is to directly perform the decimation avoiding the computation of odd samples. We shall exploit the saved clock cycles in order to generate the additional bits due to the format extension occurring in the products in (1) . In this way, we achieve a full-precision data representation which may be required by certain applications needing perfect reconstruction.
The architecture that we are going to propose consists of as many elementary modules (EM's) as the levels needed by the decomposition scheme.
1) First Level of Decomposition: The EM computing the first level of decomposition is represented in Fig. 1 . It is composed by two semisystolic arrays, A 1 and A 2 , which generate, respectively, c and a. A 1 and A2 differ only in the number of processing elements (PE's) which is, respectively, M H and M L (in the example of Fig. 1 
The bit-serial adders are constituted by full adders having the carry-out suitably delayed and fed back. For what concerns the multipliers, we can efficiently adopt the serial-parallel multipliers shown in Fig. 2 since we can consider that only the samples of x are bit-serially input. Conversely, the weights of the filters are constant during the processing and can be preloaded and stored once for all. 1 From now on, we will refer to the temporal diagram of Fig. 3(a) , and we will consider an 8-bit quantization (Q = 8) both for the input samples and for the filter coefficients. The model extension to a case with more precision is straightforward.
During the clock cycles (cc's) 0-7, x0 is bit-serially input in the EM through the line L 1 [upper side of axis 3 in Fig. 3 By an analysis of the axes 4-7 and 10-13 in Fig. 3 (a), it is evident that the decimation has been directly performed because only the products gi x2n0i and hi x2n0i that are required by (1) . have been computed, respectively, by A 1 and A 2 . Such products should be added according to (1) in order to compose the subbands a and c. Unfortunately, these sums cannot be trivially achieved. Let MJ (J = 0; 1; 2; 3) be a generic multiplier. It outputs the 16 bits of a given product (e.g., h J x i , supposed MJ in A2 ) and, immediately after, the 16 bits of another product (e.g., h J x i+2 ). Therefore, if we suppose only one adder circuit 6 J cascaded with M J , at a given instant, 6 J should process 16 bits of h J x i , and, immediately after, the bits of hJ xi+2 . As a consequence, the additional bits due to the carry propagation in the first sum will conflict in 6 J with the least significant bits of the second sum. This inconvenient is noticed by the upper and by the lower sides of axis 8, where the data are partially overlapped concerning the additional bits gener- ated by carry propagation. In bit-serial devices, the above mentioned problem is generally solved introducing wait-cycles between two consecutive input samples. Nevertheless, wait-cycles drastically reduce the achievable throughput. Therefore, the solution that we suggest consists of providing each multiplier M J with a pair of output lines: each line is connected to a single adder (6J and 6 0 J ). Upper and lower sides of axes 4-7 and 10-13 show how the data produced by MJ are interleaved on the output lines by means of two control signals, S 0 (for even J ) and S1 (for odd J ), which are shown in the axes 1 and 2 of the temporal diagram in Fig. 3(a) . As a result, a gap of 16 "void" cc's will be created between two consecutive data fed into the same adder:
this gap will allow a correct output of additional bits ( 16) generated by the sum in (1) . For the sake of clarity, we have filled in black these additional bits on the axes 8 and 14 in Fig. 3(a) . Note that, even though from a theoretical point of view, might be any value not greater than dlog 2 ( jg i j)e in A 1 (not greater than dlog 2 ( jh i j)e in A 2 ), in practice, due to particular shape of g and h in the most popular wavelet bases, is not greater than two. In Fig. 3 (a) (axes 4-7 and 10-13), solid arrows join the data which have to be added according to (1) . It is evident that the output lines of each PE must be suitably delayed by Q cc's between two consecutive adders (Fig. 1) . Moreover, by an analysis both of the upper and of the lower sides of axes 4-7 and 10-13, we derive that the connectivity of consecutive PE's has a period of two PE's, in the sense that the lines O2i+1 and O 0 2i+1 (i.e., the output from the adders 6 2i+1 and 6 0 2i+1 in PE 2i+1 ; i = 0; 1; 1 1 1) must be fed, respectively, into the lines I 0 2i and I 2i (related to the adders 6 0 2i and 62i in PE2i) and that the lines O2j and O 0 2j (i.e., the output from the adders 6 2j and 6 0 2j in PE 2j ; j = 1; 2; 1 1 1) must be fed, respectively, into the lines I 2j01 and I 0 2j01 (related to the adders 62j01 and 6 0 2j01 in PE2j01). This connectivity is shown in Fig. 1 . 
An efficient technique for easily managing negative terms is proposed in [27] .
2) Further Levels of Decomposition: The effectiveness of a DWT architecture is also evaluated by its scalability, i.e., the capability of being also used for computing more levels of decomposition. High modularity can be straightforwardly achieved, since further levels of decomposition can be performed by means of EM's identical to that one described for the first level.
If the particular application requires for the intermediate wavelet coefficients the same precision used for the input samples (i.e., Q), gaps of Q cc's (in the used example, Q = 8) should be interspersed with the intermediate wavelet coefficients fed into the second level, because of the decimation. Therefore, the clock signal in this level has to be forced to 0 ("frozen") during the input of these gaps. 2 Such "frozen" clock cycles are denoted by the dotted segments on axes 9 and 15 in Fig. 3(a) (in the example shown, Q = 8). More generally, a decomposition in L levels can be implemented by a tree of L EM's having, at the level l (l = 2; 3; 1 1 1 ; L) the clock "frozen" while gaps of (2 l01 0 1)Q cc's are input between two consecutive samples. As an example, Fig. 4(a) shows an architecture implementing a 3-level dyadic tree DWT (all the I/O and routing lines are bit-serial).
A last remark concerns the control signals S 0 and S 1 : they follow at each level the timing of the input samples and can be generated by L local controllers (one for each level). In these controllers, S0 is produced dividing by 2Q the frequency of the clock used in that specific level, and S 1 is obtained delaying S 0 by Q cc's.
In [28] , two other different strategies for computing further levels of decomposition are provided. These strategies aim, respectively, at holding full precision for the intermediate wavelet coefficients and at reducing the hardware complexity. In particular, hardware complexity 2 Alternatively (but loosing in modularity), the gaps generated in the level l can be treated as "wait-cycles" in order to have only one adder (instead of a couple) pipelined to each multiplier in the level l + 1 can be halved, with respect to the above described modular implementation. In fact, the gaps coming from the previous level can be exploited in order to compute both the high-and low-pass subbands by means of only one array of PE's [28] .
B. IDWT
The IDWT performs synthesis on the oversampled subbands produced by forward DWT. It can be shown (e.g., [28] ) that even-and odd-numbered samples of the reconstructed sequence x 0 can be formalized as follows: (2) where N L and N H are, respectively, the generic dimensions of the lowand high-pass synthesis filters (i.e., h 0 and g 0 ) and dye and byc denote the rounding to the lowest integer higher than y and to the highest integer lower than y, respectively. Therefore, the only conceptual difference between forward and inverse case is channel "splitting" and "combining" [5] .
In this scenario, the structures A1 and A2 in Fig. 1 can be also used to compute, respectively, the sequences x 0 odd and x 0 even according to (2) . The inverse functionality of the architecture for N L = N H = 4 is exhaustively described in [28] and can be easily derived by means of the temporal diagram of Fig. 3(b) . Note that, in case of inverse processing:
1) the I/O data and the filter coefficients to be used, are denoted in square brackets among those shown in Fig. 1 ; 2) the control signals S0 and S1 are delayed by Q cc's with respect to the direct processing [compare axes 1-2 in Fig. 3(a) and (b) ]. 
III. "DOUBLE-FACE" BEHAVIOR
The flexibility of the EM in Fig. 1 suggests of designing a single architecture which is suitable both for the direct and for the IDWT ("double-face" behavior). The EM's of such a "double-face" architecture need a correct dimensioning and suitable interconnections.
A. Dimensioning of the Arrays
The arrays of a "double-face" EM need a suitable number of PE's in order to perform forward as well as inverse processing. In fact, even though, for the sake of simplicity, in Fig. 1 
B. Interconnections
The networks interconnecting the EM's are quite different in forward and in inverse trees [ Fig. 4(a) and (b) ]. Therefore, both the typologies of connection have to be integrated in a "double-face" device. The resulting network will be configured case by case according to the actual utilization (analysis or synthesis). It is worth to note that this configurable network does not need complex circuitry since it requires only bit-serial lines.
As an example, Fig. 5 shows a "double-face" architecture that can be used for analysis as well as for synthesis. Each connection between two adjacent EM's needs one suitable delay and five bit-serial 2:1 multiplexers. Two of such multiplexers (i.e., the shadowed ones) select an input line (respectively, "0" in inverse mode and "1" in forward mode) and do not switch their configuration for the whole duration of the processing. The other three multiplexers effectively work only during the inverse processing: they enable in output the two input lines in a suitable interleaved fashion.
IV. PERFORMANCE EVALUATIONS
In this section we will evaluate the hardware complexity H(1), the latency L(1), and the throughput T(1) of an architecture A D imple- and Q. menting a L-level dyadic tree. Let AD be realized according to the most modular implementation, i.e., composed by L identical EM's. In order to make the length of critical paths independent from the number of levels, we insert a D-latch onto each bit-serial line between two adjacent EM's.
Since we aim at having measures independent from the specific integration technology, H(1);L(1), and T(1) will be evaluated, respectively, in terms of number of transistors, in terms of number of cc's, and in terms of inverse of number of cc's. These "figures of merit" can be converted in "abstract values" of silicon area, time, and frequency for a specific implementation, when multiplied by the average area needed by a transistor, by the period of one cc (namely, ), and by 1= , respectively. Note that, due to the silicon area needed by the routing and because of some physical phenomena (e.g., load capacitances, fan-in and fan-out, etc.), "actual values" for a specific integration technology may be different from the derived "abstract values," and therefore, they can be known only by means of a detailed analysis of the implementation.
A. Hardware Complexity
With respect to the analysis performed in Section II, H(A d ) can be Fig. 6 . Similar characteristics for architectures implementing balanced trees are provided in [28] .
B. Computing Performances
Because of the insertion of D-latches between two adjacent modules, the clock period is independent from L. 
Performances given by (7) and (8) are dependent on the implementation technology. In order to give an idea, a standard-cell based implementation in 0.5-µm Alcatel-Mietec CMOS technology can achieve As latency L(1), we consider the delay between the input of the LSB of the first input sample and the output of the MSB of the first datum produced at the last level in one period of computation. Therefore, by taking into account the functionality of our architecture (diagrams in Fig. 3(a) and (b), we obtain
where K 0 denotes either ML or MH if we are considering the latency, respectively, either of a or of c, and K 00 denotes either dN H =2e + dNL=2e or bNH=2c + bNL=2c if we are considering the latency, respectively, of either x 0 even or of x 0 odd . The term L 0 1 is due to the latches inserted between two consecutive modules. Equation (8) is plotted in Fig. 7 for different values of parameters. As an example, by using the above mentioned CMOS technology and considering typical values of K 0 ; K 00 ; Q; L, and , (i.e., 7, 7, 8, 4 , and 1, respectively), we obtain L(A d ) =3.94-µs (forward processing) and L(A d ) =4.41-µs (inverse processing).
V. CONCLUSION
In this paper, a versatile bit-serial VLSI architecture computing the forward as well as the inverse DWT has been proposed. The most at- tractive features of the proposed architecture are: 1) it has a very compact and regular structure, and therefore, can be easily implemented for being considered as part of larger processors as well as individual and efficient unit; 2) it does not depend on the implemented wavelet and it is highly flexible with respect to the length of the filters or the width of the words; 3) its hardware complexity and power requirement are reduced with respect to other known architectures operating in parallel since it processes the data in bit-serial mode; 4) it performs an on-the-fly computation (no wait cycles are required), and therefore, even though it is bit-serial, it shows a very short latency and achieves a high throughput that is independent from the parameters of the specific computation (number of levels, length of the filters, etc.); 5) the achieved on-the-fly computation makes unnecessary the storing of intermediate results between two consecutive levels; and 6) a couple of devices can operate "in tandem" (i.e., the first one performing analysis, and the second one performing synthesis).
Characteristics independent from the integration technology have been derived and plotted in terms of complexity, throughput, and latency. As an example, a standard-cells based implementation in CMOS 0.5-µm Alcatel-Mietec technology can reach a throughput of 17310 6 samples/s (8 bits of quantization for the input sequence), and can compute four levels of dyadic decomposition using 7-tap filters (8 bits of quantization), showing a latency of 3.94 µs.
Autoregressive Parameter Estimation from Noisy Data
Wei Xing Zheng
Abstract-A least-squares based method for noisy autoregressive signals has been developed recently, which needs to neither prefilter noisy data nor perform parameter extraction. In this brief, a more computationally efficient procedure for estimating the measurement noise variance is developed, and then an efficient implementation of the method is presented. It is shown that this better way of implementation can considerably reduce the computational requirement of the least-squares based method without any performance degradation. Computer simulations that support the theoretical findings are given.
Index Terms-Autoregressive signals, algorithm implementation, leastsquares method, parameter estimation.
I. INTRODUCTION
Autoregressive (AR) models are extensively used in many application areas of signal processing, such as speech analysis, radar, spectral estimation, and noise cancellation [7] , [5] , [12] . Compared with noise-free AR modeling, estimating the AR parameters from noisy data is a much more important problem in practice [6] . Due to the presence of noise, the performance of estimated AR models can be severely deteriorated. For example, the standard least-squares (LS) and related approaches will give rise to biased estimates of the AR parameters when the measured signal is corrupted with noise. One important application area of noisy AR modeling is in speech processing, where the speech signal may be subject to background noise, which cannot be ignored. Several efficient approaches that are typically used in speech processing have been developed for estimating the AR parameters from noisy data (e.g., see [9] , [8] ).
The existing methods used to estimate AR signals from noisy measurements can be divided into two main categories: autoregressive moving average (ARMA) model-based estimation and AR model-based estimation. The basic idea of the ARMA model-based estimation is to represent the underlying noisy AR model of order p by an ARMA(p; p) model, and then to estimate the AR parameters by use of the introduced ARMA model. Among this category, there are, for instance, the Modified Yule-Walker (MYW) equations method [6] , the maximum likelihood (ML) method [14] , and the recursive prediction error (RPE) method [10] . The MYW method is perhaps the most simplistic algorithm from the computational point of view, but it exhibits poor estimation accuracy and relatively low efficiency due to the use of large-lag autocovariance estimates. Moreover, numerical instability issues may occur when the MYW method is used in on-line parameter estimation. In spite of the guaranteed estimation consistency, the ML method is not only rather computationally demanding but also confined to off-line identification. While the RPE method is more numerically efficient than the ML method and is suited for on-line estimation, its employment of the Gauss-Newton algorithm in minimization causes intensive computations. The AR model-based estimation works with the underlying noisy AR model and employs Manuscript received March 24, 1999 ; revised October 1999. This work was supported in part by a research grant from the Australian Research Council and in part by a research grant from the University of Western Sydney, Nepean,
