Abstract -This paper proposes a novel VLSI architecture to compute the DWT (discrete wavelet transform) coefficients using Mallat's algorithm with reduced complexity. We studied the commonality embedded in the mirror filters of the algorithm and use a PLA as an Address Generator (PAG) to load the data for cascaded FIR computation. By using an embedded downsampling process in the control signal design, we reduced the complexity by saving storage and computation. The prototyping design is implemented and fabricated using AMI 1.5 micron CMOS process through MOSIS.
INTRODUCTION
Discrete Wavelet Transform (DWT) is a very useful tool in time-frequency analysis because of its excellent localization both in time and frequency [1] . It has been very successful in several areas such as image compression, communication and denoising. DWT is a good alternate to FFT (Fast Fourier Transform) in most applications. For the same reason as DCT and FFT ASICs, there is good demand to investigate efficient implementation architecture of VLSI design for DWT and push it into the real-world high-speed IC industry.
Mallat's pyramid algorithm is considered the most important algorithm to calculate DWT coefficients and it plays the similarly important role in wavelet transform as FFT has been in Fourier Transform. Mallat's algorithm is basically a collection of cascaded FIR filtering operations by a pair of mirror filters and down-sampling procedure in each scale. The problem is stated briefly as follows. Given a signal )} ( { n x , we are expected to generate a set of wavelet coefficients } { In this paper, we first described Mallat's DWT algorithm. We made a complexity analysis and studied commonality in the algorithm. We found some embedded redundancy in the mirror FIR filtering and down-sampling process. Based on some important observations, we derived a VLSI architecture with reduced computational and storage complexity compared to the original algorithm. A Programmable Logic Array (PLA) Address Generator (PAG) is applied to generate index and control signals to load the data to a MAC (Multiplier Accumulator) unit. The prototype design is implemented with AMI 1.5 micron CMOS process.
II.
DWT & MALLAT'S PYRAMID ALGORITHM According to wavelet transform, a set of time series f(t) can be approximated by the smooth version projection onto multiple scaling subspace and the detailed version projection to the wavelet subspaces, as in [1] . The filters have the property to make them mirror filters. In Fig.1 )} ( { are computed from specific wavelet construction methods and they have a symmetric property which makes,
Here L is the length of the filters. In Fig.1 , Daub-4 wavelet is used as an example to demonstrate the mirror filters.
The computation constitutes of two steps recursively: initially the original signal f(n) with length 
In the next scale computation, 
III. COMPLEXITY ANALYSIS AND OPTIMATION
For a space and complexity constraint VLSI ASIC design, the key challenge is the complexity in both storage and computation. Generally, we need two FIR computations for each scale with the same length filters where
and . It is obviously naïve to compute the direct multiplication of these two big matrixes since the matrix H is highly redundant and has size (N+L-1)*(N+L-1) with only L coefficients. are mirror filters and only differ in the index and sign; 2) a down-sampling procedure after the FIR introduces redundancy in the computation and we waste half the resources to compute those coefficients that we do not want; 3). By applying the standard FIR operation, we will need an extra decimating procedure to pick the desired coefficients; the storage for the intermediate coefficients
For a detailed study, we expand the expression of only those scaling/wavelet coefficients of interest,
and the relation of the filters are, Recall that in VLSI design the adder can have one input bit control signal "sub" to switch between addition and subtraction. So if we know the index and sign of each coefficient at each computation cycle, we will be able to get the correct results.
The key idea is to embed the decimation into the FIR computation. Rather than using a shift-register to generate the ordered index of the FIR coefficients as in a standard FIR computation, we design a PLA to generate both the address and sign for x(n) and h(n) at a particular step. Because the number of MAC is fixed for each coefficient as L, it is very convenient for the PLA to implement a finite state machine (FSM) with very simple states. We only require four states for all coefficients. And since the index is generated by the PLA, it can be random and flexible rather than the exact pattern of shifting-order in a FIR computation.
We achieve a significant improvement in the storage and computation complexity with this design. TABLE.I summarized the improvement. Here N is the length of data block and L is the length of the filter coefficients. For simplicity, we also use the number of the example,i.e. N=8, L=4. 
IV. VLSI ARCHITECTURE
The top level architecture of our design is depicted in Fig.4 . The core of this design consists of the following parts: a 8*8 Booth Recoding multiplier; a 12b*12b accumulator; internal memory unit to store the data x(n) and filter coefficients During computation of the DWT coefficients, an external MPU first inputs the data {x(n)} and scaling filter
Then the DWT chip switches to computation mode by "X/CSW" MUX control signal. The PLA will generate the relevant address to the decoder of memory unit according to the current scale, type of coefficient and index of the currently computed coefficient. The main PLA uses a predefined lookup table to generate the address and sign for both {x(n)} and When all the coefficients are computed in one block of {x(n)}, the MPU will feed in another block and this procedure continues recursively.
V.
PLA DESIGN
In the proposed architecture, only one MAC is required (However we can use multiple MACs for a pipelined computation. This is a possible enhancement). The design of the PLA is essential to the algorithm. Since there are several coefficients to compute, the state transitions of the PLA might be very complicated if we do not extract the commonality among the computation of coefficients.
Notice in the computation of four scaling/wavelet coefficients in (4), we have 3 additions and 4 multiplications for each coefficient. We notice further that in general, when the filter length is L, each coefficient in the complete Mallat's pyramid tree has L-1 additions and L multiplications. This is very important since we can design the PLA with L as constant number of the basic state and use counter to generate the switch control signals of different coefficients.
A simplified FSM is depicted in Fig. 5 . In the STARTDWT state, original signals and the filter )} ( { are inputted to the memory. After the data has become valid, the whole system is set to idle state (state0). All latches and other registers should be initialized after the RESTART is asserted. Then a STARTDWT (SD) signal is sent to the chip. From this point, the DWT chip is controlled by the internal main PLA and the calculation cycle begins. In one cycle, the FSM includes four basic states {s0, s1, s2, s3}, each for one multiplication. The transition of the state is caused by the completion of one multiplication operation. After each multiplication, a "LDACC" signal for the accumulator is generated. After four states, one coefficient is computed. Then it updates the number of coefficient and enters another cycle to compute the second coefficient. After the completion of scaling coefficient, C/DSW is updated to generate address and sign of the wavelet filter and the corresponding {x(n)}. Then after one scale, it continues to the next scale. After the last coefficient of the last scale is computed, the state returns to idle and waits for the next block data. In the simplified FSM depicted in Fig.5 , the bubbles are the states. The lines indicate the transition of the state based on control signal values. The output of PLA is also depicted on the lines with the form "input/output" above the lines.
Basically, there are three different periods. The first one is for addition, which equals to 4 multiplication cycles; the second one is for switching of different coefficients, i.e. d i or c i, which equals to the sum of 4 cycles of additions; the third one is for the number of coefficients computed in the current scale. So, three counters are designed for these three different periods. They send out corresponding control signals to the main PLA (Fig. 6 ). This first counter generates MD, AD and XH switch signal for main PLA. MD is asserted every four clock cycles to activate the state transition to another multiplication state. Meanwhile, the loading controls for the accumulator starts. After every four MD, an AD is asserted, which means a coefficient has been computed. XH is used to switch between loading data and wavelet coefficient. 
VI. CONCLUSION
The design is implemented and fabricated using the AMI 1.5 micron CMOS process through the MOSIS prototyping service. The mask layout is shown in Fig. 7 . The size of the chip core is 2.2mm* 2.2mm. Although this is not the most advanced technique in fabrication, it is enough for the purpose of prototyping. The simulation result showed correct computation of the design with reduced complexity compared to the original Mallat's algorithm. A detailed description of the work can be found in [3] . 
