Technological growth in semiconductor industry have led to unprecedented demand for faster, area efficient and low power VLSI circuits for complex image processing applications. DWT-IDWT is one of the most popular IP that is used for image transformation. In this work, a high speed, low power DWT/IDWT architecture is designed and implemented on ASIC using 130 nm Technology. 2D DWT architecture based on lifting scheme architecture uses multipliers and adders, thus consuming power. This paper addresses power reduction in multiplier by proposing a modified algorithm for BZFAD multiplier. The proposed BZFAD multiplier is 65% faster and occupies 44% less area compared with the generic multipliers. The DWT architecture designed based on modified BZFAD multiplier achieves 35% power saving and operates at frequency of 200 MHz with latency of 1536 clock cycles for 512x512 images.
INTRODUCTION
The wavelet transformation is a widely used technique for image processing applications. Unlike traditional transforms such as the Fast Fourier Transform (FFT) and Discrete Cosine Transform (DCT), the Discrete Wavelet Transform (DWT) holds both time and frequency information, based on a multiresolution analysis framework. This facilitates improved quality of reconstructed picture for the same compression than is possible by other transforms. In order to implement real time Codec based on DWT, it needs to be targeted on a fast device. Field Programmable Gate Array (FPGA) implementation of DWT results in higher processing speed and lower costs when compared to other implementations such as PCs, ARM processors, DSPs etc. The Discrete wavelet transform is therefore increasingly used for image coding [1] [2] [3] [4] . This is because the DWT can decompose the signals into different sub-bands with both time and frequency information and facilitate to arrive a high compression ratio [5] . It supports features like progressive image transmission (by quality, by resolution), ease of compressed image manipulation, region of interest coding, etc. The JPEG 2000 incorporates the DWT into its standard [6] .
Recently, several VLSI architectures have been proposed to realize single chip designs for DWT [7] [8] [9] [10] . Traditionally, such algorithms were implemented using programmable DSP chips for low-rate applications or VLSI application specific integrated circuits (ASICs) for higher rates. To perform the convolution, we require a fast multiplier which is crucial in making the operations efficient.
Anirban Das, Anindya Hazra, and Swapna Banerjee [11] have proposed the architecture of the lifting based running 3-D discrete wavelet transform (DWT), which is a powerful image and video compression algorithm. Chin-Fa Hsieh, Tsung-Han Tsai, Neng-Jye Hsu, and Chih-Hung Lai [12] , proposed a novel, efficient VLSI architecture for the implementation of one-dimension, lifting-based discrete wavelet transform (DWT). Both folded and the pipelined schemes are applied in the proposed architecture, where the former scheme supports higher hardware utilization and the latter scheme speeds up the clock rate of the DWT. Jen-Shiun Chiang, and Chih-Hsien Hsia [13] have proposed a highly efficient VLSI architecture for 2-D lifting-based 5/3 filter discrete wavelet transform (DWT). The architecture is based on the pipelined and folding scheme processing to achieve near 100% hardware utilization ratio and reduce the silicon area. The proposed efficient 2-D lifting-based DWT VLSI architecture uses lossless 5/3 filter and pipelined processing. The architecture may have almost 100% hardware utilization. The advantages of the proposed DWT are higher hardware utilization, less memory requirement, and regular data flow. The architectures discussed above are suitable for FPGA implementation.
In this work, we propose a modified 3D architecture based on 9/7 filter that can operate at high frequency and consume low power. Section 2 presents the lifting based DWT. Section 3 and 4 discuss the arithmetic blocks of DWT and BZ-FAD multiplier respectively. The proposed architectural details of 1D/2D DWT-IDWT are presented in Section 5. Section 6 presents 3D DWT architecture. The results are presented in Section 7. This is finally followed by Conclusion.
LIFTING BASED DWT SCHEME
The top level architecture for 1D DWT is presented in Fig. 1a and Fig. 1b . Input X is decomposed into multiple sub bands of low frequency and high frequency components to extract the detailed parameters from X using multiple stages of low pass and high pass filters. The sub band filters are symmetric and satisfy orthogonal property. For an input image, the two 1D DWT computations are carried out in the horizontal and vertical directions to compute the two level decomposition. The inverse DWT process combines the decomposed image sub bands to original signal. The reconstruction of image is possible owing to the symmetric property and inverse property of low pass and high pass filter coefficients. Input x (n 1 , n 2 ) is decomposed to four sub-components Y LL , Y LH , Y HL and Y HH . This results in a one level decomposition. The Y LL sub-band component is further processed and is decomposed to another four sub-band components, thus forming two-level decomposition. This process is continued as per the design requirements till the requisite quality is obtained. Every stage of DWT requires LPF and HPF filters with down sampling by 2. Lifting based DWT computation is widely being adopted for image decomposition. In this work, we propose a modified architecture based on BZFAD multiplier [14, 15] to realize the lifting based DWT.
Lifting scheme is one of the techniques that is used to realize DWT architecture [12] . Lifting scheme is used in order to reduce the number of operations to be performed by half and, filters can be decomposed into further steps in lifting scheme. The memory required and also computation is less in the case of the lifting scheme. The implementation of the algorithm is fast and inverse transform is also simple in this method. The block diagram for lifting scheme [12] is shown in Fig. 2 . The z -1 blocks are for delay; α, β, γ, δ, ζ are the lifting coefficients and the shaded blocks are registers. 9/7 filter has been used for implementation, which requires four steps for lifting and one step for scaling. The input signal x i is split into two parts: even part x 2i and odd part x 2i+1 . Thereafter, the first step of lifting is performed given by the equations (1) and (2) .
The first equation is predict P1 and the second equation is update U1. Then the second lifting step is performed resulting in equations (3) and (4):
The third equation is predict P2 and the fourth equation is update U2. Thereafter the scaling is performed in order to obtain the approximation and detail coefficients of DWT as given in equations (5) and (6).
The equations (5) and (6) are respectively scales G 1 and G 2 . The predict step helps determine the correlation between the sets of data and predicts even data samples from odd. These samples are used for updating the present phase. Some of the properties of the original input data can be maintained in the reduced set also by construction of a new operator using the update step. The lifting coefficients have constant values of -1.58613, -0.0529, 0.882911, 0.44350, -1.1496 for α, β, γ, δ, ζ respectively. It may be observed by these equations, that the computation of the final coefficients requires 6 steps. Data travels in sequence from stage 1 to stage 6, introducing 6 units of delay. To speed up the process of computation, modified lifting scheme is proposed and realized.
ARITHMETIC BUILDING BLOCKS FOR LIFTING SCHEME IMPLEMENTATION
High-speed multiplication has always been a fundamental requirement of high performance systems. Multiplier structure is one of the processing elements which consumes the maximum area, power and also causes delay. Therefore, there is a need for high-speed architectures for N-bit multipliers with optimized area, speed and power. Multipliers are made up of adders in order to reduce the Partial Product logic delay and regularize the layout. To improve regularity and compact layout, regularly structured tree with recurring blocks and rectangular-styled tree by folding are proposed at the expense of complicated interconnects [16] . The present work focuses on multiplier design for low power applications such as DWT by rapidly reducing the partial product rows by identifying the critical paths and signal races in the multiplier. The focus of the design has been to optimize the speed, area and power of the multiplier that form the major bottleneck in lifting based DWT [17] .
Shift and Add Multiplier
In shift and add based multiplier logic, the multiplicand (A) is multiplied by multiplier (B). If the register A and B storing multiplicand and multiplier respectively are of N bits, the shift and add multiplier logic requires two N bit registers, an N bit adder and an (N+1) bit accumulator. It also requires an N-bit counter to control the number of addition operation. In shift Figure 3 The Architecture of the Conventional Shift-and-Add Multiplier and add logic, the LSB bit of multiplier is checked for 1 or 0.
If the LSB bit is 0, then the accumulator is shifted right by one bit position. If the LSB bit is 1, then the multiplicand is added with the accumulator content and the accumulator is shifted right by one bit. The counter is decremented for every operation; the addition is performed until the counter is set to zero, which is indicated by the Ready signal. The product is available in the accumulator after N clock cycles. Fig. 3 shows the block diagram of the conventional multiplier using shift and add logic, which generates a partial product (PP). B(0) is generally used to select A or 0 as is appropriate.
Modified BZ-FAD Multiplier
As discussed in shift and add logic earlier, if the LSB position is 1, then the accumulator is added with the multiplicand. If the accumulator contains more number of 1s than 0s, the adder has to add 1 and, this triggers the Full adder block within the adder. We know that the power dissipation is due to switching activity of input lines. Thus, whenever the input or output changes, the power is switched from Vdd to Vss, thus consuming power. In order to reduce the power dissipation, it is required to reduce switching activity in the I/O lines. BZ-FAD logic based multiplier [15] reduces the switching activity and thus reduces the power dissipation.
In shift and logic operation, the counter keeps track of the number of cycles, thus controlling the multiplication operation. In a binary counter, we know that the output bit change occurs in more than one bit. For example, if the current counter value is 3 (11 in binary) and changes to 4 (binary 100), there are three bit changes occurring. This causes switching activity, and the same can be reduced by replacing the binary counter by a ring counter. In a ring counter, at any given point of time, only one bit change occurs, thus reducing switching activity and power dissipation.
Another major source of power dissipation in shift and add logic is switching. For every bit value "0" of the multiplier, a shift operation is performed; thus all the bits in the accumulator are shifted by one bit position. This causes switching and hence more power is dissipated. In BZ-FAD logic, if the LSB bit is 0, then the shift operation is bypassed and a zero is introduced at the MSB, thus there is no shifting of accumulator content. In other words, if the LSB is zero, the accumulator is directly fed into the adder and there is no addition, but a zero is introduced by the control logic which is the same as right shift operation. The architecture of this multiplier is shown in Fig. 4 . In the BZ-FAD, the control activity of ring counter, latch and bypass logic is realized using NMOS transistors, which introduces delay. The parasitic capacitance of NMOS transistors also increases the load capacitance and thus increases power dissipation.
In order to reduce power dissipation, the transistor logic is replaced by MUX logic having ideal fan in and fan out capacitances. With MUX based logic, the control signals can be suitably controlled to reduce the switching activity as they are enabled only when required, based on the inputs derived from the ring counter. However, the design requires more number of transistors than it should and thus increases the chip area. We have also used the ripple carry adder which has the least average transition per addition among the look ahead, carry skip, carry-select and conditional sum adders to reduce power dissipation. Various multipliers are modeled in HDL and are analyzed for their performances and the results are tabulated for comparison. Next section discusses the comparison results of these multiplier algorithms [18] .
COMPARISON RESULTS OF MULTIPLIER
In this section, comparison of power, area for different types of multipliers using modified multiplier (BZ-FAD) is presented. The results reveal that the modified BZ-FAD multiplier may be considered as a low-power, yet area efficient multiplier.
Table 1 Power comparison of the proposed BZ-FAD multiplier with other multipliers

International Journal of Computer Applications (0975 -8887) Volume 51-No.6, August 2012
Figure 5 Power comparisons of multipliers
The power consumption of the multipliers for normally distributed input data are reported in Table 1 . As can be seen in Fig. 5 , the BZ-FAD multiplier consumes 33% lower power compared to the conventional multiplier. The results reveal that the BZ-FAD multiplier is a very low-power, yet highly area efficient multiplier.
In terms of the area, the proposed technique has some area overhead compared to the conventional shift-and-add multiplier as shown in Fig. 6 and Table 2 . Comparison of Fig.  3 with Fig. 4 reveals that multipliers M1, M2 and the ring counter are responsible for additional area in the proposed architecture. The area overheads of the ring counter and multiplexers M1 and M2 scale up linearly with the input data width. This leads to a small increase in the leakage power which, as the results reveal, is less than the overall power reduction. The leakage power of the 8-bit BZFAD architecture is about 11% more than that of the conventional architecture but the contribution of the leakage power in these multipliers is less than 3% of the total power for the technology used in this work. It may be noted that since the critical paths for both architectures are the same, neither of the two architectures has a speed advantage over the other [19] [20] [21] [22] [23] .
Discrete wavelet transform and Inverse Discrete wavelet transform implementation
DWT has traditionally been implemented by convolution. Such an implementation demands both a large number of computations and a large storage features that are not desirable for either high-speed or low-power applications.
Recently, a lifting-based scheme that often requires far fewer computations has been proposed for the DWT [24] [25] [26] [27] . The main feature of the lifting based DWT scheme is to break up the high pass and low pass filters into a sequence of upper and lower triangular matrices and convert the filter implementation into banded matrix multiplications. Such a scheme has several advantages, including "in-place" computation of the DWT, integer-to-integer wavelet transform (IWT), symmetric forward and inverse transform, etc. Therefore, it comes as no surprise that lifting has been chosen in the upcoming implementations.
The proposed architecture computes multilevel DWT for both the forward and the inverse transforms: one level at a time, in a row-column fashion. There are two row processors that compute the high pass and low pass filter outputs as shown in Fig. 1 . Four column processors operate on the row processed outputs. The outputs generated by the row and column processors divide the input image into four sub bands of LL, LH, HL and HH. These sub bands are stored in the memory modules for further processing. The memory modules are divided into multiple banks to accommodate high computational bandwidth requirements. The proposed architecture is an extension of the architecture for the forward transform that was presented earlier. A number of architectures have been proposed for calculation of the convolution-based DWT. The architectures are mostly folded and can be broadly classified into serial architectures, where the inputs are supplied to the filters in a serial manner and, parallel architectures, where the inputs are applied to the filters in a parallel manner.
A design methodology for lifting based DWT has been evolved by us that reduces the memory requirements and communication among processors when the image is broken up in to blocks. For a system that consists of the lifting-based DWT transform followed by an embedded zero-tree algorithm, a new interleaving scheme that reduces the number of memory accesses has been proposed. Finally, a liftingbased DWT architecture has been developed as shown in Figure 7 , which is capable of performing filter operation with one lifting step, i.e., one predict and one update step. The outputs are generated in an interleaved fashion.
Figure 7 2-D Lifting-based DWT
International Journal of Computer Applications (0975 -8887) Volume 51-No.6, August 2012
The lifting scheme is represented by the following equations of the 1-D DWT:
The 2-D DWT is a multilevel decomposition technique that decomposes into four sub bands such as hh, hl, lh and ll. The mathematical formulae governing the 2-D DWT are as follows:
Similary, the 2-D IDWT is defined as follows:
The 
3D DWT Architecture
Design and VLSI implementation of high speed, low power 3D wavelet architecture is targeted on video coding application. Flexible hardware architecture is designed for performing 3D Discrete Wavelet Transform. The proposed architecture uses new and fast lifting scheme which has the ability of performing progressive computations by minimizing the buffering between the decomposition levels. The 3D wavelet decomposition is computed by applying three separate 1D transforms along the coordinate axes of the video data. The 3D data is usually organized frame by frame. A single frame has rows and columns as in the 2D case, x and y direction often denoted as "spatial co-ordinates", whereas for the video data, a third dimension of "time" is added (zdirection). The input data is a set of multiple frames each consisting of N rows and N columns. Hence the input data can be denoted as NxNxN, where N is an integer. The 3D DWT can be considered as a combination of three 1D DWT in the x, y and z directions as shown in Fig. 9 . A preliminary work in the DWT processor design is to build 1D DWT modules, which are composed of high-pass and low-pass filters that perform a convolution on filter coefficients and input pixels. After a one-level of 3D discrete wavelet transform, the volume of frame data is decomposed into HHH, HHL, HLH, HLL, LHH, LHL, LLH and LLL signals as shown in Fig. 9 .
The arithmetic blocks adopted in the design of 1D/2D DWT are extended to the design of 3D DWT as well. One of the major changes in the 3D architecture is the intermediate memory stages that are required for reordering of 1D and 2D output samples for the computation of 3D samples. Fig. 10 shows the 3D architecture with intermediate memories. For 
Simulation and Place & Route Results
The 3D DWT simulation results using ModelSim is presented in Fig. 11 . The 8 sub level images are discernible from the figure. The HDL model developed is synthesized using Xilinx ISE targeting Virtex-5 FPGAs. 3D-DWT architecture has been implemented on Virtex-5 FPGA with the utilization of 1152 slice registers and the reported maximum operational frequency is 256 MHz. The designed DWT can be used as an IP Core for VLSI implementation.
CONCLUSION
In this work, low power architecture for shift-and-add multipliers has been proposed and implemented. The conventional architecture has been modified by removing the shift operation of the B register (in A × B), direct feeding of A to the adder, and bypassing the adder whenever possible. It also uses a ring counter instead of the conventional binary counter and removes the partial product shifter. The BZ-FAD multiplier is further modified using multiplexers and XOR gates. The modified multiplier is modeled and implemented using 130 nm technology. The modified multiplier is used in constructing lifting based DWT/IDWT architecture. The DWT/IDWT architecture is modeled and synthesized using TSMC libraries. The BZ-FAD multiplier based DWT/IDWT architecture reduces power dissipation by 30% and operates at 200 MHz. The adders in the lifting based DWT/IDWT can be further improved by replacing the adders by low power adders. 3D-DWT architecture has been implemented on Virtex-5 FPGA utilizing 1152 slice registers and reporting a maximum frequency of operation of 256 MHz. The developed DWT can be used as an IP for VLSI implementation. The 1D/2D/3D DWT use large numbers of intermediate memories.
In order to reduce the memory size, systolic array architectures may be adopted. Power dissipation can be reduced using low power design techniques. 3D DWT can Figure 11 ModelSim simulation results for 3D DWT also be implemented as an ASIC, optimizing for area, power and speed.
