Abstract-This paper presents a systematic high speed VLSI implementation of the discrete wavelet transform (DWT) based on hardware-efficient parallel FIR filter structures. Highspeed 2-D DWT with computation time as low as N2/12 can be easily achieved for an N X N image with controlled increase of hardware cost. Compared with recently published 2-D DWT architectures with computation time of N2/3 and 2N2/3, the proposed designs can also save a large amount of multipliers and/or storage elements. In proposed DWT models, adders are recognized as high potential than other components. In order to improve the efficiency of DWT process, an efficient adder called "Enhanced Half-Ripple Carry Adder (EHRCA)" has been designed in this research work. Proposed EHRCA circuit offers10.71% improvements in hardware slice utilization, 11.78% improvements in total power consumption than traditional Binary to Excess 1 Conversion (BEC) based Square Root Carry Select Adder (SQRT CSLA).
Introduction
Two Dimensional (2-D) DiscreteWaveletTransformation techniques (DWT) are widely used for image and video compressionprocess5.The2-DDWTtechniquehasmultiresolution decomposition capability, because it playsrole in many engineering fields10. However, accumulation of largevaluesofdataofvariousdecompositionlevelsofthe transform makes their complexity computationally very intensive. Large endeavors have been designed many architectureswhichareaimedatprovidinghighspeed2-DDWT computation with the requirement of reasonable hardwareutilization.
Thesearchitecturescanbeclassified as separable and non-separable architectures. In a separable architecture, 2-D filtering operation can be done through two 1-D filtering operations, one for processing the data in row-wise and another one for processing the data in column-wise.
The decomposition levels of input images can be employed by either a Recursive Pyramid Algorithm (RPA) or lighting operation. In separable filtering architecture a 1-D filtering structure is used to perform the 2-D DWT and hence it must needadditional computationalcomplexitybetweentwo1-Dfilteringprocesses. This increases the latency as well as memorysize of the architectures. The non-separable architectures are used to reduce the limitation of separable architectures, since in non-separable architectures, 2-D DWT are computed directly by using 2-D filters. However, the speed of the DWT process is very low for non-separable architectures. In order to overcome this problem, pipelining technique is used in DWTarchitecture 10 . Ingeneral,HaarDiscreteWaveletTransform(HDWT) is used to compress the signal/image 6 . To increase the compression ability of image, precision-aware selfquantizing architectures can be used in 3 . To generate the DWT coefficients, Distributed Arithmetic (DA) based Multiplication is used in 2 . Therefore, the performance of DA based multiplier is better than any other multiplier. In 9 , one dimensional DWT techniques can be implemented in VeryLargeScaleIntegration(VLSI)Systemdesignenvironment. Further, VLSI based high speed 2-D DWT can be implementedin 1 .
In this paper, 2-D DWT technique is designed by using Enhanced Half Ripple Carry Adder (EHRCA). An EHRCA is the type of Ripple Carry Adder (RCA), hard-ware complexity and power consumption is reduced effectively than traditional RCA circuit. Also, the performance of DWT can be increased in terms of silicon area and power consumption, when EHRCA incorporated into DWTprocess.
Discrete Wavelet Transformation (DWT)

Discrete
Wavelet Transformation (DWT) is the technique fordecomposing/compressingtheimages.AlsoDWTrepresentsasanimagewhichisthesumofwave letfunctions (wavelets) with different location and scale. It represents the data into a set of low pass and high pass coefficients. The input data is passed through set of low pass and high passfilters.Theoutputfromhighpassfiltersandlowpass filters are down sampled by 2. The output from low pass filter is an average coefficient and the output from high passfilterisadetailcoefficient.Theschematicdiagramof 1-D DWT method is shown in Figure. In 2-D DWT, the input data is passed through set of both low pass and high pass filter in two directions, both rows and columns. As in 1-D DWT, the outputs from low pass and high pass filters are down sampled by 2 in each direction. Figure 2 shows the block diagram of 2-D DWT. As in Figure 2 , the output is in set of four coefficients LL, HL, LH and HH. In coefficient representation, the first alphabet represents the transform in rowwhereasthesecondalphabetrepresentstransformincolumn.TherepresentationLmeanslowpa sssignalandHmeans high pass signal.
In this paper, three levels of decomposition are done to compress the image with the help of EHRCA. The structure of DWT levels is shown in Figure 3 . Similarly, in reconstruction, input data can be achieved in multiple resolutionsbydecomposingtheLLcoefficientfurtherfor different levels. The compressed data is up-sampled by a factor of 2 in order to reconstruct the original input data while performing interpolation process. The VLSI architectures proposed in for hardware implementations of DWT are mainly convolution-based. In the conventional convolution method of DWT, a pair of Finite Impulse Response filters (FIR) is applied in parallel to derive high pass and low-pass filtercoefficients.
In the first-level decomposition, the size of the input image is N* N, and the outputs are the three sub bandsLH, HL, and HH, of size N/2*N/2. In the secondlevel decomposition, the input is the LL band and the outputs are the three sub bands LLLH, LLHL, and LLHH, ofsizeN/4*N/4.
The implementation of DWT in practical system has issues. First, the complexity of wavelet transform is several times higher than that of DCT. Second, DWT needs extra memory for storing the intermediate computational results. Moreover, for real time image compression, DWT has to process massive amounts of data at high speeds. The use of software implementation of DWT image compression provides flexibility for manipulation but it may not meet timing constraints in certain applications. Hardware implementation of DWT has practical obstacles. First, is that the high cost of hardware implementation of multiplier.Filter bank implementation of DWT contains two FIR filters. It has traditionally been implemented by convolution or the finite impulse response (FIR) filter bank structures.Such implementations require both large number of arithmetic computations and storage, which are not desirable for either high speed or low power image/video processingapplications. 
Image Compression usingDWT
An input image is passed through a series of filters to calculate the DWT coefficients. The procedure starts with passingthisimagethroughahalfbanddigitallowpassfilter with impulse response h[n]. Filtering an image signal corresponds to the numerical operation of convolutionof an image signal with the impulse response of the filter. Above half of the highest frequency in the signal, which can be interpreted as losing half of the information? Resolution, on the other hand, is related to the amount of information in the signal,andthereforeitisaffectedbyfilteringoperations.subsamplingoperationdoesnotaffectther esolutionafterfiltering,since;removing half of the spectral components from the input signal makes half the number of samples redundant anyway. In summary, half band low pass filteringhalvestheresolution,butleavesthescaleunchanged.ThissignalisthensubsampledbyEqu ation(2),thereforehalfofthenuberofsamplesareredundant.Theprocedureforsubsamplingcanma thematicallybeexpressedasfollowsThe input image signals are decomposed into average information and detail information. Theaverage and detail information are described as follows y high k  xng 2k n y low k  xnh2k xn  y high kg2kny low kh2kn
Conventional Carry Select Adder
Carry Select Adder (CSLA) is one of the best adders for binary addition. In CSLA architecture, dual RCA is used for carry input 0 and carry input 1 respectively. Further Multiplexors are used in final stage of addition process.AsingleRCAstructurehasfournumbersofFull Adders (FAs). Therefore, dual RCA structure has 8 numbers of FAs. More number of gates is required to design the CSLA for binary addition. Generally this adder is called as Square Root Carry Select Adder (SQRTCSLA), because, it requires set of dual RCA set to compute N-bit binary addition process. All set of dual RCA can execute in a parallel manner. Final stage of SQRT CSLA usesthemultiplexorstoproducethefinalsumresults. Further, RCA circuit for carry input 1 has been replaced to Binary to Excess 1 (BEC) Converter to improve the performance. BEC circuit utilizes the less number of gatestoperformtheRCAoperationforcarryinput1.For instance, 16-bit BEC based SQRT CSLA is illustrated in Figure 4 . It consists of four set of RCA-BEC set to add two 16-bit binary integers. It reduces the silicon area utilization and power consumption than traditional SQRT CSLA circuit. However, silicon are a requirement of combined RCA-BEC circuit is more and it consumes large power consumption to perform 16-bit binary addition process.Hence, toreducethisproblem,EHRCAcircuitis designed inthispaper.ThebriefdescriptionofEHRCAis presented in nextsection.Further, RCA circuit for carry input 1 has been replaced to Binary to Excess 1 (BEC) Converter to improve the performance. BEC circuit utilizes the less number of gatestoperformtheRCAoperationforcarryinput1.For instance, 16-bit BEC based SQRT CSLA is illustrated in Figure 4 . It consists of four set of RCA-BEC set to add two 16-bit binary integers. It reduces the silicon area utilization and power consumption than traditional SQRT CSLA circuit. However ,silicon are a requirement of combined RCA-BEC circuit is more and it consumes largepowerconsumptiontoperform16-bit binary addition process.Hence,toreducethisproblem,EHRCAcircuitisdesignedinthispaper.Thebriefdescription ofEHRCAispresentedinnextsection.
5.Enhanced Half Ripple Carry Adder
RCA is one of the basic adders to perform the binary addition process. However, CPD is the main disadvantages in RCA circuit(i.e.,) every stage must have wait for carry signal from previous stage. In order to reduce the problem of CPD in RCA circuit, Enhanced Half RippleCarry Adder (EHRCA) is developed in our work. The circuit diagram for developed EHRCA circuit for 4-bit is illustrated in Figure 5 . It consists of HAs, OR gate, AND gate and Multiplexors for performing addition process. As the name itself, final half of the circuit only (Multiplexors part) must have to wait until carry signal load from previous stage, remaining circuits can execute in a parallel manner. Hence, this adder circuit named as Enhanced Half Ripple Carry Adder. In other hand, the structure of this circuit is like that SQRT CSLA. Instead of RCA-BEC combination for Cin = 0 and Cin= 1 respectively of CSLA circuit, simplified circuit is designed as shown in Figure 5 . The carry input is considered only final stage of EHRCA where as remaining circuit can perform the respective computation in a parallel manner with the help of available input data. Similar to Figure 5 , we can design the EHRCA circuit for 8-bit and 16-bit. Further, the EHRCA adder is incorporated into the addition process of Equation (6) to increase the performanceof2-DDWT.Threelevelsofdecomposition are made in this paper for image compression. The performances of conventional SQRT CSLA and developed EHRCA circuits are analyzed in Results and Discussion of this paper. 
6.Results andDiscussions
In this paper, Enhanced Half Ripple Carry Adder (EHRCA) circuit is designed using Verilog Hardware Description Language (Verilog HDL). The validation of proposed adder circuit is evaluated using Model Sim 6.3C and Synthesis results are evaluated by using Xilinx 10.1i design tool. Also levels of decomposition of image using 2-D DWT are measured using MATLAB tool.The RCA circuit is realized in this paper and identified the redundant logic operations. Based on identified redundant logic, EHRCA circuit is designed in our work. The circuitofEHRCAismostlikelyconventionalBECbased SQRT CSLA. Hence, the performance of conventional BEC based SQRT CSLA and developed EHRCA circuit for 16-bit is compared in Table1. FromTable1,itisclearthat16-bitdevelopedEHRCAcircuit offers 10.71% reduction in silicon area and 11.78% reduction in power consumption than conventional BEC based SQRT CSLA. Therefore, developed EHRCA circuit is the best choice for 2-D DWT implementation. Further, the developed EHRCA circuit is incorporated into 2-D DWT addition process to improve the performance. The simulation result for 2-D DWT is illustrated in Figure 6 . The input image is converted into the pixels and these pixels are demonstrated in Figure 6 . Three levels of decomposition are made in this paper for image compression with the help of DWT and EHRCA. The input image for to be determine the DWT coefficients is showninFigure7.Threelevelsofdecomposedimagesare illustrated in Figure8. 
7.Conclusion
In this paper proposes the design of vlsi architecture for image compression. To perform the process of compression using lifting based DWT architecture.The advantage of lower computational complexities and higher efficiencies. The levels of decomposition are made in this paper. Simulation results for image compression using 2D DWT is validated by both ModelSim 6.3c and Matlab Simulation tools. In future the developed a EHRCA based 2D DWT will be helpful for image processing applications like compression, segmentation and fragmentation.
