Abstract-In this paper, a new data access scheme for the computation of lifting 2-D DWT (Discrete Wavelet Transform) using systolic arrays with block processing is suggested. From DG (dependence graph) linear systolic array is directly derived. For parallel and pipeline implementation of 1-D DWT from suitably segmented DG is used for deriving 2-D systolic arrays. Above two systolic arrays are used as building blocks to derive the lifting 2-D DWT. The proposed architecture requires a small on-chip memory of (4N + 8P) where N is the image width, process a block of P samples in every cycle. Compared to existing structures it has high throughput, low latency and less computational complexity. The synthesis is performed in Xilinx 8.1i, Spartan 2E hardware with XC2S50E device and FT256 package and simulation results are obtained using Mat lab 7.10 and Modelsim 6.3f. The image size is 512 X 512 and block size is 4 with area is 987500.22 u.sqm, power consumed is 8.34027 mw and delay count is 16.11 ns
I. INTRODUCTION
WO dimensional discrete wavelet transform (2-D DWT) has evolved as an effective and powerful tool in many applications especially in image processing and compression. This is mainly due to its better computational efficiency achieved by factoring wavelet transforms. Mainly two types of DWT structures are classified in to (i) convolution and (ii) lifting. Lifting scheme facilitates high speed and efficient implementation of wavelet transform and it is attractive for both high throughput and low power applications. Lifting requires less arithmetic and memory resources. Compared to convolution [1] , [2] Parallel data access scheme Cheng et al [5] in which the size of transposition memory is reduced and temporal memory remains independent of data access scheme and input block size. Therefore, in 2-D DWT structures the on-chip memory is based on parallel data access scheme is dominated by temporal memory. The line based structure in [4] requires temporal memory of size 3N to process the 4 samples per cycle and parallel scanning lifting scheme [6] involves same size of temporal memory as in line based. The proposed systolic arrays block processing system is used to utilize temporal memory to reduce area-time complexity of 2-D DWT structure. The block based methods of parallel and pipelined architecture are used in the implementation of 2-D DWT [7] , [8] . Both these structures have same throughput rate and same arithmetic resources but different sizes of transposition memory is varied according to the size of input data matrix. Mohanty et al [7] obtained data blocks by folding rows, size of temporal memory is 3N and transposition memory is 2.5N for 1 level 2-D DWT. Tian [8] derived the data blocks from Prows parallel data access, transposition memory size is [N (P + 2)/2] and temporal memory size 3N, P is the block size. Structure [8] requires transposition memory to buffer the intermediate blocks and the processing of blocks is different order than the input data matrix. Transposition memory size depends on block size as well as the on the image size. On chip memory in [8] depends on block size and for block size>=4, on chip memory is independent of block size in [7] and has less block size compared to [8] . This paper suggest a data access scheme of suitably partitioning and mapping of appropriate computation of hardware architecture to derive the memory and area-power efficient block based 2-D DWT structure.
II. EXISTING WORK
A modular and pipeline architecture of lifting based multilevel 2-D DWT [7] structure provides appropriate partitioning and scheduling is performed at each decomposition levels. The different levels at which the processing is performed using cascaded pipeline architecture. The proposed structure uses pyramid algorithm and one recursive pyramid algorithm. Then the entire processing is based on unit input block size. It has local register and RAM for storage of data instead of buffers which processes image of size 512 X 512.The main drawback of this method is large size on-chip memory which requires more area and power for processing. To overcome the drawback we are moving onto lifting based systolic arrays of block processing.
III. PROPOSED WORK
The proposed structure consists of one row processor and one column processor. The row processor is composed of M 1-D systolic arrays.
A. One-Dimensional Systolic Array Data dependence graph (DG) of N-point lifting DWT is shown in Fig. 1(a) . It consists of N/2 identical sections, and each section has four identical nodes. Lifting computation of 1-D DWT is illustrated in Fig.1 (d Register(R) 
B. Two-Dimensional Systolic Array
The DG is partitioned into Q segments of (P/2) sections each, where N = PQ. As shown in The four registers of the 2-D systolic array of block size (P/2) are replaced by four shift registers (SRs) of (N/2) words each to have a low-pass/high-pass block. SRs are used by the low-pass/high-pass block to provide the necessary row delay to the intermediate coefficients and partial results. In both the low-pass and high-pass blocks, one SR is used for storing the intermediate coefficients, and other three SRs are used for storing the partial results. The sizes of transposition memory and temporal memory of the proposed structure are N and 3N, respectively. The low-pass and high-pass outputs of row processor and column processor need to be scaled according to the lifting scheme for the 9/7 filters of proposed structure is represented in Fig.4 . The high-pass block computes P/4 rows of other two subband matrices (vHL and vHH). The column processor generates four subband matrices (vLL, vLH, vHL, and vHH) each of size (M/2 × N/2) in NQ/2 cycles. 
IV. RESULTS AND COMPARISON
The proposed structure involves the same arithmetic resource (multiplier and adder) and offers the same throughput.
However, the proposed structure involves nearly 1.5N less on-chip memory words than those in and does not involve MUXes like other structures. Due to less on-chip memory, substantial amount of area and power could be saved using the proposed scheme than the other existing schemes.
The input image and the compressed image in 1-level is represented below" From the above Table- II, for the block sizes of various existing structures, the proposed structure involves less area, power, delay and less number of I/O pins at block size 4. 
V. CONCLUSION
A new data access scheme for the computation of lifting 2-D DWT (discrete wavelet transform) using systolic arrays with block processing is suggested. From DG (dependence graph) linear systolic array is directly derived. for the parallel and pipeline implementation of 1-D DWT from suitably segmented DG is used for deriving 2-D systolic arrays. Above two systolic arrays are used as building blocks to derive the lifting 2-D DWT. The proposed structure involves only a small onchip memory of size (4N + 8P) and processes a block of P samples in every cycle, where N is the image width. The proposed structure involves the same number of multipliers and adders and 1.5N less on-chip memory and synthesis result shows that the proposed structure of image size 512 x 512 having a block size of 4 with area is 987500.22 u.sqm, power consumed is 8.34027 mw and delay count is 16.11 ns is better than the best of the existing structures. The proposed structure is regular, modular, and can be easily configured for different image sizes.
