This paper presents a memory efficient, high throughput parallel lifting based running three dimensional discrete wavelet transform (3-D DWT) architecture. 3-D DWT is constructed by combining the two spatial and four temporal processors. Spatial processor (SP) apply the two dimensional DWT on a frame, using lifting based 9/7 filter bank through the row rocessor (RP) in row direction and then apply in the colum direction through column processor (CP). To reduce the temporal memory and the latency, the temporal processor (TP) has been designed with lifting based 1-D Haar wavelet filter. The proposed architecture replaced the multiplications by pipeline shift-add operations to reduce the CPD. Two spatial processors works simultaneously on two adjacent frames and provide 2-D DWT coefficients as inputs to the temporal processors. TPs apply the one dimensional DWT in temporal direction and provide eight 3-D DWT coefficients per clock (throughput). Higher throughput reduces the computing cycles per frame and enable the lower power consumption. Implementation results shows that the proposed architecture has the advantage in reduced memory, low power consumption, low latency, and high throughput over the existing designs. The RTL of the proposed architecture is described using verilog and synthesized using 90-nm technology CMOS standard cell library and results show that it consumes 43.42 mW power and occupies an area equivalent to 231.45 K equivalent gate at frequency of 200 MHz. The proposed architecture has also been synthesised for the Xilinx zynq 7020 series field programmable gate array (FPGA).
I. INTRODUCTION
Video compression is a major requirement in many of the recent applications like medical imaging, studio applications and broadcasting applications. Compression ratio of the encoder completely depends on the underlying compression algorithms. The goal of compression techniques is to reduce the immense arXiv:1509.04268v1 [cs.AR] 14 Sep 2015 amount of visual information to a manageable size so that it can be efficiently stored, transmitted, and displayed. 3-D DWT based compressing system enables the compression in spatial as well as temporal direction which is more suitable for video compression. Moreover, wavelet based compression provide the scalability with the levels of decomposition. Due to continuous increase in size of the video frames (HD to UHD), video processing through software coding tools is more complex. Dedicated hardware only can give higher performance for high resolution video processing. In this scenario there is a strong requirement to implement a VLSI architecture for efficient 3-D DWT processor, which consumes less power, area efficient, memory efficient and should operate with a higher frequency to use in real-time applications.
From the last two decades, several hardware designs have been noted for implementation of 2-D DWT In general the circuit complexity is denoted by two major components viz, arithmetic and Memory component. Arithmetic component includes adders and multipliers, whereas memory component consists of temporal memory and transpose memory. Complexity of the arithmetic components is fully depends on the DWT filter length. In contrast size of the memory component is depends on dimensions of the image. As image resolutions are continuously increasing (HD to UHD), image dimensions are very high compared to filter length of the DWT, as a result complexity of the memory component occupied major share in the overall complexity of DWT architecture.
Convolution based implementations [1] - [3] provides the outputs within less time but require high amount of arithmetic resources, memory intensive and occupy larger area to implement. Lifting based a implementations requires less memory, less arithmetic complex and possibility to implement in parallel.
However it require long critical path, recently huge number of contributions are noted to reduce the critical path in lifting based implementations. For a general lifting based structure [4] provides critical path of 4T m + 8T a , by introducing 4 stage pipeline it cut down to T m + 2T a . In [5] Huang et al., introduced a flipping structure it further reduced the critical path to T m + T a . Though, it reduced the critical path delay in lifting based implementation, it requires to improve the memory efficiency. Majority of the designs implement the 2-D DWT, first by applying 1-D DWT in row-wise and then apply 1-D DWT in column wise. It require huge amount of memory to store these intermediate coefficients. To reduce this memory requirements, several DWT architecture have been proposed by using line based scanning methods [7] - [11] . Huang et al., [7] - [8] give brief details of B-Spline based 2-D IDWT implementation and discussed the memory requirements for different scan techniques and also proposed a efficient overlapped strip-based scanning to reduce the internal memory size. Several parallel architectures were proposed for lifting-based 2-D DWT [8] - [17] . Y. Hu et al. [17] , proposed a modified strip based scanning and parallel architecture for 2-D DWT is the best memory-efficient design among the existing 2-D DWT architectures, it requires only 3N + 24P of on chip memory for a N×N image with P parallel processing units (PU). Several lifting based 3-D DWT architectures are noted in the literature [18] - [24] 
II. THEORETICAL BACKGROUND
Lifting based wavelet transform designed by using a series of matrix decomposition specified by the Daubechies and Sweledens in [4] . By applying the flipping [5] to the lifting scheme, the multipliers in the longest delay path are eliminated, resulting in a shorter critical path. The original data on which DWT is 
Where Lifting based wavelets are always memory efficient and easy to implement in hardware. The lifting scheme consists of three steps to decompose the samples, namely, splitting, predicting (eqn. (1) and (3)), and updating (eqn. (2) and (4)).
Haar wavelet transform is orthogonal and simple to construct and provide fast output. By considering the advantages of the Haar wavelets, the proposed architecture uses the Haar wavelet to perform the 1-D
DWT in temporal direction (between two adjacent frames). Sweldens et al. [25] developed a lifting based
Haar wavelet. The equations of the lifting scheme for the Haar wavelet transform is as shown in eqn.
Eqn. (8) is extracted by substituting Predict value P (z) as 1 and Update step S(z) value as 1/2 in eqn. (7), which is used to develop the temporal processor to apply 1-D DWT in temporal direction (3 rd dimension). 
A. Architecture for Spatial Processor
In this section, we propose a new parallel and memory efficient lifting based 2-D DWT architecture denoted by spatial processor (SP) and it consists of row and column processors. The proposed SP is a revised version of the architecture developed by the Y. Hu et al. [17] . The proposed architecture utilizes the strip based scanning [17] to enable the trade-off between external memory and internal memory. To reduce the critical path in each stage flipping model [5] - [6] is used to develop the processing element (PE). Each PE has been developed with shift and add techniques in place of multiplier. Lifting based (9/7) 1-D DWT process has been performed by the processing unit (PU) in the proposed architecture. As shown in Fig. 2 , the proposed PU is designed with five PEs, and each PE (except first PE (shift PE)) the CPD to T a (adder delay). Fig. 1 shows that the number of inputs to the spatial processor is equal to 2P+1, which is also equal to the width of the strip. Where P is the number of parallel processing units 
1) Row Processor (RP):
Let X be the image of size N × N , extend this image by one column by using symmetric extension. Now image size is N × (N + 1). Refer [17] for the structure of strip based scanning method. The proposed architecture initiates the DWT process in row wise through row processor (RP) then process the column DWT by column processor (CP). Fig. 3(a) . shows the generalized structure for a row processor with P number of PUs. P = 2 has been considered for our proposed design. For the first clock cycle, RP get the pixels from X(0, 0) to X(0, 2P ) simultaneously. For the second clock RP gets the pixels from next row i.e. X(1, 0) to X(1, 2P ), the same procedure continues for each clock till it reaches the bottom row i.e., X(N, 0) to X(N, 2P ). Then it goes to the next strip and RP get the pixels from X(0, 2P ) to X(0, 4P ) and it continues this procedure for entire image. Each PU consists of five pipeline stages and each pipeline stage is processed by one processing element (PE) as depicted in Fig. 2(b , it also provides the partial output X [2n] which is required for the PE beta). Structure of the PEs are given in the Fig. 2(b) , it shows that multiplication is replaced with the shift and add technique. The original multiplication factor and the value through the shift and add circuit are noted in Table. I, it shows that variation between original and adopted one is extremely small.
As shown in Fig. 2(b) , time delay of shift PE is one T a and remaining all PEs are having delay of 2T a .
To reduce the CPD of PU, PEs from PE alpha to PE delta are divided in to two pipeline stages, and each pipeline stage has a delay of T a , as a result CPD of PU is reduced to T a and pipeline stages are increased to nine and is shown in Fig. 2(c) . The outputs
corresponding to PE alpha and PE beta of last PU and PE gama of last PU is saved in the memories Memory alpha, Memory beta and Memory gama respectively, shown in Fig. 3(a) . Those stored outputs are inputted for next subsequent columns of the same row. For a N × N image rows is equivalent to N .
So the size of the each memory is N × 1 words and total row memory to store these outputs is equals to 3N . Output of each PU are under gone through a process of scaling before it producing the outputs H and L. These outputs are fed to the transposing unit. The transpose unit has P number of transpose registers (one for each PU). Fig. 4(a) shows the structure of transpose register, and it gives the two H and two L data alternatively to the column processor.
2) Column Processor (CP): The structure of the column processor (CP) is shown in Fig. 3(b) . To match with the throughput of RP, CP is also designed with two number of PUs in our architecture. Each transpose register produces a pair of H and L in an alternative order and are fed to the inputs of one PU of the CP. The partial results produced are consumed by the next PE after two clock cycles. As such, shift registers of length two are needed within the CP between each pipeline stages for caching the partial results (except between 1 st and 2 nd pipeline stages). At the output of the CP, four sub-bands are generated in an interleaved pattern, i.e.(HL, HH), (LL, LH), (HL, HH), (LL, LH), and so on. Outputs of the CP are fed to the re-arrange unit. Fig. 4(b) shows the architecture for re-arrange unit, and it provides the outputs in sub-band order i.e.LL, LH, HL and HH simultaneously, by using P registers and 2P multiplexers.
For multilevel decomposition, the same DWT core can be used in a folded architecture with an external frame buffer for the LL sub-band coefficients.
B. Architecture for Temporal Processor (TP)
Eqn. (8) 
IV. IMPLEMENTATION RESULTS AND PERFORMANCE COMPARISON
The proposed 3-D DWT architecture has been described in Verilog HDL. A uniform word length of 14 bits has been maintained throughout the design. Simulation results have been verified by using Xilinx 
A. Comparison
The performance comparison of the proposed 2-D and 3-D DWT architectures with other existing architectures is figure out in Tables III and IV It is found that, the proposed design has less memory requirement, High throughput, less computation time and minimal latency compared to [19] , [20] , [22] , and [24] . Though the proposed 3-D DWT architecture has small disadvantage in area and frequency, when compared to [22] , the proposed one has a great advantage in remaining all aspects. Table V gives the comparison of synthesis results between the proposed 3-D DWT architecture and [24] . It seems to be proposed one occupying more cell area, but it included total on chip memory also, where as in [24] on chip memory is not included. Power consumption of the proposed 3-D architecture is very less compared to [24] .
V. CONCLUSIONS
In this paper, we have proposed memory efficient and high throughput architecture for lifting based 3-D DWT. The proposed architecture is implemented on 7z020clg484 FPGA target of zynq family, also synthesized on Synopsys' design vision for ASIC implementation. An efficient design of 2-D spatial processor and 1-D temporal processor reduces the internal memory, latency, CPD and complexity of a control unit, and increases the throughput. When compared with the existing architectures the proposed scheme shows higher performance at the cost of slight increase in area. The proposed 3-D DWT architecture is capable of computing 60 UHD (3840×2160) frames in a second.
