This paper presents a memory efficient VLSI architecture of low complex video encoder using three dimensional 
estimation and deblocking filters of the current video coding system. CS module utilize the sparse nature of the wavelet coefficients and projects on the random Bernoulli matrices for selecting the measurements at the encoder to enable the compression and approximate message passing algorithm for reconstruction at the decoder. CS module provides the good compression ratio and improves the error resilience. As a result the proposed architecture enjoys lesser complexity at the encoder and marginal complexity at the decoder.
From the last two decades, several hardware designs have been noted for implementation of 2-D DWT Convolution based implementations [5] [6] [7] provides the outputs within less time but require high amount of arithmetic resources, memory intensive and occupy larger area to implement. Lifting based a implementations requires less memory, less arithmetic complex and possibility to implement in parallel.
However it require long critical path, recently huge number of contributions are noted to reduce the critical path in lifting based implementations. For a general lifting based structure [8] provides critical path of 4T m + 8T a , by introducing 4 stage pipeline it cut down to T m + 2T a . In [9] Huang et al., introduced a flipping structure it further reduced the critical path to T m + T a . Though, it reduced the critical path delay in lifting based implementation, it requires to improve the memory efficiency. Majority of the designs which implement the 2-D DWT, first by applying 1-D DWT in row-wise and then apply 1-D DWT in column wise. It require huge amount of memory to store these intermediate coefficients. To reduce this memory requirements, several DWT architecture have been proposed by using line based scanning methods [10] [11] [12] [13] [14] . Huang et al., [10] - [11] given brief details of B-Spline based 2-D IDWT implementation and discussed the memory requirements for different scan techniques and also proposed a efficient overlapped strip-based scanning to reduce the internal memory size. Several parallel architectures were proposed for lifting-based 2-D DWT [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] . Y. Hu et al. [20] , proposed a modified strip based scanning and parallel architecture for 2-D DWT is the best memory-efficient design among the existing 2-D DWT architectures, it requires only 3N + 24P of on chip memory for a N×N image with P parallel processing units (PU).
Several lifting based 3-D DWT architectures are noted in the literature [21] [22] [23] [24] [25] [26] to reduce the critical path of the 1-D DWT architecture and to decrease the memory requirement of the 3-D architecture. Among the best existing designs of 3-D DWT, Darji et al. [26] produced best results by reducing the memory requirements and gives the throughput of 4 results/cycle. Still it requires the 4N 2 + 10N on-chip memory.
Based on the ideas of compressed sensing (CS) [27] [28] [29] , several new video codecs [30] [31] [32] [33] [34] [35] have been proposed in the last few years. Wakin et al. [30] have introduced the compressive imaging and video encoding through single pixel camera. From his research results, Wakin has established that 3-D wavelet transform is a better choice for video compared to 2-D (two-dimensional) wavelet transform. Y. Hou and F. Liu [31] have proposed a system of low complexity, where sparsity extracted is from residuals of successive non-key frames and CS is applied on those frames. Key frames are fully sampled resulting in increased bit-rate. Moreover, performing motion estimation and compensation while predicting the non key frames increases the encoder complexity. S. Xiang and Lin Cai [32] proposed a CS based scalable video coding, in which the base layer is composed of a small set of DCT coefficients while the enhancement layer is composed of compressed sensed measurements. It uses DCT for I frames and undecimated DWT (UDWT) for CS measurements which increases the complexity at the decoder to a great extent. Jiang et al. [33] proposed CS based scalable video coding using total variation of the coefficients of temporal DCT.
Scalability is enabled by multi-resolution measurements while the video signal is reconstructed by total variation minimization by augmented Lagrangian and alternating direction algorithms (TVAL3) [34] at the decoder. However, it increases the decoder complexity, making hardware implementation quite difficult.
J. Ma et al. [35] introduced the fast and simple on-line based encoding and decoding by forward and backward splitting algorithm. Though encoder complexity is low, scalability is not achieved and decoder complexity is very high. Most of the recently proposed video codecs [30] [31] [32] [33] [34] [35] , which are assumed to be of uniform sparsity, are available for all the video frames and a fixed number of measurements are transmitted to decoder for all the frames. Depending on the content of the video frame, sparsity may Finally, concluding remarks are given in Section VI. 
A. Discrete Wavelet Transform
Lifting based wavelet transform designed by using a series of matrix decomposition specified by the Daubechies and Sweledens in [8] . By applying the flipping [9] to the lifting scheme, the multipliers in the 
Where a = 1/α, b = 1/αβ, c = 1/βγ, d = 1/γδ, K0 = αβγ/ζ, and K1 = αβγδζ [8] . The lifting step coefficients α, β, γ, δ and scaling coefficient ζ are constants and its values α = −1.586134342, β = −0.052980118, γ = 0.8829110762, and δ = 0.4435068522, and ζ = 1.149604398.
Lifting based wavelets are always memory efficient and easy to implement in hardware. The lifting scheme consists of three steps to decompose the samples, namely, splitting, predicting (eqn. (1) and (3)), and updating (eqn. (2) and (4)).
Haar wavelet transform is orthogonal and simple to construct and provide fast output. By considering the advantages of the Haar wavelets, the proposed architecture uses the Haar wavelet to perform the 1-D DWT in temporal direction (between two adjacent frames). Sweldens et al. [45] developed a lifting based
Haar wavelet. The equations of the lifting scheme for the Haar wavelet transform is as shown in eqn.
Eqn. (8) is extracted by substituting Predict value P (z) as 1 and Update step S(z) value as 1/2 in eqn. (7), which is used to develop the temporal processor to apply 1-D DWT in temporal direction (3 rd dimension).
Where L and H are the low and High frequency coefficients respectively.
The process which is shown in Fig. 2 represents the one level decomposition in spatial and temporal.
Among all the sub-bands, only LLL sub-band (LL band of L-frames) is fully sampled and transmitted without applying any CS techniques because it represents the image in low resolution (Base layer in 8 SVC domain) which is not sparse. All the other sub-bands (3-D wavelet coefficients) except LLL exhibit approximate sparsity (Near to zero) and hard thresholding has been applied (consider as zero if value is less than threshold). After this step, conventional encoders use EZW coding to encode these wavelet coefficients which is complex to implement in hardware. EZW coding is replaced by CS in the proposed framework which exploits the sparsity preserving nature of random Bernoulli matrix by projecting the wavelet coefficients onto them. DWT version of each frame consists of four sub-bands. All the LL subbands of L-frames have large wavelet coefficients. Remaining three bands of L-frames and four sub-bands of H-frames exhibits sparsity on which compressed sensing is applied.
B. Compressed Sensing
Compressed sensing is an innovative scheme that enables sampling below the Nyquist rate, without (or with small) drop in reconstruction quality. The basic principle behind the compressed sensing consists in exploiting sparsity of the signal in some domain. In the proposed work, CS has been applied in wavelet domain.
} be a set of N real and discrete-time samples. Let s be the representation of x in the Ψ (transform) domain, that is: The problem of signal recovery from CS measurements is very well studied in the recent years and there exists a host of algorithms that have been proposed such as Orthogonal Matching Pursuit (OMP) [38] [39] [40] , Iterative Hard-Thresholding (IHT) [41] , Iterative Soft-Thresholding (IST) [42] . Although recently introduced Approximate Message Passing (AMP) algorithm [43] shows a similar structure to IHT and IST, it exhibits faster convergence. Literature [43] , [44] shows that AMP performs excellently for many deterministic and highly structured matrices.
III. PROPOSED ARCHITECTURE FOR 3-D DWT
The proposed architecture for 3-D DWT comprising of two parallel spatial processors (2-D DWT)
and four temporal processors (1-D DWT), is depicted in Fig. 1(b) . After applying 2-D DWT on two consecutive frames, each spatial processor (SP) produces 4 sub-bands, viz. LL, HL, LH and HH and are fed to the inputs of four temporal processors (TPs) to perform the temporal transform. Output of these TPs is a low frequency frame (L-frame) and a high frequency frame (H-frame). Architectural details of the spatial processor and temporal processors are discussed in the following sections.
A. Architecture for Spatial Processor
In this section, we propose a new parallel and memory efficient lifting based 2-D DWT architecture denoted by spatial processor (SP) and it consists of row and column processors. The proposed SP is a revised version of the architecture developed by the Y. Hu et al. [20] . The proposed architecture utilizes the strip based scanning [20] to enable the trade-off between external memory and internal memory. To reduce the critical path in each stage flipping model [9] - [37] is used to develop the processing element (PE). Each PE has been developed with shift and add techniques in place of multiplier. Lifting based (9/7) 1-D DWT process has been performed by the processing unit (PU) in the proposed architecture. To reduce the CPD, processing unit is designed with five pipeline stages and multipliers are replaced with shift and add techniques. This modified PU reduces the CPD to 2T a (two adder delay). Fig. 3(a) shows the data flow graph (DFG) of the proposed PU and Fig. 3(b) depicts the internal architecture of the proposed PU. The number of inputs to the spatial processor is equal to 2P+1, which is also equal to the width 
1) Row Processor (RP):
Let X be the image of size N×N, extend this image by one column by using symmetric extension. Now image size is N×(N+1). Refer [20] for the structure of strip based scanning method. The proposed architecture initiates the DWT process in row wise through row processor (RP) then process the column DWT by column processor (CP). Fig. 4(a) . shows the generalized structure for a row processor with P number of PUs. P = 2 has been considered for our proposed design. For the first clock cycle, RP get the pixels from X(0,0) to X(0,2P) simultaneously. For the second clock RP 
L(3,P) L(2,P) L(1,P) L(0,P) H(3,P) H(2,P) H(1,P) H(0,P)
MUX MUX Reg Reg
L(3,P) H(3,P) L(1,P) H(1,P) L(2,P) H(2,P) L(0,P) H(0,P)
LL ( X(1,0) to X(1,2P) , the same procedure continues for each clock till it reaches the bottom row i.e., X(N,0) to X(N,2P). Then it goes to the next strip and RP get the pixels from X(0,2P) to X(0,4P) and it continues this procedure for entire image. Each PU consists of five pipeline stages and each pipeline stage is processed by one processing element (PE) as depicted in Fig. 3(b Fig. 3(b) , it shows that multiplication is replaced with the shift and add technique. The original multiplication factor and the value through the shift and add circuit are noted in Table. I, it shows that variation between original and adopted one is extremely small. The maximum CPD provided by the these PEs is 2T a . The outputs
, and H 2 [n + P − 1] corresponding to PE alpha and PE beta of last PU and PE gama of last PU is saved in the memories Memory alpha, Memory beta and Memory gama respectively. Those stored outputs are inputted for next subsequent columns of the same row. For a N×N image rows is equivalent to N. So the size of the each memory is N×1 words and total row memory to store these outputs is equals to 3N. Output of each PU are under gone through a process of scaling before it producing the outputs H and L. These outputs are fed to the transposing unit.
The transpose unit has P number of transpose registers (one for each PU). Fig. 5(a) shows the structure of transpose register, and it gives the two H and two L data alternatively to the column processor.
2) Column Processor (CP):
The structure of the Column Processor (CP) is shown in Fig. 4 fed to the re-arrange unit. Fig. 5(b) shows the architecture for re-arrange unit, and it provides the outputs in sub-band order i.e LL, LH, HL and HH simultaneously, by using P registers and 2P multiplexers. For multilevel decomposition, the same DWT core can be used in a folded architecture with an external frame buffer for the LL sub-band coefficients. Eqn. (8) 
IV. ARCHITECTURE FOR COMPRESSED SENSING MODULE
The proposed 3-D DWT module, simultaneously works on two video frames of size N ×N and provide eight 3-D DWT sub-bands as its output. As shown in Fig. 1(b) , CS is applied on all sub-bands of 3-D DWT outputs, except LLL band (LL band of L-Frame) and each sub-band is connected to one CS module.
Size of the each sub-band equals to the half of the original frame for one level decomposition (N/2×N/2).
The main function of the CS module is to calculate the measured matrix y from Φ and x by using the CS equation y = Φx. Where x is a input vector (for which CS need to calculate). Size of x is equal to P* N/2 (N/2 is the height of single column in a sub-band), because proposed 3-D DWT simultaneously works on P columns due to P number of PUs in the spatial processor. Proposed architecture has been designed with P = 2; so for each clock, alternative column coefficients are provided by the 3-D DWT 
V. RESULTS AND PERFORMANCE COMPARISON

A. Simulation Results
The proposed encoder has been simulated by using Matlab tool and functionality has been verified on cyclone (Downloaded from the NASA website) and clock video sequences of 512×512 resolution, Compression Ration is the ratio of total number of bits in input frame and number of bits after the entropy coding. Table II shows that performance of the proposed framework competes with the existing IBMCTF [36] and H.264 [1] . Performance in terms of compression ratio and PSNR of the proposed encoder and decoder for clock, cyclone and V iplane video sequences are noted from the level 1 to level 3 in Table   III . 
B. Synthesis Results
The proposed architecture for CS based low complex video encoder has been described in Verilog HDL.
Simulation results have been verified by using Xilinx ISE simulator. We have simulated the Matlab model which is similar to the proposed CS based low complex video encoder architecture and verified the 3-D DWT coefficients and CS measurements. RTL simulation results have been found to exactly match the Matlab simulation results. The Verilog RTL code is synthesised using Xilinx ISE 14.2 tool and mapped to a Xilinx programmable device (FPGA) 7z020clg484 (zync board) with speed grade of -3. Table IV shows the device utilisation summary of the proposed architecture and it operates with a maximum frequency of 265 MHz. The proposed architecture has also been synthesized using SYNOPSYS design compiler with 90-nm technology CMOS standard cell library. Synthesis results of the proposed encoder is provided in Table VI shows the comparison of proposed 3-D DWT architecture with existing 3-D DWT architecture.
It is found that, the proposed design has less memory requirement, High throughput, less computation time and minimal latency compared to [22] , [23] , [24] , and [26] . Though the proposed 3-D DWT architecture has small disadvantage in area and frequency, when compared to [24] , the proposed one has a great advantage in remaining all aspects. Table VII gives the comparison of synthesis results between the proposed 3-D DWT architecture and [26] . It seems to be proposed one occupying more cell area, but it included total on chip memory also, where as in [26] on chip memory is not included. Power consumption of the proposed 3-D architecture is very less compared to [26] .
VI. CONCLUSIONS
In this paper, we have proposed memory efficient and high throughput architecture for CS based low complex encoder. The proposed architecture is implemented on 7z020clg484 FPGA target of zync family, also synthesized on Synopsys' design vision for ASIC implementation. An efficient design of 2-D spatial processor and 1-D temporal processor reduces the internal memory, latency, CPD and complexity of a control unit, and increases the throughput. When compared with the existing architectures the proposed scheme shows higher performance at the cost of slight increase in area. The proposed encoder architecture is capable of computing 60 UHD (3840×2160) frames in a second. The proposed architecture is also suitable for scalable video coding. In addition, the complexity of the encoder is reduced to a great extent.
The proposed encoder is considered to be suitable for applications including satellite communication, wireless transmission and data compression by high speed cameras.
