Two-dimensional discrete wavelet transform (DWT) for processing image is conventionally designed by line-based architectures, which are simple and have low complexity. However, they suffer from two main shortcomings -the memory required for storing intermediate data and the long latency of computing wavelet coefficients. This work presents a new block-based architecture for computing lifting-based 2-D DWT coefficients. This architecture yields a significantly lower buffer size. Additionally, the latency is reduced from N 2 down to 3N as compared to the line-based architectures. The proposed architecture supports the JPEG2000 default filters and has been realized in ARM-based ALTERA EPXA10 Development Board at a frequency of 44.33 MHz.
Introduction
Over the past decade, the discrete wavelet transform (DWT) has been widely applied in the area of image processing. The DWT is used in the decorrelation step of systems for compressing still pictures. Several research results indicate that wavelets outperform discrete cosine transforms (DCT) in terms of image quality at high compression ratios, by avoiding the block distortion problem suffered by DCTbased solutions. DWT has traditionally been implemented by convolution, which depends on both a large number of computations and a large storage size. In 1994, the lifting scheme, a new method which is known superior to conventional convolution-based DWT was proposed in [1] , [2] . In addition to providing a significant reduction in memory and the computational complexity, lifting scheme provides inplace computation of the wavelet coefficients by overwriting the memory locations where contain the input sample values. Furthermore, it has less hardware implementation and faster computation time. Therefore, the specification of the DWT kernels in JPEG2000 is only provided in terms of the lifting coefficients and not the convolutional filters.
Memory is an important constraint in many image compression applications. Existing DCT-based compression algorithms, including those defined under the JPEG standard use memory very efficiently because, if required, they can operate on individual image blocks such that the minimum amount of memory required is very low. Although wavelet-based coders outperform DCT-based coders in terms of compression efficiency, their implementations
have not yet matured. Memory efficiency is in fact one of the most important issues to be addressed before wavelet-based techniques can be widely deployed, and this is currently one area of extensive research activity related to JPEG2000 standard.
In the JPEG2000 verification model [9] , the following wavelet filters are proposed: (5, 3) (5-tap highpass filter, 3-tap lowpass filter), (9, 7), C(13, 7), S(13, 7), (2, 6) , (2, 10) and (6, 10) . To be compliant with JPEG2000, the codec has to implement a (5, 3) filter in lossless mode and a (9, 7) filter in lossy mode. Some proposed architectures [3] - [7] do not implement all of the filters and the data paths are in a line-based fashion, resulting in a large buffer size and the late production of wavelet coefficients. In [3] , [4] , the DWT is processed by two main modules -a row module and a column module. Another structure was presented in [5] to implement all stages of the transform using recursive architecture. Direct implementation of the lifting scheme was described in [6] and the architecture in [7] improves upon this direct implementation by its folded structure. All of these methods use line-based data flow to process the DWT and suffer from large intermediate data storage. This paper proposes a new block-based architecture that can implement lifting scheme DWT and significantly reduce the amount of memory required. This memory efficiency is also advantageous in terms of computation speed. Instead, in our proposed system, the enforced "locality" of the filtering operations makes it more likely that strips of the image get loaded into the on-chip memory only once.
The rest of this paper is organized as follows. Section 2, briefly reviews the lifting scheme. Section 3 analyzes the precision analysis and the data flow. Section 4 explains the proposed architectures. Section 5 presents the FPGA implementation results and comparisons with others' work. Finally, Sect. 6 draws conclusions.
Lifting Scheme
The basic concept that underlies the lifting scheme is the factorization of the polyphase matrix of a wavelet filter into a sequence of alternating upper and lower triangular matrices and a diagonal matrix. Let h(z) and g(z) be the low-pass and high-pass analysis filters. The corresponding polyphase matrix is defined as,
where h e (z) contains the even coefficients of h(z), h o (z) contains the odd coefficients h(z), g e (z) contains the even coefficients of g(z) and g o (z) contains the odd coefficients g(z), respectively. Then, P(z) can be factored into lifting steps as,
As shown in Fig. 1 , the P(z) factorization, involves of three steps:
(1) Prediction step, in which the even samples are multiplied by the time domain equivalent of p i (z), then added to the odd samples;
(2) Update step, in which updated odd samples are multiplied by the time domain equivalent of u i (z), then added to the even samples; (3) Scaling step, in which the even samples are multiplied by 1/K and the odd samples by K.
The inverse DWT is performed by traversing in the reverse direction; changing the factor K to 1/K, factor 1/K to K, and reversing the signs of the coefficients in p i (z) and u i (z).
The original 1-D signal {s
. . } is split into odd and even indexed subsequences, and then these values are modified using alternating prediction and updating steps. The computational steps are summarized as 
where {s n i } and {d n i } are, respectively, the even and odd sequences, p n (k) and u n (k) are, respectively, the prediction and updated weights at the nth iteration and M is the number of lifting sequence. For the (5, 3), C(13, 7), S(13, 7), (2, 6), (2, 10) filter-bank, M=1, while for the (9, 7) and (6, 10) filter-bank, M=2. Equation (3) indicates the prediction step that consists of predicting each odd sample and subtracting it from the odd sample to form the prediction error {d n i }. Equation (4) indicates the update step that consists of updating the even samples by adding to them a linear combination of the already modified odd samples, {d n i }, to form the updated sequence {s n i }. The output of the final prediction step will be the high-pass coefficients up to a scaling factor K, while the output of the final update step will be the low-pass coefficients up to a scaling factor 1/K. For the (9, 7) filterbank, K= 1.230174104914001. The lifting steps of the (5, 3) filter-bank and the (9, 7) filter-bank [8] are depicted in Fig. 2 .
The number of computations required for calculation of a high-pass, low-pass pair of wavelet transforms using convolution and lifting scheme is given in Table 1 . The reduction in the number of multiplications for the lifting scheme is significant for odd-tap filters compared with convolution. For even-tap filters, the convolution scheme has fewer or an equal number of multiplications. The number of additions for lifting scheme is lower in both odd and even tap filters. Such reduction in the computational complexity makes lifting schemes attractive for both high throughput and low-power applications.
Precision Analysis
The drawback of using fixed-point data format for implementing application-specific integrated circuit (ASIC) chips is that the precision can be reduced. To overcome this drawback, we need to increase the additional bits for ensuring precision using image quality analysis.
The filter coefficients of the seven filters in JPEG2000 considered herein range from 0.003906 to 2 [4] . To convert the filter coefficients to integers, these coefficients are multiplied by 256. The value of the coefficients range from 1 to 512, so that 10 bits can be used to represent the coefficients in 2's complement form. At the end of multiplication, the product is shifted right by eight bits to yield the required result. The rounding is applied to the individual product terms instead of the result of the filter operation.
Now we consider the format of signal values for hardware implementation. The signal values must be shifted left to increase the precision. The extension of the shift is determined by image quality analysis. Consider the general structure of lifting schemes, as indicated in Fig. 3 . Given the equation
where a and b are the coefficients, x k , 1 ≤ k ≤ 5, are the signal inputs, and y is the transformed value. Assume A = Round (256 × a) and B = Round (256 × b), Eq. (5) can be expressed as follows,
If the input values x k are shifted by the extension bits, S , then
The order of the computation is changed to improve its precision
where the subscript round represents the function of rounding. Rounding occurred when each term has been calculated. The SNR values with different extension bit numbers, for the Baboon, Lenna, Elaine, and Boat images, after three levels of forward and inverse transforms are given in Table 2 . For a set of given images, we varied the extension bit number S to select the bit number S with saturated SNR performance. That is, the bit number greater than S will only introduce slight SNR improvement. According to Figs. 4 and 5, when S > 5, this proposed architecture uses five extension bits for processing the DWT. Once the number of extension bits is chosen, the width of the data path must be determined, as can be done by observing the maximum and minimum values for the forward and inverse transform at the end of each level. Table 3 presents the maximum and minimum values for the Baboon, Lenna, Elaine and Boat images with five extension bits. This table indicates that 16 bits are required to represent the transformed values in 2's complement representation.
The multiplier multiplies a 16-bits number by a 10-bit number and then rounds the product that has eight LSBs (to account for the increased precision of the filter coefficients) and two MSBs to form a 16-bit output. (Sixteen bits are required to represent the outputs and therefore the two MSBs are sign extension bits.)
Proposed VLSI Architectures

Proposed Data Flow Diagram
For each level of the DWT using line-based method, the filtering along columns is performed after the completion of the filtering along rows as shown in Fig. 6 . For instance, in image processing, it requires N 2 words for intermediate data storage. This may be unreasonable to fit on a single chip for even moderately sized images. While the line-based method can be efficient for 1-D applications, 2-D line-based architectures suffer from the bottleneck that the required memory equals to the input data size. Besides this disadvantage, the line-based approach does not lend itself to parallel processing.
In this paper, the proposed data flow for the DWT does not follow the line-based method. A new block-based fashion is presented in this paper. When the input image is divided into several blocks, the coefficients of each layer (i.e., LL, LH, HL, HH) can be concurrently obtained within a block. For this method, it can be thought of a window sliding over the image. The overlapping design smoothly slides the window across the image. The idea behind the overlapping block architecture is to take only as many inputs as required to compute a set of outputs. For example, a 1-D version would require only one input per filter length (L), and produces two outputs: a low-pass and a high-pass. The 2-D case takes L 2 inputs and produces four outputs. In general, an n-dimensional transform needs L n inputs to produce 2n outputs. Figure 7 presents an example of the data flow, using a (5, 3) filter-bank. The size of input image is assumed to be 5 × 5 pixels, and a block of 3 × 3 pixels is used. There are three intermediate data produced in Fig. 7(a) . 
Proposed Architectures
The proposed block-based architecture for 2-D DWT is depicted in Fig. 8 . The outputs in each level are LL, LH, HL, and HH. The LL data are used for the next level of decomposition. This system has three primary stages. The first stage reads the input data and the block controller forms a "block" according to double buffer scheme. After a "block" of input data is ready for processing, it is sent to the pipeline register for the next stage.
The second stage is the PE Y controller that processes the intermediate transform data within a block and stores the data to the Buffer Y. The last stage, the PE Z controller, processes the final transform coefficients. Buffer Z is only used in 4M filters because two passes of one dimension transform is calculated in a round. The registers in Fig. 8 are used for storing of the second-pass input. The details are discussed in the following subsections.
Block Controller Modules
The block controller modules read the image input data. The BUFFER X is used to store input data. It is utilized to segment the image data into sub-blocks. BUFFER X contains two banks (MEM1 and MEM2) to implement the doublebuffer scheme. The first step is to read data from the External Memory into MEM1 (see Fig. 8 ). When the MEM1 is full of the image data, second, the MEM2 reads the image data. The MEM1 can be simultaneously read, forming a "block" for processing. The MEM2 will wait until the processing of MEM1 is completed. The third step is similar to step 2 but with the MEM1 and MEM2 exchanged. The first step is executed only once, after which, the second and the third steps are performed alternatively till the entire image is completely processed. The roughly finite state machine of the block controller is described in Fig. 9 .
Processing Elements (PE) Modules
Two PE modules are used in our design. The PE Y reads a block of data from BUFFER X; calculates the intermediate data Y, and writes the data into BUFFER Y, when the PE Z reads a block of data from BUFFER Y; calculates the transform data Z, and writes the data into BUFFER Z. The basic computation unit, MAC, is indicated in Fig. 10 . Figures 11  and 12 show the structures of the 2M and 4M filter banks, respectively. The REG1 and REG2 are used for storing the overlapped data of the block in the 2M filter banks. While the 4M filter banks are being processed, all four registers are used to reduce the numbers of memory access. Thus the reaccessing of the memory can be prevented to diminish the power consumption.
In our algorithm, a block has two frames. In each frame, the processing element calculates the high-pass and low-pass pair of coefficients. The PE Y and PE Z can simultaneously perform transform when the PE Z has enough input data to do so. Thus, the computational time can be significantly reduced. 
Memory Modules
The structures of double-buffer and overlapping are adopted, so the size of MEM1 and MEM2 in the proposed blockbased architecture is N ×2, where N is the width of the input image. While dealing with the MEM1 (MEM2) data, all of them is processed in the PE Y and stored in the BUFFER Y. At this time, the PE Z starts to deal with the other dimension since it has sufficient data for processing. The size of the memory is much lower than those associated with line-based architecture whose memory requirement is N × N/2 [3] , [4] .
BUFFER Y and BUFFER Z have size N × 4. Referred to Fig. 13 , when a row of the intermediate data is processed, the three other rows can be accessed for simultaneous processing of other dimensions. These four rows can be rewritten circularly. 
FPGA Implementation
To realize the proposed architecture, ALTERA EPXA10 Development Board (ALTERA TM EXCALIBUR TM EPXA 10F1020C2) was utilized. Figure 14 shows the system architecture of the embedded stripe and the interfaces to the PLD portion of the devices [11] . This architecture promotes maximum integration with minimal system cost and allows the embedded stripe and PLD to be independently optimized for maximum performance. Two AMBA-compliant AHBs ensure that the embedded processor activity is unaffected by peripheral and memory operation. Three bidirectional AHB-to-AHB bridges enable embedded peripherals and PLD-implemented peripherals to exchange data with the embedded processor or with other peripherals. With these interfaces, the performance of the ARM922T is uncompromised, and is equivalent to an ASIC implementation on a 0.18-µm CMOS process. The implementation results are summarized in Table 5 . The critical path of the system is about 22.557 ns. That means the maximum operating frequency is roughly 44.33 MHz. As shown in Fig. 8 , the critical path is the path between two pipeline registers (through Table 5 The implementation results of the FPGA prototype. Fig. 15 Schematic view of the whole system. a multiplexer and a PE controller). Figure 15 depicts the schematic view of the whole system. The prototype system photo is given in Fig. 16 .
The following will compare the buffer size, hardware utilization, and computational time of the proposed architecture with those of others' architectures. In the proposed architecture, the buffer memory is significantly reduced as shown in Table 6 . From Table 6 , while the block-based architectures may use more computing time, the work can be divided among many processors. In this proposed architecture, the first wavelet transform coefficient is generated as soon as possible. The total computational time can also be reduced in comparison with those of other architectures, facilitating quantization in the processing of image compression in JPEG2000, representing another advantage of the proposed block-based structure.
Conclusions
Line-based DWT architectures are efficient for 1-D applications. In 2-D transforms (or higher), they suffer from two main problems -memory requirements and latency. For example, image processing requires N 2 words for storing intermediate data may not fit on a single chip even for moderately sized images. Also, the latency depends on the input size. At least O(N) clock cycles are required to generate the first output. These problems are inherent in line-based architectures.
This paper offers a new data processing path and performs a new VLSI architecture to implement the 2-D lifting scheme DWT with small memory. The DWT coefficients are computed using a block fashion of data path. This architecture reduces the latency to 3N and the total required memory is also reduced. Finally, the proposed design has successfully been verified using an ARM-based ALTERA EPXA10 Development Board.
