Abstract-Wavelet coding performs better than discrete cosine transform in visual processing. Moreover, it is scalable, which is important for modern video standards. The transpose memory requirement and operation speed are the two major concerns in 2-D lifting-based discrete wavelet transform (LDWT) implementation. This letter presents a novel algorithm, called 2-D symmetric mask-based discrete wavelet transform (SMDWT), to improve the critical issue of the 2-D LDWT, and then obtains the benefit of low-latency reduced complexity, and low transpose memory. The SMDWT also has the advantages of reduced complexity, regular signal coding, short critical path, reduced latency time, and independent subband coding processing. Furthermore, the 2-D LDWT performance can also be easily improved by exploiting an appropriate parallel method inherent to SMDWT. The proposed method has a significantly better lifting-based latency and complexity in 2-D DWT than normal 2-D 5/3 integer LDWT without degradation in image quality. The algorithm can be applied to real-time image/video applications.
I. INTRODUCTION

I
N THE PASTfew years, discrete wavelet transform (DWT) [1] has been adopted in a wide range of applications such as image coding and video compression, including speech analysis, numerical analysis, signal analysis, image coding, pattern recognition, computer vision, and biometrics. DWT can be viewed as a multiresolution decomposition of a signal, meaning that it decomposes a signal into several components in different wavelet frequency bands. By factoring the classical wavelet filter into lifting steps, the computational complexity of the corresponding DWT can be reduced by up to 50%. The lifting steps can be easily implemented, and is different from the direct finite impulse response (FIR) implementations of Mallat's algorithm [11] . Several lifting-based discrete wavelet transform (LDWT) hardware architectures have recently been proposed. The 2-D DWT architecture described by Chiang et al. [1] is based on the new interlaced read scan algorithm with pipeline processing to achieve low-transpose memory size and high-speed operation. One architecture performs the LDWT with the 5/3 filter, which is based on the interleaving technique presented in [2] . Chen et al. [3] used a 1-D folded architecture to improve the hardware utilization for 2-D 5/3 and 9/7 filters. Andra et al. [4] proposed simple processing units that compute several stages of the DWT at a time. For this, Tan et al. [5] presented a novel shift-accumulator arithmetic logic units architecture for 2-D lifting-based JPEG2000 5/3 DWT. The architecture has an efficient memory organization, which uses a smaller amount of embedded memory for processing and buffering. Varshney et al. [6] presented an energy-efficient single-processor and fully pipelined architectures for the 2-D 5/3 lifting-based JPEG2000. The single processor performs both the row-wise and columnwise processing simultaneously, thus achieving full 2-D transform with 100% hardware utilization. Chen et al. [7] proposed a flexible and folded architecture for 3-level 1-D LDWT to increase hardware utilization. The recursive architecture is a general scheme to implement any wavelet filter that is decomposable into lifting steps [8] in small-size and low-power design. Despite these efficiency improvements of the existing architecture, further improvements in the algorithm and architecture are required. Liao et al. [9] proposed two similar 2-D liftingbased 9/7 DWT generic architectures by employing parallel and pipeline techniques with recursive pyramid algorithms. Those architectures achieve multilevel decomposition using an interleaving scheme that reduces the size of memory and the number of memory accesses, while having a slow throughput rate and inefficient hardware utilization. Some VLSI architectures of 2-D LDWT reduce the transpose memory requirements and communication between the processors, such as the architectures presented in [1] - [10] . However, these architectures need large transpose memory and long latency time.
Low-transpose memory requirement and latency reduction are the major concerns in 2-D DWT implementation. This letter presents a new approach, namely 2-D symmetric maskbased discrete wavelet transform (SMDWT), to improve the 2-D LDWT, and further applies it 2-D DWT real-time visual applications.
The rest of this letter is organized as follows. In Section II, the LDWT is briefly introduced. The proposed SMDWT approach and the efficient hardware architecture are presented in Section III. Section IV demonstrates the performance comparisons with other existing 2-D DWT architectures. The conclusions are given in Section V.
II. LIFTING-BASED DWT
The lifting-based scheme proposed by Daubechies and Sweldens requires fewer computations than the traditional convolution-based approach. The lifting-based scheme is an efficient implementation for DWT. It can easily use integer operations and avoid the problems caused by the finite precision or rounding [1] .
A lifting-based scheme has the following four stages. 1) Split phase: The original signal is divided into two disjoint subsets. Significantly, the variable Xe denotes the set of even samples and Xo denotes the set of odd samples. This phase is called lazy wavelet transform because it does not decorrelate the data, but only subsamples the signal into even and odd samples. 2) Predict phase: The predicting operator P is applied to the subset Xo to obtain the wavelet coefficients 
4) Scaling: In the final step, the normalization factor is applied on s[n] and d[n] to obtain the wavelet coefficients. Equations (3) and (4) describe the implementation of the 5/3 integer lifting analysis DWT and are used to calculate the odd coefficients (high-pass coefficients) and even coefficients (low-pass coefficients), respectively
Although the lifting-based scheme has less complexity, its long and irregular data paths constitute a major limitation for efficient hardware implementation. Additionally, the increasing number of pipelined registers increases the transpose memory size of the 2-D DWT architecture [10] .
III. PROPOSED 2-D SYMMETRIC MASK-BASED DWT
The transpose memory requirement and operation speed are the two major concerns in 2-D LDWT implementation. The row-and columnwise signal flow operation is generally adopted for an N × N 2-D DWT. However, the memory requirement of this scheme ranges from 2.5N to N 2 [2] - [20] . To solve the transpose memory access problem, this letter proposes a low-latency and low-memory architecture for multilevel 2-D LDWT. The previous row-and columnwise signal flow is replaced with mask-based processing, i.e., SMDWT, to reduce the transpose memory requirement for the 2-D DWT. The SMDWT has many advanced features, such as short critical path, less latency time, regular signal coding, and independent subband processing. The following sections introduce the 2-D SMDWT where the coefficients of mask wavelet coefficient derivation are based on the 2-D integer LDWT.
A. 2-D SMDWT Structure
In this section, the proposed SMDWT is discussed in three aspects: lifting structure, transpose memory, as well as latency and critical path. The proposed SMDWT algorithm has the advantages of high computational speed, less complexity, reduced latency, and regular data flow.
For speed and simplicity, four-masks, i.e., 3 × 3, 5 × 3, 3 × 5, and 5 × 5, are generally used to perform spatial filtering tasks. Moreover, the four-subband processing can be further optimized to speed up and reduce the transpose memory of DWT coefficients. The four-matrix processors consist of four mask filters, and each filter is derived from one 2-D DWT of 5/3 integer lifting-based coefficients. In LDWT implementation, a 1-D DWT needs massive computations, so the computation unit dominates the hardware cost [4] . A 2-D DWT is composed of two 1-D DWTs and a block of transpose memory, which is of the same size of the processed image. The transpose memory is the main overhead of the computational unit in the 2-D DWT. Without loss of generality, the 2-D 5/3 LDWT is adopted for comparison. Assuming that the image is of size N × N , during the transformation, a large amount of transpose memory (order of N 2 ) is needed to store the temporary data after the first stage 1-D DWT decomposition. The second-stage 1-D DWT is then applied to the stored data to obtain the four-subband (HH, HL, LH, and LL) results of the 2-D DWT. Because the memory requirement of size N 2 is huge and the processing is too long, this letter proposes a new approach, called 2-D SMDWT, to reduce the transpose computing latency and critical path. Fig. 1(a) shows the concept of the proposed SMDWT architecture, which consists of input arrangement, processing element, memory unit, and control unit, as shown in Fig. 1(b) . The outputs are fed to the 2-D DWT four-subband coefficients, HH, HL, LH, and LL. Significant transpose memory can be saved using the proposed approach. This architecture is described in detail in the following sections, and is illustrated in Figs. 1 and 2(b). This letter focuses on the 2-D 5/3 LDWT complexity reduction.
B. Simplified 2-D SMDWT Using Symmetric Features 1) HH Band Mask Coefficients Reduction for 2-D SMDWT:
According to the 2-D 5/3 LDWT, the HH band coefficients of the SMDWT can be derived as follows: The mask as shown in Fig. 2 (a) can be obtained by (5) , where the variables α = −1/2, β = 1/4, and γ = 1. Fig. 2(b) shows the hardware architecture. The transpose memory requirement is a very important issue in multimedia IC design. Therefore, to make the SMDWT architecture suitable for VLSI implementation, the design processing element must be as simple and modular as possible. However, the product of cost and computation time is always the most important consideration from a standardization-provides-economies-of-scale-for-VLSI-solution point of view. Therefore, speed is sometimes sacrificed to obtain less costly hardware, while still satisfying the performance requirement. In other words, the SMDWT architecture can be decomposed so as to adjust the product of cost and computation time. Its hardware cost and computation time tradeoffs must be carefully considered to find the optimal design for VLSI implementation. A simple SMDWT method for cost and computation time savings is introduced below. Fig. 2(b) shows the concept of the proposed HH-band architecture for SMDWT. The proposed HH-band architecture consists of a shifter (α, β, and γ ) and one adder tree with propagation registers, as shown in Fig. 2(b) . The architecture design can be divided as follows.
1) Input Arrangement Unit: Three pixels in a column are inputted into a processing element for address generator circuits in each cycle. Simultaneously, the input arrangement to assign the original input signals used in multiplexer (MUX) fetches three pixels in each cycle to switch for group 1, group 2 and group 3 to operations, respectively.
2) Coefficient Shifter Unit: The coefficient shifter values are α = −1/2, β = 1/4, and γ = 1. Shifters replace multipliers to achieve a high-efficiency architecture by (reducing computational time, critical path, area cost and power consumption [5] ). 3) Adder Tree Unit: An adder tree architecture is adopted to avoid the long signal path length, signal skewing, and hazards caused by signal dependency. Each adder tree level can be viewed as a parallel pipeline stage. This architecture is suitable for the realization in hardware design. 4) Propagation Register Unit: Current pixels are stored to assign subband coefficients' computation needs in each group, and next, the horizontal or vertical scan oriented computations are stored in propagation registers for data reuse. This approach can reduce the next access time and computations. The pipeline design is the best method to improve the system throughput. Based on this structure, the coefficient overlap part can be reused as shown in Fig. 2(b) .
The complexity of the mask-based method is further reduced by employing the symmetric feature of the mask. First, the initial horizontal scan is expressed by
The next coefficient can be calculated by
where the variable XM H denotes the repeated part after the horizontal third coefficient, where X denotes group of pixels x, M denotes the mask, and H denotes horizontal orientation.
The general form can be derived as
Since γ = 1, the general form can be expressed as
where
The vertical scan can be done in the same way, where HH(0, 0) is the same as that in (6) . The next coefficient can be calculated by
where the variable XM V denotes the repeated part after the vertical third coefficient, where V denotes vertical orientation.
Finally, the diagonal oriented scan can be derived as
where the variable XM D denotes the repeated part after the vertical fifth coefficient, where D denotes diagonal orientation.
The general form can be expressed as
The repeat part is only needed to be calculated once throughout the whole image. Hence it greatly reduces the complexity of the SMDWT.
2) HL, LH, and LL Band Mask Coefficients Reduction for 2-D SMDWT:
According to the 2-D 5/3 LDWT, the HLband coefficients of the mask-based DWT can be expressed as follows:
x(2i + 4u, 2 j) The mask as shown in Fig. 3(a) can be obtained via (16), where α = −1/8, β = 1/16, γ = 1/4, δ = −3/8, and ε = 3/4. The hardware design architecture is also depicted in Fig. 4(a) . The complexity of the SMDWT is further reduced by employing the symmetric feature of the mask. The variable XM H +n denotes the repeated part after the second horizontal coefficient. The general form can be expressed as
The vertical scan can be done in the same way. The variable XM V denotes the repeated part after the vertical fifth coefficient. The general form can be expressed as
Finally, the diagonal oriented scan can be the general form of XM D+n and can be expressed as
The general form of the remaining part can be expressed as
The HL-band can be derived in the same way. According to the 2-D 5/3 LDWT, the LH-band coefficients of the SMDWT can be derived as follows:
The mask as shown in Fig. 3 (b) can be obtained via (21), where α = −1/8, β = 1/16, γ = 1/4, δ = −3/8, and ε = 3/4. The hardware design architecture is also depicted in Fig. 4(b) . The complexity of the SMDWT is further reduced by employing the symmetric feature of the mask. First, the initial horizontal scan is calculated by a method similar to that of HL SMDWT, where the variable X M H denotes the repeated part after the horizontal fifth coefficient. The general form can be expressed as
Next, the initial vertical scan is calculated by a method similar to that of HL mask-based DWT. The general form of the first vertical step can be expressed as
Finally, the diagonal oriented scan can be the general form of XM D+n can be expressed as
where According to the 2-D 5/3 LDWT, the LL-band coefficients of the SMDWT can be expressed as follows:
(26) The mask as shown in Fig. 3(c) can be obtained via (26) , where α = −1/32, β = 1/64, γ = 1/16, δ = −3/32, ε = 3/16, and ζ = 9/16. The hardware design architecture is also depicted in Fig. 4(c) . The complexity of the SMDWT is further reduced by employing the symmetric feature of the mask. The general form can be expressed as
The vertical scan can be done in the same way. The variable XM V +n denotes the repeated part after the vertical fifth coefficient. The general form can be expressed as
(29) The general form of the rest part can be expressed as
where Original: adder is 14, and multiplier is 15; Simplified: adder is 10, and multiplier is 0. XM V +1 of LH (1, j) Original: adder is 14, and multiplier is 15; Simplified: adder is 12, and multiplier is 0. XM V +n of LH(i + 2, j) Original: adder is 14, and multiplier is 15;
Simplified: adder is 9, and multiplier is 0.
Original: adder is 24, and multiplier is 25; Simplified: adder is 20, and multiplier is 0. XM H +n of LL(i, j + 2) Original: adder is 24, and multiplier is 25;
Simplified: adder is 15, and multiplier is 0.
Original: adder is 24, and multiplier is 25; Simplified: adder is 20, and multiplier is 0. XM V +n of LL(i + 2, j) Original: adder is 24, and multiplier is 25;
C. Summary of the Complexity Reduction
The four-matrix frameworks lead to four different architectures. The discussion above shows that the complexity of the proposed SMDWT can be significantly reduced by exploiting the symmetric feature of the masks. Table I shows the overall complexity reductions from the original SMDWT to the simplified SMDWT.
IV. EXPERIMENTAL RESULTS AND PERFORMANCE COMPARISONS
The proposed 2-D SMDWT algorithm is generally used to perform the 2-D DWT for still images. The wavelet transform provides a multiscale representation of visual in the spatial frequency domain. A major advantage of the DWT is its scalability. The proposed algorithm is based on the foursubband matrices which are processed to achieve the same performance as the 2-D 5/3 LDWT algorithm. The SMDWT is implemented in the JPEG2000 reference software VM 9.0 and is compared with the original JPEG2000. The test image the used in this experiment was Lena of size 512 × 512. Experimental results show that the proposed algorithm not only significantly improves lifting-based latency but also has the same visual quality as the normal 2-D 5/3 LDWT.
The architecture of the 2-D SMDWT has many advantages compared to the 2-D LDWT. For example, the critical path of the 2-D LDWT is potentially longer than that of SMDWT. Moreover, the 2-D LDWT is frame-based with the implementation bottleneck being the large amount of the transpose memory size. This letter uses the symmetric feature of the masks in SMDWT to improve the design. Experimental results, as shown in Table II, show that the proposed algorithm is superior to most of the previous works. The proposed algorithm has efficient solutions for reducing the critical path 3.5N 2N + 5 (N 2 /2) + N + 5 LDWT [5] 3N N/A (N 2 /2) + N + 5 LDWT [6] 3N 13 N/A LDWT [7] 3N N/A (N 2 /2) + N + 5 LDWT [10] N (which is defined as the longest time-weighted sequence of events from the start of the program to its termination shown in Table III Therefore, the proposed architecture is suitable for multilevel DWT computations.
V. CONCLUSION
This letter proposes a novel 2-D SMDWT fast algorithm, which is superior to the 5/3 LDWT. The algorithm solves the latency problem in the previous schemes caused by multiplelayer transpose decomposition operation. Moreover, it provides real-time requirement and can be further applied to computer vision and visual compression.
The proposed 2-D SMDWT algorithm has the advantages of high computational speed, less complexity, reduced latency, low transpose memory, and regular data flow, and is suitable for VLSI implementation. Possible future works are described below.
1) The Dual-Mode 2-D SMDWT on JPEG2000: The dualmode 2-D SMDWT can be developed to support 5/3 (lossless) lifting or 9/7 (lossy) lifting using similar hardware architecture, since the 5/3 and 9/7 are very similar and both have less complexity. 2) An independent four-subband mask can be used in other visual coding fields.
