Abstract. In this paper, a VLSI architecture for lifting-based shape-adaptive discrete wavelet transform (SA-DWT) with odd-symmetric filters is proposed. The proposed architecture is comprised of a stage-based boundary extension strategy and the shape-adaptive boundary handling units. The former could reduce the complexity of multiplexers that are introduced to solve the shape-adaptive boundary extension. The latter consists of two multiplexers and can solve the shape-adaptive boundary extension locally without any additional register. Two case studies are presented, including the JPEG 2000 default (9, 7) filter and MPEG-4 default (9, 3) filter. According to comparison results with previous architectures, the efficiency of the proposed architectures is proven.
Introduction
Visual object coding has become an important technique because it can provide great flexibility to manipulate visual objects in multimedia applications and could improve visual quality in very low bit-rate applications. There have been many research efforts on developing new algorithms for coding arbitrarily shaped visual objects. SA-DWT [1] has been proven to outperform other coding methods in both PSNR (Peak Signal-to-Noise Ratio) and subjective quality. The visual coding standard, MPEG-4, has adopted SA-DWT as its core transform for coding arbitrarily shaped still texture.
The implementation of SA-DWT relies on the ability to handle the boundary extension for any kind of input signal segments. There are usually three candi- * This work was supported in part by MOE Program for Promoting Academic Excellence of Universities under the grant number 89E-FA06-2-4-8, in part by National Science Council, Republic of China, under the grant number 91-2215-E-002-035, and in part by the MediaTek Fellowship.
dates of the boundary extension scheme, which are zero-padding, periodic extension, and symmetric extension. Zero-padding is the simplest one, but it would induce additional DWT coefficients at boundary for perfect reconstruction. Periodic extension can preserve the number of DWT coefficients, but the output of the highpass filter will have large magnitude at boundary because of the discontinuity. Symmetric extension can preserve the number of DWT coefficients for perfect reconstruction and maintain the continuity at boundary. However, it is only effective for symmetric filters. In this paper, only symmetric extension is considered because of its good performance [2] .
In literature, DWT architectures are mainly proposed to be implemented by use of the convolution-based or lifting-based architectures. As for the convolutionbased architectures [3] [4] [5] , the boundary extension issue is not discussed frequently, which could change the overall architecture. The symmetric extension has been considered in [6, 7] , and the periodic extension has been discussed in [8] . However, no convolution-based architectures for SA-DWT have been proposed. On the other hand, many lifting-based architectures have also been proposed, but only some of them mention about the boundary extension issues. The symmetric boundary extension is considered in [9] , and the zero-padding is adopted in [10] . We have proposed some lifting-based architectures for SA-DWT [11] , but they would require more than the minimal number of registers in the conventional DWT architectures. To the best knowledge of authors, there were no other SA-DWT architectures proposed in literature.
The boundary extension issues are very important not only to the SA-DWT but also to the original DWT. When the one-dimensional (1-D) DWT modules are extended to the two-dimensional (2-D) line-based architectures [12] , the size of the required line buffers of the latter will be proportional to the minimal number of registers in the former. Thus, if additional registers are introduced when designing the boundary extension handling circuits, the area of the corresponding 2-D line-based architecture will be increased.
In this paper, an efficient architecture design method for the lifting-based SA-DWT implementation with odd-symmetric filters is proposed. Based on existing lifting-based DWT architectures, SA-DWT can be achieved by introducing some multiplexers. A stagebased boundary extension strategy is proposed to reduce the complexity of multiplexers. Moreover, we propose the shape-adaptive boundary handling units to implement these multiplexers without any additional register. Besides, the proposed architectures could also be used to handle the boundary extension of the original DWT exactly because the extension cases are only the subset of SA-DWT.
The organization of this paper is as follows. The SA-DWT algorithm is reviewed in Section 2. Section 3 introduces 1-D and 2-D DWT architectures and the corresponding boundary extension issues. The VLSI architecture design method of SA-DWT is proposed in . In order to demonstrate the efficiency, two case studies of the JPEG 2000 default (9, 7) filter and MPEG-4 default (9, 3) filter are given in Section 5, which includes comparisons with other architectures and a prototyping chip implementation. Lastly, conclusions about this paper are given in Section 6.
Shape-Adaptive Discrete Wavelet Transform
Visual objects consist of the shape and texture information. As shown in Fig. 1 , the former illustrates which positions belong to the object, and the latter describes the texture content. The basic concept of the 1-D SA-DWT algorithm is to treat the arbitrarily shaped visual objects as several continual signal segments based on the shape information and to perform DWT on the texture information for each signal segment independently. This concept can be directly extended to the 2-D SA-DWT for separable 2-D DWT filter banks. Without loss of generality, the row-wise DWT is assumed to be performed before the column-wise DWT for the separable 2-D DWT in the following. For example, the three white lines in Fig. 1(b) represent three continual signal segments of a visual object for the row-wise SA-DWT.
There are two main components in SA-DWT [1] . One is the subsampling strategy, and the other one is the boundary extension issue of signal segments with arbitrary lengths. The computation of DWT is equivalent to subsampling the output signals from one pair of lowpass and highpass filters. For odd symmetric DWT filters, even subsampling for the lowpass filter and odd subsampling for the highpass filter are usually adopted. There are two subsampling strategies in SA-DWT, which are local and global subsampling. The local subsampling strategy selects the subsampling positions that refer to positions relative to the beginning of each signal segment. The number of lowpass signals is always larger than or equal to that of highpass signals. This is preferred for the entropy coding methods. On the other hand, the global subsampling strategy will refer to positions relative to the rectangular box of the shape information. Although the highpass signals may be more than the lowpass signals, the phase of row-wise SA-DWT coefficients can be always aligned for the column-wise SA-DWT. Figure 2 shows an example for these two subsampling strategies, where white rectangles stand for the shape information. In the case of global subsampling, the highpass signals are more than the lowpass signals in the second row. But the phase is skew in the local subsampling case. According to [1] , the global subsampling can achieve better coding gain than the local subsampling. In the following of this paper, only the global subsampling is discussed.
There are two symmetric boundary extension types for odd-symmetric DWT filters as shown in Fig. 3 , where the solid and dot rectangles represent the texture signals and the extended signals, respectively. Only Type B is involved in the forward DWT. When signal segments are very short, the boundary extension needs to be performed in a recursive way. The leading boundary extension is performed first and then the trailing boundary. Some extension examples are given in Fig. 4 . The boundary extension of very short signal segments is more complex than that of long signal segments because the leading and trailing boundary extensions are dependent.
However, one special case should be emphasized. If the signal segment contains only one point, one lowpass signal should be placed in the corresponding position of the global frame, as described in the first row of 
DWT Architectures
In this section, we will introduce different DWT architectures and discuss the corresponding implementation issues for boundary extension. The 1-D DWT architectures focus on the computing unit designs, including multipliers, adders, and registers. The 2-D DWT architectures are dominated by memory issues, such as frame memory access bandwidth and internal buffer size.
1-D DWT Architectures
3.1.1. Convolution-Based. The arithmetic computation of DWT can be expressed as filter convolution and subsampling as follows:
where h(n) and g(n) are the lowpass and highpass filters, and P L and P H are the filter tap lengths of h(n) and g(n), respectively. The convolution-based architectures can be implemented by use of parallel or serial filters [5] . The parallel filters can handle the boundary extension with a router that is placed before the input of filters [6, 7] , as shown in Fig. 6 . The router is basically a multiplexer and handles the boundary extension case-by-case according to the given shape information. However, it becomes more complex for solving the boundary extension of the signals with arbitrary lengths, and the complexity will become higher as the filter taps increase. For example, Fig. 7 shows all possible cases of the shape information for SA-DWT with five-tap filters (P L = P H = 5), where only the cases, in which the central point of the filters is inside the segment, need to be considered. Furthermore, the number of possible cases for P-tap filters (P L = P H = P, P is Figure 6 . Parallel filter architecture and the corresponding boundary extension router [6, 7] . odd) will be:
Thus, the router will be a P × P (P-inputs and Poutputs) multiplexer with ( P+1 2 ) 2 possible cases. If the central points of the lowpass and highpass filters are in different positions, the possible cases will be the union set of these two filters, which are slightly larger than max{(
2 }. Unlike parallel filters, serial filters cannot handle the boundary extension without additional registers because of the serial feature of input signals. If the boundary extension router in Fig. 6 is used for serial filters, max{P L − 2, P H − 2} additional registers would be required.
Lifting-Based.
There are many advantages of the lifting scheme [13] , such as fewer arithmetic operations, in-place implementation, and easy management of boundary extension. According to [14] , any DWT filter bank of perfect reconstruction can be decomposed into a finite sequence of lifting steps. This decomposition corresponds to a factorization for the polyphase matrix of the target wavelet filter into a sequence of alternating upper and lower triangular matrices and a constant diagonal matrix, which is shown as the following equations: where h(z) and g(z) are the lowpass and highpass analysis filters, respectively, the Eq. (3) is the polyphase decomposition, and P(z) is the polyphase matrix. Furthermore, every triangular matrix can be decomposed into such basic matrices,
(k is an odd integer), for odd-symmetric filter banks [15] . The corresponding basic unit is shown in Fig. 8 where the computation node performs summation, the register node stores the data in the previous clock cycles, and the input node receives the coming data in the current clock cycle. For example, the JPEG 2000 default (9, 7) filter can be further decomposed into four lifting stages as follows:
where the coefficients are given as a = −1.586134342, b = −0.052980118, c = 0.882911076, d = 0.443506852, and K = 1.149604398. The liftingbased architecture of the above factorization is shown in Fig. 9 . It requires six multipliers (a, b, c, d, K , and 1/K ), eight adders (each computation node has two adders), and six registers (one grey node stands for one register). However, the parallel filter implementation of the (9, 7) filter bank would require nine multipliers (five for the lowpass filter and four for the highpass filter), fourteen adders, and seven registers. For software implementation, the boundary extension of lifting-based (9, 7) filter can be solved as described in [16] . As for hardware implementation, we have proposed VLSI architectures for SA-DWT with the (9, 7) filter and the MPEG-4 default (9, 3) filter in [11] . For each basic unit like Fig. 8 , one multiplexer is used to locally handle the boundary extension with control signals from the shape information. For the (9, 7) filter, the shape information could be used as control signals directly. However, one state machine is needed to generate control signals from the shape information for the (9, 3) filter because the delay number k of one basic unit is three, instead of one. Besides, two additional registers are used to address the special case in which the signal segment length is one. Besides the output registers, six and seven registers are required for the unpipelined lifting-based SA-DWT architectures with the (9, 7) and (9, 3) filters, respectively.
Flipping Structure.
The critical path of liftingbased architectures is potentially longer than that of convolution-based ones because of the timing accumulation effect among basic units. Although pipelining can reduce the critical path, it would also increase the number of registers. Instead of pipelining, we have proposed the flipping structure to shorten the critical path of lifting-based architectures by flipping the multiplier coefficients [17] . For example, the critical path of Fig. 9 is 4T m + 8T a (T m + 2T a for each lifting stage) where T a is the time needed for an addition operation, and T m is the time taken for a multiplication operation. However, the parallel filter implementation can have a much shorter critical path T m + 4T a if adder trees are used to connect the adders of the lowpass and highpass filters. If Fig. 9 is cut with four pipelining stages so as to reduce the critical path to T m + 2T a (one lifting stage), it would increase six registers. The flipping structure could be used to shorten the critical path to T m + 5T a (the multiplier 1/a and five following adders) without any hardware overhead as shown in Fig. 10 where b = 16b, c = 2c, and d = 2d.
The basic unit of flipping structures for oddsymmetric DWT filters is shown in Fig. 11 . The boundary extension handling strategy of flipping structures is the same as the corresponding lifting-based architectures since the basic units are similar.
2-D DWT Architectures
In contrast to systolic or semisystolic routing architectures [5] , RAM-based 2-D DWT architectures have many advantages, including real-life feasibility, high regularity/density of storage, and simple control circuits. According to the evaluation in [18] , the memory issue is the most critical part for 2-D DWT implementation, instead of the number of multipliers that dominates the 1-D DWT architectures.
With respect to the external frame memory access bandwidth and the internal buffer size, the 2-D DWT architectures can be categorized as direct, 1-level 2-D [19] , and multi-level 2-D [4, 5] methods. The latter two architectures are called line-based implementation [12] . The direct method performs the row-wise DWT first and store these intermediate coefficients in the external frame memory. After finishing the rowwise DWT, the column-wise DWT is performed to the intermediate coefficients with the same 1-D DWT module. This direct implementation is very simple but requires huge external frame memory bandwidth. Because the size of this frame memory is O(N 2 ) (N is the image width/height), it is usually assumed to be off-chip.
The 1-level 2-D architecture performs the row-wise and column-wise DWT of the same level simultaneously. The implementation method is to use some internal line buffer to store temporal coefficients such as to reduce the external frame memory access. The multi-level 1-D architecture performs all levels of 2-D DWT at the same time and makes the required external memory bandwidth minimized. However, this kind of implementation usually results in a poor hardware utilization. For example, using Recursive Pyramid Algorithm [20] to schedule the decomposition tasks will make the hardware utilization only at most 66.7%. These three architectures are summarized in Table 1 , where the decomposition level, J , is assumed to be infinite, L is a constant related to the adopted 1-D DWT architecture, and the external memory access includes memory write and read. The trade-off between external memory access and internal buffer size is that the former would consume the most power and cost much frame memory bandwidth [18] while the latter would increase the hardware cost. L is highly related to the number of registers in the adopted 1-D DWT architectures because the internal buffer consists of the temporal buffer, which is proportional to the minimal register number, and the data buffer, which is only responsible to the input scan method. For the parallel-parallel architecture of the (9, 7) filter bank, L is 8.5 [5] . However, for the lifting-based architecture, L can be only 5.5 [21] .
Proposed VLSI Architecture Design Methods for Odd-symmetric SA-DWT
VLSI architecture design methods for lifting-based SA-DWT are proposed in this section, which are mainly improved from [11] . The design methods require no additional registers and are comprised of two parts, the stage-based boundary extension strategy and the shape-adaptive boundary handling unit. The former is a strategy to reduce the complexity of multiplexers, and the latter is used to handle the boundary extension for every basic unit that could be lifting-based or flipping-based.
The following proposed methods are also applicable to the lifting-based Inverse SA-DWT as proposed in [11] . The lifting scheme of the Inverse DWT can be derived from inverting the signal flow of that of DWT. Thus, the lifting stages of the Forward and Inverse DWT are the same, except that one addition should be replaced with one subtraction. As a result, the shapeadaptive boundary handling multiplexers of the Forward and Inverse DWT are the same, except the passing path of the special case in which the signal length is one.
Stage-Based Boundary Extension Strategy
Instead of handling the boundary extension at the filter input side like Fig. 6 , the boundary extension of liftingbased or flipping-based architectures is proposed to be performed locally in every lifting stage. For every basic unit of lifting stages, the related shape information covers 2k + 1 nodes, where k is defined as Figs. 8 and 11. Thus, there are (k + 1) 2 possible cases of shape information for each basic unit according to the Eq. (2). The shape information of each basic unit is exactly the shape information in the corresponding position of the filter input. Furthermore, the input and output sizes of the multiplexers can be smaller than the filter length. 
Shape-Adaptive Boundary Handling Unit
The proposed shape-adaptive boundary handling unit for the basic unit of Figs. 8 and 11 is shown in Fig. 12 . Instead of only one multiplexer in [11] , two multiplexers, M1 and M2, are used with the (2k + 1)-bit shape information from the corresponding filter input. The basic unit could be lifting-based or flipping-based. The input I2 is from the central branch of the original basic unit, and the input bus I1 is from the register chain. If the signal segment length is one, the input signal will be passed through the path I 1 → I n3 → Out or I 2 → I n2 → Out. Thus, no additional registers are required for this special case. Except the special case, the output of multiplexer M2 is always from the output of the basic unit (In1). And the multiplexer M1 will select the corresponding inputs of the basic unit (A and B) from k + 1 possible candidates of I1 according to the (2k + 1)-bit shape information. However, the detailed configuration of M1 and M2 is dependent on which DWT filters are adopted. M1 needs to be designed to perform short-length boundary extension locally and independently from other basic units with the extension strategies described in Fig. 4 . In summary, the proposed design method is to introduce two multiplexers, M1 and M2, for each basic unit to solve the SA-DWT without any additional register. The M1 is a (k + 1) × 2 multiplexer of at most (k + 1) 2 possible cases, and the M2 is a 3 × 1 multiplexer of few possible cases.
Case Study
Two case studies are presented in this section, including the JPEG 2000 default lossy (9, 7) filter and the MPEG-4 default (9, 3) filter.
JPEG 2000 Default Lossy (9, 7) Filter
The flipping structure in Fig. 10 requires nearly the same hardware cost of the lifting-based architecture in Fig. 9 . Thus, the flipping structure is adopted for the SA-DWT implementation in the following. Except the normalization step, Fig. 10 is composed of four basic flipping units, and each unit consists of one multiplier and two adders. The proposed shape-adaptive boundary handling (SABH) unit can be designed as Fig. 13 , where n and f are the shift bit number and the multiplier coefficient, respectively. The two multiplexers, M1 and M2, can help handle the boundary extension by examining the shape information S = {S1, S2, S3} that is corresponding to the input signals {I 1, I 2, I 3}.
The configuration of M1 and M2 is shown in Fig. 14 . If the input signals are all inside the signal segment, M1 will set the nodes A and B as I1 and I3, respectively. Otherwise, when the input signals are at the segment boundary, M1 will set A and B both as I1 or I3, which depends on the boundary is leading or trailing. M2 will output the computation result, In1, on the above conditions. For the special case, M1 and M2 should be designed to pass the signal of length one, which may be from even or odd positions as described in Fig. 15 . Since this special case is not included in the above conditions, the multiplexers can be designed to pass the signal to the lowpass output node in the exact cycle. Thus, M2 will pass In2 or In3 for the special case.
By adopting the above SABH units to Fig. 10 , the proposed shape-adaptive flipping structure can be derived as Fig. 16 , where the critical path is only increased by 3T a if the time taken for multiplexers is ignored. The additional multiplexer, M L, is used to select the correct lowpass signals and output the shape information of lowpass and highpass signals by examining the shape information, {m2, m3, m4, m5}. When Fig. 16 is extended to the 2-D line-based architecture, only the four data registers and four shape registers of the left side of the pipelining dot line are required to be modified to the internal temporal buffer because the right side can be implemented independently.
Comparison.
The comparison results of three 1-level 2-D architectures, including parallel-parallel [6] , previous lifting-based [11] , as well as the proposed flipping-based and lifting-based architectures, are shown in Table 2 , where the time taken for multiplexers is ignored to calculate the critical path.
The parallel-parallel architecture requires more multipliers, adders, and internal buffer, but has a shorter critical path. However, the router design in [6] can only handle the boundary extension of long signal segments and needs to be modified for handling very short signal segments. Although the previous lifting-based architecture can handle the shape-adaptive boundary extension, the critical path is too long, and the internal buffer size is larger than the parallel-parallel one. By adopting the proposed SABH units to the flipping structure and a proper design of the data buffer [21] , the critical path is shortened, and the internal buffer size is only about 65% of the parallel-parallel one. Moreover, the proposed SABH units could be used for the lifting-based architecture as well. The performance is the same as the flipping structure, except the longer critical path, 4T m + 8T a . The saving of multipliers from the number 16 to 14 between the previous and proposed lifting-based architecture is because two additional paths are introduced in the previous architecture. About the internal buffer size, two lines of internal buffer are saved due to no additional registers in the proposed 1-D SA-DWT architecture. And the left saving of 1.5 lines comes from the proper design of data buffer.
Chip Implementation.
A prototyping chip for the 1-level 2-D line-based SA-DWT with the (9, 7) filter by using the proposed shape-adaptive flipping structure Figure 17 . Photograph of the prototyping chip for 1-level 2-D linebased SA-DWT with the (9, 7) filter.
was implemented and fabricated with the TSMC 0.25-µm CMOS 1P5M process. The chip photograph and feature are shown in Fig. 17 and Table 3 , respectively. If this chip works at 50 MHz, the processing capability will be 100 M pixels per second. This processing rate can provide the real-time computation In this prototyping chip, the data wordlength and the frame size are assumed to be 16-bits and 128 × 128, respectively. Thus, the internal buffer size is 5.5 × 128 × (16 + 1) = 11968 bits. Under these conditions, the logic part and the internal buffer occupy nearly the same area. Therefore, the internal buffer will dominate the area cost if the frame width is larger than 128 or the data wordlength is longer than 16-bits. 
MPEG-4 Default (9, 3) filter
The lowpass and highpass coefficients of the (9, 3) filter bank are given as follows:
[3, −6, −16, 38, 90, 38, −16, −6, 3]
The lifting scheme of the polyphase matrix can be factored as:
where p = − , and r = − . The corresponding signal flow graph is shown in Fig. 18 . Only the normalization coefficients, √ 2 and 1/ √ 2, need to be implemented with floating-point multipliers. The integer multiplications can be implemented by shifters and adders for both lifting-based and convolution-based architectures. For example, the coefficients p, q, and r , can be implemented by one shifter, two adders and one shifter, and one adder and one shifter, respectively.
The SABH units of the first two lifting stages of coefficients p and q are nearly the same as Fig. 13 , except that the basic units are implemented with shifters and adders. Since the delay number k of the third stage of coefficient r is three, the corresponding SABH unit should be designed as Fig. 19 , where the input bus I 1 is comprised of four signals from the register chain, and the input control signals S contains the shape information in the corresponding seven positions. Thus, the M1 of SABH r is required to handle (3 + 1) 2 = 16 possible cases. Furthermore, M1 and M2 should be designed to pass the signal of length one that may be from even or odd positions as described in Fig. 20 .
The proposed shape-adaptive lifting-based architecture for the (9, 3) filter is shown in Fig. 21 , where five data registers, six shape registers, and six multiplexers are used. Two multipliers are adopted for the two normalization coefficients √ 2 and 1/ √ 2. Nine adders are used, of which one is for r , two are for q, and six are for the adders in the three lifting stages. Besides the above mentioned functions, The M2's of SABH p and SABH r are used to judge the shape information of the highpass and lowpass output signals, respectively. The critical path is T m + 5T a without pipelining if the timing delay of multiplexers are ignored, which includes the lifting stages p and r as well as the normalization coefficient √ 2. Figure 21 . Proposed shape-adaptive lifting-based architecture for the (9, 3) filter.
Comparison.
The SA-DWT of the (9, 3) filter can be implemented with a parallel filter architecture and a router that is a 9 × 9 multiplexer of 25 possible cases. In this approach, two multipliers (both lowpass and highpass filters have √ 2), sixteen adders (including the additions and the integer multipliers), seven data registers, and seven shape registers are required. The critical path is also T m + 5T a , which is achieved by using adder threes.
The previous shape-adaptive lifting-based architecture of [11] handles the boundary extension by use of a state machine and some multiplexers. The drawback is that the design complexity of the state machine is high, and it requires two additional paths and registers for solving the special case.
The above architectures, including the proposed one, have been verified with Verilog-XL and synthesized with standard cells from Avant! 0.35-µm cell library. For comparison, the timing constraints for circuit synthesis are set as tight as possible. The comparison results are summarized in Table 4 .
According to Table 4 , the parallel filter architecture requires the most hardware resource for adders, multiplexers, and registers. Compared with the previous lifting-based architecture, the proposed one saves two data registers and six shape registers. Although the gate counts of the multiplexers of these two architectures are similar, the proposed one is designed in a more systematic way and can be easily extended to other filter banks.
Conclusion
The implementation of SA-DWT is required to modify the original DWT architectures to have the ability of handling the boundary extension of very short signal segments. For the parallel filter architecture, a complex router can be used to solve SA-DWT. However, the router needs additional registers for the liftingbased architectures. In this paper, a stage-based boundary extension strategy is used to reduce the complexity of multiplexers, and the shape-adaptive boundary handling unit for the basic lifting-based and flippingbased units is proposed to solve SA-DWT without additional registers. By examining two case studies, the JPEG 2000 default (9, 7) filter and the MPEG-4 default (9, 3) filter, the efficiency of the proposed architectures is demonstrated.
Acknowledgment
The multi-project chip support from the National Science Council of Taiwan/Chip Implementation Center is acknowledged. 
