Abstract-In this paper, we develop a scalable VLSI architecture employing a two-channel quadrature mirror filter (QMF) lattice for the one-dimensional (1-D) discrete wavelet transform (DWT). We begin with the development of systematic scheduling, which determines the filtering instants of each resolution level, on the basis of a binary tree. Then input-output relation between lattices of the QMF bank is derived, and a new structure for the data format converter (DFC) which controls the data transfer between resolution levels is proposed. In addition, implementation of a delay control unit (DCU) that controls the delay between lattices of the QMF is proposed. The structures for the DFC and DCU are regular, scalable, and require a minimum number of registers, and thereby lead to an efficient and scalable architecture for the DWT. A scalable architecture for the inverse DWT is also developed in a similar manner. Finally, pipelining of the proposed architecture is considered.
I. INTRODUCTION

D
UE TO ITS inherent time-scale locality characteristics, the discrete wavelet transform (DWT) has received considerable attention in digital signal processing applications such as speech and image processing [1] - [5] . The DWT is usually implemented based on the binary tree structured quadrature mirror filter (QMF) bank illustrated in Fig. 1 . At each level of the tree for the forward transform, outputs of the two-channel QMF bank and are evaluated and decimated by a factor of two. Note that the same filter bank is used at each resolution level. For the inverse, the QMF bank and follows the twofold expanders. Here, the filters and have the perfect reconstruction (PR) property [6] - [8] . An attractive feature of this tree structure, which is useful for VLSI implementation, is stated as follows: in the tree, the number of filter outputs computed during each sample period is upper bounded by two, irrespective of the number of levels. This property naturally leads to VLSI architectures that employ only one pair of filters and iteratively use them for all levels. In fact, all VLSI implementations for DWT introduced so far have such an architecture, which consists of a data format converter (DFC) 1 and a two-channel filter bank [9] - [15] . The DFC controls data Publisher Item Identifier S 1057-7130(98)04670-9. 1 In [12] , this is called the routing network. transfer between levels: it stores filtered outputs at a certain level and provides them for filtering at the next level. The two-channel filter bank is implemented either in direct form or in lattice form. Among the architectures with direct form FIR filters [9] - [13] , the one in [11] needs less hardware than those in [12] and [13] . The former, however, is not regular, and therefore is difficult to scale; it should be redesigned when either the filter length or the number of resolution levels changes. On the other hand, the ones in [12] and [13] , which are based on systolic array, have regular structures and are scalable. The two-channel filter bank in lattice form, which is often refered to as the two-channel QMF lattice, is considerably simpler to implement than the filter bank in direct form: the hardware complexity of the former is about half of that of the latter [8] , [16] , [17] . Efficient DWT architectures employing the QMF lattice are proposed in [14] and [15] . These, however, are not scalable, and hardware complexity of the DFC increases exponentially as the number of resolution levels increases. In this paper, we develop a scalable VLSI architectures employing a two-channel QMF lattice for the one-dimensional (1-D) DWT. We begin with the development of systematic scheduling, which determines the filtering instants of each resolution level, on the basis of a binary tree. In contrast to the DWT scheduling algorithms in [9] - [15] , the proposed algorithm provides a closed-form expression for scheduling. Using this expression, a scalable DFC is developed. In addition, a scalable DCU that controls the delay between lattices of the QMF bank is proposed. It will be shown that these DFC and DCU require a minimum number of registers. The proposed DWT architecture consisting of the DFC, DCU, and the QMF lattice is scalable and requires less hardware than those in [9] - [15] .
The rest of the paper is organized as follows. In Section II, the proposed architecture for the DWT is developed. In Section III, we present an architecture for the inverse DWT. Finally, pipelining of the proposed architecture is discussed in Section IV.
II. ANALYSIS STAGE WAVELET ARCHITECTURE
Consider again Fig. 1(a) . Due to the decimations, the data rate decreases by a factor of two as we move from one resolution level to the next. Therefore, the operating frequency of the filters and in level is given by where is the input sampling frequency. Fig. 2 illustrates the lattice filter implementing and . The high-pass and low-pass filtered outputs are produced by the upper and lower nodes of the lattice, respectively. For th level filtering, the delay and indicate and time unit delays, respectively, where . Note that for filter length , this structure requires only lattices, each of which consists of two multipliers and two adders. Therefore, its hardware complexity is about half of that of a pair of direct form FIR filters.
Our objective is to implement all resolution levels of the DWT by employing only one lattice filter which operate with frequency . To achieve this, we compute only the filter outputs which are not thrown away by the decimations. At level 1, outputs are evaluated every other sample times, say, . When these are obtained by the lattice filter operating with frequency , the filter becomes idle for . Now we can compute outputs of the other resolution levels during these idle periods. This can be seen from the Data Dependence Graph (DDG's) shown in Fig. 3 . In this figure, the processing element (PE) is the lattice consisting of two multipliers and two adders;
is the input to the 1st level; th filtered outputs at level are denoted by and , respectively. The delay indicates time unit delay. Since the delay between adjacent lattices at level is , the delays between lattices in levels 2 and 3 are represented as and , respectively. In the DDG for level , the lower level output of the th PE at time is inputted to the th PE at time . This holds because the delay between lower levels of the adjacent lattices is . As described above, the filtering for level 1 is performed at . To exploit the idling time slots , for level 2 we start filtering at and compute the outputs at . Similarly, the outputs for level 3 are computed at . Now it should be noted that the filter execution times , and never overlap with each other, and that the th and th low-pass outputs from level and , which are required for computing the th output at level , are obtained before initializing the computation of . Based on these observations, we can combine the three DDG's in Fig. 3 (a)-(c). The result is shown in Fig. 4 . This DDG indicates that the outputs of all three resolution levels of the DWT can be evaluated by employing only one lattice filter which operates with frequency . In what follows, we shall extend this result to the DWT with arbitrary number of resolution levels.
Denote the filtering instants for the th level by . Then and . Note that the th output of the th resolution level is computed at time . Since the outputs of the th level is produced every time unit, is expressed as , where is the instant at which the output of the th level is calculated for the first time. The constant can be obtained with the help of a binary tree, shown in Fig. 5 , which successively decomposes the set of nonnegative integers by half. At the top of the tree, we have the set . In the second level of the tree, this set is decomposed into even and odd number sets, and . The even set provides the filtering instants for the first resolution level, i.e.,
. The odd set is again decomposed into two sets and -note that . From the set , we get and the set is decomposed further. Continuing in this manner, we can obtain the following general expression for , :
(
where is the number of resolution levels. Due to the successive decompositions, the filtering instants given by (1) never overlap with each other. Furthermore, this scheduling leads to the following observation.
Observation 1: If we schedule filtering operations based on (1), then the th and th low-pass outputs of the th level, and , are obtained before initializing the computation of . Specifically, and , respectively, are computed and time units before the evaluation of . Proof: The first part is proved by showing that . Here the first inequality is obvious. Now . From these relations, the second part directly follows.
From this observation we can generalize the DDG in Fig. 4 for an arbitrary number of resolution levels.
For implementing the DWT, we need to establish the inputoutput relations between the PE's. Denote the th upper and lower inputs associated with the th PE, say , by and , respectively, where . The corresponding outputs are denoted by and (Fig. 6 ). It should be pointed out that the upper output of , is equal to for . The relation between these inputs and outputs are derived directly from the previous discussions. For the first PE for for (2a) and for for (2b)
For the th PE
for (3b)
These relations lead to the architecture shown in Fig. 7 . The Data Format Converter (DFC) controls the input and the feedback sequences depending on (2). The Delay Control Unit (DCU) controls the delay , depending on the relation between the time index and the resolution level , given by (3b). Next we design circuits for implementing the DCU and DFC.
A. Design of Delay Control Unit (DCU)
Consider the design of a DCU for the DWT with three resolution levels . For this case, (3b) is rewritten as for for for (4) This equation directly leads to the structure depicted in Fig. 8 . This DCU, which was originally proposed in [15] , looks simple but requires about word-level registers for resolution levels. Note that the required number of registers increases exponentially as the number of levels increases. It is possible to reduce the number of registers by using the method in [18] . In what follows, we briefly review this method and then develop an alternative approach.
The method in [18] begins with the formation of the lifetime chart that shows the lifetime of each input value. For example, suppose are the values of at . Their lifetimes are determined according to (4) , and marked on a lifetime chart as illustrated in Fig. 9(a) . Then, the minimum number of required registers, say , is obtained by counting the number of live input values at each time instant, and selecting the maximum among the numbers. For the example in Fig. 9(a) ,
. The registers, which are denoted by , are connected in cascade. At each time, a value in , is shifted to if the value is alive; otherwise it is sent to the next PE through a multiplexer. If the value in is alive, it is stored in an empty register. Data flow among registers is summarized in the register allocation table [ Fig. 9(b) ], and an efficient DCU structure [ Fig. 9(c) ] and proper switching time are derived by examining the table. The resulting DCU requires minimal number of registers, but is not scalable; if varies, it should be redesigned.
An alternative structure for the DCU, that we propose, is based on a parallel connection of registers. Suppose that the registers are connected in parallel, where is the minimal number of registers obtained via the lifetime chart. In this scheme, if an input value is loaded in , it is remained in during its lifetime. A clock signal is given to only when a new input is loaded to and the old one is discarded. The register allocation table for the proposed scheme is illustrated in Fig. 10(a) [lifetime chart in Fig. 9(a) is assumed]. It is seen that all input values can be stored during its lifetime without any collision. The resulting DCU structure is shown in Fig. 10(b) . The proposed architecture is scalable, as shown in the observation below.
Observation 2: For a given number of resolution levels , DCU satisfying (3b) can be implemented as in Fig. 11 . The instant, say , at which loading clock signal is applied to the register , is expressed as
Proof:
We assume that is stored in , where and . This observation can be proved by showing that the lifetime of does not overlap with that of for all . is generated at and should be stored in from to . On the other hand, is generated at and should be stored in from to . Note that . This proves the nonoverlapping property, and should be latched at the beginning of . This indicates that should be . 
B. Design of DFC
Now we consider the design of a DFC for the DWT with three resolution levels . In this case, (2) can be rewritten as for for for (5a) Direct design of DFC with these equations leads to the structure in Fig. 12 . This DFC, which was also employed in [15] , requires word-level registers. The number of registers can be reduced by applying the method for DCU design. Specifically, from the lifetime chart and the register allocation table in Fig. 13(a) and (b) , respectively, we can obtain the DFC structure in Fig. 13(c) . This structure, which is based on parallel connection of registers, employs a minimal number of registers. The scalability of the structure is described below.
Observation 3: For a given number of resolution levels , DFC satisfying (2) can be implemented as in Fig. 14. The instant, say , at which clock signal for data loading is can be stored in without collision. This proves the first part of this observation. A value allocated to a register should be latched one time unit later than its generation time. This indicates that the should be applied both at and for , and at for .
C. The Proposed Architecture for DWT
According to Observations 2 and 3, we can design DCU and DFC for arbitrary number of resolution levels even without knowing the details of their design methods. Fig. 15 shows a general architecture for the lattice structure based forward DWT. In this figure, all DCU's between lattices have the structure shown in Fig. 11 , and the DFC has the structure in Fig. 14. This architecture is valid for any and . If varies, only the DFC and DCU's are modified according to Figs. 11 and 14 ; and we can add or remove the lattice blocks depending on . This architecture requires less hardware than the existing architectures, since the DFC and DCU's employ minimal number of registers and the number of multipliers required by the lattice blocks is about a half of those required by the direct form FIR filters. Table I compares the hardware complexity of the architectures for the DWT when and pipelining is not considered. As expected, the proposed architecture is considerably simpler to be implemented than the others.
III. SYNTHESIS STAGE WAVELET ARCHITECTURE
In this section, we develop lattice structure-based architecture for the inverse DWT. Consider again Fig. 1(b) . Each level consists of the same synthesis two-channel QMF bank operating with a frequency corresponding to level , where is output data rate at level 1. At each level and are inputted to low-pass and high-pass filters and , respectively, after being expended by a factor of 2. Due to the expansion, is computed only with even coefficients of and , while is computed only with corresponding odd coefficients. Fig. 16 illustrates the lattice filter implementing and . At even time Fig. 18 . A flipped binary tree for scheduling of the inverse DWT. instants of the output, this filter generates the outputs and simultaneously from the upper and lower levels of the lattice, respectively, and produces zeros at odd time instants of the output. By reusing this odd time instants for higher level computation, all resolution levels can be computed based on a single synthesis QMF lattice. One thing to be mentioned is that the feeding relations between adjacent levels are reverse compared to that of analysis stage: higher level should be computed earlier than lower one. Fig. 17 shows the DDG for the inverse DWT with three resolution levels. Here, the time slots and are dedicated to the computation of levels 3, 2, and 1, respectively. Next we consider the scheduling problem for an arbitrary .
Denote the filtering instants for the th level of the inverse DWT by . Note that th outputs of the th resolution level, and , are computed at time . Since the outputs of level should be produced every time unit, should be expressed as where is the instant at which the output of level is calculated for the first time. The constant can be obtained with the help of a flipped binary tree, shown in Fig. 18 , which successively combines two sets of nonoverlapped time slots. From this figure, we get the following expression for :
It can be seen that the filtering instants given by (6) never overlap with each other. Furthermore, this scheduling leads to the following observation. . From this relation, the second part directly follows.
Using the notation in Fig. 6 , input-output relations between lattices for the synthesis QMF bank can be derived as follows: for the first PE where is the number of resolution levels. For the th PE for (8a)
These relations lead to the architecture shown in Fig. 19(a) . The structures for the DCU and DFC shown in Figs. 19(b) and (c) are obtained by using the methods described in the previous section. The instants at which the clock signal for data loading is applied to registers of the DCU and DFC are expressed as follows for the DCU: (9) and for the DFC when when (10) where , and implies that the clock is applied both at and . As in the case of the forward DWT, the proposed architecture is scalable and requires less hardware than the existing architectures. Let be the number of pipelining stages of each , and be the number of lattices. The pipelining introduces time unit latency to the QMF lattice. The computation for the th output of the th level, , should be started at least time units after starting the computation for . Details of scheduling can be seen from the DDG in Fig. 20 , which illustrates the data dependency for , and . Filtering of the first level is initiated at . Note that the computation of can be started after is available. Since is available at , we can compute right after this time. However, this time instant is occupied for computing , and thus is computed at . In a similar manner, we can see that can be computed at . Here, the time slots , and never overlap with each other. Now we extend this result to the DWT with an arbitrary . Suppose that the computation of is started at and ended at . Then . In the binary tree scheduling illustrated in Fig. 21 , the nonnegative integer set is decomposed into and . Here is the smallest integer satisfying . The even set provides the starting instants for first level computation, i.e.,
. The remaining set is decomposed into and , where is the smallest intgeger satisfying . Continuing in this manner, we can get the following expression for :
where is the smallest nonnegative integer satisfying (11b)
It can be seen that the filtering instants given by (11a) never overlap with each other. The observation below shows the validity of this scheduling. and for the DFC for for (15) where and implies that the clock is applied both at and .
Finally, we derive the latency time caused by the pipelining stages. The latency time of the pipelined DWT architecture is written as . Since , the latency time is given by (16) The latency time is a function of , and . Table II  tabulates the latency time when and . The latency time in (16) is useful for examining the tradeoff between the latency and throughput of the pipelined DWT architecture. Pipelining of the architecture for the inverse DWT can be done in a similar manner, and will not be considered here due to space limitation.
V. SUMMARY AND CONCLUSION
In this paper, we developed a scalable VLSI architecture employing a two-channel QMF lattice for the 1-D DWT. Based on the development of a systematic scheduling, the inputoutput relation between lattices of the QMF bank has been derived, and new structures for the DFC and the DCU have been proposed. The proposed structures are regular, scalable, and require a minimum number of registers, and thereby lead to an efficient and scalable architecture for the DWT. An architecture for the inverse DWT has been also developed in a similar manner, and finally, pipelining of the proposed architecture has been considered. Future work in this direction will be concentrated on the design of two-dimensional (2-D) DWT and -ary tree structured filter banks.
