Recently, the conventional lifting scheme that is widely used for the construction of Wavelets and 2-channel filter banks has been extended to M-channel filter banks (M > 2). This extension brings up the beneficial properties of the lifting scheme to a broader range of applications, like discrete cosine transforms. There exist many hand-crafted lifting-based VLSI architectures, which mostly concentrate on a single and specific target application having a fixed data throughput and resource consumption. However, the reusability of such architectures is limited due to the lack of scalability and flexibility. To overcome this issue, we present a novel design methodology for automatic synthesis of VLSI architectures that are suitable for arbitrary lifting-based M-channel filter banks and transforms. The methodology offers wide-ranging design space exploration with varying resource consumption and data throughput tradeoffs, as it is desired in modern System-on-Chip design.
INTRODUCTION
The lifting scheme provides excellent features, namely perfect reconstruction, reduction of computational complexity and in-place computation [1] . It is widely used for the realization of the discrete wavelet transform (DWT). The classical lifting scheme deals with the factorization of a 2-channel filter bank into a product of so-called lifting steps. Recently, a generalization of the 2-channel lifting scheme to the M-channel lifting scheme (M > 2) has been proposed [2] . This extension brings up the features of the lifting scheme to applications that rely on an M-channel filter bank, like the discrete cosine transform (DCT), which is widely used in image and video coding applications.
Another example for a lifting-based M-channel transform is the binDCT [3] . The concept of the binDCT is the design of multiplierless approximations of DCTs with the lifting scheme. The result is an approximated DCT that provides the desired features of the lifting scheme and the DCT as well as low-cost implementations, since there are only multiplications with dyadic coefficients necessary.
Modern System-on-Chip (SoC) designs demand for parameterizable, reusable Intellectual Property (IP) cores that can be easily integrated into system specifications. However, hand-crafted architectures mostly concentrate on a single and specific target application having a fixed data throughput and resource consumption. For example, the works of [4] and [5] focus on lifting-based architectures for DWT filter banks. In [6] two architectures for calculating DCT approximations with binDCT were presented.
In this contribution, we present a novel design methodology to automatically synthesize VLSI architectures with different area and throughput tradeoffs using a single specification of a lifting-based filter bank. Hence, our methodology is applicable during design space exploration to choose the most efficient architecture for a specific application. The proposed design methodology is implemented as a high-level compilation tool that generates synthesizable VHDL code at the register transfer level (RTL).
The rest of this paper is organized as follows. In section 2 a brief introduction into the M-channel lifting factorization is given. Section 3 explains our design methodology and in section 4 we discuss some design issues of the generated hardware architectures. Section 5 presents some experimental results and in section 6 we finally conclude the paper.
M-CHANNEL LIFTING FACTORIZATION
Starting point for the M-channel lifting scheme is the polyphase representation of the underlying filter bank. This representation is classified by an M × M polyphase matrix as shown in the following equation:
The signal flow graph (SFG) for the direct implementation of the polyphase filter bank is shown in Fig. 1 . The signals Xi(z) and Yi(z) are the polyphase components of the input and output signals.
The filters Hi,j(z) are FIR filters with an arbitrary number of coefficients. The idea of the M-channel lifting scheme is to factorize the polyphase matrix H(z) into a product of special triangular matrices, which are referred to as M-channel lifting steps. An M-channel lifting step from channel j to i is defined by the following matrix operator:
where I is the M × M identity matrix, ei denotes the i-th M × 1 unit vector with i, j ∈ {0, 1, · · · M −1} and i = j. λ(z) denotes the corresponding FIR lifting polynomial. The SFG for the lifting step Γi,j[λ(z)] is shown in Fig. 2(a) . The complete SFG for a fully lifting factorized polyphase matrix can be obtained easily by cascading the SFGs of each lifting step with appropriate values for the parameters i, j and λ(z). In [2] it is shown that a polyphase matrix H(z) with det(H(z)) = z −k , k ∈ Z can be factorized into a product of lifting steps. To obtain additional flexibility during the process of factoring and designing filter banks an extension of the strict lifting factorization is proposed in [2] . This extension allows the decomposition of a matrix H(z) into a product of lifting steps and so-called diagonal steps. A diagonal step is defined by the matrix operator:
where k ∈ Z, i ∈ {0, 1, · · · M − 1} and c is a single dyadic coefficient of the form ±2
−n with n ∈ Z. The corresponding SFG can be obtained from Fig. 2 (b). Equation (4) denotes the decomposition of a matrix H(z) into a product of NL lifting steps and diagonal steps, respectively.
The factorization of (4) provides the basis for the hardware synthesis. It will be referred to as lifting specification in the following sections.
DESIGN METHODOLOGY
In Fig. 3 we present a block diagram of the design flow that is used for the automatic architectural synthesis. The input of this flow is (4) as well as several synthesis constraints. The result is a RTL netlist of a so-called Lifting Processor (LP) architecture that calculates the operations of the filter bank, respecting to the synthesis constraints. The synthesis constraints basically enable the customization of the following aspects: desired data throughput, multiplier-based or multiplierless implementation and arithmetic precision of coefficients and data path.
The concept of our approach is to map a given lifting specification to a set of elementary operations, which allow to apply easily both parallel and pipeline techniques. The elementary operations are so-called LP instructions that are realized on an adequate LP architecture. In order to capture all the common properties of different lifting specifications we define that each LP instruction processes a lifting step or a diagonal step with exactly one polynomial coefficient. Hence, the first step in the design flow is to obtain an architecture specific lifting factorization. This factorization rewrites (4) and decomposes any lifting step Γi,j[λ(z)] with more than one polynomial coefficient into a product of lifting steps with only one coefficient using
The result of this decomposition procedure again is a lifting specification according to (4) but the number NL increased by the number of nonzero coefficients c k minus one. Depending on the precision of the polynomial coefficients, multiplierless architectures may lead to more resource efficient realizations compared to architectures that use parallel multiplier logic. Multiplierless architectures are restricted to multiplications with dyadic coefficients, which can be implemented efficiently by a single right or left shift of the multiplicands. Consequently, for the synthesis of multiplierless architectures we also have to restrict that each LP instruction processes a lifting step or a diagonal step with exactly one dyadic polynomial coefficient. Therefore, after applying (5) to the lifting specification (4) we further have to decompose each lifting step with a non-dyadic coefficient into a product of appropriate lifting steps with dyadic coefficients. Equation (6) points out this decomposition. Again, the result of this decomposition procedure is a lifting specification according to (4) , whereas the number NL increased by NC −1. The next step in the design flow considers the generation of an LP instruction list by iterating the decomposed lifting specification (4) in increasing order. Each element Lm(z) of (4) is translated into an LP instruction of a specific type. Table 1 
summarizes the different types of LP instructions. If Lm(z) is a lifting step it is translated into an LP instruction of type mac. Otherwise, if Lm(z)
is a diagonal step an LP instructions of type mul is generated. These two LP instruction types perform pure arithmetic operations using the operands A, B and C. The operands A and B are variables that hold information about the referred channels of the associated lifting step or diagonal step. The operand C is the only polynomial coefficient of Lm(z). Depending on the multiplier implementation constraint, C is either a rational or a dyadic coefficient.
During the process of LP instruction mapping the actual structure of the resulting LP architecture is determined and all the required arithmetic resources are allocated. Fig. 4 presents the structure of a synthesized LP architecture. The architecture embeds a one dimensional array of so-called Lifting Processing Elements (LPEs), where each LPE is receiving partial results from its predecessor. This array can be considered as a coarse-grained pipeline, whereas each LPE represents a pipeline stage. An LPE consists of a register bank and one or more Processing Units (PUs). Each PU has the capability to process an LP instruction within a single clock cycle. This kind of structure enables to apply both parallel and pipelining techniques for controlling the data throughput of the architecture. Parallel processing is achieved by mapping concurrent LP instructions to different PUs within a LPE and pipelining is applied by mapping LP instructions to different pipeline stages (different LPEs). With consideration of the data dependencies between the LP instructions and the synthesis constraints for data throughput, the LP instructions are scheduled and bound to the allocated PUs. We have applied a list scheduling algorithm to schedule the LP instructions of each LPE.
After scheduling and binding of the LP instructions the variables A and B of LP instructions are mapped to the data registers of an LPE's register bank. The data registers are organized as shift registers to realize the delay elements z −k resulting from decomposed lifting steps or diagonal steps, respectively. To determine the clock cycle of a shift register's shift action, some of the LP instructions are . . . modified to the types macsft and mulsft. These types are extended versions of mac and mul that initiate a shift action in parallel to the arithmetic operation. A detailed explanation is given in [7] . The final step in the design flow of Fig. 3 affects the VHDL RTL code generation for the target LP architecture.
PROCESSING UNIT DESIGN
Since the PU is the most important design module we want to inspect some design issues of this module. Fig. 5 outlines the schematic diagram of a PU. To process all its mapped LP instructions, each PU embeds a control unit that controls the data path elements, like multiplexers, registers or arithmetic units, as well as write enable signals that are connected to the register bank of the associated LPE. The control unit is realized as a sequencer, whereas the information of each LP instruction is stored as an instruction word in the LP instruction ROM.
The data path of a PU contains two multiplexers for selecting the data associated with the variables A and B of the corresponding LP instructions. A PU either implements a multiplier resource or an appropriate operand shifting circuit for the realization of the product A · C, depending on the synthesis constraint for multiplier implementation. Fig. 5 outlines the two different implementation styles by the dashed boxes. In the case of a multiplier implementation a coefficient C is embedded into the instruction word of the associated LP instruction. If a multiplierless implementation is requested, the multiplications with the dyadic coefficients are realized by hardwired arithmetic shifting of the operands A. Consequently, each instruction word of an LP instruction embeds a multiplexer address to select the desired product A · C.
EXPERIMENTAL RESULTS
We implemented the design flow of figure 3 as a high-level compilation tool that generates VHDL RTL code. This code can be processed by any RTL-based design flow. In order to prove the efficiency of the proposed design methodology, we present results about design space explorations of lifting-based DWT and DCT architectures, which were generated by our tool. Equation (7) points out a lifting factorization of the popular 9/7 filter for the DWT. The corresponding floating point coefficients can be obtained from [5] . According to [5] , we also ignore the scaling coefficients in the diagonal matrix diag(ζ, 1/ζ), resulting in a lifting specification according to (4) . Using this lifting specification we generated four different multiplier-based architectures by varying the synthesis constraints for data throughput. The architectures are summarized in Table 2. They are ranging from slow low-cost architectures, with a single multiplier and adder resource, to high-speed architectures. The resource consumption is measured in terms of required multiplier, adder and register resources. The critical paths of our architectures are mainly dominated by the delay of a single multiplier and a single adder, which equals the pipelined variant of [5] . As the comparison with [5] shows, in the case of the DWT our architectures require more resources for the high-speed architectures. However, we are able to generate resource efficient low-cost architectures, which are sufficient for many applications that do not require highest speed.
As already mentioned in section 1, the concept of the binDCT is to approximate multiplierless DCTs with the lifting scheme. In Fig. 6 we present a SFG of a lifting factorized 8-point binDCT. This SFG is a slightly modified variant of the fully lifting factorized SFG from [3] that fulfills the definition according to (4) . For our experiments, we defined a lifting specification using the following approximated coefficients for the lifting steps of Fig. 6 : p1 = 3/8 p2 = 5/8 p3 = 1/8 p4 = 3/8 u1 = 1/4 u2 = 1/2 u3 = 1/8 u4 = 5/8 p5 = 3/8. According to the proposals of [3] , we ignore the scaling coefficients in the dashed boxes. To perform a design space exploration, we generated five different multiplierless architectures by simply varying the synthesis constraints for data throughput. C3 with the slow architecture of [6] shows that both architectures have the same throughput and our architecture even requires less resources. The fast architecture of [6] and our architecture C1 both have same throughput and similar resource consumptions. The critical paths of all the inspected architectures of Table 3 are mainly dominated by a single adder. Our proposed coefficients for DCT approximation reflect slightly better performance measures (smaller mean square error and higher coding gain) compared to [6] .
CONCLUSION
We presented a novel design methodology for automatic architectural synthesis of arbitrary lifting-based M-channel filter banks and transforms. The methodology is suitable for design space exploration to find adequate architectures for different applications. As our experiments show, the methodology produces satisfactory results. Our generated DCT architectures are comparable to handcrafted designs.
