Abstract-This work presents a VLSI design rule, namely, an embedded instruction code (EIC), for the discrete wavelet transform (DWT). Our approach derives from the essential computations of DWT, and we establish a set of multiplication instructions, MUL, and the addition instruction, ADD. In addition, we propose a parallel arithmetic logic unit (PALU) with two multipliers and four adders, called 2M4A. With these requirements, the DWT computation paths can be calculated more efficiently with limited PALUs. Furthermore, since the EIC is operated under the PALU, the number of needed inner registers depends on the wavelet filters' length. Besides, the boundary problem of DWT has also been resolved by the symmetric extension. Moreover, the two-dimensional inverse DWT (2-D IDWT) can be completed using the same PALU for 2-D DWT; the only changes needed to be made are the instruction codes and coefficients. Our chip supports up to six levels of decomposition and versatile image specifications, e.g., VGA, MPEG-1, MPEG-2 and 1024 1024 image sizes.
I. INTRODUCTION
L ATELY, the hardware development of discrete wavelet transform (DWT) has made rapid advance [1] - [4] . Its main applications include signal, image and video processing; and it specifically puts highlights on data compression. Recently, the standard protocol of still images, JPEG 2000 [5] , has employed the DWT for transform coding. In this paper, we design the integrated hardware containing the functions of an one-dimensional DWT (1-D DWT) and inverse DWT (1-D IDWT) [6] . Thus, we ultimately synthesize the architecture with the functions of a two-dimensional DWT (2-D DWT) and IDWT (2-D IDWT).
Knowles's DWT [7] adopted a number of multiplexers, which is not suitable for hardware implementation. Hence, Vishwanath et al. proposed an innovative design in [2] and [4] , which is called recursive pyramid algorithm (RPA), and their the implementations are formed in a systolic array. It rids the drawback of Mallat's pyramid algorithm (PA) [8] , whose implementations need the feedback storages. Lately, the utilizations of systolic in DWT have been thoroughly studied in researches [3] , [9] - [13] . To achieve better computation efficiency, Shunjun et al. [10] and Marino et al. [12] used the parallel and pipeline techniques, respectively. Moreover, the border problem [3] should be careful since the perfect reconstruction (PR) [1] , [6] , [19] is very critical in DWT. Ferretti et al. [3] proposed the modified RPA to solve this problem. The realizations of 2-D DWT have two difference approaches: separable and nonseparable. In [20] - [22] , the architectures were built based on the RPA and separable approach. To reduce the usage of multipliers and storages, Lee et al. [21] proposed a revised RPA, called, circular parallel architecture. According to experimental results, it had performed well in various wavelet filters' length and image sizes. On the other hand, to ensure flexibility in the selection of the wavelet filters' length and the decomposition level, Chen et al. [22] built 256 process elements to achieve the work. In addition, Ferretti et al. [23] analyzed the data dependencies and designed an architecture with the computer cells to achieve the integer lifting DWT. Hence, their work can reduce the recourse less than the needed of RPA. Besides, Lafruit et al. [24] reorganized the arithmetic data; thus, they can minimize the size and accesses of memory.
In the RPA, the number of process element contained in systolic array depends on the chosen wavelet filers. That is, if the filters' length increases, the number rises, as well. Therefore, to restrict number of process element, we present a plastic design rule, named, embedded instruction code (EIC) and propose a VLSI architecture that is organized with a parallel arithmetic logic unit (PALU). In addition, the number of multipliers and adders in PALU remains the same regardless of wavelet filters' length.
The primary concept of EIC employs the simply built-in instruction, to command the PALU for the 1-D DWT/IDWT processing. While the 1-D architecture is made up of EIC, we put the 2-D DWT into practice with separable approach. At the outset, because the essential computations of DWT/IDWT are the multiplication and addition, we can build the multiplication instruction, MUL, the addition instruction, ADD and the generalpurpose registers (GPRs), to complete the work. The instructions that can be drawn up regularly would be embedded in a built-in RAM or ROM and performed by a look-up-table. With the requirements, the instructions command the PALU to enforce specific jobs, including the fetching of the filter's coefficients, multiplication, addition and the appointment of specific registers for accumulators. Furthermore, the highlight in our design is 1051-8215/03$17.00 © 2003 IEEE that the architecture needs only two multipliers and four adders (2M4A). We utilize 2M4A to construct PALU organization with parallel technique. Therefore, we have successfully limited hardware resource to execute accelerative the 1-D DWT/IDWT computations.
The similarity between the computations of DWT and IDWT has been discussed in the lifting schemes [14] and [15] . We also prove the similarity still existence if the computational equations are original definition in DWT algorithm. In the computation paths of DWT, an input signal is sent to two filters separately and downsampled by two. We rewrite IDWT equation analogous to DWT formulae. Hence, both DWT and IDWT can be realized in the same architecture and we just change the filters' coefficients. The manifestation of formulae is suitable for orthogonal and biorthogonal filters. The chosen wavelet filters in our architecture should be a Type I linear phase FIR. The EIC has the following five advantages:
1) the number of multiplication in DWT and IDWT computations are reduced; 2) the instructions can be drawn up regularly and performed by the look-up-table; 3) we adopt parallel organization to speed the execution; 4) both DWT and IDWT computations can be built in the same VLSI architecture; 5) our work needs a fewer number of multipliers and adders, regardless the wavelet filters' length. The rest of the paper is organized as follows. Section II discusses the basic formulae of the DWT and IDWT, followed by Section III, which focuses on specific instructions that EIC uses. In Section IV, the derivation of the similarity between the 1-D DWT and IDWT computation is shown mathematically. Furthermore, the experimental results and comparisons are listed in Section V. The conclusion of the paper is found in Section VI.
II. FORMULAE AND NOTATIONS OF THE DWT
Both DWT and IDWT can be implemented with filter banks [1] , [6] . The 1-D DWT computations are shown in (1), and 1-D IDWT is described in (2) (1) (2) In (1), is the original input signal with a finite length and its low-and high-pass signals are and , respectively. and are the low-and high-pass filters of 1-D DWT, respectively. In (2) , is the reconstructive signal from and . In addition, and are the low-and high-pass filters of the 1-D IDWT, correspondingly. If is required to be equal to , the filters of 1-D DWT and 1-D IDWT should have the PR condition [1] , [6] , [8] and [19] . When a linear-phase signal convolutes with a linear phase filter, it will generate a linear phase output [18] . Therefore, we extend the input sequences symmetrically in the boundary, to make sure that the output signal is symmetric in the boundary under linear phase filters. In this paper, we adopt the wavelet filters (Spline97) with nine-order and seven-order in the low-and high-pass filters, individually [6] . More precisely, the chosen wavelet filters in our architecture should be whole-point symmetric, or a Type I linear-phase finite impulse response (FIR). Spline97 has been modified according to the symmetry property in convolution [18] and [19] . The four filters have the linear phase property
The filter is not the same as the original definition, since the reconstruction equation in [6] does not adopt the convolution form (2) . If the reconstruction equation is rewritten by the convolution form, then (4) would be held. Moreover, based on [18] and [19] , the four filters are whole-point symmetric. If the input sequence is extended with whole-point symmetry in the beginning and the end (5) then will become whole-point symmetry in the beginning and half-point symmetry in the end. Conversely, is halfpoint symmetric in the beginning and whole-point symmetric in the end (6) Since the arithmetic operations of 1-D DWT are addition and multiplication, this indicates that we can use the simple instructions to complete the operations. The multiplication is executed by a MUL instruction; the addition is done by an ADD instruction. Therefore, we propose that, instruction codes be saved to ROM or RAM in advance for the execution of computation later. We also propose to have GPRs and save the filters' coefficients in ROM or RAM. The MUL instruction is manipulated as follows.
A. MUL Coefficient
Although the multiplication needs specific input and coefficient, we hide the former in MUL. The reason is that the coefficient can be specified by input in order. Once we obtain the sequential input, we only need to arrange the order of coefficients and fetch them one by one. In addition, the product should be saved in a fixed product register. The ADD instruction is manipulated as follows.
B. ADD Specified-Register
All the ADD instruction needs to do is to specify a register to be an accumulator and the value of the accumulator would add up the product and then save the result back to the accumulator. 
III. 1-D DWT COMPUTATION
Before we discuss the 1-D DWT computation, we need to set up the nine GPRs . The registers , , , and are related to . And the registers , , and are related to . Furthermore, the length of the filter decides numbers of registers and a whole-point symmetric filter has odd filter's length. If the length is , then the numbers of registers are (7) We would explain the equation by following example. Fig. 1 indicates the computation paths of , with the assumption . In the figure, each black dot represents the product obtained from the corresponding indicated above and it multiplies the associated multiplicand indicated in the column on the far left. This calculation can be done within the single MUL instruction. To obtain each , we need to sum up the product represented by the black dots in dash-lines. For example, we can calculate via (1) and obtain The can be attained by summing up each product generated from the particular and . The black dots of , , , and are within one straight dash-line and the dots of , , and are within another dash-line. The summation paths of are merged in two dash-lines. Furthermore, the instructions shown in Table I will complete the calculations and each product will be saved to the product register (PREG).
A. Computation of
In the summation path, we need to specify the particular register illustrated at the bottom of Fig. 1 , in order to add up the PREG. For example, Fig. 1 shows the summation path of , starting with . By tracing the path, the value of , is shown in Table I . When the eighth input is filled, the value of would be . We would output the value of and zero for the remaining computation. From Fig. 1 
B. Embedded Instruction Code
The EIC is proposed to form the instruction codes and the instructions are the control commands of the PALU organization. The EIC systematically forms the codes in two steps: 1) to observe the values of each register depended on the current input and 2) and check the reuse in PREG.
1) The Values of Each Register:
At first, the relationship between and should be established according to each input. From (1) and Fig. 1 , we can list the values of each register, which vary with the input in Table II. TABLE II  VALUES OF THE REGISTERS   TABLE III  NUMBER OF MULTIPLICATIONS The symbol " " means that the right operand is transmitted to the left operand. Additionally, "N/A" means that the register will no longer be used. Moreover, through (5) we can assign the value of the boundary inputs:
and . Since the value of PREG needs to be multiplied by two within the boundary , we add a least-significant bit shift (LS) preceding the adder in the architecture, to be activated by a multiplexer. In  Fig. 1 , the products, represented by black dots, are shared by two registers as two dash-lines cross one another. The exceptions are the products in the second row, whose multiplicand is . Since the product sharing is based on the symmetry property, we would build one multiplier and two adders (1M2A) in PALU for computation. Hence, since there are nine multiplications for each , our method can cut down on the number of multiplications. Table III lists the multiplication reduction. Table II , we can describe instruction codes in Table IV • The "Order" indicates the execution order and the "(LS)" marks that the LS be activated. In some execution order, there exist two instructions that should be executed concurrently. For example, in order 9, we have "ADD " and "ADD ". We limit ADD and MUL instructions in the same execution time for PALU organization design.
3) Instruction Codes for the Computation of : From
• There are three classifications in the instruction codes:
"Boundary in the beginning", "Loop" and "Boundary in the end". The codes in "Loop" are in a certain sequence, shown in Table V . The instruction codes in "Loop" for the th input are the same as those belonging to ( )th, where . Hence, we just record the instructions within and execute them periodically.
• We use a simple version of execution order, shown in Fig. 2 , to interpret the period in "Loop". Each coefficient in Fig. 2 stands for the instructions, e.g., in the circle represents the instructions when : MUL and ADD . In addition, mark C is labeled that the register should be cleaned, e.g., in Table II . Furthermore, the coefficients arranged above is the same as that above . The coefficients allocated above would be the same as that above . The reason is that each register should be accumulated nine times and finally zeroed. After that, the register can be reused and the instructions would be executed periodically. The number of the addition is decided by the filter's length. Thus, the period in "Loop" is if the filter's length is .
• The codes for "Boundary in the end" are dependent on the value of and are listed in Table VI . Here, is the remainder of and .
4) PALU Organization of 1M2A:
From Tables IV-VI, our architecture includes the PALU organization, that is formed with one multiplier and two adders (1M2A), shown in Fig. 3 . Furthermore, the organization comprises one 16-bit multiplier and two 32-bit adders. To parallelize the organization, we use the latch PREG, which contains the product and the product is obtained from the MUL executed in previous order. Thus, the ADD can employ the previous value and the MUL can generate the new product concurrently.
C. Computation of
Likewise, we could apply the EIC and PALU organization to carry out computation. Fig. 4 shows the computation paths of . The boundary could be revised by using (3) and the reapplication in PREG should be considered. We show the part of the instruction codes for in Tables VII-IX. Note that in Table VIII is 5 plus the remainder of , where . Finally, based on the previous explanation of PALU, we allocate another 1M2A for computation. Thus, the eventual organization would be called 2M4A PALU.
IV. 1-D IDWT COMPUTATION
In 1-D IDWT, we transform the original 1-D IDWT equation (2) into new equations, and the equations are similar to 1-D DWT equations (1), according to the theories of multirate systems [1] . At first, we individually divide , , and into two parts: the even part and odd part Furthermore, from the 1-D IDWT equation (2), we get (10) Then, we introduce a new signal , which is alternately composed of and as follows:
From (6), we see that belongs to the whole-point symmetry in the beginning and the end. In the following, if is multiplied by two specific filters defined in (12), we would get (13) and (14) (12) (13) (14) Eventually, we respectively downsample the signals (13) and (14) by two and get (15) , as follows: (15) It is clear that the computation path of the 1-D IDWT is similar to the 1-D DWT. Additionally, because of the symmetry property in (4), (16) is supported (16) Besides, from (16), we could attain (17) , which also have the symmetry property (17) Using (17), we can examine the symmetry property of and in (18) 
Thus, due to linear phase [17] and Spline97, and belong to the linear-phase and whole-point symmetric filter (19) Based on the discussions, the computation paths of and are demonstrated in Figs. 5 and 6, respectively. Thus, EIC can also be used in 1-D IDWT and we just need to change the filters' coefficients and instruction codes. The VLSI architecture and PALU organization for 1-D DWT still applies in 1-D IDWT.
V. EXPERIMENTAL RESULTS AND COMPARSIONS
We accomplish the 2-D DWT by applying 1-D DWT in column and row (separable approach) and the PALU organization in 1-D DWT architecture is adopted 2M4A. In addition, Table X indicates the properties and specification. Finally, we show the layout in Fig. 7 . The recent researches [20] - [22] , are concluded to compare in Table XI . In [20] , their architecture sustains 512 512 image size and the size is smaller than that of ours. In addition, they do not show the precision bits. On the other hand, in [21] , despite their precision bits are the same as ours, our design enjoys superiority in supported image size with specifications like VGA, MPEG-1, and MPEG-2. Moreover, in [22] , though their wavelet filters are programmable, their doable image size is 256 256 and the precision is 24 bits. Thus, our chip has advantages in precision and image size.
The merits of our work are summarized in the following. Our chip integrates DWT and IDWT into a single chip without extra circuits and uses fewer pin counts at just 84 pins. The goals of lower hardware cost and architecture sharing are therefore achieved. The decomposition level is up to six and the image sizes are in many ways better supported, with contemporary applications' specifications.
VI. CONCLUSION
In this paper, we propose VLSI architecture and a design regulation named, EIC, to perform the 2-D DWT/IDWT. We establish a set of instructions to accomplish the DWT computation. In addition, we present a PALU organization for the computation unit and it consists of two multipliers and four adders. Since the similarity between DWT and IDWT is proven, we can build both DWT and IDWT in the same architecture. Moreover, we employ symmetric extension to satisfy the PR requirement. Using the TSMC 0.35 1P4M CMOS technology, our experiment is 3123 in the total area, the pin counts at 84 pins and the processing speed at 7.78 Mpixels/s. Finally, our work supports up to six decomposition levels and versatile image specifications.
