Abstract-The
I. INTRODUCTION
The discrete wavelet transform (DWT) [1] - [6] has been developed recently as a feature extraction tool for signal and image processing and has been shown to be efficient in comparison to traditional signal processing techniques in several industrial and commercial applications.
The massive computation required by the DWT can be met only with suitable computing resources. If the application is well defined and real-time operation is important, dedicated VLSI ASIC solutions should be considered (see, e.g., [7] - [18] ). In particular, two efficient SIMD architectures are proposed in [10] to implement the 1-D and the 2-D DWT's, respectively. A pipeline-based realization of the 2-D DWT is described in [18] . Other efficient architectures are presented in [4] , [15] , and [17] . Whenever either the application or the desired DWT processing is subject to change, VLSI implementation is not Manuscript received September 29, 1997 ; revised April 29, 1999 . The associate editor coordinating the review of this paper and approving it for publication was Dr. Elias S. Manolakos.
F. Marino is with the Dipartimento di Ingegneria Elettrotecnica ed Elettronica, Politecnico di Bari, Bari, Italy.
V. Piuri is with the Department of Electronics and Information, Politecnico di Milano, Milano, Italy.
E. E. Swartzlander, Jr. is with the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712 USA.
Publisher Item Identifier S 1053-587X(99) 08298-7. suitable since it cannot easily accommodate varying specifications. In addition, if only a few processors have to be constructed, the cost of a dedicated VLSI realization is often too high. In these cases, configurable VLSI systems (e.g., FPGA-and FPL-based structures) can be adopted. If high-speed operation is not mandatory, a feasible solution is the use of traditional general-purpose parallel computers. If high-speed operation is required, it is necessary to consider a highly parallel system with many processing elements to achieve the required speed without losing the flexibility of a general-purpose implementation. During the application, design, and experimentation phases, flexibility and modifiability of the implementation are attractive since they allow for tuning the solution, possibly before starting an ASIC development for volume production. In some applications (e.g., experimental computer graphics, medical imaging, multimedia production), even if no ASIC device is envisioned, the freedom of modifying the DWT processing parameters is important to achieve high-quality results. The use of general-purpose systems must be pursued also when the application is not required to have a flexible and adaptable implementation, but the production volume is so low that development and fabrication of a dedicated system is too expensive. In the literature, only few researchers have dealt with the DWT implementation on general-purpose machines. Parallelization of the 2-D DWT is proposed in [19] by using the snake sweeping algorithm [20] . Both DWT data dependence and localization analysis have been studied in [4] to design a distributed parallel memory/control architecture.
Maximum computational efficiency is an important issue in the use of multiprocessors for DWT processing. Unfortunately, the research mentioned above does not address how to avoid the interprocessor communications required to transpose the intermediate results in the 2-D DWT. Solutions to this problem have been presented in literature only for classical transforms, e.g., the DFT. An algorithm is presented in [21] to compute a N k -point k-D DFT (where N is a prime number) by evaluating (N k 0 1)=(N 0 1) independent one-dimensional (1-D) DFT's. In [22] , the case of the 2-D DFT for N = p 2 (with p a prime number) and N = 2 n is considered. In [23] , the parallel implementation of the algorithm described in [24] is discussed for the AT&T BT100 binary-tree computer. This algorithm computes the
DFT's and the discrete radon transform applied to the input matrix created by using linear congruences-based criteria [24] ; to solve these congruences, simple formulas are provided only for some specific values of N [24] . In [25] , the reduced transform algorithm is used to balance communication and computation in a parallel machine, but the sizes of the input array must be prime numbers. These algorithms impose restrictions on the size of the input array, thus making them nongeneral.
In [26] , the matrix-vector multiplication approach has been shown highly suitable and effective for DFT when the input data are sequentially available since the matrix-vector multiplication does not need the whole input data set to begin its operation. This approach also provides also high modularity and scalability to satisfy a wide range of applications. We have explored the use of the matrixvector multiplication approach to design a novel algorithm without limitations on the size of the input array.
This correspondence presents an approach to compute the 2-D DWT by using matrix-vector multiplications. Even though the matrixvector multiplication approach is a well-known technique, our idea is to use this technique to avoid data transposition. Since we are 1053-587X/99$10.00 © 1999 IEEE interested in maximizing the parallelism of the computation itself, we have selected the intrinsically parallel filter bank algorithm [1] to perform the DWT in standard form. Mallat's algorithm [3] is usually preferred in hardware implementations since the same circuit can be used repeatedly with the same coefficients to generate every transformation output, even though it has a higher latency. However, these two algorithms are equivalent from the point of view of the DWT results sinceMallat's algorithm can be viewed as the unrolled version of the filter bank. A suitable choice of the filter bank coefficients achieves a result quality similar to that provided by Mallat's algorithm. Coefficients of one of these two approaches can be easily transformed one into those of the other. In our software implementation, the use either of the same or different coefficients to generate each DWT output has no effect on the computational complexity. Therefore, we take advantage of the intrinsic parallelism of the filter bank to reduce the latency. Consequently, the application domain of the proposed approach covers the areas typically tackled by using Mallat's algorithm.
This corespondence is structured as follows. The parallelization approach based on matrix-vector multiplication is defined in Section II. Section III provides a theoretical analysis of the computational complexity and the performance evaluation for an experimental implementation on the AT&T DSP3 parallel computer. (1)
II. THE MATRIX APPROACH TO
The lowpass filter H = bh(0);h(1);1 11;h(l);11 1;h(2 M 0 1)c generates the residual filtered sequence y y
The 1-D DWT of the sequence x is obtained by collecting the The first filter bank is defined as the row 1-D DWT on the N 2N input matrix X X X = fx(r; s); 0 r; s < N g generating the N 
The A parallel 2-D N 2N DWT implementation on P processors 1 can be obtained [27] , [28] by partitioning the input and the intermediate matrices. This approach (Algorithm A) is described in Fig. 1 . In step 1, the rows of the input matrix X X X are downloaded from the host computer to the processors N=P rows per processor. In step 2, each processor performs the 1-D DWT on the rows. In step 3, the partial results are uploaded by rows to the host to create the intermediate matrix To avoid these drawbacks, we propose a matrix-based approach to parallelize the 2-D DWT. The basic idea is to describe each filter as a matrix and the filtering as matrix-vector multiplication. This is a well-known approach to implement convolution; the innovative idea is to use it in order to avoid data transposition and interprocessor communications, maintaining a high level of parallelism.
Let us consider the M matrices otherwise. 
The N 2 N intermediate matrix Z Z Z can thus be obtained by juxtaposition of the row vectors z z z T r resulting from (7) for each input row x x xr, according to (3):
The 2-D DWT of the matrix X X X can be written by using the following matrix expression:
By transposing both of the members of (9) 
where w w wq is the qth column of W W W , and parentheses suggest the most efficient computing sequence.
Equation (11) has been obtained from (10) by partitioning Y Y Y T by columns. This suggests an efficient parallel algorithm for computing the 2-D DWT (Algorithm B), which is described in Fig. 2 .
Downloading of the whole input matrix X X X is performed in steps 1 and 2. In the steps 2 and 3, each processor computes (11) for N=P different vectors w w w q : Uploading of N=P columns y y y T q to the host computer is performed in step 4. Downloading and computing can be overlapped (see step 2) since each processor can start its computation as soon as its data are received from the host. No time skewing is thus necessary. Uploading can be executed simultaneously by each processor [step 4 in Fig. 2(b) ] if connections among processors and host allow contemporaneous data transfer. Otherwise, step 4 has to be skewed, as shown in Fig. 2(a) 
III. PERFORMANCE EVALUATION AND EXPERIMENTAL RESULTS
The performance of Algorithm B with respect to Algorithm A is evaluated both from a theoretical point of view and with an experimental implementation on the AT&T DSP3 parallel processor [29] .
A. Theoretical Speedup
The speed-up S of Algorithm B with respect to algorithm A is defined as S Time to compute a 2-D DWT by using Algorithm A Time to compute a 2-D DWT by using Algorithm B : (12) The time intervals in the diagrams shown in Figs. 1 and 2 are as follows.
• 
• [H-0] is the latency required to download the whole input matrix into the multiprocessor system 
• [J-I] is the latency required by each processor to upload the N=P vectors y y y T q either in a skewed or in a parallel way
where is either 1 or 1=P in case of skewed or parallel upload, respectively. • [I-G] is the latency needed by a single processor to compute (11) for each of the N=P different vectors y y y q assigned to the processor itself
where (N) is the time needed by a processor to compute one N-point vector y y y q : The speed-up S is thus given by
Since matrix W W W is sparse, each processor needs to store and use only the elements that are a priori known to be nonzero. This can be implemented without additional control circuitry since each processor is a general-purpose processor executing a software program; such a program can easily be written to skip the unnecessary multiplications. Consequently, the computation for Algorithm B is perfectly balanced among all processors, we can derive from (11) 
Since, generally, q 0 q, the speedup is always greater than 1 for any and :
In evaluating the speedup [in particular, both (N ) and (N)],
we did not make use of any optimization (e.g., coding strategy, compiler optimizations, fetching strategies) since they are machine or environment specific. The detailed analysis of possible influences both of optimizations and architectural factors goes beyond the scope of this correspondence, which aims to describe a machineindependent algorithm. It is worth noting that these factors have similar influence both in Algorithms A and B since the mathematical operations performed by them are similar.
B. Experimental Results
Algorithms A and B have been implemented and tested on the AT&T DSP-3 multiprocessor system, even though the generality of the proposed algorithm allows for implementation over any multiprocessor system. This architecture [29] has from 16-128 processors, each of them containing a DSP-32C unit (50 MHz, 25 MFLOPS, with a multiplier and an adder working in parallel) and 512K words (32-bit) of local memory.
Our test aims to demonstrate the relative performance improvement that can be obtained by Algorithm B with respect to Algorithm A. In our experiments, processors have been configured in a pipeline to test the worst connection scheme. Data downloading from the host to the last processor of the pipeline and, similarly, results uploading onto the host involve, therefore, all processors in the pipeline. With respect to the implementation presented here, our algorithm can achieve higher absolute performances if it is implemented on more powerful processors or on machines having more efficient connection schemes among the host and the processors. To clearly establish the advantages of avoiding transposition in Algorithm B, we have not exploited its ability to begin processing data as soon as they become available (i.e., = 1 in our implementation); this feature could provide an even higher speedup. Since the machine available in our laboratory has only 16 processors, data for P = 16 were obtained by measuring the performance of C++ programs running on the DSP-3, whereas data for P > 16 were derived by simulating the larger machines on our system. 3 The dip in Fig. 3 is due to the behavior of the average communication bandwidth, (P ): In our pipelined processor connection scheme, the average communication bandwidth decreases with the number P of processors. For a critical value of P (depending on the number of operations performed by each processor), this negative effect overcomes the speedup achieved by the algorithm in each processor. blocks that should be produced by the virtual processors is evaluated without any concern to the actual propagated values. This is achieved by propagating the last generated real block for P 0 16 additional times. 
