Abstract-In this paper two different approaches to the rationalization of FDWT and IDWT basic operations execution with the reduced number of multiplications are considered. With regard to the well-known approaches, the direct implementation of the above operations requires 2L multiplications for the execution of FDWT and IDWT basic operation plus 2(L-1) additions for FDWT basic operation and L additions for IDWT basic operation. At the same time, the first approach allows the design of the computation procedures, which take only 1,5L multiplications plus 3,5L+1 additions for FDWT basic operation and L+1 multiplications plus 3,5L additions for IDWT basic operation. The other approach allows the design of such computation procedures, which require 1,5L multiplications, plus 2L-1 addition for FDWT basic operation and L+1 addition for IDWT basic operation.
I. INTRODUCTION
Recently, a discrete wavelet transform (DWT) has been used in numerous computer graphics, signal and image processing applications [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] .
In 1989, Sté phane Mallat proposed a fast wavelet decomposition and reconstruction algorithm [1] . The basic idea of the fast algorithm is, in case of both 1D DWT and 2D DWT, the decomposition of the original signal (or image) using a pair of filters (high-and lowpass) on two components and following the decomposition of the low pass component in the same hierarchical manner. This decomposition process is called analysis. The inverse process is called reconstruction (or synthesis) [3, 5, 7] .
The cores of multilevel DWT decomposition and reconstruction procedures are forward DWT (FDWT) and inverse DWT (IDWT) "basic operations" -the multiplication of data vector by FDWT or IDWT base matrix, which describes the filter coefficients [12] .
In the matrix-vector form FDWT basic operation can be defined in the following way:
While the inverse operation used to recreate signal from DWT coefficients (inverse DWT base operation) looks as demonstrated in (2) : (2) where N -is a number of the original signal samples, L -is a size of sliding window that defines the part of the signal processed by the given base operation, 1 ..., ,
-is a decomposition step number indicating the granularity degree of data resolution, Kthe total number of the decomposition steps. Vector 
is a FDWT base matrix with dimensions of ) 2 ( L  , whose elements represent the coefficients of the highpass and low-pass filter respectively. Formulas (1), (2) define FDWT and IDWT basic operation respectively [12] .
From expressions (1) and (2) Minimizing the number of multiplications is especially important in the design of specialized VLSI mathematical or DSP processors because reducing the number of multipliers also reduces the power dissipation and lowers the cost implementation of the entire system being implemented. Moreover, a hardware multiplier is more complicated unit than an adder and occupies much more chip area than the adder. Even if the chip already contains embedded multipliers, their number is always limited. This means that if the implemented algorithm has a large number of multiplications, the projected processor may not always fit into the chip and the problem of minimizing the number of multiplications remains relevant.
In this paper two approaches to the synthesis of rationalized algorithms for implementation of FDWT/IDWT basic operations are considered. In the first approach, the reduction of the number of multipliers in the implementation of the FDWT/IDWT basic operations is achieved by applying the Winograd's inner product formula. Another approach uses effective schemes of matrix factorization described in [13] . It should also be noted that during the construction of algorithms the extensive use of the principles of parallelization and vectorization of data processing will be made. Therefore, it is assumed that the synthesized procedures will be implemented on the hardware platforms with parallel processing. 
The calculation of product poses a problem
According to Winograd's formula for the inner product calculation each element of vector
can be calculated as follows [14] :
advance. The calculation of () M  requires the 2 N multiplications to be performed. By exploiting some rationalization solutions based on application of the Winograd's inner product formula, the number of multipliers, necessary for fully-parallel implementation multipliers could be decreased.
B. Rationalized algorithm for FDWT basic operation implementation using Winograd's inner product formula
In the beginning, when using the elements of the matrix
F , a column vector will be formed:
-is a column vector containing the impulse response coefficients of the low pass filter,
-is a column vector containing the impulse response coefficients of the high pass filter, and 1  L 0 -is a column vector or matrix of size defined by low subscript and consisting of all zeros.
Next, we introduce some auxiliary matrices:
here and further in this paper, is a matrix of size defined by a low subscript and consisting of all 1s,  -is a Kronecker product sign [15] ; -permutation matrix where the 1 elements reside on the counter diagonal and all other elements are zero;
-partial products summation matrix
K
Taking into account the introduced vector-matrix constructions, the FDWT basic operation computational procedure with a reduced number of multiplications (or multipliers in case of hardware implementation) can be represented as follows:
The operator "  " named "vectorized Hadamard product" [12, 13] has been introduced for the convenience of the description of the simultaneous selected vector elements multiplication procedure -it will transform certain data column vector
in a following way:
-is a binary mask-matrix, and the elements m y are determined by the following rule [12, 13] : Whereas the computational procedure for this example takes the following form: ) (  [   1  8  8  24  1  24  24  24  12  12  3  3  2  1  2  1 
where 
C. Rationalized algorithm for IDWT basic operation implementation using Winograd's inner product formula
For the synthesis of rationalized calculation procedure for IDWT basic operation implementation the following matrix constructions are introduced:
and  -is direct sum sign [15] , -first permutation matrix When using introduced vector-matrix constructions, the IDWT basic operation computational procedure with reduced number of multiplications (or multipliers in case of hardware implementation) can be written in the following form: 
III. RATIONALIZED ALGORITHM FOR FDWT/IDWT BASIC OPERATIONS EXECUTION USING GAUSS' TRICK FOR 2×2 MATRIX FACTORIZATION

A. Short background
Let us rearrange the columns of the matrix
F so that a new matrix takes the following form:
is a permutation matrix whose elements are defined as follows:
The computation process for the multiplication of matrix
independent vector-matrix products with 2 2  matrices. The results of these calculations should be later added. We will notice that all sub blocks of the new matrix
F
possess specific block structures. This specificity, as we show below, allows the reduction of the number of multiplications in the implementation of the partial vector-matrix products. The mentioned possibility of rationalization uses the following decomposition [13] As can be seen, each of such vector-matrix multiplications requires only three multiplications and five additions. The use of singularities can offer a more cost-effective way to obtain a set of partial vector-matrix products. Below the application of this trick is considered in detail.
B. Rationalized algorithm for FDWT basic operation implementation using Gauss's matrix factorization trick
At first we extract elements from
F matrix to create a new diagonal matrix:
where submatrices
Next we introduce three types of summation matrices:
-matrix of post-additions, ) ( Taking into account introduced vector-matrix constructions, the FDWT basic operation computational procedure with a reduced number of multiplications can be represented as follows:
We consider against an example for 8  L . Then 
C. Rationalized algorithm for IDWT basic operation implementation using Gauss'matrix factorization trick
The basic operation computational procedure takes the following form:
Where 
The data flow diagrams for the execution of FDWT and IDWT basic operations in accordance with the procedures (8) and (10) , are shown in Figures (3) and (4), respectively. The number of multiplications in the algorithm for the implementation of FDWT/IDWT basic operations, represented by the procedure (7), is L 5 , 1 . On the other hand, the algorithm for the implementation of FDWT basic operation, represented by the procedure (7), requires only )
The number of multiplications in the algorithm for the implementation of IDWT basic operations represented by the procedure (9) , is L 5 , 1 . In turn, the algorithm for the implementation of IDWT basic operation, represented by the procedure (9), requires 1  L additions.
IV. CONCLUSION
We see that the solutions proposed in this article allow to reduce the total number of multiplications in the implementation of FDWT/IDWT basic operations compared to the naive methods of computing. Indeed, in the general case a fully parallel implementation of FDWT/IDWT basic procedures requires L 2 multiplications. The number of multiplications for each of proposed here algorithms is 25% less than that of the direct execution of computations.
It is noteworthy that, because all elements in L  2 F matrix are constants, we can (but not must) use one-input units (encoders), instead of traditional multipliers. In such case, it is apparently advisable to use the second approach for the implementation of the FDWT/IDWT basic operations. This solution greatly simplifies implementation, reduces the power dissipation and lowers the price of the device. On the other hand, when we are dealing with FPGA chips that already contain a number of embedded multipliers, the construction and usage of additional encoders instead of multipliers is irrational. In this case, it would be unreasonable to refuse the possibility of using embedded multipliers. In this case, any algorithms derived from the application of both approaches can be used with approximately equal effect. The algorithms proposed in this article allow the number of multiplications to be reduced at the cost of more additions, or more complex memory access. Such solutions make no sense in some modern high-speed architectures, where pipelined fixed-point or floatingpoint addition and multiplication take just one clock cycle. Therefore, the solutions presented here are intended solely for the hardware implementation of FDWT/IDWT basic operations.
