This article offers the strategies for the synthesis of fast algorithms for computing the matrix-vector products. It considers the specific example of synthesis of fast algorithm for matrix by the vector multiplication. The example offered allows tracking all the stages of construction of the algorithm which was rationalized from the point of view of number multiplication minimization.
Introduction
The matrix-vector multiplication is one of the most important and basic operations in science and engineering. The necessity of computing a matrix-vector product arises in many numeric tasks that are connected to the digital signal processing (DSP), image detection, audio and video compression (Blahut, 1985) . The operations typical for digital signal processing, such as the linear and circular convolution, correlation, discrete orthogonal transforms in various bases, multiresolution decomposition and reconstruction are also reduced to the matrix-vector multiplication. Through the analysis of different basic matrix structures it was discovered that some interrelations between their elements allow the reduction of the number of arithmetic operations. This fact gave rise to the works that concern the synthesis of algorithms which are connected with the realization of matrixvector products, optimized in the aspect of the number of multiplication and addition operations. Fast Fourier transform is an example of an algorithm created with the use of favorable relations between the elements inside a matrix. In order to reduce the number of arithmetic operations necessary for its realization, this algorithm exploits the features of symmetry and periodicity of the discrete exponential functions that are the elements of a discreet Fourier transform base matrix (Blahut, 1985) . The other "fast" algorithms for the calculation of the discreet Fourier transform and
Defining the Set of Matrix Pattern Structures that Allow the Reduction of the Number of Arithmetic Operations while Computing Matrix-vector Product
The aim of computing rationalization while computing matrix-vector products is to minimize the total number of multiplication and addition operations (Blahut, 1985) . It should be emphasized that it is not always possible to reduce the number of additions and multiplications. Sometimes it is possible to reduce the number of multiplication operations at the cost of increasing the number of additions. However, the total number of operations often remains lower than the original. In such cases the computational "gain" is obvious. On the other hand, if as a result of growth additions the total number of arithmetic operations is bigger than the original, the gain of such a rationalization needs to be justified.
Thus, the strategies of the reduction of arithmetic operations in calculating the matrix-vector products will be presented. Table 1 presents an incomplete set of typical 2×2 matrix structures (they shall be called "patterns" or "pattern matrices"), the block allocation of which allows to reduce the number of arithmetic operations in calculating matrix-vector products. The possibility of the reduction of operations for some of these structures has been, among others, presented in the work (Țariov and Adamski, 2005) for 2×2 matrices and (Țariov and Adamski, 2006) for N×N matrices. The extended set of computational procedures for 2×2 matrices, which accounts for such a reduction, has been presented in Table 2 . Table 3 presents a set of typical N×N matrix structures, the block constellation of which allows reducing the number of arithmetic operations in calculating matrix-vector products. 
Hadamard matrix, N I -is an identity N N  matrix and signs "  ", "  " denote tensor product and a direct sum of two matrices, respectively (Regaliat, P. A. and Mitra, S. K., 1989) , 
The Strategies of the Construction of "Fast" Matrix-vector Multiplication algorithms
First of the strategies for the rationalization of computation while computing matrix-vector products depends on finding such fragments of a matrix structure, which allocation would agree with any of the pattern structures. Fortunately, the matrices that occur in the digital signal processing have such a specific block structures or may be brought to such form by rearranging their rows or columns.
Firstly, we should divide in mind the original matrix into the four submatrices. Then, the structure of the new blocks, which were formed through that division, should be matched with the structure of block of the one of patterns (of pattern matrices from Table 3 ). In this way, one of the computing procedures from (1) to (16) is chosen. If the block structure that we have found matches the patterns 2-16, it means that we have managed to minimize the number of arithmetic operations. Yet, if the block structure of the matrix corresponds to the pattern 1, it means that we cannot rationalize the multiplication by such a matrix.
In such a case the permutation of the rows and/or columns of the original matrix is possible before we retry to match the pattern. A success in finding a suitable block structure depends largely on the fortunate choice of permutation of the rows and/or columns. However, it should be remembered, that the permutation of the rows of a matrix makes it necessary to shuffle the elements of the result vector (output), while permuting the columns, which will require the prior permutation of the input vector elements.
Also, it needs to be underlined that if initially the rationalized procedure of vector and matrix multiplication cannot be found, the conclusions cannot be drawn and it is also not possible in the next steps. Hence, regardless of the result of the first step, the second step should be taken. The next step of the rationalization of the algorithm of multiplying a vector by a matrix is the analogical division of each of the three (or four in the worst case) submatrices. These matrices emerged as a result of the synthesis of the computing procedure in the previous step, and of an attempt to find an adequate block structure for each of those submatrices, that matches the pattern. Then, an adequate procedure for each of these submatrices should be chosen and, on the base of it, a common overall procedure for this step of rationalization should be created. In case of finding the matching pattern (from numbers 1 to 10) for our block structure, at least for one of the submatrices, we obtain even more rationalized total computing procedure. Fig. 1b.  Fig. 2.  Fig. 3 . Following this procedure, it is possible to get a submatrix of the 2×2 order, and to reduce the number of necessary arithmetic operations to the minimum. However, such a complete decomposition cannot always be achieved, especially not for all the submatrices which were generated as a result of block division of the original matrices and not at each step of decomposition. On the one hand, it depends on the structural properties of a matrix, and on the other, on the designer's invention -the wrong choice of the rows or columns of a matrix (or submatrix) to shuffle at each step may lead to a non-optimal decomposition of the original matrices.
Fig. 1a.
It also needs to be emphasized that an iterating division of original matrices into four submatrices may not always lead to a reduced number of operations, even after the several permutation of rows and/or columns of those matrices. In such case it is worth to search for other alternatives. For example, the original matrix can be presented in a form of a sum or a subtraction result of two matrices and then the above described ways of rationalization can be applied. Naturally, such a transformation is sensible, when the total number of operations after the decomposition is lower than the original.
Another approach may be to try to find the single fragments (subblocks) of a matrix that have a pattern structure. In the latter case the reduction of arithmetic operations may be achieved only for the selected fragments. 
The technique of designing fast algorithms for the vector-matrix multiplication based on the presented considerations is shown schematically in Figure 33.   Fig. 17a.  Fig. 17b.   Fig. 18.  Fig. 19 . Fig. 20.  Fig. 21.   Fig. 22.   Fig. 23.  Fig. 24 . Fig. 25.  Fig. 26.   Fig. 27.  Fig. 28.   Fig. 29.  Fig. 30 . Fig. 31.  Fig. 32.   Fig. 33 . The strategies of computing rationalization in the synthesis of fast algorithms for the matrix-vector multiplication
An Example of a Synthesis of a "Fast" Algorithm for the Matrix-vector Multiplication
Let us consider a synthesis of a fast algorithm for calculating a matrix-vector product for a matrix of arbitrary values. Suppose that the matrix has the following form:
Finding matrix fragments (blocks) matching the structure of matrix patterns
The choice of a pattern procedure and partial computing procedures for the selected blocks of a matrix
Synthesis of a final matrix-vector computing procedure on the base of partial procedures
Assessment of the effectiveness of the matrix-vector multiplication algorithm according to selected criteria It should be observed that the submatrices (35) that emerged as a result of procedure (34) have also the same block constellation as that of matrix 8 A : X after the first step of rationalization Thus, we can once again apply the same method for the reduction of the number of operations for each of the created submatrices.
Strategies of synthesizing fast algorithms for the matrix-vector multiplication
Through the application of an adequate decomposition of matrix components that was demonstrated in the procedure (34) and of the model presented in the Figure 34 , we synthesize a new, specified computing procedure for the second step of the algorithm rationalization. Then, the procedure for calculation of the matrix-vector product will take the following form: diag ,
D C
The data flow diagram that reflects the algorithm structure of the computing process on this stage of rationalization has been presented in Figure 35 . Now tone may notice that the matrices (36), have the block structure which is compatible with structure 4 from Table 3 . In such case the determination of the products of the appropriate subvectors and the marked matrices may also be rationalized through the use of procedure (4). In Figure 36 a data flow diagram which represents the final version of rationalized computing process of calculating the product of the matrix 8 A and vector 1 8 X has been presented. Evidently, the synthesized algorithm for the multiplication of a matrix for the example considered requires only 12 multiplications instead of 64, and 44 addition operations instead of 56. X after the third stage of rationalization As it can be observed, in the example analyzed, the synthesized algorithm of the matrix multiplication by the vector requires only 12 multiplications, instead of 64, and 112 operations of addition, instead of 56. The joint number of arithmetic operations for the rationalized algorithm amounts to 124 operations, whereas for the original algorithm (non-rationalized), this number amounts to 120. However, when we consider the multiplication of the vector by a constant matrix might be determined in advance, the number of the additions is lowered to 44. In such case, the joint number of arithmetic operations will amount to 56, so the number will be reduced by half.
The repetitions of certain calculations and relations which could be taken into account in the further rationalization of the calculation process can be observed thanks to the analysis of the system of relations, used to assign the elements } { i s . By skipping the transitional transformations, the procedure for the value calculation } { i s can be presented in the following way: 
where ] , , , 
Summary
The strategies of finding rational solutions during the synthesis of the fast algorithms of multiplying a vector by a matrix that were presented in this article, constitute a useful instrument for designing the effective digital signal processing algorithms. It is a relatively simple and effective approach which allows to reduce the number of arithmetic operations during the matrix-vector product calculation, which bases on the specific structure of the matrices. The suggested approach can be used during the synthesis of optimized algorithms for the tasks which are not included in the so far existing solutions. It should also be emphasized that the suggested approach allows to repeat (duplicate) the results, or to synthesize the completely new fast algorithms (of different algorithmic structure) for the typical digital signal processing tasks that are already solved and presented in the scientific literature. Articles (Ţariov, 2012.) , , (Majorkowska-Mech and Cariow, 2012) , (Cariow and Gliszczyński, 2012) , , , (Cariow and Majorkowska-Mech, 2014) , show the application of the described methodology for the synthesis of some fast data processing algorithms.
