Index Terms-algorithm design and analysis, signal processing algorithms, digital signal processing chips, high performance computing.
I. INTRODUCTION
Most of the computation algorithms which are used in digital signal, image and video processing, computer graphics and vision and high performance supercomputing applications have matrix-vector multiplication as the kernel operation [1, 2] . For this reason, the rationalization of these operations is devoted to numerous publications [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] . In some cases, elements of the multiplied matrices and vectors are complex numbers [5] [6] [7] [8] [9] . In the general case a fully parallel hardware implementation of a rectangular complexvalued matrix-vector multiplication requires MN multipliers of complex numbers. In the case where the matrix elements are constants, we can use encoders instead of multipliers. This solution greatly simplifies implementation, reduces the power dissipation and lowers the price of the device. On the other hand, when we are dealing with FPGA chips that contain several tens or even hundreds of embedded multipliers, the building and using of additional encoders instead of multipliers is irrational. Examples could be that of the Xilinx Spartan-3 family of FPGA's which includes between 4 and 104 18x18 on-chip multipliers and the Altera Cyclone-III family of FPGA's which include between 23 and 396 18×8 on-chip multipliers. Another Altera's Stratix-V GS family of FPGA's has between 600 and 1963 variable precision on-chip blocks optimized for 27×27 bit multiplication. In this case, it would be unreasonable to refuse the possibility of using embedded multipliers. Nevertheless, the number of on-chip multipliers is always limited, and this number may sometimes not be enough to implement a high-speed fully parallel matrixvector multiplier. Therefore, finding ways to reduce the number of multipliers in the implementation of matrixvector multiplier is an extremely urgent task. Some interesting solutions related to the rationalization of the complex-valued matrix-matrix and matrix-vector multiplications have already been obtained [10] [11] [12] [13] . There are also original and effective algorithms for constant matrix-vector multiplication. However, the rationalized algorithm for complex-valued constant matrix-vector multiplications has not yet been published. For this reason, in this paper, we propose such algorithm.
II. PRELIMINARY REMARKS
The complex-valued vector-matrix product may be defined as:
where complex multiplications. Therefore, we can observe that the computation of (3) for all m requires only
However, the number of real additions in this case is significantly increased.
It is well known too, that the complex multiplication can be carried out using only three real multiplications and five real additions, because [13] : (4) is well known as Gauss's trick for multiplication of complex numbers [17] . Taking into account this trick the expression (3) can be calculated using the only
multiplications of real numbers at the expense of further increase in the number of real additions.
IV. THE ALGORITHM
First, we present the vector
following form:
and vector ] ,..., , [
-in a following form:
Next, we splits vector
containing only even-numbered and only oddnumbered elements respectively:
Then from the elements of the matrix . we form two super-vectors of data:
And now we introduce the vectors
Next, we introduce some auxiliary matrices:
where
where every element is equal to one), N I -is an identity N N × matrix and sign " ⊗ " denotes tensor product of two matrices [18] .
Using the above matrices the rationalized computational procedure for calculating the constant matrix-vector product can be written as follows:
where sign " ⊕ " denotes direct sum of the matrices which are numbered in accordance with the increase of the superscript value [18] .
If the elements of 
If the elements of
The data flow diagram for realization of proposed algorithm is illustrated in Figure 1 . In turn, Figure 2 shows a data flow diagram for computing elements of the matrix 2 3MN D in accordance with the procedure (6) . In this paper, the data flow diagrams are oriented from left to right. Note [13] [14] [15] 
V. DISCUSSION OF HARDWARE COMPLEXITY
We calculate how many multipliers and adders are required, and compare this with the number required for a fully parallel naïve implementation of complex-valued matrix-vector product in Eq. (1). The number of conventional two-input multipliers required using the proposed algorithm is
. Thus using the proposed algorithm the number of multipliers to implement the complex-valued constant matrix-vector product is drastically reduced. Additionally our algorithm requires
one-input adders with constant numbers (ordinary encoders),
conventional twoinput adders, and
Instead of encoders we can apply the ordinary two-input adders. Then the implementation of the algorithm will requires
two-input signed adders and
In turn, the number of conventional two-input multipliers required using fully parallel implementation of "schoolbook" method for complex-valued matrix-vector multiplication is MN 4 . This implementation also requires the M 2 N -inputs adders and MN 2 two-input adders. Thus, our proposed algorithm saves 50 and even more percent of twoinput embedded multipliers but it significantly increases number adders compared with direct method of fullyparallel implementation. For applications where the "cost" of a multiplication is greater than that of an addition, the new algorithm is always more computationally efficient than direct evaluation of the matrix-vector product. This allows
concluding that the suggested solution may be useful in a number of cases and have practical application allowing to minimize complex-valued constant matrix-vector multiplier's hardware implementation costs. 
VI. CONCLUDING REMARKS
The article presents a new hardware-oriented algorithm for computing the complex-valued constant matrix-vector multiplication. To reduce the hardware complexity (number of two-operand multipliers), we exploit the Winograd's inner product formula and Gauss trick for complex number multiplication. This allows the effective use of parallelization of computations on the one hand and results in a reduction in hardware implementation cost of complexvalued constant matrix-vector multiplier on the other hand.
If the FPGA-chip already contains embedded multipliers, their number is always limited. This means that if the implemented algorithm contains a large number of multiplications, the developed processor may not always fit into the chip. So, the implementation of proposed in this paper algorithm on the base of FPGA chips, that have builtin binary multipliers, also allows saving the number of these blocks or realizing the whole complex-valued constant matrix-vector multiplying unit with the use of a smaller number of simpler and cheaper FGPA chips. It will enable to design of data processing units using a chips which contain a minimum required number of embedded multipliers and thereby consume and dissipate least power. How to implement a fully parallel complex-valued constant matrix-vector multiplier on the base of concrete FPGA platform is beyond the scope of this article, but it's a subject for follow-up articles. 
