A Hardware-oriented Algorithm for Complex-valued Constant Matrix-vector
  Multiplication by Cariow, Aleksandr & Cariowa, Galina
        1 
 
1Abstract— In this communication we present a hardware-
oriented algorithm for constant matrix-vector product 
calculating, when the all elements of vector and matrix are 
complex numbers. The main idea behind our algorithm is to 
combine the advantages of Winograd’s inner product formula 
with Gauss's trick for complex number multiplication. The 
proposed algorithm versus the naïve method of analogous 
calculations drastically reduces the number of multipliers 
required for FPGA implementation of complex-valued 
constant matrix-vector multiplication. If the fully parallel 
hardware implementation of naïve (schoolbook) method for 
complex-valued matrix-vector multiplication requires 4MN 
multipliers, 2M N-inputs adders and 2MN two-input adders, 
the proposed algorithm requires only 3N(M+1)/2 multipliers 
and [3M(N+2)+1,5N+2] two-input adders and 3(M+1) N/2-
input adders. 
 
Index Terms—algorithm design and analysis, signal 
processing algorithms, digital signal processing chips, high 
performance computing. 
I. INTRODUCTION 
Most of the computation algorithms which are used in 
digital signal, image and video processing, computer 
graphics and vision and high performance supercomputing 
applications have matrix-vector multiplication as the kernel 
operation [1, 2]. For this reason, the rationalization of these 
operations is devoted to numerous publications [3-18]. In 
some cases, elements of the multiplied matrices and vectors 
are complex numbers [5-9]. In the general case a fully 
parallel hardware implementation of a rectangular complex-
valued matrix-vector multiplication requires MN  
multipliers of complex numbers. In the case where the 
matrix elements are constants, we can use encoders instead 
of multipliers. This solution greatly simplifies 
implementation, reduces the power dissipation and lowers 
the price of the device. On the other hand, when we are 
dealing with FPGA chips that contain several tens or even 
hundreds of embedded multipliers, the building and using of 
additional encoders instead of multipliers is irrational. 
Examples could be that of the Xilinx Spartan-3 family of 
FPGA’s which includes between 4 and 104 18x18 on-chip 
multipliers and the Altera Cyclone-III family of FPGA’s 
which include between 23 and 396 18×8 on-chip multipliers. 
Another Altera's Stratix-V GS family of FPGA’s has 
between 600 and 1963 variable precision on-chip blocks 
optimized for 27×27 bit multiplication. In this case, it would 
be unreasonable to refuse the possibility of using embedded 
multipliers. Nevertheless, the number of on-chip multipliers 
is always limited, and this number may sometimes not be 
 
 
enough to implement a high-speed fully parallel matrix-
vector multiplier. Therefore, finding ways to reduce the 
number of multipliers in the implementation of matrix-
vector multiplier is an extremely urgent task. Some 
interesting solutions related to the rationalization of the 
complex-valued matrix-matrix and matrix-vector 
multiplications have already been obtained [10-13]. There 
are also original and effective algorithms for constant 
matrix-vector multiplication. However, the rationalized 
algorithm for complex-valued constant matrix-vector 
multiplications has not yet been published. For this reason, 
in this paper, we propose such algorithm. 
II. PRELIMINARY REMARKS 
The complex-valued vector-matrix product may be 
defined as:   
11 ××× = NNMM XAY   (1) 
where Τ
−× = ],...,,[ 1101 NN xxxX  - is N -dimensional complex-
valued input vector, Τ
−× = ],...,,[ 1101 MN yyyY  - is N -
dimensional complex-valued output vector, and 
















−−−−
−
−
=×
1,11,10,1
1,11,10,1
1,01,00,0
NMMM
N
N
NM
aaa
aaa
aaa
L
MOMM
L
L
A , 
where 1,...,1,0 −= Nn , 1,...,1,0 −= Mm , and )()( inrnn jxxx += , 
)(
,
)(
,,
i
nm
r
nmnm jaaa += , )()( imrmm jyyy += . 
In this expression )(rnx , )(inx , )(rmy , )(imy  are real variables, 
)(
,
r
nma , 
)(
,
i
nma  are real constants, and j  is the imaginary unit, 
satisfying 12 −=j . Superscript r  means the real part of 
complex number, and the superscript i  means the imaginary 
part of complex number. The task is to calculate the product 
defined by the expression (1) with the minimal 
multiplicative complexity. 
III. BRIEF BACKGROUND 
It is well known, that complex multiplication requires 
four real multiplications and two real additions, because: 
)())(( bcadjbdacjdcjba ++−=++   (2) 
So, we can observe that the direct computation of (1) 
requires NM  complex multiplications ( NM4 real 
multiplications) and )12(2 −NM  real additions. 
According to Winograd’s formula for inner product 
calculation each element of vector 1×MY  can be calculated 
as follows [16]: 
A Hardware-oriented Algorithm for Complex-
valued Constant Matrix-vector Multiplication 
Aleksandr CARIOW1, Galina CARIOWA1 
1West Pomeranian University of Technology, Szczecin, 720229, Poland 
atariov@wi.zut.edu.pl 
  2 
 ∑
−
=
++ −−++=
1
2
0
212,122, )])([(
N
k
Mmkkmkkmm cxaxay ξ  (3) 
where 
∑
−
=
+⋅=
1
2
0
12,2,
N
k
kmkmm aac  and ∑
−
=
+⋅=
1
2
0
122
N
k
kkN xxξ  
if N  is even. (The case of odd N , we will not be considered 
here, as it can easily be reduced to the even length N ). It is 
clear that if we are dealing with complex-valued data, then 
)()( i
m
r
mm jccc +=  and )()( iNrNN jξξξ += , where are )(rNξ  and 
)(i
Nξ  are real and imaginary parts of calculated real variable 
Nξ  respectively, )(rmc  and )(imc  are real and imaginary parts 
of calculated in advance constants mc . Here it should be 
emphasized that because nma ,  are constants, the mc  can be 
precomputed and stored in a lookup table in advance. Thus, 
the calculation of mc  does not require the execution of 
arithmetic operations during realization of the algorithm. 
The calculation of )(Nξ  requires the implementation of the 
2N  complex multiplications. Therefore, we can observe 
that the computation of (3) for all m  requires only 
2/)1( +MN  complex multiplications ( )1(2 +MN real 
multiplications). However, the number of real additions in 
this case is significantly increased. 
It is well known too, that the complex multiplication can 
be carried out using only three real multiplications and five 
real additions, because [13]: 
]))([())(( bdacdcbajbdacjdcjba −−+++−=++   (4) 
Expression (4) is well known as Gauss’s trick for 
multiplication of complex numbers [17]. Taking into 
account this trick the expression (3) can be calculated using 
the only 2)1(3 +MN  multiplications of real numbers at the 
expense of further increase in the number of real additions.  
IV. THE ALGORITHM 
First, we present the vector Τ
−× = ],...,,[ 1101 NN xxxX  in a 
following form: Τ
−−
× = ],,...,,,,[ )( 1
)(
1
)(
1
)(
1
)(
0
)(
012
i
N
r
N
irir
N xxxxxxX , 
and vector ],...,,[ 1101 −× = MM yyyY  - in a following form:  
Τ
−−
× = ],,...,,,,[ )( 1
)(
1
)(
1
)(
1
)(
0
)(
012
i
N
r
N
irir
M yyyyyyY . 
Next, we splits vector 12 ×NX  into two vectors )1( 1×NX  and 
)2(
1×NX  containing only even-numbered and only odd-
numbered elements respectively: 
Τ
−−× = ],,...,,,,[
)(
2
)(
2
)(
2
)(
2
)(
0
)(
0
)1(
1
i
N
r
N
irir
N xxxxxxX , 
Τ
−−× = ],,...,,,,[
)(
1
)(
1
)(
3
)(
3
)(
1
)(
1
)2(
1
i
N
r
N
irir
N xxxxxxX . 
Then from the elements of the matrix . we form two 
super-vectors of data: 
Τ−
×××× = ],...,,[
)1
2
(
12
)1(
12
)0(
12
)1(
1
N
MMMMN AAAA
)))
, 
Τ−
×××× = ],...,,[
)1
2
(
12
)1(
12
)0(
12
)2(
1
N
MMMMN AAAA
(((
, 
where 
Τ
+−+−++++× = ],,...,,,,[
)(
12,1
)(
12,1
)(
12,1
)(
12,1
)(
12,0
)(
12,0
)(
12
i
kM
r
kM
i
k
r
k
i
k
r
k
k
M aaaaaaA
)
, 
Τ
−−× = ],,...,,,,[
)(
2,1
)(
2,1
)(
2,1
)(
2,1
)(
2,0
)(
2,0
)(
12
i
kM
r
kM
i
k
r
k
i
k
r
k
k
M aaaaaaA
(
, 
And now we introduce the vectors 
Τ
−−
× = ],,...,,,,[ )( 1
)(
1
)(
1
)(
1
)(
0
)(
012
i
M
r
M
irir
M ccccccC , 
Τ
× = ],,...,,,,[ )()()()()()(12 iNrNiNrNiNrNM xξξξξξΞ . 
Next, we introduce some auxiliary matrices: 
)( 21
2
I1IP ⊗⊗= ×× MNNMN , 23
22
3 ×
×
⊗= TIT MNMNMN
(
, 
)( 3
2
1
2
33
II1Σ ⊗⊗=
××
MNMNM
, 3232 ×× ⊗= TIT MMM
)
, 










−
=×
11
1
1
23T , 





=× 11
11
32T . 
where NM×1  - is an NM ×  matrix of ones (a matrix 
where every element is equal to one), NI  - is an identity 
NN ×  matrix and sign „ ⊗ ” denotes tensor product of two 
matrices [18]. 
Using the above matrices the rationalized computational 
procedure for calculating the constant matrix-vector product 
can be written as follows:  
)]}(
[ˆ{
)1(
1
)1(
1
2
3
2
3
2
3332121212
××××
×
××××
+
×++=
NNMNMNMNMNMN
MNMMMMMM
XPATD
ΣTCΞY
(  (5) 
)(
3
1
2
0
2
3
l
MN
lMN
DD
−
=
⊕= , ),,( )(2)(1)(0)(3 llll sssdiag=D , 
where sign „ ⊕ ” denotes direct sum of the matrices which 
are numbered in accordance with the increase of the 
superscript value [18]. 
If the elements of 
MN
2
3D  placed vertically without 
disturbing the order and written in the form of the vector 
1
2
3
2
31
2
3
××
=
MNMNMN
1DS , then they can be calculated using 
the following vector-matrix procedure: 
)(~ )2( 1)2( 1
2
31
2
3 ×××××
+= NNMNMNMNMNMN
XPATS   (6) 
23
22
3
~~
×
×
⊗= TIT MNMNMN
, 









 −
=×
10
11
11
23T . 
As already noted, the elements of the vector 12 ×MC can be 
calculated in advance. However, the elements of vector 
12 ×MΞ must be calculated during the realization of the 
algorithm. The procedure describes the implementation of 
computing elements of this vector can be represented in the 
following form: 
)1(
1
2
3
2
3
2
33322212 ××××××
= NNNNNMM
XTΨΣTPΞ
(
  (7) 
where 
23
22
3 ×
×
⊗= TIT NNN
(
, 2122 I1P ⊗= ×× MM , 3
2
1
2
33
I1Σ ⊗=
××
NN , 
 
        3 
and )(3
1
2
0
2
3
k
N
kN
ΨΨ
−
=
⊕= , ),,( )12(2)12(1)12(0)(3 +++= kkkk diag εεεΨ . 
If the elements of 
N
2
3Ψ  placed vertically without 
disturbing the order and written in the form of the vector 
1
2
3
2
31
2
3
××
=
NNN
1ΨE , then they can be calculated using the 
following vector-matrix procedure: 
)~ )2( 1
2
31
2
3 ×××
= NNNN
XTE , 23
22
3
~~
×
×
⊗= TIT NNN
. 
Consider, for example, the case of 4=N  and 3=M . 
Then the procedure (5) takes the following form: 
)]}([{ )1( 14412)1( 112121818189161616 ×××××××× +×++= XPATDΣCΞY , 
where 
Τ
× = ],,,,,[ )(3
)(
3
)(
1
)(
1
)(
0
)(
016
iririr yyyyyyY , 
Τ
× = ],,,[
)(
2
)(
2
)(
0
)(
0
)1(
14
irir
xxxxX , Τ× = ],,[
)(
3
)(
1
)(
1
)2(
14
rir
xxxX , 
)(
3
5
0
18
l
l
DD
=
⊕= , ),,( )(2)(1)(0)(3 llll sssdiag=D , 
)(~ )2( 14412)2( 112121811818118 ×××××× +== XPAT1DS , 
Τ
××× = ],[
)1(
16
)0(
16
)1(
112 AAA
))
, 
Τ
××× = ],[
)1(
16
)0(
16
)2(
112 AAA
((
 
Τ
× = ],,,,,[ )(4)(4)(4)(4)(4)(416 iririr ξξξξξξΞ , 921189 I1Σ ⊗= ××  
Τ
× = ],,,,,[
)(
1,2
)(
1,2
)(
1,1
)(
1,1
)(
1,0
)(
1,0
)0(
16
iririr
aaaaaaA
)
, 2361218
~~
×× ⊗= TIT  
Τ
× = ],,,,,[
)(
3,2
)(
3,2
)(
3,1
)(
3,1
)(
3,0
)(
3,0
)1(
16
iririr
aaaaaaA
)
, 2361218 ×× ⊗= TIT
(
, 
Τ
× = ],,,,,[
)(
0,2
)(
0,2
)(
0,1
)(
0,1
)(
0,0
)(
0,0
)0(
16
iririr
aaaaaaA
(
, 32396 ×× ⊗= TIT
)
, 
Τ
× = ],,,,,[
)(
2,2
)(
2,2
)(
2,1
)(
2,1
)(
2,0
)(
2,0
)1(
16
iririr
aaaaaaA
(
, )( 2132412 I1IP ⊗⊗= ×× ,  
Τ
× = ],,,,,[ )(2
)(
2
)(
1
)(
1
)(
0
)(
016
iririr
ccccccC , 23246
~~
×× ⊗= TIT  
)1(
1446663322616 ×××××× = XTΨΣTPΞ
(
, )~ )2( 144616616 ×××× == XT1ΨE ,  
23246 ×× ⊗= TIT
(
, 21326 I1P ⊗= ×× , 32163 I1Σ ⊗= ×× , 
and )(3
1
0
6
k
k
ΨΨ
=
⊕= , ),,( )12(2)12(1)12(0)(3 +++= kkkk diag εεεΨ , 
The data flow diagram for realization of proposed 
algorithm is illustrated in Figure 1. In turn, Figure 2 shows a 
data flow diagram for computing elements of the matrix 
23MND  in accordance with the procedure (6). In this paper, 
the data flow diagrams are oriented from left to right. Note 
[13-15] that the circles in these figures show the operation 
of multiplication by a real number (variable) inscribed 
inside a circle. Rectangles denote the real additions with 
values inscribed inside a rectangle. Straight lines in the 
figures denote the operation of data transfer. At points 
where lines converge, the data are summarized. (The dashed 
lines indicate the subtraction operation). We use the usual 
lines without arrows specifically so as not to clutter the 
picture. Figure 3a shows a data flow diagram for computing 
elements of the vector 12 ×MΞ  in accordance with the 
procedure (7). In turn, Figure 3b shows a data flow diagram 
for computing elements of the diagonal matrix 2/3NΨ . 
 
 
Figure 1. Data flow diagram for rationalized complex-valued constant 
matrix-vector multiplication algorithm for N=4, M=3. 
V. DISCUSSION OF HARDWARE COMPLEXITY 
We calculate how many multipliers and adders are 
required, and compare this with the number required for a 
fully parallel naïve implementation of complex-valued 
matrix–vector product in Eq. (1). The number of 
conventional two-input multipliers required using the 
proposed algorithm is 2)1(3 +MN . Thus using the 
proposed algorithm the number of multipliers to implement 
the complex-valued constant matrix-vector product is 
drastically reduced. Additionally our algorithm requires 
)1(2 +NM  one-input adders with constant numbers 
(ordinary encoders), 25,1)4( +++ NNM  conventional two-
input adders, and )1(3 +M  )2/(N -input adders. Instead of 
encoders we can apply the ordinary two-input adders. Then 
the implementation of the algorithm will requires 
2)1(3 +MN  multipliers 25,1)2(3 +++ NNM  two-input 
signed adders and )1(3 +M  )2/(N -input adders. 
In turn, the number of conventional two-input multipliers 
required using fully parallel implementation of 
“schoolbook” method for complex-valued matrix-vector 
multiplication is MN4 . This implementation also requires the 
M2  N -inputs adders and MN2  two-input adders. Thus, our 
proposed algorithm saves 50 and even more percent of two-
input embedded multipliers but it significantly increases 
number adders compared with direct method of fully-
parallel implementation. For applications where the "cost" 
of a multiplication is greater than that of an addition, the 
new algorithm is always more computationally efficient than 
direct evaluation of the matrix-vector product. This allows 
)(
0
rc
)(
0
ic
)(
1
rc
)(
1
ic
)(
2
rc
)(
2
ic
)(
1,0
r
a
)(
1,0
i
a
)(
0
r
x
)(
0
i
x
)(
2
i
x
)(
2
r
x
)(
1,1
r
a
)(
1,1
i
a
)(
1,2
r
a
)(
1,2
i
a
)(
3,0
r
a
)(
3,0
i
a
)(
3,1
r
a
)(
3,1
i
a
)(
3,2
r
a
)(
3,2
i
a
)(
4
rξ
)(
4
iξ
)(
4
rξ
)(
4
iξ
)(
4
rξ
)(
4
iξ
)(
0
ry
)(
0
iy
)(
1
ry
)(
1
iy
)(
2
ry
)(
2
iy
)0(
0s
)0(
1s
)0(
2s
)1(
0s
)1(
1s
)1(
2s
)2(
0s
)2(
1s
)2(
2s
)3(
0s
)3(
1s
)3(
2s
)4(
0s
)4(
1s
)4(
2s
)5(
0s
)5(
1s
)5(
2s
  4 
concluding that the suggested solution may be useful in a 
number of cases and have practical application allowing to 
minimize complex-valued constant matrix-vector 
multiplier’s hardware implementation costs. 
 
 
 
Figure 2. The data flow diagram for calculating elements of diagonal matrix 
D3MN/2 for N=4, M=3. 
 
 
Figure 3. The data flow diagrams for calculating elements of vector Ξ6×1 
(a), and for calculating elements of diagonal matrix Ψ6 (b). 
VI. CONCLUDING REMARKS  
The article presents a new hardware-oriented algorithm 
for computing the complex-valued constant matrix-vector 
multiplication. To reduce the hardware complexity (number 
of two-operand multipliers), we exploit the Winograd’s 
inner product formula and Gauss trick for complex number 
multiplication. This allows the effective use of 
parallelization of computations on the one hand and results 
in a reduction in hardware implementation cost of complex-
valued constant matrix-vector multiplier on the other hand. 
If the FPGA-chip already contains embedded multipliers, 
their number is always limited. This means that if the 
implemented algorithm contains a large number of 
multiplications, the developed processor may not always fit 
into the chip. So, the implementation of proposed in this 
paper algorithm on the base of FPGA chips, that have built-
in binary multipliers, also allows saving the number of these 
blocks or realizing the whole complex-valued constant 
matrix-vector multiplying unit with the use of a smaller 
number of simpler and cheaper FGPA chips. It will enable 
to design of data processing units using a chips which 
contain a minimum required number of embedded 
multipliers and thereby consume and dissipate least power. 
How to implement a fully parallel complex-valued constant 
matrix-vector multiplier on the base of concrete FPGA 
platform is beyond the scope of this article, but it's a subject 
for follow-up articles. 
REFERENCES 
[1] R. E. Blahut, “Fast algorithms for digital signal processing”, Addison-
Wesley Publishing company, Inc. 1985. 
[2] W. K. Pratt. Digital Image Processing (Second Edition), John Wiley 
& Sons, New York, 1991. 
[3] N. Fujimoto, “Dense matrix-vector multiplication on the CUDA 
architecture,” Parallel Processing Letters, vol. 18, no. 4, pp. 511-530, 
2008. 
[4] S. M. Qasim, A. A. Telba and A. Y. Al Mazroo, “FPGA Design and 
Implementation of Matrix Multiplier Architectures for Image and 
Signal Processing”, IJCSNS International Journal of Computer 
Science and Network Security, vol.10, no.2, pp. 168-176, February 
2010. 
[5] A. T. Fam, “Efficient complex matrix multiplication”, IEEE 
Transactions on Computers, vol. 37, no. 7, pp. 877-879, 1988. 
[6] F. T. Connolly, A. E. Yagle, “Fast algorithms for complex matrix 
multiplication using surrogates”, IEEE Transactions on Acoustics, 
Speech and Signal Processing, vol. 37, no. 6, pp. 938 – 939, 1989. 
[7] E. Ollila, V. Koivunen, H. V. Poor, “Complex-valued signal 
processing — essential models, tools and statistics”, Information 
Theory and Applications Workshop, 6-11 Feb. 2011, pp. 1 – 10. 
[8] Li, Guoqiang, Liu, Liren, “Complex-valued matrix-vector 
multiplication using twos complement representation”. Optics 
Communications, v. 105, no. 3-4, pp. 161-166.  
[9] B. Barazesh, J. Michalina, and A. Picco “A VLSI signal processor 
with complex arithmetic capability”. IEEE Transactions on Circuits 
and Systems, 35(5), pp.495–505, May 1988. 
[10] O. Gustafsson, H. Ohlsson, and L. Wanhammar, “Low-complexity 
constant coefficient matrix multiplication using a minimum spanning 
tree approach”, Proceedings of the 6th Nordic Signal Processing 
Symposium (NORSIG 2004), June 9 - 11, Espoo, Finland, pp. 141-
144, 2004. 
[11] N. Boullis, A. Tisserand, “Some optimizations of hardware 
multiplication by constant matrices”, IEEE Transactions on 
Computers, 2005, vol. 54, no 10, pp. 1271 – 1282. 
[12] A. Kinane, V. Muresan, “Towards an optimized VLSI design 
algorithm for the constant matrix multiplication problem”. In: Proc. 
IEEE International Symposium on Circuits and Systems (ISCAS-
2006), pp. 5111 – 5114, 2006. 
[13] A. Cariow, G. Cariowa, “An algorithm for complex-valued vector-
matrix multiplication”. Electrical Review, R 88, no 10b, pp. 213-216, 
2012. 
[14] A. Cariow, G. Cariowa, “A rationalized algorithm for complex-valued 
inner product calculation”, Measurement Automation and Monitoring, 
no 7, pp. 674-676, 2012. 
[15] A. Ţariov. “Algorithmic aspects of computing rationalization in 
digital signal processing”, West Pomeranian University Press, 2011, 
(in Polish).  
[16] S. Winograd, “A new algorithm for inner Product”, IEEE 
Transactions on Computers, vol. C-17, no 7, pp. 693 – 694, 1968. 
[17] D. E. Knuth, “The Art Of Computing Programming”, vol. 2, Semi-
numerical Algorithms, Addison-Wesley, Reading, MA, USA, Second 
Ed., 1981. 
[18] P. A. Regalia and K. S. Mitra, “Kronecker Products, Unitary Matrices 
and Signal Processing Applications”, SIAM Review., vol. 31, no. 4, 
pp. 586-613, 1989. 
)1(
0ε
)1(
1ε
)1(
2ε
)3(
0ε
)3(
1ε
)3(
2ε
)(
1
r
x
)(
1
i
x
)(
3
i
x
)(
3
r
x
)(
0
r
x
)(
0
i
x
)(
2
i
x
)(
2
r
x
)(
4
rξ
)(
4
iξ
)(
4
rξ
)(
4
iξ
)(
4
rξ
)(
4
iξ
)1(
0ε
)1(
1ε
)1(
2ε
)3(
0ε
)3(
1ε
)3(
2ε
a b 
)0(
0s
)0(
1s
)0(
2s
)1(
0s
)1(
1s
)1(
2s
)2(
0s
)2(
1s
)2(
2s
)3(
0s
)3(
1s
)3(
2s
)4(
0s
)4(
1s
)4(
2s
)5(
0s
)5(
1s
)5(
2s
)(
1
rx
)(
1
ix
)(
3
ix
)(
3
rx
)(
0,0
r
a
)(
0,0
i
a
)(
0,1
r
a
)(
0,1
i
a
)(
0,2
r
a
)(
0,2
i
a
)(
2,0
r
a
)(
2,0
i
a
)(
2,1
r
a
)(
2,1
i
a
)(
2,2
r
a
)(
2,2
i
a
