Matrix-vector multiplication on a fixed-size linear systolic array  by Milovanović, E.I et al.




PERGAMON Computers and Mathematics with Applications 40 (2000) 1189--1203 
www.elsevier.nl/locate/camwa 
Matr ix-Vector  Mult ip l icat ion 
on a F ixed-Size Linear Systol ic Array 
E. I. MILOVANOVIC, M. K. STOJCEV, N. M. NOVAKOVIC 
I. Z. MILOVANOVIC AND T. I. TOKIC 
Faculty of Electronics Engineering 
Beogradska 14, P.O. Box 73, 18000 Ni~, Serbia 
(Received October 1999; revised and accepted April 2000) 
Abst ract - -Th is  paper considers the multiplication ofmatrix A = (aik)n×n by vector b = (bk)n×l 
on the bidirectional linear systolic array (BLSA) comprised of p _~ In/2] processing elements. To 
accomplish this matrix, A is partitioned into quasi-diagonal blocks. Each block contains p quasi- 
diagonals. To avoid zero element insertion between successive iterations during the computation of 
the resulting vector if, we perform index transformation in the block matrices and vector 5". The index 
transformation can be described as perfect shuffle followed by the shifting. Besides, we propose an 
efficient hardware interface, called memory interface subsystem (MIS), located between the host and 
BLSA, which optimize memory access by elimination of extraneous main-memory operations. Then 
we evaluate the speedup and efficiency of the proposed matrix-vector multiplication algorithm. To 
estimate benefits obtained by introducing MIS, we compare host occupation with data transfer during 
matrix-vector multiplication on the BLSA without MIS and when it is involved. (~) 2000 Elsevier 
Science Ltd. All rights reserved. 
Keywords - -Mat r ix -vector  multiplication, Linear systolic arrays, Hardware synthesis. 
1. INTRODUCTION 
Complex embedded systems uch as those found in number of control, avionics, industr ia l ,  med- 
ical, automot ive,  and communicat ion equipment,  typical ly  consist out of heterogeneous mix of 
hardware blocks: processor cores, general purpose macro blocks, and appl icat ion specific archi- 
tectures [1-3]. Since embedded systems are usual ly implemented by processors and appl icat ion 
specific hardware,  the most common architecture in these systems can be character ized as one of 
coprocessing, i.e., a processor working in conj unction with dedicated hardware to deliver a specific 
appl icat ion.  These accelerator blocks are required to execute algor ithms of high complexi ty  such 
as convolut ion [4], correlat ion [5], FFT  [6], DCT/ IDCT [7], 1D and 2D fi ltering [8], matr ix-vector  
and matr ix -matr ix  mult ip l icat ion [9-11], etc. 
In this paper,  we will concentrate on: 
(i) the generat ion of an appl icat ion specific coprocessor architecture of type SA/PA  dedi-  
cated to compute several different signal processing and scientific a lgor i thms which uti l ise 
matr ix -vector  operations; 
0898-1221/00/$ - see front matter (~) 2000 Elsevier Science Ltd. All rights reserved. Typeset by A~S-TEX 
PII: S0898-1221 (00)00231-5 
1190 E.I. MILOVANOVI(; et al. 
(ii) host/coprocessor interface which define another architectural variable that strongly affects 
data transfer ate between the host and accelerator; and 
(iii) study on the performance of the final design. 
The proposed approach is limited to a specific class of coprocessor architectures, linear sys- 
tolic/processor array as one of the most popular and important regular array which includes 
several attractive features uch as bounded I/O, fault tolerance, minimal communication pat- 
tern, and modular extendability. The research reported in this paper focuses on the design where 
hardware-software partition is assumed to be fixed. Beginning with an initial hardware-software 
partition, we seek a design solution in order to maximize the speedup for a given application. 
Accordingly, in the rest of the paper we first propose an efficient algorithm for matrix-vector 
multiplication which improves the hardware utilization of the SA/PA, then we present a specific 
hardware interface, located between the host and SA/PA, intended for address transformation 
which optimizes memory access by elimination of extraneous main-memory operations. 
2. AN OVERVIEW OF THE RELATED WORK 
SAs and PAs consisting of large numbers of identical processing elements (PEs) have been 
popular accelerator candidates in VLSI/WSI technology. Parallel and pipelined processing ca- 
pabilities of SAs/PAs can provide very high computational throughput in real-time applications. 
Different interconnection topologies have different properties and different algorithms run best on 
different accelerator architectures. For instance, hexagonally connected meshes are used for LU 
decomposition, binary trees for sorting, torus for transitive closure, double trees for searching, 
linear arrays for convolution, correlation, FFT and matrix-vector multiplication, rectangular or 
hexagonal arrays for matrix multiplication, and so on [5]. This paper examines the problem of 
determining an efficient linear SA/PA as an accelerator architecture for matrix-vector multipli- 
cation. 
Historically, a group of researchers, headed by Kung, has introduced the systolic concept for 
parallel architectures [12,13] in the period from 1978 to 1982. The most interesting architectural 
concept hat inspired a lot of researchers for better design solutions due to its simple design, 
was the bidirectional linear systolic array (BLSA) since it is amenable for solving a variety of 
problems in engineering, scientific, and signal processing applications [14-16]. 
The array introduced in [13] used for matrix-vector multiplication was neither time nor space 
optimal [17]. Therefore, the problem how to optimize space and/or time parameters of this array 
has stimulated a considerable r search interest. Optimization of space and/or time parameters of
this array can be accomplished either through higher complexity of processing elements (PE) (see, 
for example, [16,18-20]), or by modifying the design procedure and keeping the PE's complexity 
intact (see, for example, [21 24]). 
When the size of problem is larger than the number or PEs available in the array, algorithm 
partitioning and time multiplexing of hardware resources must take place [25]. For several rea- 
sons, the partitioning of algorithms for SAs is not a simple problem. First, a poor allocation 
of computations to processors may lower the speedup factor. Also related to this problem is 
the amount of external storage, and the communication links, introduced by the partitioning. 
Among the papers considering this problem are [25,26]. The approach used in [25] to algorithm 
partitioning is to divide the index space into bands and to map those bands into the processor 
space. The authors have proved that partitioning and mapping of the iterative algorithms into 
array processors can be done by the same transformation function as for the nonpartitioned case 
if only an extra constraint is satisfied. However, the approach used in [25] does not minimize the 
processing time. 
Navarro et al. [26] proposed a method to transform dense matrices of any dimension into band 
matrices of the desired bandwidth, so that easy and adequate matching to the dimension of BLSA 
is achieved. The proposed transformation allows good utilization of the PEs in the BLSA. 
Linear Systolic Array 1191 
Most of the previous approaches to matrix-vector multiplication on the SAs view the design 
process independent to hardware implementation [13,16,18-20,23-26]. Also, it is usually assumed 
that communication operations are conducted using memory-mapped I /O by the host. However, 
memory-mapped I /O is often an inefficient mechanism for data transfer. More efficient methods, 
including dedicated evice drivers, and interconnection network consisting of pipeline registers, 
take care of moving data from main memory to accelerator architecture, and vice versa, without 
creating a communication bottleneck, but their relation to partitioning has not been articulated 
yet [5]. One of the main goals in hardware synthesis of the accelerator for matrix-vector multi- 
plication is the optimization of storage requirements. The intensive memory referencing of these 
behaviors necessitates the use of secondary storage (e.g., a memory system) since the primary 
store (e.g., PEs register storage) if sufficiently large enough, would be impractical. This mem- 
ory is explicitly addressed in a synthesized system by memory operations containing indexing 
functions. However, due to the bottleneck that a memory system represents, memory accessing 
operations must be effectively scheduled so as to optimize memory access. In this t)aper, we 
present a specific hardware for address generation which optimize memory access by elimination 
of extraneous main-memory operations. We consider the multiplication of matrix A = (aik),,×,. 
by vector b = (bk)nxl on the BLSA comprised of p _< [n/2] processing elements. Since p <__ n, 
matrix A is partitioned into quasi-diagonal b ocks. Each block matrix contains p quasidiagonals. 
The computation begins with multiplying the first block matrix with vector b which gives the first 
iteration of 5", i.e., ~.(1). In order to enable the second iteration to begin immediately, index trans- 
formation in the next block matrix and g(1) is performed. The transformation is a function of n 
and p. It can be described as a perfect shuffle followed by shifting. This tran. sformation enables 
that there is no null element insertion between the iterations, which decreases the processing 
time approximately two times compared to that in [26]. The memory system of the accelerator 
architecture is realized as dual-port RAMs. 
3. GLOBAL ARCHITECTURAL 
MODEL OF  THE ACCELERATOR 
Before we proceed with mathematical model and a discussion of the design methodolog?z, we 
summarize the assumed system architecture. Figure 1 shows the overall structure of the system. 
It is comprised of a host, which includes CPU, main memory, and I /O subsystem, and a hardware 
accelerator intended for matrix-vector multiplication. Two parts can be distinguished within the 
accelerator, the BLSA with p < [n/2] PEs which performs the computation, and a memory- 
interface subsystem (MIS) used as an efficient interconnection consisting of dual-port RAMs that 
take care of moving data to/from the BLSA without creating a communication bottleneck. The 
MIS is comprised of p dual-port RAMs, denoted as DPR-Ai (i = 0, 1, . . .  ,p - 1), one denoted as 
DPR_B, and one denoted as DPR_C. Dual-port RAMs are used for storing data elements prior 
to being fed into the BLSA. Namely, matrix A is stored into DPR_A~, i = 0 . . . .  ,p - 1, vector b 
into DPR_B, and all locations of DPR_C are set to zero. 
4. MODEL FOR MATRIX-VECTOR 
MULT IPL ICAT ION ON F IXED-S IZE  ARRAY 
We consider the computation of ~" -- Ab, where A is an n x n matrix, and b and ~" are two 
vectors of size n implemented on a fixed size BLSA. A crucial design goal is to attain a minimal 
execution time with a given number of PEs in the BLSA. 
In order to avoid data conflicts when minimizing the execution time, it is desirable that n is 
odd (see, for example, [24]). If not so, then zero entries should be added to the matrix A and 
1192 E.I .  MILOVANOVI(~ et al. 





Figure 1. Global structure of the accelerator architecture. 
vector b. To simplify the presentation, we introduce the following notation: 
S n, if n is odd, 
m / n+l ,  i fn i seven .  
Thus, in fact we consider the multiplication of matr ix  A of order m x m with vector b of order m x 1. 
Each element ci of the resulting vector ~" = Ab consists of m inner products. Since the BLSA 
has p _< Ira/2] PEs, it is obvious that  each element ci must recirculate through the array several 
t imes, i.e., for 
times. 
In the text that  follows, we introduce some useful notations that  will help us to describe data 
distr ibution during the computat ion of matr ix-vector product on fixed size BLSA. 
Let ~ = [x0.. .  Xm-1] r be an arbitrary vector of size m x 1, and k = [(m - 1)/2]. Denote with 
~,  = IX 0 Xk+l Xl Xk+2"'"  ]T (4.2) 
a vector obtained by the perfect shuffle of • = [x0.. .  Xk] T and 2" = [xk+l. . .  Xm-1] T. 
Denote with I (t) a shifting matr ix of order m x m, defined as 
I110 i1 0 0 1 .-.  i(t) = i(1) . / ( t - l ) ,  i(1) = - , t = 2 ,3 , . . .  0 0 . . .  
0 0 . . .  
I f  ~7 ---- Ix0 . . .  Xrn-1] T is an arbitrary vector, then denote with (x)t a vector obtained according 
to 
= (4 .3 )  
Now, we describe how the computat ion is performed on the p processor BLSA. The matr ix  A 
is part i t ioned into m quasidiagonals, ~ ,  of type 
di = [ao,i,al,(i+l)modm,... ,ak,(i+k)modm,... ,am-l,(m+i-1)modm] , k ---- 0 , . . .  ,m - 1, (4.4) 
Linear Systolic Array 1193 
for i = 0 , . . . ,  m - 1. Thus, for example d0 = [a0o, a l l , . . . ,  am-l,m-1] T, dl ---- [aol, a12, . . . ,  
am-2,m-1, am-l,0]T, ... , d~ra-1 - -[ao,m-l ,  a lo , . . .  ,am-l,m-2] T. 
Denote with A~ a block matrix of order m x m that contains p quasidiagonals, uch that A0 
contains the first p quasidiagonals ~,  i = 0, . . .  ,p - 1, A1 quasidiagonals ~,  i = p , . . .  2p - 1, etc. 
Block matrix A,_~, V = [m/p] contains quasidiagonals d~,_l)p+~, for i = 0, . . .  ,p - 1. If m/p is 
not an integer, then dm . . . . .  d~p-1 = 0. Now, we can write A as A = A0 + ""  + A,_  1, and 
accordingly, the matrix vector product 6 = Ab as 
"),-1 
2= Z A,6. 
i=0 
(4.5) 
We take equation (4.5) as a basis for the realization of matrix-vector multiplication on the fixed- 
size BLSA. The computation begins with A0b which gives the first iteration of 2, i.e., ff (1). A 
distribution of input data items for this iteration can be described by the formulas given in [24]. 
In order to enable the next iteration to begin immediately (i.e., to perform time optimization), 
index transformation i Ai and 2(0, i = 1 , . . . ,7 -  1 is performed. The transformation is a 
function of m and p. It can be described as perfect shuffle defined by (4.2) followed by shifting 
defined by (4.3). The distribution of matrix A elements at the beginning of each iteration can be 
described as follows: 
PEo (-- (do) 
PE., (-- 0 
PEp.~--- 0 
(4; (.; :;; +1 " " " " (~- l )p+l) i f -Dr 
0 " ' "  0 . (4 .1"  ) . (d ;p . , ' ) ] . . : ' "  (d'.fp_l)(?_l) r 
p~-I !-st iter. 2-rid iter. T- iter. 
(4.6) 
The shifting factor r is determined according to 
r = m - 2p. (4.7) 
Elements of the resulting vector 6 enter the array by PEo in the following distribution 
PE0~ 2~ I 2" I I ( )(,-1)r 
I1. iter. I 2. iter. I ' "1  7 iter. 
(4.8) 
where 5"* denotes transformation (4.2) and (C~r transformation (4.3). 
Elements of vector b always enter the array in b* distribution. Figure 2 shows the multiplication 
process on the BLSA with p = 2 PEs and m = 5. As can be seen, there are no zero elements 
between the iterations, which leads to minimization of execution time. 
5. HARDWARE STRUCTURE OF THE SYSTEM 
As we have already mentioned, the system for matrix-vector multiplication, given in Figure 
1 consists of host computer (CPU and main memory), BLSA, and MIS. The CPU and main 
memory are standard parts of each general purpose computer system, therefore we will not 
discuss them here. The BLSA consists of p < [m/2] identical PEs. Internal structure of a PE is 
shown in Figure 3. It is a simple multiply-add cell which consists of two latches, L_B and L_C, 
a multiplier, and an adder. The PE structure is also standard, so its hardware synthesis will be 
omitted. 
1194 E . I .  MILOVANOVI(~ et al. 
0 Oot A: ~_ , ao, 
0 (d,L a:, _~ 
............ 0 a4~ (d')2 
no3 .................... a ~o 
a a2o ) a02- ....... 
"'l a42 ~ --> a~, 1 
a,4 J (d3"), a,, ~ 
............. a3, a .  / (4~), 
a23 ................ a3o -' 
al  2 ~ a4 4 
a3, at," at, __> 
6" t;" l;" ............. a0, .... a,, do" 
0 ............... aoo 
b2 b4 b~ b3 bob2 b, b~ b3 bob2 b4 b~ b3 ~ Co G G c4 G G c~ c 4 c2 Co c~ c, ¢2 Co c3 
PEI PEo " ":- " ,~"--Txq-'~, ~ ..... c (e ) ,  
Figure 2. Data-flow schedule of input data  items during matr ix-vector mult ipl icat ion 





Figure 3. Internal structure of the PE. 
Special attention is devoted to the hardware synthesis of memory-interface-subsystem, MIS. 
Performance ofthe system pictured in Figure 1 is often determined not only by the characteristics 
of its data processing component, i.e., the BLSA, but also by how well it transfers required ata 
to the desired destination. Thus, MIS design becomes increasingly important. If this interface 
is not efficient, a significant decrease in system performance can result. Therefore, low latency 
and high bandwidth are two major functional design requirements that MIS has to fulfill. This 
section focuses on MIS hardware design. Design aspects by which we provide hardware assistance 
to latency reduction and increased ata transfer bandwidth are discussed. 
The structure of MIS is organized into three relatively independent parts: MIS_A, an interface 
for manipulation with matrix A, and MIS-B and MIS_C, interfaces for handling with vectors 
and ~, respectively. Concerning the design efforts, the most complex part is the one intended for 
manipulation with vector ~', i.e., MIS_C. 
5.1. The Structure of MIS_C 
The global structure of MIS_C is shown in Figure 4. It provides permutations of 5" elements 
according to (4.2) and (4.3). The main blocks of MIS_C are duM-port RAM, DPR_C, and address 
generators AG1 and AG2. 
C_in 
AG2 
Linear Systolic Array 
I 








Figure 4. Global structure of MIS_C. 
i j F 
\ ~_ j t '~  "~j +1 
1195 
mod m 
div2 ~ f mod2 
Adr 
Figure 5. Data flow diagram for address computation. 
Address generation for elements of vector 6 in each iteration is performed according to the 
following formula: 
Adr=((i+j*r) modm)mod2*  ( -~+l )+( ( i+ j , r )modm)d iv2 ,  (5.1) 
where j = 0 , . . . , '7  - 1 defines the iteration, and i = 0 , . . . ,  m - 1 defines time step within each 
iteration. Having in mind the features of the mod operator, we can rewrite (5.1) in the following 
way: 
Adr=( ( imodm+( j* r )  modm)  modm)  mod2.  - - + 1  (5.2) 
+ ((i mod m + ( j .  r) mod m) mod m) div 2. 
Data flow diagram for address generation according to (5.2) is shown in Figure 5. To implement 
the computation represented by data flow diagram in Figure 5, we need a standard multiplier, a 
modulo m multiplier, a barrel shifter, and one full adder. 
During hardware synthesis of AG1 and AG2, we take into account he following facts. 
- The first multiplication operation in (5.2), i.e., (r * j )  mod m, j = 0 , . . . ,~ / -  1, can b.e 
implemented as modulo m add-with-accumulation operation. 
- The div2 operation, i.e., the division by constant which is power of 2, on unsigned numbers 
such as addresses, is simply a logical shift of the bits to the right. Since in (5.2) the divisor 
1196 E.I. MILOVANOVIC et al. 
is a constant 2, the shift operation is a hard-wired shift. In short, there is no circuit 
involved, simply a rearrangement of the wires in the bus is required. 
The rood2 operation--the unsigned implementation is very simple. In fact, like unsigned 
division, no circuit is necessary at all. The modulus of a number is found by simply 
discarding the MS bits and keeping the LS bit which represents the range of the modulo 
arithmetic. This is effectively a masking operation in which the MS bits are masked off, 
leaving just the LS bit intact. 
- Having in mind that the result of mod2 operation is 0 or 1, the second multiplication 
in (5.2) can be implemented by a multiplexor. 
- In all operations that involve modulo arithmetics, both operands are less then modulus m. 
A block diagram of MIS_C is presented in Figure 6. The hardware structures of AG1 and AG2 
are identical, with one exception: the timing of AG2 is delayed for p time units in respect o 
AG1. Data are read at the right side of the DPR_C according to the addresses generated by AG1 
and fed into the BLSA through PE0. 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
~ Data ~ 
path 
; 
~l, sol . . . .  C .? . . . ! f . !un! t  . . . . . . . . . . . . . . . . .  
: . . . . . .  ' ~ A~r r- - -"- - -m BS IQ . . . . .  Q'IQ°P - - - - - -  
', AG2 ,At r. " SELECT] /~ : '~  [ Data 
, ~ +1 
_ : w  . . . . .  
R/~/Adrj ." LOGIC I D~al~ I I 
AGI 
........................................ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
Figure 6. A detailed structure of MIS_C. 
Both read and write operations at the left side of DPR_C are allowed. The host performs write 
operation during the initialization of DPR_C, and read during result transfer. During matrix- 
vector multiplication by the BLSA, the output from PE(B-1) is written into DPR_C. The direction 
of data transfer is defined by a pair of control signals se l l  and sel_2 according to Table 1. 
Table 1. Control signals for AG2. 
sel_2 sel_l Transfer 
0 0 Host to DPR_C 
O 1 BLSA to DPR_C 
1 0 DPR_C to host 
1 1 set all locations in DPR_C to zero 
The AG1 logic consists of data path and control unit. Data path generates addresses according 
to the data flow diagram from Figure 5. Its constituents are: two modulo m adders, ADD1 and 
ADD2, one full adder, ADD3, a latch, L1, a barrel shifter, BS, and a multiplexor, MUX1. Blocks 
ADD1 and L1 perform modulo m add-with-accumulation operation. Initially, L1 is set to zero. 
Linear Systolic Array 1197 
At the beginning of each iteration the content of L1 is incremented for the shifting factor r. The 
ADD2 performs the modulo m addition. The BS performs two operations simultaneously: rood2, 
by keeping the right most bit position of the ADD2 output, and div2 by shifting the ADD2 
output one position to the right and appending zero to the left-most bit position. The result of 
rood2 operation is used as a select signal for MUX1. The MUX1 implements multiplication of 
term [(m - 1)/2] + 1 by 0 or 1. Finally, ADD3 generates the address for accessing data element 
in DPR_C. A detail concerning modulo m addition is depicted in Figure 7. 




- \  MUX 
~(a+b) mod m 
Figure 7, A structure of modulo m adder. 
The control unit of AG1 is composed of two counters, CCl- -modulo m counter, and CC2--  
modulo 7 counter, and decoding logic which generates all internal control signals. The control 
signal En defines the period during which control unit is active. Namely, En defines timing, i.e., 
the beginning and the end of active period. 
5.2. The  St ructure  of  M IS_A  
As can be seen from Figure 1, each PEi is associated with one dual port RAM, referred to as 
DPR_A~, i = 0, . . .  ,p -  1 for manipulation with matrix A. Each DPR_A~ has a capacity of m * 7 
words. Namely, each diagonal of matrix A consists of m elements, and 7 diagonals are stored 
in each DPR_A. We assume that row-major ordering scheme for storing matrix A in the main 
memory is used. A global structure of the memory subsystem for address manipulation with 
elements of matrix A is presented in Figure 8. It consists of two address generators, AG_A and 













A sel 0"~- -~ 





1198 E. I .  MILOVANOVIC et al. 
Address generator AG_A reads data from main memory and stores them into the DPR_Ai, 
i = 0 , . . . ,  p - 1. The AG_A consists of two parts: 
- AGA_H which reads a diagonal element fi'om the main memory and stores it in a temporary 
buffer, Temp_buff; 
- AGA_DPR which accepts element from Temp_buff and writes it into the corresponding 
DPR_Ai, i = 0, . . .  ,p - 1. Once a diagonal is loaded into a dual port RAM it acts as if it 
had logically adjacent elements. 
Block AGA_H generates addresses for accessing diagonal elements of matrix A stored in the 
main memory. Elements of the ith diagonal, i = 0 , . . . ,  m - 1, are read from the addresses 
Adr_diag = m*  k + (i + k) mod m, k = 0 , . . . ,m-  1. (5.3) 
To implement the computation given in (5.3) we need: a multiplier, a modulo m adder, and a full 
adder. The structure of AGA_H is shown in Figure 9. The first term in (5.3) is again implemented 
as add-with-accumulation operation using one full adder, SUM1, and one latch, L_A1. The term 
(i + k) mod m is obtained at the output of modulo m adder, denoted as SUM2, and finally, the 
address Adr_ diag for accessing elements of matrix A located in main memory is generated at the 
output of full adder SUM3. Table 2 gives details concerning address generation when m = 4. 
CLK 
k j 
.... ~ p 
 suM 7 
---["'L2=(i+k) 
L_A1 ~-  CLK 
L l=m*k  
mod m 
Adr_diag=m*k+(i+k) rood m 
Figure 9. A structure of AGA_H. 
A detailed structure of AGA_DPR and the corresponding select logic, SL, from Figure 8, are 
sketched in Figure 10. The AGA_DPR consists of three counters, K1, K2, and K3. Chip select 
signals for DPR-Ai, i = 0, . . .  ,p - 1, are generated at the output of the SL. Addresses for dual 
port RAMs are obtained by merging the outputs of counters K1 and K3. 
Data are fed into the BLSA from the DPR_A~, i = 0 , . . . ,p - ,  according to the addresses 
generated by the address generator AG_SA. The AG_SA performs perfect shuffle and shifting 
permutations over diagonal elements of matrix A during address generation. Its structure is 
similar to the structure of AG1 shown in Figure 6. The block Z -1 represented in Figure 8 
involves propagation delay of one clock period for the generated address. The multiplexor, MUX_i, 
controlled by A_sel_i, i = 0, . . .  ,p -  1, provides a mechanism to feed-in zero elements into the 
BLSA at the beginning and the end of the computation. 
Linear Systolic Array 1199 
Table 2. The AG_H address generation when m = 4. 
CA1 (k) 
oo (o) 
o 1 (1) 
1 0 (2) 
1 1 (3) 
oo (o) 
01 ( i)  
10 (2) 























(0) 00 (0) 
(0) 01 (1) 
(0) 10 (2) 
(0) 11 (3) 
(1) 01 (1) 
(1) 10 (2) 
(1) 1 1 (3) 
(1) O0 (0) 
(2) 10 (2) 
(2) i 1 (3) 
(2) oo (o) 
(2) Ol (1) 
(3) 1 1 (3) 
(3) O0 (0) 
(3) 01 ( i)  
























a l l  
a22 
a33 






































Figure 10. A structure of AGA_DPR and SL. 
5.3.  The  S t ruc ture  o f  MIS_B 
A global structure of the memory  interface subsystem for manipulat io n with elements of vector 
is given in Figure 11. As can be seen, the constituents of MIS_B are the following. 
- Address generator AG_B--generates the addresses for reading elements of vector b from 
the main memory and stores them into the dual port RAM, DPR_B. It performs the 
1200 E. I .  MILOVANOVIC et al. 
perfect shuffle permutation, so that data are stored in a perfect shuffle ordering in the 
DPR_B. The hardware structure of AG_B is identical with one of AG1 shown in Figure 6, 
except hat the value of shifting factor r is always set to zero. 
- Dual port RAM, DPR_B--holds elements of vector b. 
- Address generator, AGB_SA--generates addresses for reading data from DPR_B. In 
essence, it acts as modulo m counter. 
MIS B 
AG B DPR B AGB_ data BLSA 
I data ~ 
1 " 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
Figure 11. A global structure of MIS_B. 
6. DETERMIN ING THE CYCLE-T IME 
To estimate how the MIS affects the clock ratio of the overall system, comprised of BLSA 
and MIS, we analyze the duration of the PEs activity during one computational step (TpE), and 
the duration of address generation and data fetch performed by the MIS (TMIs). These two 
activities are overlapped in time in the proposed esign. Therefore the cycle-time of the system 
is determined as 
max {TpE, TMm}. 
Let us note that all parts of the MIS (i.e., MIS_A, MIS_B, and MIS_C) work in parallel. Having in 
mind that the data path of address generator AG1 (presented in Figure 6), is the most complex 
part of the MIS, we determine the cycle-time of the system as 
max {TpE, TMIS_C }. 
Analyzing the structures of PE (Figure 3) and AG1 (Figure 6), we conclude that 
TpE = 2 * TLATC H + TAD D -~ TMUL, 
TMIS_ C ---- 3 * TAD D ~- 2 * TLATC H -~- TMU X -~ TFETCH, 
where TADD, ~MUL, TLATCH, TMUX, and TFETC n stand for the propagation delay of adder, 
multiplier, latch, multiplexor, and data fetch, respectively. 
If we adopt the following normalized times [27,28]: 
TAD D = T, TMU L ---- 4T, TLATC s = 0.2T, TMux ---- 0.1T, TFETC n ---- T, 
we obtain 
TpE = 5.4T and TMm_c = 4.5T, 
which means that the cycle-time of the system is determined by the PE latency, i.e., the MIS 
does not introduce the additional delay in the system. 
7. PERFORMANCE ANALYS IS  AND DISCUSSION 
In the text that follows we will evaluate the following performance measures. 
- the speedup and efficiency of the matrix-vector multiplication algorithm implemented on 
the fixed-size BLSA with p < [m/2] PEs, according to the described procedure. These 
performance measures reflect he benefits of the proposed algorithm. 
- Q factor of the proposed memory interface subsystem, MIS. This performance measure 
enables us to estimate the possibilities of the MIS to cope with massive-high-speed data 
transfer to/from the BLSA. 
Linear Systolic Array 1201 
7.1. Speedup and Efficiency 
The total execution time of the described matrix-vector multiplication algorithm on the fixed- 
size BLSA is given by 
Ttot = 7m + 2p - 2, (7.1) 
where the initialization time takes Tin = p - 1 time units, and the active execution time Te×e = 
7m + P - 1. A duration of an add-multiply operation is taken as a time unit. If we assume that 
duration of matrix-vector multiplication on a uniprocessor system is T1 = m 2, then the speedup 
and efficiency of the BLSA are 
T 1 m 2 Sp m 2 
Sp= Ttot 7m+2p-2  and Ep , (7.2) p p(Tm + 2p - 2) 
respectively. 
According to (7.2) it can be concluded that for some fixed p the speedup and efficiency are 
increasing as the problem size increases. Thus, when m --+ cx~ (m >> p) we have that Sp ---+ p and 
Ep --* 1. Table 3 gives the efficiency in percentage for various values of m and p. 
Let us note that the total execution time of matrix-vector multiplication realized on p-processor 
BLSA according to the procedure given in [26] is 
Ttot ~--- 2"/7n + 2p - 3. (7.3) 
Accordingly, we obtain two times shorter total execution time for solving matrix-vector mul- 
tiplication problem on the p-processor BLSA by implementing the procedure described in this 
paper, with respect o the one given in [26]. This is a direct consequence of involving transfor- 
mations (4.2) and (4.3) which eliminate zero element insertion between the successive iterations. 
Table 3. The efficiency in percentage. 
m p----2 p - -3  p=4 p=lO p=15 p=20 
10 73.5 69 64 - - - 
50 94 95.7 93.7 77 72 65.4 
100 97 97 95 88.6 90.7 77.6 
7.2. Q Factor  
During operation of the system depicted in Figure 1, three major activities can be identified. 
1. MIS initialization, which includes the following activities: 
(a) setting MIS into initial state, with duration of 1 time unit; 
(b) loading the elements of matrix A into DPR_Ai, (i = 0, . . .  ,p -  1), with duration of m 2 
time units; 
(c) loading vector b into DPR_B, with duration of m time units; 
(d) setting all locations in DPR_C to zero, with duration of m time units. The hard- 
ware structure of MIS provides an overlapping between subactivities (c) and (d). 
Accordingly, total duration of MIS initialization takes m 2 + m + 1 time units. 
2. BLSA operation with duration of Trot = m * 7 + 2p - 2 time units. 
3. Transfer of the results to the host, with duration of m time units. 
In numerous cientific and technical applications, such as circuit simulation, image process- 
ing, filtering, iterative methods for solving system of linear equations, to mention just a few, 
the matrix A describes behaviour of the system and it is time invariant, vector b corresponds 
to the excitation (i.e., input of the system), while ~ represents system response (i.e., output of 
the system) for a given excitation. In other words matrix A is multiplied by the different vec- 
tor b. Therefore, when we consider system performance we will neglect he side effect concerning 
matrix A loading into dual port RAMs, since it is visible only once. 
1202 E.l. MILOVANOVI(~ et al. 
The BLSA and MIS are designed as integrated circuits (IC) of ASIC (Application Specific 
Integrated Circuit) type intended to accelerate the computation. The crucial tasks that they 
have to provide are 
- efficient support to the algorithm implementation (i.e., efficient mapping of the algorithm 
into SA architecture), and 
- low overhead ue to handling with data transfer to/ from the BLSA. 
In order to estimate the efficiency of the MIS, we define a performance measure, referred to 
as Q factor, of the overall system as 
T_H 
Q-  T_H"  
where T_H is time for which the host is occupied with data transfer to/ f rom the BLSA without 
MIS, while T_H'  is the corresponding time when MIS is involved. We assume that the time needed 
to generate address of one data item is the same both for the host and the MIS. Accordingly, we 
have 
T_H = T_b+T_c  = m + 2(7 -  1 ) .  m, 
where T_b = m is time needed to transfer vector b from the main memory to DPR_B, and 
T_c = 2(7 - 1) * m is time for manipulation with elements of vector gdur ing  all 7 iterations. 
When MIS is involved the host is committed only during results transfer from DPR_C. Ac- 
cordingly, we have 
T~/ '  = m. 
Thus, we have that  the Q factor of the system is 
m + 2('y - 1) * m 
Q= =2 7 - i, 
which means that  MIS diminishes the time of host occupation with data transfer (2 7 - i) times. 
In the boundary case, when p = [m/2], i.e., 7 = 2, we have Q = 3. 
As a measure Q points out to the efficiency of the memory subsystem concerning the manipu- 
lation with data transfer to/ f rom the BLSA. 
8. CONCLUSION 
We have presented an efficient matrix-vector multiplication algorithm which minimize execution 
time when implemented on the fixed size BLSA by eliminating zero element insertions between 
the successive iterations. To match the dimension of matrix A to the BLSA size, we perform 
partit ioning of the matrix A into quasidiagonal blocks, and to eliminate zero element insertions 
between successive iterations, we perform reordering of the elements of the resulting vector 5", 
matrix A, and vector b'. Data schedule reordering enables us to reduce execution time two times 
with respect to [26]. Reordering and fast data transfer to/ f rom the BLSA are handled by the 
MIS. Hardware synthesis of the MIS was discussed in detail in this paper. Our analysis shows 
that  MIS reduces host occupation with data transfer at least three times. 
REFERENCES 
1. G. DeMicheli and R.K. Gupta, Hardware/software codesign, Proc. of the IEEE 85 (3), 349-365, (1997). 
2. S. Edwards, L. Lavagno, E.A. Lee and A. Sangiovani-Vincentelly, Design of embedded systems: Formal 
models, validation and synthesis, Proc. of the IEEE 85 (3), 366-390, (1997). 
3. P.G. Paulin, C. Liem, M. Cornero, F. Nacabal and G. Goossens, Embedded software in real-time signal 
processing systems: Application and architecture trends, Proc. of the IEEE 85 (3), 419-435, (1997). 
4. I. Milentijevid, M. Stoj~ev and D. Maksimovid, Configurable digit-serial convolver of type F, Microelectronics 
Journal 27 (6), 559-566, (1996). 
5. S.Y. Kung, VLSI Array Processors, Prentice-Hall, New York, (1988). 
6. I. Gertner and M. Shasmash, VLSI architectures for multidimensional Fourier transform processing, IEEE 
Trans. Comput. C-36 (11), 1256-1274, (1987). 
Linear Systolic Array 1203 
7. K.R. Rao and P. Yip, Discrete Cosine Transforms: Algorithms, Advantages, Applications, Academic Press, 
Boston, (1990). 
8. S. MiUa, Digital Signal Processing: A Computer Based Approach, McGraw-Hill, New York, (1998). 
9. I.Z. Milovanovid, E.I. Milovanovid, I.Z. Milentijevi~ and M.K. Stoj~ev, Designing of processor-time optimal 
systolic arrays for band matrix-vector multiplication, Computers Math. Applic. 32 (2), 21-31, (1996). 
10. I.Z. MilentijevK, I.Z. Milovanovid, E.I. Milovanovi5 and M.K. Stoj~ev, The design of optimal planar systolic 
arrays for matrix multiplication, Computers Math. Applic. 33 (6), 17-35, (1997). 
11. R.B. Urguhart and D. Wood, Systolic matrix and vector multiplication methods for signal processing, IEE 
Proc.m Pt. F. 131 (6), 623-631, (1984). 
12. H.T. Kung, Why systolic architectures, Computer 15 (1), 37-46, (1982). 
13. H.T. Kung and C.E. Leiserson, Systolic Arrays for VLSI, In Introduction to VLSi Systems, (Edited by 
C.A. Mead and L.A. Conway), pp. 271-292, Addison-Wesley, Reading, MA, (1980). 
14. D.J. Evans and K. Margaritis, Systolic designs for eigenvalue-eigenvector omputations using matrix powers, 
Parallel Comput. 14, 77-87, (1990). 
15. Y.-C. Lin, Array size anomaly of problem-size independent systolic arrays for matrix-vector multiplication, 
ParaUel Comput. 17, 515-522, (1991). 
16. K.G. Margaritis and D.J. Evans, Folding techniques for systolic iterations, Parallel Algorithms Applic. 7, 
87- 105, (1995). 
17. R. Suros and E. Montagne, Optimizing systolic networks by fitting diagonals, Parallel Comput. 4, 167-174, 
(1987). 
18. D.J. Evans and M. Gu~ev, Implementation f folding transformations on linear VLSI processor arrays, Parallel 
Comput. 18, 525 542, (1992). 
19. M. Gu~ev and D.J. Evans, VLSI processor array IPS cells, Parallel Comput. 18, 997-1007, (1992). 
20. I.Z. MilentijevK, I.Z. Milovanovid, E.I. Milovanovid, M.B. To~id and M.K. Stojeev, Two-level pipelined systolic 
arrays for matrix-vector multiplication, Journal of Systems Architecture 44, 383-387, (1998). 
21. H. Barada and A. EL-Amawy, A methodology for algorithm regularization and mapping into time-optimal 
VLSI arrays, Parallel Comput. 19, 33-61, (1993). 
22. M. Gu~ev and D.J. Evans, The fastest matrix vector multiplication, Parallel Algorithm Appl. 1, 1-11, (1993). 
23. I.Z. Milovanovid, E.I. MilovanovK, I.Z. Milentijevid and M.K. Stojeev, Designing of processor-time optimal 
systolic arrays for band matrix-vector multiplication, Computers Math. Applic. 32 (2), 21-31, (1996). 
24. E.I. Milovanovid, M.B. To~id, I.Z. Milovanovid and I.Z. Milentijevid, Designing processor-time optimal linear 
systolic arrays foi~ matrix-vector multiplication, J. Electrotechn. Math. 1, 7-19, (1998). 
25. D.I. Moldovan and J.A.B. Fortes, Partitioning and mapping algorithms into fixed size systolic arrays, IEEE 
Trans. Comput. C-35 (1), 1-12, (1986). 
26. J.J. Navarro, J.M. LLaberia and M. Valero, Partitioning: An essential step in mapping algorithms into 
systolic array processors, Computer 20, 78-88, (1987). 
27. S. Naffziger, A bub-nanosecond 0.5#m 64 b adder design, In High-Per/ormance System Design: Circuits and 
Logic, (Edited by V. Oklobdzija), pp. 443-444, IEEE Press, New York, (1999). 
28. G. Goto, A. Inoue, R. Ohe, S. Kashiwakura, S. Mitarai, T. Tsuru and T. Izawa, A 4.1-ns compact 54 × 54 - b 
multiplier utilizing system-select booth encoders, In High-performance System Design: Circuits and Logic, 
(Edited by V. Oklobdzija), pp. 489-494, IEEE Press, New York, (1999). 
