on the bidirectional linear systolic array (BLSA) comprised of p _~ In/2] processing elements. To accomplish this matrix, A is partitioned into quasi-diagonal blocks. Each block contains p quasidiagonals. To avoid zero element insertion between successive iterations during the computation of the resulting vector if, we perform index transformation in the block matrices and vector 5". The index transformation can be described as perfect shuffle followed by the shifting. Besides, we propose an efficient hardware interface, called memory interface subsystem (MIS), located between the host and BLSA, which optimize memory access by elimination of extraneous main-memory operations. Then we evaluate the speedup and efficiency of the proposed matrix-vector multiplication algorithm. To estimate benefits obtained by introducing MIS, we compare host occupation with data transfer during matrix-vector multiplication on the BLSA without MIS and when it is involved. (~)
INTRODUCTION
Complex embedded systems such as those found in number of control, avionics, industrial, medical, automotive, and communication equipment, typically consist out of heterogeneous mix of hardware blocks: processor cores, general purpose macro blocks, and application specific architectures [1] [2] [3] . Since embedded systems are usually implemented by processors and application specific hardware, the most common architecture in these systems can be characterized as one of coprocessing, i.e., a processor working in conj unction with dedicated hardware to deliver a specific application. These accelerator blocks are required to execute algorithms of high complexity such as convolution [4] , correlation [5] , FFT [6] , DCT/IDCT [7] , 1D and 2D filtering [8] , matrix-vector and matrix-matrix multiplication [9] [10] [11] , etc.
In this paper, we will concentrate on:
(i) the generation of an application specific coprocessor architecture of type SA/PA dedicated to compute several different signal processing and scientific algorithms which utilise matrix-vector operations;
0898-1221/00/$ -see front matter (~) 2000 Elsevier Science Ltd. All rights reserved. Typeset by A~S-TEX PII: S0898-1221 (00) (ii) host/coprocessor interface which define another architectural variable that strongly affects data transfer rate between the host and accelerator; and (iii) study on the performance of the final design.
The proposed approach is limited to a specific class of coprocessor architectures, linear systolic/processor array as one of the most popular and important regular array which includes several attractive features such as bounded I/O, fault tolerance, minimal communication pattern, and modular extendability. The research reported in this paper focuses on the design where hardware-software partition is assumed to be fixed. Beginning with an initial hardware-software partition, we seek a design solution in order to maximize the speedup for a given application. Accordingly, in the rest of the paper we first propose an efficient algorithm for matrix-vector multiplication which improves the hardware utilization of the SA/PA, then we present a specific hardware interface, located between the host and SA/PA, intended for address transformation which optimizes memory access by elimination of extraneous main-memory operations.
AN OVERVIEW OF THE RELATED WORK
SAs and PAs consisting of large numbers of identical processing elements (PEs) have been popular accelerator candidates in VLSI/WSI technology. Parallel and pipelined processing capabilities of SAs/PAs can provide very high computational throughput in real-time applications. Different interconnection topologies have different properties and different algorithms run best on different accelerator architectures. For instance, hexagonally connected meshes are used for LU decomposition, binary trees for sorting, torus for transitive closure, double trees for searching, linear arrays for convolution, correlation, FFT and matrix-vector multiplication, rectangular or hexagonal arrays for matrix multiplication, and so on [5] . This paper examines the problem of determining an efficient linear SA/PA as an accelerator architecture for matrix-vector multiplication.
Historically, a group of researchers, headed by Kung, has introduced the systolic concept for parallel architectures [12, 13] in the period from 1978 to 1982. The most interesting architectural concept that inspired a lot of researchers for better design solutions due to its simple design, was the bidirectional linear systolic array (BLSA) since it is amenable for solving a variety of problems in engineering, scientific, and signal processing applications [14] [15] [16] .
The array introduced in [13] used for matrix-vector multiplication was neither time nor space optimal [17] . Therefore, the problem how to optimize space and/or time parameters of this array has stimulated a considerable research interest. Optimization of space and/or time parameters of this array can be accomplished either through higher complexity of processing elements (PE) (see, for example, [16, [18] [19] [20] ), or by modifying the design procedure and keeping the PE's complexity intact (see, for example, [21 24] ).
When the size of problem is larger than the number or PEs available in the array, algorithm partitioning and time multiplexing of hardware resources must take place [25] . For several reasons, the partitioning of algorithms for SAs is not a simple problem. First, a poor allocation of computations to processors may lower the speedup factor. Also related to this problem is the amount of external storage, and the communication links, introduced by the partitioning. Among the papers considering this problem are [25, 26] . The approach used in [25] to algorithm partitioning is to divide the index space into bands and to map those bands into the processor space. The authors have proved that partitioning and mapping of the iterative algorithms into array processors can be done by the same transformation function as for the nonpartitioned case if only an extra constraint is satisfied. However, the approach used in [25] does not minimize the processing time.
Navarro et al. [26] proposed a method to transform dense matrices of any dimension into band matrices of the desired bandwidth, so that easy and adequate matching to the dimension of BLSA is achieved. The proposed transformation allows good utilization of the PEs in the BLSA.
Most of the previous approaches to matrix-vector multiplication on the SAs view the design process independent to hardware implementation [13, 16, [18] [19] [20] [23] [24] [25] [26] . Also, it is usually assumed that communication operations are conducted using memory-mapped I/O by the host. However, memory-mapped I/O is often an inefficient mechanism for data transfer. More efficient methods, including dedicated device drivers, and interconnection network consisting of pipeline registers, take care of moving data from main memory to accelerator architecture, and vice versa, without creating a communication bottleneck, but their relation to partitioning has not been articulated yet [5] . One of the main goals in hardware synthesis of the accelerator for matrix-vector multiplication is the optimization of storage requirements. The intensive memory referencing of these behaviors necessitates the use of secondary storage (e.g., a memory system) since the primary store (e.g., PEs register storage) if sufficiently large enough, would be impractical. This memory is explicitly addressed in a synthesized system by memory operations containing indexing functions. However, due to the bottleneck that a memory system represents, memory accessing operations must be effectively scheduled so as to optimize memory access. In this t)aper, we present a specific hardware for address generation which optimize memory access by elimination of extraneous main-memory operations. We consider the multiplication of matrix A = (aik),,×,.
by vector b = (bk)nxl on the BLSA comprised of p _< [n/2] processing elements. Since p <__ n, matrix A is partitioned into quasi-diagonal blocks. Each block matrix contains p quasidiagonals. The computation begins with multiplying the first block matrix with vector b which gives the first iteration of 5", i.e., ~. (1) . In order to enable the second iteration to begin immediately, index transformation in the next block matrix and g (1) is performed. The transformation is a function of n and p. It can be described as a perfect shuffle followed by shifting. This tran. sformation enables that there is no null element insertion between the iterations, which decreases the processing time approximately two times compared to that in [26] . The memory system of the accelerator architecture is realized as dual-port RAMs.
GLOBAL ARCHITECTURAL MODEL OF THE ACCELERATOR
Before we proceed with mathematical model and a discussion of the design methodolog?z, we summarize the assumed system architecture. Figure 1 shows the overall structure of the system. It is comprised of a host, which includes CPU, main memory, and I/O subsystem, and a hardware accelerator intended for matrix-vector multiplication. Two parts can be distinguished within the accelerator, the BLSA with p < [n/2] PEs which performs the computation, and a memoryinterface subsystem (MIS) used as an efficient interconnection consisting of dual-port RAMs that take care of moving data to/from the BLSA without creating a communication bottleneck. The MIS is comprised of p dual-port RAMs, denoted as DPR-Ai (i = 0, 1,... ,p -1), one denoted as DPR_B, and one denoted as DPR_C. Dual-port RAMs are used for storing data elements prior to being fed into the BLSA. Namely, matrix A is stored into DPR_A~, i = 0 .... ,p -1, vector b into DPR_B, and all locations of DPR_C are set to zero.
MODEL FOR MATRIX-VECTOR MULTIPLICATION ON FIXED-SIZE ARRAY
We consider the computation of ~" --Ab, where A is an n x n matrix, and b and ~" are two vectors of size n implemented on a fixed size BLSA. A crucial design goal is to attain a minimal execution time with a given number of PEs in the BLSA.
In order to avoid data conflicts when minimizing the execution time, it is desirable that n is odd (see, for example, [24] ). If not so, then zero entries should be added to the matrix A and In the text that follows, we introduce some useful notations that will help us to describe data distribution during the computation of matrix-vector product on fixed size BLSA. We take equation (4.5) as a basis for the realization of matrix-vector multiplication on the fixedsize BLSA. The computation begins with A0b which gives the first iteration of 2, i.e., ff (1) . A distribution of input data items for this iteration can be described by the formulas given in [24] . In order to enable the next iteration to begin immediately (i.e., to perform time optimization), index transformation in Ai and 2(0, i = 1,...,7-1 is performed. The transformation is a function of m and p. It can be described as perfect shuffle defined by (4.2) followed by shifting defined by (4.3). The distribution of matrix A elements at the beginning of each iteration can be described as follows:
PEp.~---0 (4; (.; :;; Figure 2 shows the multiplication process on the BLSA with p = 2 PEs and m = 5. As can be seen, there are no zero elements between the iterations, which leads to minimization of execution time.
HARDWARE STRUCTURE OF THE SYSTEM
As we have already mentioned, the system for matrix-vector multiplication, given in Figure  1 consists of host computer (CPU and main memory), BLSA, and MIS. The CPU and main memory are standard parts of each general purpose computer system, therefore we will not discuss them here. The BLSA consists of p < [m/2] identical PEs. Internal structure of a PE is shown in Figure 3 . It is a simple multiply-add cell which consists of two latches, L_B and L_C, a multiplier, and an adder. The PE structure is also standard, so its hardware synthesis will be omitted. Special attention is devoted to the hardware synthesis of memory-interface-subsystem, MIS. Performance of the system pictured in Figure 1 is often determined not only by the characteristics of its data processing component, i.e., the BLSA, but also by how well it transfers required data to the desired destination. Thus, MIS design becomes increasingly important. If this interface is not efficient, a significant decrease in system performance can result. Therefore, low latency and high bandwidth are two major functional design requirements that MIS has to fulfill. This section focuses on MIS hardware design. Design aspects by which we provide hardware assistance to latency reduction and increased data transfer bandwidth are discussed.
The structure of MIS is organized into three relatively independent parts: MIS_A, an interface for manipulation with matrix A, and MIS-B and MIS_C, interfaces for handling with vectors and ~, respectively. Concerning the design efforts, the most complex part is the one intended for manipulation with vector ~', i.e., MIS_C. Data flow diagram for address generation according to (5.2) is shown in Figure 5 . To implement the computation represented by data flow diagram in Figure 5 , we need a standard multiplier, a modulo m multiplier, a barrel shifter, and one full adder. During hardware synthesis of AG1 and AG2, we take into account the following facts.
-The first multiplication operation in (5.2), i.e., (r * j) mod m, j = 0,...,~/-1, can b.e implemented as modulo m add-with-accumulation operation. -The div2 operation, i.e., the division by constant which is power of 2, on unsigned numbers such as addresses, is simply a logical shift of the bits to the right. Since in (5.2) the divisor is a constant 2, the shift operation is a hard-wired shift. In short, there is no circuit involved, simply a rearrangement of the wires in the bus is required. The rood2 operation--the unsigned implementation is very simple. In fact, like unsigned division, no circuit is necessary at all. The modulus of a number is found by simply discarding the MS bits and keeping the LS bit which represents the range of the modulo arithmetic. This is effectively a masking operation in which the MS bits are masked off, leaving just the LS bit intact. -Having in mind that the result of mod2 operation is 0 or 1, the second multiplication in (5.2) can be implemented by a multiplexor. -In all operations that involve modulo arithmetics, both operands are less then modulus m.
A block diagram of MIS_C is presented in Figure 6 . The hardware structures of AG1 and AG2 are identical, with one exception: the timing of AG2 is delayed for p time units in respect to AG1. Data are read at the right side of the DPR_C according to the addresses generated by AG1 and fed into the BLSA through PE0. Both read and write operations at the left side of DPR_C are allowed. The host performs write operation during the initialization of DPR_C, and read during result transfer. During matrixvector multiplication by the BLSA, the output from PE(B-1) is written into DPR_C. The direction of data transfer is defined by a pair of control signals sell and sel_2 according to The control unit of AG1 is composed of two counters, CCl--modulo m counter, and CC2--modulo 7 counter, and decoding logic which generates all internal control signals. The control signal En defines the period during which control unit is active. Namely, En defines timing, i.e., the beginning and the end of active period.
The Structure of MIS_A
As can be seen from Figure 1 , each PEi is associated with one dual port RAM, referred to as DPR_A~, i = 0,... ,p-1 for manipulation with matrix A. Each DPR_A~ has a capacity of m * 7 words. Namely, each diagonal of matrix A consists of m elements, and 7 diagonals are stored in each DPR_A. We assume that row-major ordering scheme for storing matrix A in the main memory is used. A global structure of the memory subsystem for address manipulation with elements of matrix A is presented in Figure 8 . It consists of two address generators, AG_A and AG_SA, and p dual port RAMs, DPR-Ao to DPR_A(p_I). To implement the computation given in (5.3) we need: a multiplier, a modulo m adder, and a full adder. The structure of AGA_H is shown in Figure 9 . The first term in (5.3) is again implemented as add-with-accumulation operation using one full adder, SUM1, and one latch, L_A1. The term (i + k) mod m is obtained at the output of modulo m adder, denoted as SUM2, and finally, the address Adr_ diag for accessing elements of matrix A located in main memory is generated at the output of full adder SUM3. Table 2 A detailed structure of AGA_DPR and the corresponding select logic, SL, from Figure 8 , are sketched in Figure 10 . The AGA_DPR consists of three counters, K1, K2, and K3. Chip select signals for DPR-Ai, i = 0,... ,p -1, are generated at the output of the SL. Addresses for dual port RAMs are obtained by merging the outputs of counters K1 and K3.
Data are fed into the BLSA from the DPR_A~, i = 0,...,p-, according to the addresses generated by the address generator AG_SA. The AG_SA performs perfect shuffle and shifting permutations over diagonal elements of matrix A during address generation. Its structure is similar to the structure of AG1 shown in Figure 6 . The block Z -1 represented in Figure 8 involves propagation delay of one clock period for the generated address. The multiplexor, MUX_i, controlled by A_sel_i, i = 0,... ,p-1, provides a mechanism to feed-in zero elements into the BLSA at the beginning and the end of the computation. 
Ol ( 
The Structure of MIS_B
A global structure of the memory interface subsystem for manipulatio n with elements of vector is given in Figure 11 . As can be seen, the constituents of MIS_B are the following.
-Address generator AG_B--generates the addresses for reading elements of vector b from the main memory and stores them into the dual port RAM, DPR_B. It performs the perfect shuffle permutation, so that data are stored in a perfect shuffle ordering in the DPR_B. The hardware structure of AG_B is identical with one of AG1 shown in Figure 6 , except that the value of shifting factor r is always set to zero. 
DETERMINING THE CYCLE-TIME
To estimate how the MIS affects the clock ratio of the overall system, comprised of BLSA and MIS, we analyze the duration of the PEs activity during one computational step (TpE), and the duration of address generation and data fetch performed by the MIS (TMIs). These two activities are overlapped in time in the proposed design. Therefore the cycle-time of the system is determined as max {TpE, TMm}.
Let us note that all parts of the MIS (i.e., MIS_A, MIS_B, and MIS_C) work in parallel. Having in mind that the data path of address generator AG1 (presented in Figure 6 ), is the most complex part of the MIS, we determine the cycle-time of the system as max {TpE, TMIS_C }.
Analyzing the structures of PE ( Figure 3 ) and AG1 ( Figure 6 ), we conclude that where TADD, ~MUL, TLATCH, TMUX, and TFETC n stand for the propagation delay of adder, multiplier, latch, multiplexor, and data fetch, respectively. If we adopt the following normalized times [27, 28] :
we obtain TpE = 5.4T and TMm_c = 4.5T, which means that the cycle-time of the system is determined by the PE latency, i.e., the MIS does not introduce the additional delay in the system.
PERFORMANCE ANALYSIS AND DISCUSSION
In the text that follows we will evaluate the following performance measures.
the speedup and efficiency of the matrix-vector multiplication algorithm implemented on the fixed-size BLSA with p < [m/2] PEs, according to the described procedure. These performance measures reflect the benefits of the proposed algorithm.
-Q factor of the proposed memory interface subsystem, MIS. This performance measure enables us to estimate the possibilities of the MIS to cope with massive-high-speed data transfer to/from the BLSA.
Speedup and Efficiency
The total execution time of the described matrix-vector multiplication algorithm on the fixedsize BLSA is given by Ttot = 7m + 2p -2, (7 respectively. According to (7. 2) it can be concluded that for some fixed p the speedup and efficiency are increasing as the problem size increases. Thus, when m --+ cx~ (m >> p) we have that Sp ---+ p and Ep --* 1. Table 3 gives the efficiency in percentage for various values of m and p.
Let us note that the total execution time of matrix-vector multiplication realized on p-processor BLSA according to the procedure given in [26] is Ttot ~---2"/7n + 2p -3.
(7.3) Accordingly, we obtain two times shorter total execution time for solving matrix-vector multiplication problem on the p-processor BLSA by implementing the procedure described in this paper, with respect to the one given in [26] . This is a direct consequence of involving transformations (4.2) and (4.3) which eliminate zero element insertion between the successive iterations. 
Q Factor
During operation of the system depicted in Figure 1 , three major activities can be identified. In numerous scientific and technical applications, such as circuit simulation, image processing, filtering, iterative methods for solving system of linear equations, to mention just a few, the matrix A describes behaviour of the system and it is time invariant, vector b corresponds to the excitation (i.e., input of the system), while ~ represents system response (i.e., output of the system) for a given excitation. In other words matrix A is multiplied by the different vector b. Therefore, when we consider system performance we will neglect the side effect concerning matrix A loading into dual port RAMs, since it is visible only once.
E.l. MILOVANOVI(~ et al.
The BLSA and MIS are designed as integrated circuits (IC) of ASIC (Application Specific Integrated Circuit) type intended to accelerate the computation. The crucial tasks that they have to provide are -efficient support to the algorithm implementation (i.e., efficient mapping of the algorithm into SA architecture), and -low overhead due to handling with data transfer to/from the BLSA.
In order to estimate the efficiency of the MIS, we define a performance measure, referred to as Q factor, of the overall system as
where T_H is time for which the host is occupied with data transfer to/from the BLSA without MIS, while T_H' is the corresponding time when MIS is involved. We assume that the time needed to generate address of one data item is the same both for the host and the MIS. Accordingly, we have T_H = T_b+T_c = m + 2(7-1). m, where T_b = m is time needed to transfer vector b from the main memory to DPR_B, and T_c = 2(7 -1) * m is time for manipulation with elements of vector gduring all 7 iterations. When MIS is involved the host is committed only during results transfer from DPR_C. Accordingly, we have T~/' = m.
Thus, we have that the Q factor of the system is m + 2('y -1) * m Q= =2 7 -i, which means that MIS diminishes the time of host occupation with data transfer (2 7 -i) times.
In the boundary case, when p = [m/2], i.e., 7 = 2, we have Q = 3. As a measure Q points out to the efficiency of the memory subsystem concerning the manipulation with data transfer to/from the BLSA.
CONCLUSION
We have presented an efficient matrix-vector multiplication algorithm which minimize execution time when implemented on the fixed size BLSA by eliminating zero element insertions between the successive iterations. To match the dimension of matrix A to the BLSA size, we perform partitioning of the matrix A into quasidiagonal blocks, and to eliminate zero element insertions between successive iterations, we perform reordering of the elements of the resulting vector 5", matrix A, and vector b'. Data schedule reordering enables us to reduce execution time two times with respect to [26] . Reordering and fast data transfer to/from the BLSA are handled by the MIS. Hardware synthesis of the MIS was discussed in detail in this paper. Our analysis shows that MIS reduces host occupation with data transfer at least three times.
