Abstract-An efficient systolic array with the robust concurrent error diagnosis (CED) capability for the multiplication of band matrices is presented. This array is most appropriate for medium-banded matrices. While the corresponding Kung-Leiserson system is shown to be best suited for wide-banded matrices, the associated Huang-Abraham design is demonstrated as optimal for narrow-banded matrices. As it is impractical to maintain all three designs for the range of band matrices, it would be most economical to install the new design over the others.
I. INTRODUCTION OMPUTE-BOUND computational tasks are compu-C tations in which the total number of operations is
larger than the total number of input and output (I/O) elements [l] . The speed at which a computer can perform such a task depends largely on the volume and speed of its data-transfer capability. This is the classic Von Neumann bottleneck. Advances in this area over the past decade have been in the development of better and faster fundamental devices, as well as the development of sophisticated application-specific architectures. Of the many innovative concepts in the architectural front, the systolic array has probably received more developments than the others [1]- [21] .
A systolic array is basically geared for the handling of a subclass of the compute-bound tasks, known as the recursive computation tasks. Since its introduction, in the late 1 9 7 0 '~~ the application of systolic arrays to these tasks have been quite extensive. As a result, the design of systolic arrays has undergone appreciable refinements. The latest trends in the systolic design seem to be in the min- imization of the throughput latency and the maximation of the per-cycle processing element (PE) utilization rate [5] - [7] , as well as the implementation of fault-tolerant capabilities [3] , [13] - [20] .
The throughput latency of a systolic array refers to the duration between this time when the first pieces of data were pumped into array and the moment when the last results are being retrieved from it. The throughput latency is usually measured in terms of the total number of systolic cycles. The per-cycle PE utilization rate is the ratio of the average number of active PE's to the total number of PE's in the array, during a cycle of operation. Depending on the design philosophies, some of the systolic algorithms may have comparatively high throughput latency and lower per-cycle PE utilization rate. In order to attain better results for these two figures of merit, most recently [5] - [7] has studied the tradeoffs of the lower I/O burden in the previous designs [ 11- [4] . These optimal designs, however, are of importance only as long as the resulting increased I/O bandwidths do not significantly slow down the rate of data transfer of the system as a whole.
The emphasis on fault-tolerant systolic arrays, much like those on other computational structures, has only received attention recently. The practice of fault tolerance in a real-time task-intensive architecture, such as a systolic array, is understandably difficult. Recently, several researchers have focused on error diagnosis, correction, and reconfiguration of faulty structures [ 131- [20] . Much fundamental work still needs to be done before systolic arrays can truly be highly reliable systems.
In the present paper, the effort to optimize the throughput latency and the per-cycle PE utilization rate will be reexamined with the need to maintain a small I/O bandwidth in mind. The incorporation of a concurrent error diagnosis (CED) capability, which is the first step towards fault-tolerant systolic arrays, will also be pursued.
In the next section, the characteristics of a systolic array and the important issues in fault-tolerant systolic computing are presented. The recent efforts to optimize the performance of a band matrix multiplication systolic array (BMMSA) will be discussed in Section 111. The discussion will concentrate on the fundamental differences between the Kung-Leiserson and Huang-Abraham schemes of systolic design. The emphasis on these two schemes is to exemplify the extremes in design philosophies. Against such background, the motivations for additional figures of merit will be pointed out. A new design of BMMSA will also be introduced.
The results of a recent work [ 191 regarding the practice of CED in a systolic array will be extended in Section IV. The efficient scheme, based on the time redundancy technique of RES0 (recomputation with shifted operands) [22] - [24] , is to be applied to the design of CED-capable BMMSA's. Different designs, based on the Kung-Leiserson BMMSA, the proposed BMMSA, and the HuangAbraham BMMSA, will be presented.
The conclusions and future research directions are given in Section V.
11. BACKGROUND A systolic array exemplifies the essence of VLSI practice. It embraces regularity, granularity, expansibility, and hierarchical integration. It is a special form of computational structure in which processing elements (PE's) carrying out simple functions are regularly interconnected to facilitate data and control flows of the specific algorithm it implements. Not all algorithms can be implemented in systolic arrays efficiently. However, a large number of algorithms do favor systolic application. These algorithms are basically oriented towards recursive computations.
A recursive computational task involves arrays of data that are either spatially or temporally related, or both. The class of recursive problems is notoriously complex for easy computational handlings. The complexities can be attributed to the large number of data involved, the huge number of calculations themselves. These translate into the burdens of large memory commitment, long computation time, long I/O time, and large I/O bandwidth. Of these, the bottleneck is at the I/O.
In fact, the merit of a systolic array lies in its algorithmoriented structure to closely mimic the inherent recursive nature of the specific computational task it implements. To this end, a systolic array has the following properties [ 191: (1) it has a hierarchical architecture; (2) its PE's are usually identical and interconnected in a highly regular fashion; (3) it relaxes the potential global I/O bottleneck; (4) it entertains multidimensional, multiple data streams of different speeds; (5) it orchestrates lockstep concurrent pipelines to realize parallel processing; (6) it only requires relatively simple control signals; and (7) it is liable to have PE's idling in each time cycle.
Since most recursive tasks involve the multiplication and addition of elements of matrices, a typical architecture ( Fig. 1) for a PE consists of a multiplier-accumulator (MAC) at its heart. For 2's complement fixed-point computations, a typical MAC is usually further comprised of two major parts: a Baugh-Wooley multiplier (BWM) and a ripple-carry adder (RCA).
It is possible to install a pipeline inside the PE by partitioning the MAC accordingly. In the multiplication of two n-bit numbers, the result could be 2 n bits or n bits long, depending on the precision desired. If the output were to be 2 n bits long, then the architecture of the MAC would not be conducive to efficient pipelining [ 191. However, if the output is to be only n bits long, then the MAC can be efficiently partitioned.
The last row of a BWM is basically equivalent to an RCA in structure. In the case of a MAC with an n-bit output, the associated accumulator (a RCA) has about the same number of bits as the last row of the BWM. Thus, if the last row of the BWM is partitioned from it to the accumulator, then the resulting two substructures (the BWM without its last row, and the accumulator with the acquired last row from the BWM) will have essentially the same operation delay time, namely half that of the original MAC. Let the top partition of the MAC be referred to as the incomplete multiplier array (IMA) and the bottom partition be referred to as the coupled accumulator array (CAA). Then, the operation of the MAC can be pipelined by inserting a latch R,, between the IMA and the CAA. The resulting throughput can then be doubled by doubling the clock rate. In other words, the corresponding throughput latency of the host systolic array can be halved.
The other accessories in the PE are the latches for the loading of the associated operands, as well as the data lines and the control lines. Except for the I/O port PE's, the interconnects will be for local communications and are, therefore, inherently short.
A. Huang-Abraham Ratio
Due to its complexity, it is not easy to decide whether a certain systolic array is better than others or not. Extensive analyses and comparisons must be carried out before fair appraisals can be made. Such studies must rely heavily on the different attributes and properties presented earlier. In an attempt to simplify evaluations, Huang and Abraham [17] introduced the ratio R = PBT2/CI. In this Huang-Abraham measure R, P is the number of PE's, B is the input bandwidth, T is the turnaround time or throughput latency in terms of the number of computation cycles, C is the gross volume of computations, and I is the gross input transfer volume of the system. The value of R is always greater than or equal to 1 (R = 1 is then the optimal value). With this ratio, Huang and Abraham justified their philosophy on systolic designs.
B. Fault-Tolerant Systolic Computing
The practice of fault-tolerant design in systolic structures has only been at the dawning stage. If any reasonable error corrections or recoveries are to be achieved, the very first step must be in attaining a robust CED scheme. Such scheme must have the following capabilities:
1) It provides error diagnosis down to an individual faulty PE. 2) It must target temporary errors (intermittent or transient), which include permanent errors resulting from wear-out as special cases.
3) The diagnosis latency must be as short as possible. 4) There should be no inherent restrictions on the number of faulty PE's that are diagnosable.
5)

6)
In tolic The overheads in terms of hardware, time, control signals, and interconnections must be as low as possible. It should enhance, rather than impair, the elegance and functionality of its host structure.
2) The time redundancy scheme of RESO provides unique opportunities for highly efficient and robust CED applications, by virtue of the last three points above.
3) The marriage of RESO into the l-D Kung-Leiserson systolic array studied resulted in 106-percent per-cycle PE utilization rate. There were practically no undesirable tradeoffs in terms of I/O bandwidth.
the earlier work on the designs of CED-capable sysarrays for band matrix-vector multiplication (Figs. 2 and 3) [ 191, the following conclusions were extracted
The important attributes of time redundancy are the concepts of disjoint error set and mappable correct output. More specifically, consider the time redundancy technique shown in Fig. 4 [21] . Let x be the input to the com- (Table I) putation unitf, and letf(x) andf'(x) be the outputs with and without encoding-decoding operations, respectively. Two fundamental requirements must be satisfied in these operations. First, the coding function c must not interfere with the original functionf. In other words, for a selected coding function c, there must exist a decoding function c-' such that in the absence of faults. This is the concept of mappable correct output. Secondly, for the purpose of fault detection, the coding operation c must transform the input operand(s) x in such a way that when subjected to the same faulty conditions, the output in the repeated step, though still erroneous, will be different from the first step. This is the concept of disjoint error sets. This concept is formalized into the theorem that follows. output of the repeated step,fb(x), must not be an element in Eo, and any possible output, f F ( x ) , must not be an element in E,; i.e., Eo n = 4.
Q.E.D.
In RESO, left shift and right shift are chosen as the the number of bits to be left shifted for each of the m input operands of the host ILA, and let r be the number of bits to be right shifted in the result. The values of the k,'s and r are dependent on the function as well as the architecture of the host ILA. Thus, the design parameters of a RESO scheme for a specific ILA structure are specified by RESO ( { k , , k2, * , k,,,}, r ) , instead of RESO-k.
An easy way to determine the appropriate number of bits to shift in RESO is to analyze the potential error sets Eo and E, of the unshifted and shifted results, respectively. From the disjoint error sets requirement in Theorem 1 and the characteristics of most ILA's, the potential error set, Eo, of the first unshifted step can be formulated as (Fig. 5 ) coding and decoding functions. Let kl , k2,
where i is the minimum of the bit-slice index of fault modules, and U is the maximum error factor.
In the recomputation step, the operands are left shifted such that the result is left shifted by r bits with respect to the original, unshifted result. Thus, the same faulty jth bit-slice now generates an output bit that is equivalent to the ( j -r)th bit of the original, unshifted result. In other words, the weight of thejth bit-slice in this step must be reduced by 2' (or, right shifted by r bits before comparison). Since the function performed by the ILA is the same in both steps, the corresponding potential error set, E l , then has the same attributes as Eo, except that all the weights of the entries are reduced by 2' . Therefore, the potential error set of the recomputation step is
Note that in the original RESO scheme, the error sets Eo and El contain the case of q = 0 [22] . This definition is valid from a fault-modeling point of view. However, from a functional error point of view, an output that contains zero error by definition is not an error, even though there may be cells at fault. Therefore, in our definition of error set, the q = 0 case is not included. This definition allows the concept development of "error set disjointedness," i.e., once the error sets are made sure to be disjoint, then errors caused by any fault (permanent and intermittent alike) can be detected under the enhanced RESO scheme.
Based on these concepts, extensive analyses and enhancements to the underlying theory of RESO have been accomplished [19] . The important attributes of the proposed CED design strategy are summarized as follows: 1) Let r be the number of bits the recomputed result is to be right shifted. It is determined from the maximum error factor U , as r > Llog2 ( U ) ] , where U is determined from the six most common regular fault geometries. 2) Conversely, given a chosen r value, the error coverage in terms of number of faulty bit-slices can be easily accomplished. 3) For a chosen r value, the corresponding RESO implementation is robust against the target fault configurations. The robustness holds, if either any of these fault configurations is confined in a single region, or the left-most bit-slices of any two separated fault regions must have a distance of at least 2 r -1 bit-slices.
4)
The overhead ratios, in terms of hardware and time, are both O ( r / n ) .
)
The accompanying accessories and control signals are simple.
THE DESIGN OF BAND MATRIX MULTIPLICATION SYSTOLIC ARRAYS
The multiplication of two matrices is an operation basic to many scientific and engineering calculations, such as signal-processing, pattern recognition, and finite-element analysis. The total number of multiplications and addi-3) Each PE is active (or useful) once every three cycles.
Thus, once the operation is in full steam, the percycle PE utilization rate is about 33 percent.
4)
The throughput latency is T = 3n + min ( w l , w2 ).
5)
The maximum input bandwidth is roughly B = 2 ( w1 6) Its Huang-Abraham ratio is R > 3.
The low per-cycle PE utilization rate and the high throughput latency of the Kung-Leiserson array have been concerns to other designers. To achieve better values for these two measures, Huang and Abraham [5] traded off the relaxed I/O bandwidth (Fig. 8) .
B. The Huann-Abraham BMMSA + w2 ) / 3 per cycle.
" tions involving elements of two matrices can be quite appreciable. However, the complexity of the problem can be greatly reduced if the two matrices are both banded.
Consider the matrix multiplication of two n-by-n matrices A = ( aij ) and B = (6, ). The product c = ( cii )
can be computed by the following recurrences:
The attributes of this BMMSA design can be summarized as follows:
1) The number of PE's involved is still p = wl w2.
2) There is no grouping of pipelines within each Set Of the three sets (corresponding to the three matrices).
( n + l ) .
Cy = C rj
A band matrix has all zero elements except along a band of w diagonals, where w = p + q -1 is the bandwidth of the band matrix. Let A and B be n-by-n band matrices with bandwidths wI = p 1 + q1 -1 and w2 = p 2 + q2 -1. Then C is an n-by-n matrix with a bandwidth w3 = w1 + w 2 -1, as shown in Fig. 6 , where w3 = p 3 + q3 -1, p 3 = p1 + p 2 -1, and q3 = q1 + q2 -1. These recurrences can be evaluated by pipelining the elements of A, B, and C through a systolic array which consists of w1 w 2 hex-connected PE's. Thus, the total number of multiplications and additions of elements from A and B is much less than for the case when both matrices were dense. Consequently, the systolic architecture for a band matrix multiplication is smaller. The general topology for the systolic array is arrived at by considering the governing recursive formula alone. After Kung and Leiserson [2] 1) The number of PE's involved is P = w1 w2. 2) Each set of pipelines (for A, B, and C) is subdivided into three groups. Only one group (out of each set) will pump data into the array in any cycle. In other words, each pipeline pumps data into the array once every three cycles.
Each pipeline pumps in data to the array every cycle. 3) Each PE is active once the flows of data are in full steam. Thus, the per-cycle PE utilization rate is about 100 percent.
4)
The throughput latency is T = n + min ( wl, w2 ).
)
The maximum input bandwidth is roughly B = 2 ( w1 6) Its Huang-Abraham ratio is R > 1.
According to the characteristics of both BMMSA designs, while the Kung-Leiserson BMMSA is best suited for wide-banded matrices, the Huang-Abraham BMMSA is optional for narrow-banded matrices. However, as far as medium-bandwidth matrices are concerned, there is no decision as to whether a Kung-Leiserson array or a Huang-Abraham array is better.
This dilemma can be bypassed if a design which is a compromise between this two is available. The compromise must be in the areas of I/O bandwidth, throughput latency, and per-cycle PE utilization rate.
C. The Design of BMMSA for Medium-Banded Matrices
The proposed BMMSA design, as shown in Fig. 9 , has the following stipulations:
1) The organization of the data associated with the three matrices is to be similar to that in the Huang-Abraham scheme. 2) Each of the three sets of pipelines is subdivided into two groups. In any one cycle, for each set of pipelines, only pipelines belonging to one group will have their I/O ports activated. Thus, each pipeline will make data transfer(s) with the outside world once every two cycles. 3) Each PE will be activated once every two cycles.
During an active cycle, a PE will receive data from its A and B pipelines; otherwise, it retains its existing A and B values. The C pipeline, however, will be active every cycle. 
4)
Active and idle PE's will be placed in an alternating pattern inside the array. The specifics of the placement are to conform with the following algorithm, which divides the PE's according to the control signals (CL, and CL2 ): a) Start with the adjacent edges of the systolic array, where the input ports of the C pipelines are located.
b) The edge with w , PE's will serve as the input ports of the A pipeline while the edge with w2 PE's will be dedicated as the input ports of the B pipelines.
c) The intermediate r w1 /2 1 and r w 2 / 2 1 PE's, on the two sides of the intersection of the two edges, will be marked off as group #1 (receiving CL, ). The remaining Lw1/2 J and Lw2/2 J A cycle 4. PE's of the two edges will be designated as group #2 (receiving CL2 ).
With the first PE on each C pipeline being assigned to either group, the assignment of the remaining PE's on the same C pipeline is to be of alternate groupings. In other words, if the first PE along a C pipeline belongs to group #1, then the second PE must belong to group #2, and the third one must belong to group #1, the fourth one to group #2, and so on.
berships in both groups would be of about the same number. Furthermore, the partitioning of the PE's will form some sort of alternating zigzag patterns. 5) The architectures of a group #1 PE and a group #2 PE are basically the same, the only difference being in the control signals (CL, and CL2, where CL, leads CL2 by one systolic cycle, and both are two systolic cycles long).
In essence, this new design has the corresponding atNote that with this assignment algorithm, the mem-tributes o f tively.
wI w2n, and I is the total input transfers = 2 ( w 1 + w2 )n.
1) The number of PE's involved is also P = w , w2.
2) The per-cycle PE utilization rate is about 5Q per-
3) The throughput latency is T = 2 n + min ( w,? w 2 ).
The maximum per-cycle input bandwidth is roughly
5) The corresponding Huang-Abraham ratio is R > 2.
cent.
As was the motivation behind this new design, it should be better than either the Kung-Leiserson or the HuangAbraham arrays when the application is for mediumbanded matrices. As a matter of fact, since it is impractical to maintain three different arrays for the ranges of applications, it may be most economical to have this new array for all band matrix multiplications. A comparison of all three designs is summarized in Table 11 .
An additional advantage of the new design, which may not be obvious at first glance, is its adaptability to efficient CED implyentations, as compared to the other two structures. This will be pursued in the next section.
IV. THE DESIGN OF BMMSA WITH ROBUST CED CAPABILITY It has been demonstrated that, among the different architectural strata within a systolic array, it is best to implement a CED mechanism at the PE/FA (full adder) level with the time redundancy technique of RESO [ 191. RESO is best adapted to an iterative logic (ILA), which is the underlying structure of the heart of a PE.
Developed in an earlier study of fault-tolerant systolic arrays of one dimension, the RESO-based time redundancy CED scheme [ 191 has definite advantages over other space redundancy CED schemes [ 131-[ 171. The proposed RESO-based CED scheme is applicable to pipelined or nonpipelined PE's (Fig. 3) [19] .
With the success of the previous work on the 1-D array for the multiplication of a band matrix and a vector, it would appear that the extension of the same practice to a BMMSA would be straightforward. However, this is only true to a certain extent. With a 2-D systolic array, the subtleties lie in the added complexities involving two-dimensional arrangements of the sets of pipelines, as well as the multiplicities in design philosophies. Nonetheless, these subtleties can be turned into assets, rather than burdens, inasmuch as attention to details is exercised.
The RESO-based CED implementation in the three BMMSA designs discussed in the last section is presented.
A. RESO-Based CED Implementation in a KungLeiserson BMMSA
Since the Kung-Leiserson array has an inherent percycle PE utilization of about 33 percent, there is ample allowance for the RESO-based CED implementation. After a PE is active for one cycle, it can be drafted to recompute the same operands (shifted) again, in one of the two following idle cycles. Specifically, the adaptation of the Kung-Leiserson array to the RESO-based CED scheme calls, as shown in Fig. 10 , for the following orchestration:
1) The arrangement and progression of the A, B, and C pipelines are to be the same as before.
2) Each PE must be RESO-lized. (Fig. 11) 3) The latches RA, R,, and Rc of each PE will receive new data only once every three cycles.
4)
Following a regular active cycle, a PE will be activated again to carry out the computation step, instead of being left in idle. 5) Error-contingency algorithms will be executed upon the issuance of an error signal by the comparator inside the PE. The state of error detection is consequential of conflicting results from the original and recomputed steps.
Notice that the fourth point above means that the PE's must be partitioned into three groups, according to the three control clocks ( CL,, CL2, and CL3 ). CL, leads CL2 by one systolic cycle, and CL3 by two systolic cycles. Each clock signal is three systolic cycles long.
B. RESO-Based CED Implementation in the Proposed
BMMSA
Since the BMMSA proposed in this paper has PE's that inherently retain their data for two cycles, and due to the fact that each of its PE's is idle every other cycle, the modifications to implement the RESO-based CED, as shown in Fig. 12 , should be simpler than those of the Kung-Leiserson array.
The arrangement and progression of the A, B, and C pipelines are to be the same as before. Each PE must be RESO-lized. (Fig. 11) As before, the latches RA, RB, and Rc of each PE will receive new data only once every two cycles. Following a regular active cycle, a PE will be activated again to carry out the recomputation step, instead of being left in idleness. Error-contingency algorithms will be executed upon the issuance of an error signal by the comparator inside the PE. The state of error detection is con- cations. This is due to the fact that the corresponding PE's are active every cycle, and the data inside a PE are only retained for a single cycle. As a consequence, the following changes are required ( Fig. 13 ): The amount of modifications is minimal, since the PE's already belong to two clock-driven groups.
C. RESO-Based CED Implementation in the HuangAbraham BMMSA
The implementation of the RESO-based CED scheme in the Huang-Abraham array requires extensive modifi-1) The arrangement of the A, B, and C pipelines is to be similar to the original design. However, the progression of these pipelines must be drastically altered. The input data must be pumped into the pipe- lines during every other cycle. The idea is to allow for a 50-percent per-cycle PE idleness, which is a must far RESO.
In contrast to the case before, the latches RA, RB, and Rc of each PE will receive new data only once every two cycles. They are to retain their values in the alternate cycles.
4)
Each PE must be RESO-lized. (Fig. 11) 
)
Following a regular active cycle, a PE will be activated again to carry out the recomputation step, instead of being left in idleness. Error-contingency algorithms will be executed upon the issuance of an error signal by the comparator inside the PE. The state of error detection is consequential of conflicting results from the original and recomputed steps. It should be pointed out that the PE's must be partitioned into two clock-driven groups. The control clocks are CL, and CL2, where CLI is to lead CL2 by one cycle. Both signals are to be two cycles long.
It is quite obvious that most attributes of this CED implementation are similar to the implementation of the proposed BMMSA design, except the I/O band,width. The maximum per-cycle I/O bandwidth for this implementation is more than twice that of the implementation in the proposed BMMSA. With everything else being the same, the Huang-Abraham ratio for the CED investment in this system would be larger than that for the corresponding investment in the new BMMSA. Thus, it appears that when CED realization is concerned, the RESO-invested new BMMSA design would be best suited for matrices of narrow to medium bandwidths.
On the other hand, Huang and Abraham have derived a concurrent error diagnosis and correction scheme for their BMMSA 1171. Based on the concept of check-sum matrices, that scheme is not without its drawbacks. In particular, the Huang-Abraham check-sum band matrix multiplication array can only diagnose and correct errors from a single faulty PE; the diagnosis and correction latency is very long; it is vulnerable to false alarms brought on by roundoffs; and its structure is no longer highly regular and granular.
D. Comparison the CED-Invested BMMSA s
The investment of CED capability to each of the three systolic arrays brought about quite interesting results. As far as the modifications required for the implementation is concerned, the new BMMSA design presented the least problem, while the Huang-Abraham system had the most difficulties.
The proposed BMMSA design had achieved the most enhancements in its functionalities. Its per-cycle PE uti- lization rate becomes about 100 percent. The Kung-Leiserson array also received significant improvements in its performance. Its per-cycle PE utilization rate becomes roughly 67 percent. The effectiveness of the HuangAbraham system, however, experienced degradations. Its throughput latency is doubled, and its Hung-Abraham ratio is quadrupled. As pointed out in the previous section, the HuangAbraham array loses its edge in multiplications of narrowbanded matrices when the CED mechanism is installed. The proposed BMMSA design then becomes optimal in multiplications of matrices of narrow to medium bands. The Kung-Leiserson system, however, should remain appropriate for wide-banded matrix multiplication. Table 111 outlines the different attributes of these CEDimplemented BMMSA's.
V. CONCLUSIONS
In order to achieve 1/0 bottleneck minimization, a Kung-Leiserson systolic system trades off some computational speed, which results in inherent PE idleness in any one time step. In the band matrix vector multiplication, the Kung-Leiserson structure has a per-cycle PE idleness of about 50 percent. In the band matrix multiplication, about 67 percent of the PE's in the corresponding array are idle in each cycle. An efficient, low-overhead CED scheme based on RES0 was successfully implemented to the Kung-Leiserson band matrix vector multiplication [19] . In that design, the per-cycle PE idleness is basically nonexistent. The incorporation of a CED scheme similar to the Kung-Leiserson band matrix multiplication array has also been achieved in the present work; and the resulting per-cycle PE idleness becomes about 33 percent.
The Huang-Abraham systolic system emphasized the throughput rather than the I/O bottleneck. In the band matrix multiplication array of such system, there is practically no per-cycle PE idleness once the computation is in full steam. On the other hand, the Huang-Abraham BMMSA leaves no room for the much-needed functionality of fault tolerance. In other words, appreciable overhead must be introduced to ensure the integrity of its fast output.
A new systolic array, earmarked for multiplications of medium-banded matrices, was presented. Its 50-percent per-cycle PE utilization rate is better than the Kung-Leiserson array. It also has the advantage that its I/O bandwidth is only half that of the Huang-Abraham design. However, its control signals may be more involved than the Huang-Abraham array. The CED implementation in this new design resulted in optimizing its functionality. Its per-cycle PE idleness is practically eliminated. In fact it becomes optimal not just for medium-banded but also for narrow-banded matrix multiplication with CED capability. Furthermore, as long as maintaining three arrays for the ranges of band matrices is impractical, this new design may be the one to be preferred. Although the proposed BMMSA design is concentrated on the hex-connected systolic arrays, due to the generality of the assumptions, the extension of this design strategy to other types of systolic structures should be straightforward. Also, the design concept for band matrix multiplication can be implemented in other arithmetic applications.
