Area-Time Optimal VLSI Networks for Matrix Multiplication and Inversion of Triangular Matrices by Preparata, F.P. & Vuillemin, J.E.
ACT-23 APR IL,1980
S . 3 COORDINATED SCIENCE LABORATORY
APPLIED COMPUTATION THEORY GROUP
AREA-TIME OPTIMAL VLSI 
NETWORKS FOR MATRIX 
MULTIPLICATION AND INVERSION 
OF TRIANGULAR MATRICES
F. R PREPARATA 
J.E. VUILLEMIN
APPROVED FOR PUBLIC RELEASE. DISTRIBUTION UNLIMITED.
REPORT R-879 UILU-ENG 80-2211
UNIVERSITY OF ILLINOIS -  URBANA, ILLINOIS
UNCLASSIFIED
S E C U R I T Y  C L A S S I F I C A T I O N  O F  T H I S  P A G E  (When D ete  Entered)
REPORT DOCUMENTATION PAGE READ INSTRUCTIONS BEFORE COMPLETING FORM
i. r e p o r t  n u m b e r 2. G O V T  A C C E S S I O N  NO. 3. R E C I P I E N T ’S C A T A L O G  N U M B E R
4. T I T L E  (and  Subtitle)
AREA-TIME OPTIMAL VLSI NETWORKS FOR MATRIX 
MULTIPLICATION AND INVERSION OF TRIANGULAR MATRICES
5. T Y P E  O F  R E P O R T  4 P E R I O O  C O V E R E O
Technical Report
6. P E R F O R M I N G  O R G .  R E P O R T  N U M B E R
R-879 (ACT-23);UILU-ENG80-2211
7. A U T H O R ( s )
F. P. Preparata 
J. E. Vuillemin
3. C O N T R A C T  O R  G R A N T  N U M B E R f a J
MCS-78-13642; N00014-79- 
C-0424.
9. P E R F O R M I N G  O R G A N I Z A T I O N  N A M E  A N D  A O D R E S S
Coordinated Science Laboratory 
University of I l l in o is  at Urbana-Champaign 
Urbana, I llin o is  61801
10. P R O G R A M  E L E M E N T ,  P R O J E C T .  T A S K  
A R E A  4 W O R K  U N I T  N U M B E R S
11. C O N T R O L L I N G  O F F I C E  N A M E  A N D  A O O R E S S
National Science Foundation
Joint Services Electronics Program Contract
12. R E P O R T  D A T E
April, 1980
13. N U M B E R  O F  P A G E S
23
14. M O N I T O R I N G  A G E N C Y  N A M E  4 A O O R E S S ( i i  d iilerent from C ontro lling  O ffice ) IS. S E C U R I T Y  C L A S S ,  (of. this report)
UNCLASSIFIED
15«. D E C L A S S I  FI  C A T I O N / D O W N G R A D I N G  
S C H E D U L E
16. D I S T R I B U T I O N  S T A T E M E N T  (of thia Report)
Approved for public release; distribution unlimited
17. D I S T R I B U T I O N  S T A T E M E N T  (of the abstract entered in B lo c k  20, it different from Report)
18. s u p p l e m e n t a r y  n o t e s
19. K E Y  W O R D S  (C o nt in ue  on reverse  s id e  if n e ce ssa ry  and identify  by b lock  number)
VLSI, matrix multiplication, triangular matrices, matrix inversion, 
optimal networks, area-time complexity, pipeline computation.
20. A B S T R A C T  (C ontinue  on reverse  side if n e ce ssa ry  and identify  by b lock  number)
This report consists of two papers describing networks for para lle l matrix 
operations. In the f ir s t  paper, "Area-time optimal VLSI networks for multiplying 
matrices," we describe a class of VLSI networks having chip area A for multiply­
ing two nxn  matrices in time T, with an area X time^ product A»T^ = 0(n^). These 
networks achieve Savage's [1] lower bound to this complexity measure for any T 
such that log n <  T ^  n. The second paper, "Optimal integrated-circuit implemen­
tation of triangular matrix inversion," describes a class of VLSI implementations 
of algorithms for inverting an n X n triangular matrix. These networks have
DD F O R M  1 J A N  73 1473 UNCLASSIFIED
S E C U R I T Y  C L A S S I F I C A T I O N  O F  T H I S  P A G E  (When Data Entered)
UNCLASSIFIED______________________ __
S E C U R I T Y  C L A S S I F I C A T I O N  O F  T H I S  P A G S H V h t i  D a t a  Entered)
2 2 4area A and time T, with an area X time product AT = 0(n ) for a l l  values of T 
2
such that O(log n) <  T <  0 (n ). Since there is a simple reduction of matrix 
multiplication to inversion of a triangular matrix, due to Savage's result, 
the presented networks are optimal in the VLSI model.
S E C U R IT Y  C L A S S IF IC A T IO N  O F  T H IS  PAGE(T*7»*n Data E ntered )
AREA-TIME OPTIMAL VLSI NETWORKS FOR MATRIX MULTIPLICATION 
AND INVERSION OF TRIANGULAR MATRICES
by
F. P. Preparata and Jean Vuillemin
This work was supported in part by the National Science Foundation 
under Grant MCS 78-13642 and Joint Services Electronics Program (U.S. 
Army, U.S. Navy, and U.S. Air Force) under Contract N00014-79-C-0424.
Reproduction in whole or in part is permitted for any purpose of 
the United State Government.
This report is issued simultaneously by the Coordinated Science 
Laboratory and by the Institut National de Recherche d'informatique et 
d’Automatique, 78150 Rocquencourt, France.
Approved for public release. Distribution unlimited.
AREA-TIME OPTIMAL VLSI NETWORKS FOR MATRIX MULTIPLICATION 
AND INVERSION OF TRIANGULAR MATRICESt
F. P. Preparata and Jean Vuillemin
Abstract
This report consists of two papers describing networks for para lle l 
matrix operations. In the f ir s t  paper, "Area-time optimal VLSI networks 
for multiplying matrices," we describe a class of VLSI networks having 
chip area A for multiplying two n X n matrices in time T, with an 
2 2 4area X time product A*T = 0(n ) .  These networks achieve Savage’ s 
lower bound to this complexity measure for any T such that log n <  T <  n. 
The second paper, "Optimal integrated-circuit implementation of triangular 
matrix inversion," describes a class of VLSI implementations of algorithms 
for inverting an n X n triangular matrix. These networks have area A and
2 2 4time T, with an area x time product AT = 0(n ) for a l l  values of T such 
2
that O(log n) 5= T <  0 (n ). Since there is a simple reduction of matrix 
multiplication to inversion of a triangular matrix, due to Savage’ s result 
the presented networks are optimal in the VIEI model.
Keywords : VLSI, matrix multiplication, triangular matrices, matrix inversion, 
optimal networks, area-time complexity, pipeline computation.
f
This work was partia lly  supported by National Science Foundation Grant 
MCS-78-13642, and by the Joint Services Electronics Program Contract 
N00014-79-C-0424, and by ERA 452 "Al Khowarizmi", Centre National de 
la Recherche Scientifique, France.
*
Coordinated Science Laboratory, University of I l l in o is ,  Urbana, 
I llin o is  61801.
**
Laboratoire de Recherche en Informatique, Bat. 490, Université de 
Paris-Sud, 91405 Orsay.
tAREA-TIME OPTIMAL VLSI NETWORKS FOR MULTIPLYING MATRICES
* . . **Franco P. Preparata and Jean E. Vuillemin
Abstract: We describe a class of VLSI networks having chip area A
for multiplying two nXn matrices in time T, with an 
2 2 4area X time product A • T = 0(n ) .  These networks 
achieve Savage's [1] lower bound to this complexity 
measure for any T such that log n £ T ^ n.
Keywords: VLSI, matrix multiplication, optimal networks, area-time 
complexity, pipeline computation.
This work was partia lly  supported by National Science Foundation Grant 
MCS-78-13642, and by the Joint Services Electronics Program Contract 
N00014-79-C-0424, and by ERA 452 "Al Khowarizmi", Centre National de 
la Recherche Scientifique, France.
Coordinated Science Laboratory, University of I l l in o is ,  Urbana, 
I llin o is  61801.
irk a . . /
Laboratoire de Recherche en Informatique, Bat. 490, Université de 
Paris-Sud, 91405 Orsay.
1. Introduction
Savage [1] has shown that, under reasonable assumptions reflecting
current VLSI technology [2 ,3 ,4 ], the designer of any circu it for multiplying
two n X n matrices is confronted with a tradeoff between chip area A and
2 4computation time T expressed by A*T = 0 (n ) .  Designs of Kung-Leiserson
2
[5] meet that bound for T = 0(n) and minimal area A = 0(n ) .
The purpose of this note is to illu stra te  a general design scheme of
VLSI networks for multiplying two n X n matrices. According to this scheme,
2
a network with optimal value of the statistics AT can be designed for any 
value of T ^  in the range [logn,n] . ^  The scheme makes use of a recursively 
defined matrix multiplier - to be described next - and implements pipelining 
in an effic ient way.
2. A Straightforward Matrix Multiplier
Let U = ^^  ^\ and V = ^^ ®  ^ be two s X s matrices. We present a
network for computing the matrix product U X V in a recursive fashion,
assuming that we know how to construct circuits for multiplying two 
s s
2 X y matrices, such as a, b, c, d, e, f ,  g, h.
The layout for such a rectangular network in figure 1, where one finds
2
8 recursively defined m ultipliers; lines drawn represent bundles of s /4 
wires, corresponding to the para lle l transmission of fu l l  j  X j  sub­
matrices; the network comprises 4 matrix adders, which are placed in the
2 2 s s
' X 2 j— area occupied by the intersection of two bundles of wires as 
shown in figure 2.
(1)
(2)
Computation time T is expressed in units, which also reflect the 
current state of technology.
A ll logarithms in this paper are to the base of 2.
2Figure 1. Layout of the recursive matrix multiplier
$
Figure 2. A closer look at the adders- of figure 1 (p = s/2).
3For specificity , we assume the matrix elements to be drawn from a 
fin ite  ring, so that an elementary fin ite  chip can be used for multiplying 
and adding elements in that ring in constant time and area. Our recursive 
definition stops at s = 1, where we use the elementary circuit for ring 
multiplication.
The width w (s) of our rectangular network satisfies the recurrence
w (s) = 6 X + 3w(|) , w (l )  = w1 ’
where X is the wire width as in [2] or [4 ] ,and w^  the width of the e le ­
mentary m ultiplier. (We assume elementary adders to have width and 
height X.) The solution to this recurrence is (where we have set s = 2P) :
« / log 3 \
w(2^) = 6(4^-3^)X + 3^ *w^ , thus w (s) = 0(s ) *X + 0 \ s / *w^  .
Sim ilarly, the height h (s ) of our circuit satisfies h (s ) = 11 —  *X + 3h(— ) ,  
h (l )  = h^,where h^ is the height of the elementary m ultiplier. The exact
ry
solution is  given by h(2^) = 11 (4^-3^)X +' 3^hp thus h (s) = 0(s ) and
4
the total chip area A = h X w = 0(s ) .
Notice that the network consists of 2 ("log si+1 levels, so subdivided: 
the f ir s t  ^log si levels are buffer-drivers, whose only purpose is to 
construct two copies of their input; next, there is a single level of 
elementary multipliers, followed by i~log si levels of adders. Therefore 
the computation time of the network just described is
T (s) = 'log  si • t +t +i log si t , where t , t , and t are respectively the 
times for copying, multiplying, and adding, for a total of T (s) = O(log s ) . 
Furthermore, i t  is important to point out that at any time before the end 
of the computation a l l  intermediate results are on one and the same level
of the matrix m ultiplier, which is therefore ideally  suited for pipelined 
operation. This straightforward scheme, when used for multiplying nXn
4matrices, has area A = 0(n ) and time T = O(logn), and thus does not yet 
2 4meet the AT = Q(n ) bound.
3. Pipelining Strategies
Consider now two n X n matrices A and B, and decompose each of them 
2  n n
into r blocks of size — X -  . Specifically  for A we obtain
~ An A12 . . .  Al r -
A = A21
C
M
C
M
<
. . .  A« 2r
a n w n , A . . an — X — matrix i j  r r
- Ar l
C
M . . .  A _rr
and sim ilarly for B.
I f  we define the n X n matrices
C.
J (j  = 1,2, •,r) ,
obviously C = A X B - C ^  + C2 + . . .  + C^ _. Thus the product of A and B 
can be obtained by "accumulating" the matrices C ^  .. . ,C . We now show 
how the calculation of these matrices and their accumulation can be 
pipelined.
The scheme is a c lassica l one-way pipelining application (see [6 ], ch. 9). 
Consider the r X r square array of modules shown in figure 3. For the time 
being we assume that each module has a register c and receives two operands 
a and b (a on its "west" input and b on its "north" input); it  performs 
the operation c «- a X b, and passes on a to the "east" and b to the 
"south". Here a, b, and c are (n/r) X (n/r) matrices.
5Ir
'12
’l l
21
r i -*(Ì)-*(Ì)— * o
Figure 3. General layout of the network.
TTo compute C^, we feed [A ^  k ^  ••• A^] and [B.^ B ^  ••• B.^] with the 
timing as shown in figure 3 at uniform speed. At each step the front 
of the moving data lies on a secondary diagonal of the array; the 
modules on this diagonal compute a product and store i t ,  and i t  is  
obvious that after (2 r - l )  such steps the matrix is  stored in the 
module registers. Since at each step only one diagonal of the array is  
active, the pipelining can be naturally implemented. Specifically , A is 
pipelined from le ft  to right and B from top to bottom; each module is  
now to be viewed as an inner product processor, which executes c c +ab, 
thereby accumulating the matrix C = A X B (see figure 4) .
6rr
r2 2r
r l 'lr
22
rr
A2r . . .  A,
A12 A11
Figure 4. Pipelining scheme.
The structure of the inner product module is shown in figure 5. It
is easily realized that the module, which incorporates an sX s multiplier 
of the type described in Section 2, with s=n /r, has width and height both
7The computation of the product matrix C is  effected by a seria l
pipelining both through the array and the m ultipliers; since the former
has r levels and the latter has O (log (n/r)) levels, the matrix C is
available after time T = 0 (r+ log (n/r)) ,  and can be shifted out in either
row-or column-major order (sh ifting of the output actually may begin
even before the last block of C has been computed).
To evaluate the performance of the scheme, we note that the array
consists of a compact mesh of r X r modules, each of which has area 
4 40 (n /r ) .  Thus the network area is
*-•($)
As for the time T, we have just shown that
T = O(r+log 2 ).
Combining these two results we obtain
at2 = ° f ir ' ) x r2 x ° (U  + 7 loS f ) 2)
X x\ 2For a l l  values of T in the range [logn,n], the term 0 ((1  + — log —) ) 
is bounded by a constant, whence
2 4AT = 0 (n )
thereby substantiating our earlier claim.
Remark. There is  another pipelining strategy - apparently 
unrelated to the preceding one - which also allows us to meet Savage’ s 
bound, but for a smaller interval of computation times T. This strategy
is worth reporting.
8Again, regard an n X n matrix as an n/rXn/r matrix whose elements are 
rX r  blocks. Using a single n/rXn/r matrix multiplier of the type d is­
played in figure 1, we adopt the following pipelining scheme:
2
1) the r elements of each block for both matrices A and B are fed
2
seria lly  on one input line (there are (n/r) such lines);
2) the elementary multiplier (see Section 2) must now have the
capability of performing an rX r matrix multiplication. Such
elementary multiplier could be, for example, a hexagonal mesh of
2
the Kung-Leiserson [5] type, with area 0(r ) and time 0 (r ).  The
operands arrive se ria lly , must be stored, then processed, and
the result fina lly  must be released seria lly ; obviously the
2
arrival and release pipelining times 0(r ) predominate over the 
multiplication time 0 (r ).
3) The result is n/rxn/r matrix, whose elements are rX r  blocks 
which appear seria lly  on each output line.
o
We evaluate the performance of the scheme. As long as r ^ O(logn)
2
we have for the computation time that T = 0(r ) .  As to the area, we note
•k
that the width w of the network is given by
where welem> the width of the elementary multiplier is now 0 (r ).  It follows 
that, as long as r <  0(n^2’ lc>8 (3“lo§ 3) ) we have w* = o(n2/r2) ;  a similar
results holds for the height h" of the network, whence A = 0(n^/r^) .  We 
conclude that
2 4AT = 0(n )
0.58)( i . e . , it is optimal) for a ll  values of T such that 0 (logn) ^  T^O (n
9References
1. J. E. Savage, "Area-time tradeoffs for matrix multiplication and 
transitive closure in the VLSI model," Proc. of the 17th Annual 
Allerton Conference on Communications, Control and Computing,
Oct. 1979.
2. C. Mead, L. Conway, Introduction to VLSI Systems. Addison-Wesley, 
Reading, Mass. 1980.
3. R. P. Brent, H. T. Kung, "The area-time complexity of binary 
m ultiplication," Research Report CMU-CS-79-136, Carnegie-Melion 
University, Pittsburgh, Penn., Sept. 1979.
4. C. D. Thompson, "A complexity theory for VLSI," Ph.D. Thesis, Dept, 
of Computer Science, Carnegie-Mellon University, Pittsburgh, Penn., 
Sept. 1979.
5. H. T. Kung, C. E. Leiserson, "Algorithms of VLSI processor arrays," 
Symposium on Sparse Matrix Computations, Knoxville, Tenn., Nov. 1978.
6. T. C. Chen, Overlap and Pipeline Processing, pp. 375-431 in 
Introduction to Computer Architecture, (H. S. Stone, ed itor),
Science Research Associates, 1975.
OPTIMAL INTEGRATED-CIRCUIT IMPLEMENTATION
tOF TRIANGULAR MATRIX INVERSION
Franco P. Preparata and Jean Vuillemin
Abstract
We describe a class of integrated-circuit implementations of
algorithms for inverting an nx n triangular matrix. These networks
2 2 4have area A and time T, with an area X time product AT = 0(n ) for
2
a ll  values of T such that 0(log n) £ T £ 0(n ). Since there is a
simple reduction of matrix multiplication to inversion of a triangular
2 4matrix, and Savage [6  ] has given an AT = 0 (n ) lower-bound for n X n 
matrix multiplication, the presented networks are optimal in the 
VLSI model.
Keywords: VLSI, matrix inversion, triangular matrices, area-time
complexity, pipeline computation, optimal networks
This work was partia lly  supported by National Science Foundation Grant 
MCS-78-13642, and by the Joint Services Electronics Program Contract 
N00014-79-C-0424, and by ERA 452 "Al Khowarizmi", Centre National de 
la Recherche Scientifique, France.
Coordinated Science Laboratory, University of I llin o is , Urbana, 
I llin o is  61801.
** A /
Laboratoire de Recherche en Informatique, Bat. 490, Université de 
Paris-Sud, 91405 Orsay.
11
1. Introduction
Increasing attention has been paid recently to the design of networks 
for the direct implementation of several interesting algorithms using the 
integrated-circuit technology (VLSI); particularly, combinatorial and 
numerical problems have been the target of these investigations [1 -4 ].
Among numerical problems, several workers have directed their attention 
to matrix computations [1 ,2 ,5 ], and, as regards the design of networks, 
have found that the mesh interconnection of computing modules is partic­
ularly attuned to this class of problems, leading to optimal rea liza ­
tions [5,6] in the VLSI model [7 ,8 ].
In this paper we consider the problem of designing VLSI net­
works for inverting a nonsingular triangular matrix. The design 
complies with specifications of the VLSI model of computation 
recently proposed by Mead, Conway, and Thompson [7 ,8 ]. In this model, 
the network is a computation graph consisting of nodes (processing 
modules) and wires. Wires have unit width and are partitionable into 
two orthogonal sheaves. A data item takes a unit of time to propagate
along a wire from node to node (processing time is thus absorbed into
2
propagation time). The usual complexity metric is the area X time 
2
product (AT ),  which embodies a trade-off between production cost (chip
area A) and incremental cost (time T ).
Within this model, Savage [ 6 ] has recently proved the following
interesting result: any VLSI design for the multiplication of two n X n
matrices, with chip area A and computation time T, must satisfy  the 
2 4lower bound AT ^ C n , for some constant C. In [5] the authors demonstrate
12
the existence of VLSI networks for multiplying n X n matrices with 
2 4
AT = 0(n ) for any computation time T within the bounds logn ^  T <  n.
2
Note that an AT ^ Cn bound also holds for the problem of inverting a 
nonsingular n X n triangular matrix, since matrix multiplication is  
reducible to i t ;  the straightforward reduction is based on the fact that 
the inverse of the 3n X 3n triangular matrix
"i A o' "i -A ab"
0 I B is 0 I -B
_0 0 I_ .0 0 I
i . e . ,  it  contains an n X n block equal to the product AB.
This paper is organized as follows; In Section 2 we review the 
general scheme for inverting an n X n triangular matrix, and evaluate 
two network implementations, corresponding respectively to block­
partitioning the matrix and choosing extreme values for the block size
in the allowable range. These two inverters are referred to as
2"recursive and "systo lic" respectively; with respect to the AT
statistics only the latter is  optimal for T = 0 (n ). In Section 3 we
show that the recursive and systolic inverters can be combined to build
2 4networks, called "mixed" inverters, which realize the AT = 0 (n ) lower
2
bound for a l l  values of T such that O(log n) <  T ^  0 (n ).
13
2. The general scheme for inverting a nonsingular triangular matrix.
Let A be a nonsingular nXn triangular matrix^'* to be thought of 
as an n/sXn/s matrix whose elements are sX s blocks of the original
entries (s is a parameter in the range [ l ,n / 2 ]) ;  let be the ( i , j )  
block of A (i , j * 1,2,. •. ,n/s) and let A^\^ be the corresponding block
of A-1 It is well known — and also straightforward to verify — that
-
iJ
[A( " 1} A( ""1)1 i i  »'Ai , i+ 1 ’ * i , j - l J
A. .
3-J
Ai+1,j
A. . . J-liJ
a ^
JJ (1)
This general formula w ill  now be specialized to two interesting cases
2.1 Recursive inversion
The standard scheme for the para lle l inversion of a triangular 
matrix [9,10] corresponds to specializing the general scheme to s=n/2. 
In this case the inverse of
An A12
0
A22
is
-1
L- - -A A, A11 11 12 22
0 A"122
(2 )
This immediately suggests a recursively defined network, containing two 
inverters of n/2X n/2 triangular matrices (to be used to compute A ^
The entries of a ll  matrices considered in this paper are assumed to 
be drawn from a fin ite  ring, so that an elementary fin ite  chip can 
be used for multiplying and adding entries in constant area and time.
14
and A^ 2  i-n p a ra lle l) and a network for the para lle l multiplication of
two n/2Xn/2 matrices (to be used to compute (A-q ^ 2 ^ 2 2  in the order
shown by the parenthesization). In figure 1 we show a possible layout
2
for such a network. Each line shown carries n /4 operands in paralle l
22
42
11
Figure 1. Layout ‘of the recursive matrix inverter;
Shaded boxes are data buffers.
2
and the shaded surfaces are buffers of capacity (n /4); the core of the circuit  
are two multipliers of two (n/2) x (n/2) matrices, of a type described in [5 ],
and called recursive m ultipliers. Each of these multipliers has height and
2 2width bounded respectively by (6/4)n and (ll/4 )n  and computes a
matrix product in 21ogn-l time units. Due to the recursive definition
of the inverter, a simple argument shows that its height and width are
2 2respectively bounded by (15/4)n and (15/3)n ; also, the computation time
2
is 0 ((logn ) ) .  Notice therefore that for the matrix inverter being described— 
called recursive inverter — we have the following properties (referred to
are n X n m atrix):
15
Type A T
C
M
H<
Recursive inverter 40(n ) O(log2n) 4 4 0(n log n) (3)
2 4Note that AT is short of the optimal value ft(n ) .
2•1 Systolic inversion
The next scheme to be described corresponds to the choice s=  1 in 
the general method. The resulting network is a mesh of processors, each 
of which feeds data in and out, each time performing some computation, 
keeping a regular flow in the network. Such networks have been called  
systolic by Kung and Leiserson [1 ].
With our choice of s9 block A ^  in ( i )  becomes entry a ^  (and sim ilarly
A^k^  becomes a ^ ^ ) .  The form of (1) suggests a computation method on
an nXn square mesh (figure 2). Only the upper-triangular positions
in this mesh need contain processing modules ( i . e . ,  denoting by
M .. the module in position ( i , j ) ,  M. . is deployed only for j ^ i ) .
J J
Modules are of two types with different computational capabilities: D-modules 
and M-modules, placed respectively in diagonal and off-diagonal positions. 
Entry a ^ ^  of A  ^ w ill be computed in place in M^^. For i = j
n ) M-modules
'm 'NNnn
Figure 2. General structure of the systolic matrix inverter 
(triangular mesh)
16
(diagonal entry), the corresponding D-module must invert entry a . . ,
( - 1)
i . e . ,  a_^ = l/a ._; for j > i ,  the process is more complex and is to
be analyzed.
Each operation — inversion of an entry or multiplication of two 
entries — is conventionally assumed to take one "step". Assume induc­
tively  that for fixed t and 2k-h < t , the computation of a ^ ^  be com­
pleted in at the end of the (2k-h)-th step; the basis for this 
induction is easily established for h = k, in which case we just activate 
at the h-th step. We now extend the induction by showing that a. . 
with 2 j - i  = t, can be computed at the end of the t-th step. Indeed, 
af ^  (p < j )  is computed, by the inductive hypothesis, at the end of
the (2p -i)-th  step; after this computation is completed, af ^  is
IP
shifted to the right along the i-th  row of the mesh (figure 2) (one
position per step), so that a [ ^  resides in M .. at the end of the
IP iJ
[ (2p -i) + ( j - p ) ] -th step. Similarly entry a^  ^(p < j) is shifted upwards 
along the j-th  column of the mesh of figure 1 (one position per time 
step) starting at step ( j -h i), so that a  ^ resides in at step 
(p - i+ j ) ,  simultaneously with af ^ . Thus, the product a • a . canip r ip pj
be computed in the step (p - i+ j ) ,  and accummulated in . This shows 
that the inner product [ a ^ 1^ ,. . . ,  af "P ., ] • [a. a. ,] resides in
i iJ
at the end of step ( j - l ) - i + j  = 2j - i -1 .  Now, recall that,by the 
inductive hypothesis, a ^ ^  is computed in at step j ;  therefore 
it  can be shifted upwards in the mesh and after ( j - i )  steps ( i . e . ,  at 
step t = 2 j - i )  it  arrives in M_, where the computation of a^7^ is 
completed at the end of the t-th  step, as was originally  claimed.
For c larity , in figure 3(a) we illu strate  the timing of the computa­
tions: Each module is labelled with an integer which denotes the step
17
at which computation in that module is completed. Also, in figure 3 (b and 
c) we present snapshots of the data participating in the horizontal and 
vertical flow, respectively, at step 7. Clearly the calculation of A  ^
is completed in 2n-l steps.
As regards implementation details,
1 3 5 7 . . .
2 4 6  . . .
3 5 7 . . .
4 6 . . .
5 7 . . .
6 . . .
7 . . .
X X X
* X X X ■ * X
/
X
\
X
• • X X
L
• • X X
— o
• X X • X X
• X • X
X X
(a) (b) (c)
Figure 3 (a ): timing of completion of computation up to step 7.
( b )  : data (x) participating in horizontal flow at step 7.
( c )  : data (x) participating in vertical flow at step 7.
in it ia lly  module M. . is loaded with a .. .  While M. . ( i = l , . . . , n )  must simplylj ij li v '
compute 1/a.., an M-module M .. ( i^ j )  can be designed with one operand register i i  i  j
R and two buffers H and V; there are two operand input lines W and S, and 
two operand output lines, E and N (see figure 4 ). Buffers H and V 
constantly feed output lines E and N, respectively and the module must 
be capable of executing the following instructions:
1. R<- R + W*S, H ♦- W, V *“ S* for the general step.
2. R ♦“ -R*S, H«“ -R- S,  V -  S, for the fin a l step.
18
It  is readily realized that this is a ll  is needed to implement the
described algorithm.
N
A
s
Figure 4. Structure of an M-module ; H and V are buffers.
According to our original assumption that both the area of the 
processing modules and the time needed to execute any of the prescribed 
operations be bounded by a constant, we have the following:
Type A T AT2
lst-order systolic 
inverter
0(n2) 0(n) 0(n4)
2
i . e . ,  the network is optimal for the AT measure. The optimal be­
havior, however, is achieved only for T = 0 (n). An interesting question 
is whether it can be extended to a wider range of processing times.
This question is addressed in the next section.
19
3. Mixed networks.
We now describe how to combine the recursive and systolic inverters
2
described in the preceding section in order to improve the AT measure for 
a wide range of the time parameter T.
The resulting networks - to be called mixed - have the following 
general structure. A mixed network is  a systolic scheme, as the one 
described in 2.2, where the "operands", rather than being elementary entries, 
are blocks of s X s such entries. In the corresponding n/s X n/s triangular 
mesh (see figure 2), the modules must now be designed to process s X s 
blocks. The layout of mixed networks is  chosen as in figure 3, where the 
modules themselves have been conveniently assumed to have a rectangular 
shape on the chip (e lse , we consider the smallest rectangle with sides 
para lle l to the coordinate axes which contains the module). From figure 3, 
i t  is  clear that while the dimensions (width and height) of the M-module 
determine one dimension of the network -  say, its  width - ,  the other 
dimension - say, its  height - is  determined by the larger of the 
corresponding values for the D- and M-modules.
Figure 3. - General layout of mixed networks. Each line 
shown carries in para lle l s^ operands.
20
The f ir s t  kind of mixed networks to be considered is one for which 
the following selections are made (Type-1 mixed inverter) :
(1) D-modules are recursive inverters, as described in Section 2.1;
2 2they have width 5s , height (15/4)s , and computation time 
0 (log2s) .
(2) M-modules are recursive matrix multipliers of the Preparata-
Vuillemin type [5 ], as already used to build the recursive
inverter. They can be placed on the chip so that their width 
2 2and height are 2s and (15/4)s (see [5 ] for d e ta ils ); their 
computation time is O (logs).
Since the heights of the D- and M-modules are of the same order (in  this
n 2case, they are identical) the height of the network is  0 (— X s ) = 0 (ns);
2
since also the widths of the two modules are 0(s ) ,  the same holds for
2 2the width of the network. Thus, A = 0 (n s ) ,  and the smallest containing
rectangle is  nearly a square with both sides 0 (n s ). As regards computation
-1 2 time, the blocks A ^  ( i  = l , . . . ,A / s )  are a l l  computed in time 0 (log s ),
and after this, the mesh computation begins. We have observed in Section
2.2 that the systolic-network completes its  computation in 0(n/s) steps,
2
whence the total computation time is  T = 0 (log s + — logs). I f  we bound 
the parameter s by s ^  n/logn we obtain T = 0 (^ logs), and the performance 
of Type-1 networks is summarized as follows:
Type A T AT2
Type-1 mixed 
inverter
const.^ s £ n/logn
2 20 ( ns  ) 0(~logs) 4 2 0(n log s) (5)
21
The second kind of mixed networks (Type-2 mixed inverter) is 
constructed as follows:
(1) D-modules are type-1 mixed inverters (for s X s matrices). 
According to the preceding discussion, for any value of a 
parameter r <  s/logs, these modules have height and width
(2) M-modules are pipelined matrix m ultip lier, as introduced by
Preparata and Vuillemin in [5 ]. I t  is shown in [5 ] that one such
With respect to computation time, we obtain the same conclusions as for 
type-1 mixed inverters, i . e . ,
¿m Sboth 0 (s r ) and computation time O(log r + — lo g r ) . Choosing
2 2r = s/logs we obtain height 0(s /logs), width 0(s /logs), and 
2
time 0 (log s) .
and computation time O (logs).
2
Again, the dimensions of both D-modules and M-modules are 0(s /logs),
whence :
Therefore we obtain
22
Obviously, i f  s <  n/logn we have slogs < n, whence AT = 0(n ) ,  and the 
performance of the Type-2 mixed inverter is  so summarized:
Type A T AT2
Type-2 mixed 
inverter
const. ^  s < n/logn
2 2
° : s 2  log s
0 (~ logs) 0(n4)
Since as s varies from a small constant value to n/logn the computation
2
time T varies from 0(n) to O(log n ), we say that the above network meets
the AT2 = O(n^) optimal bound for a l l  T such that O(log n) <  T < 0 (n ).
Incidentally, even in totally  unrestricted models of computation - as
2
the shared-memory-machine [see, for example [10 ]] -  0 (log n) is the 
smallest known running time for inverting a triangular matrix.
23
REFERENCES
1. H. T. Kung and C. E. Leiserson, "Algorithms for VLSI processor arrays," 
Symposium on Sparse Matrix Computations, Knoxville, Tenn., Nov. 1978.
2. L. J. Guibas, H. T. Kung, and C. D. Thompson, "Direct VLSI implementa­
tion of combinatorial algorithms," Proc. Conference on VLSI Architecture, 
Design, Fabrication, C a lif. Inst, of Techn., January 1979.
3. R. P. Brent and H. T. Kung, "The area-time complexity of binary multi­
p lication ," Research Report, Department of Computer Science, Carnegie- 
Mellon University, Pittsburgh, Penn., July 1979.
4. C. D. Thompson, "Area-time complexity for VLSI," Proc. of the 11th 
Annual ACM Symposium on the Theory of Computing (SIGACT), pp. 81-88,
May 1979.
5. F. P. Preparata and J. Vuillemin, "Area-time optimal VLSI networks 
for multiplying matrices", 14th Princeton Conference on Information 
Sciences and Systems, March 1980.
6. J. E. Savage, "Area-time tradeoffs for matrix multiplication and 
transitive closure in the VLSI model," Proc. of the 17th Annual 
Allerton Conference on Communications, Control, and Computing,
October 1979.
7. C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley, 
Reading, Mass. 1979.
8. C. D. Thompson, "A complexity theory for VLSI", Ph.D. Thesis, Depart­
ment of Computer Science, Carnegie-Me11on University, Pittsburgh,
Penn., September 1979.
9. D. Heller, "A Survey of para lle l algorithms in numerical linear 
algebra," Dept, of Comp. Sei. Carnegie-Melion University, Pittsburgh,
Pa., Feb. 1976.
10. L. Csanky, "Fast para lle l matrix inversion algorithms, SIAM J. Comptng.
5, 1976, 618-623.
