Area—time tradeoffs for matrix multiplication and related problems in VLSI models  by Savage, John E.
JOURNAL OF COMPUTER AND SYSTEM SCIENCES 22, 230-242 (1981) 
Area-Time Tradeoffs for Matrix Multiplication 
and Related Problems in VLSI Models* 
JOHN E. SAVAGE 
Department of Computer Science, Brown University, 
Providence, Rhode Island 
Received November 18, 1979; revised December 12, 1980 
Two models for very-large scale integrated (VLSI) semiconductor circuits are considered 
that have been developed by Thompson and by Brent and Kung. The models permit the study 
of tradeoffs between chip area and computation time. We show that these tradeoffs can be 
derived from a single common complexity measure of a problem. We derive bounds on this 
measure for matrix multiplication under weak assumptions about the operations of addition 
and multiplication. The assumptions are such that the bounds can be applied directly to tran- 
sitive closure and matrix inversion. 
1. INTR~DUCTI~N 
The development of new manufactoring techniques for the fabrication of small, 
inexpensive and dense semiconductor chips has created a revolution in the computer 
industry. Though the use of very large scale integration (VLSI) of semiconductor 
circuits, the size and cost of processing elements and memory has been reduced 
dramatically, thus increasing the availability and use of computers. 
VLSI is achieved through new uses of semiconductor circuit design and high 
resolution photographic techniques. In this it has been convenient to place wires on 
rectangular grids and to limit the number of parallel layers of semiconductor material 
containing wires and circuit elements, two important constraints. As wires and tran- 
sistors are made as small as possible, within the limits of photographic resolution, the 
area occupied by a circuit becomes important. In turn, the area of chips is limited to 
keep the chip yield high and to satisfy a practical limit on the number of wires which 
can be attached to them. Also important, of course, is the time it takes for a chip of a 
given area and configuration to complete its task. Mead and Conway [l] present an 
introduction to VLSI. 
Recent research by Thompson [2,3] indicates that tradeoffs exist for many 
problems between chip are A and computation time T. He demonstrates that 
A T2 = f2(n’) for the discrete Fourier transform on n inputs [2] and has similar results 
for other problems [3]. These results are derived by postulating a reasonable model 
* This research was supported in part by NSF Grants MC’S 76-20023 and ENG 75-17614. 
230 
0022-0000/81/02023~13$02.00/0 
Copyright 0 1981 by Academic Press, Inc. 
All rights of reproduction in any form reserved. 
MATRIX MULTIPLICATION IN VLSI MODELS 231 
for VLSI in terms of which these lower bounds can be derived. Brent and Kung [41 
have presented a somewhat different model and have shown that AT’ = l&n’) for the 
multiplication of two integers in the standard binary representation. A second 
inequality of the form A = O(n) allows them to derive the relation AT’” = Jz(n’t”), 
0 < x < 1, which they show is tight by construction of algorithms. 
In this paper we consider the problem of matrix multiplication in both VLSI 
models and derive lower bounds on the measure AT’. To do this, we outline the 
assumptions and analysis associated with the two models and show that lower 
bounds for both models can be derived by evaluating a single measure of the 
complexity of the problem in question, which in this case is matrix multiplication. We 
give separate treatments of the multiplication of square p x p matrices, for which we 
show that AT’ = R(p4), and the general case of multiplication of m X n and n x p 
matrices for which we show that AT* = l2((mp)‘) when (a - n)(b - n) < n2/2 where 
a = max(n, p) and b = max(n, rn). We also show that the bound for square matrices 
can be achieved by appealing to algorithms developed by Kung and Leiserson [5 1. 
Recently, Preparata and Vuillemin [6] have developed chip layouts for square matrix 
multiplication which are within a constant factor of the lower bounds for 
J&log n) < T< o(n), 
The proof technique employed to derive the lower bounds uses some very limited 
assumptions about the set X from which matrix elements are drawn and the 
operations of addition and multiplication over these two sets. These assumptions are 
weaker than those for a semiring [7] from which it follows directly that the lower 
bounds apply to the transitive closure problem and matrix inversion as well. 
2. VLSI MODELS 
Thompson [2,3] and Brent and Kung [4] have presented models for VLSI circuits 
which have much in common. In this section we describe these models and 
summarize their accompanying analysis in terms that permit their application to 
arbitrary multi-output problems. A problem is characterized by a function 
f: X” -+X”, X a finite set, where f (a,, a2 ,..., a,) = (b,, b, ,..., b,) and a, and b, are 
input and output variables, respectively. 
Both models assume that VLSI chips have nodes and wires with wires of width ;C 
(determined by the resolution of photographic techniques) and spacing A. The nodes 
serve as input ports, output ports, processing elements or a combination of several of 
these. Each node has an area of at least p > I’, The chip is laid out on several 
different planes, each of which contains wires and nodes and between which there are 
connections. The number of planes, y 2 2, is small. It is assumed that each input 
variable is entered once and that the computation is performed in a data-independent 
manner. 
The two models also assume that wires carry results from the alphabet X and that 
it takes one unit of time r to transmit one result over one wire. In one of his models, 
Thompson [2] allows X to be non-binary. Brent and Kung [4] assume that )XJ = 2. 
232 JOHN E. SAVAGE 
(Xl = 2. The analysis of both models can be done assuming that X is non-binary by 
increasing the wire width and spacing and by assuming a chip design in which wires 
are not split by selecting components of words in X. 
The assumptions that are special to the Thompson (T) model [2] are the following: 
Tl. Wires are placed on a rectangular grid. 
T2. Nodes with total degree d are square with sides of length Ud. 
T3. Each output variable appears in a distinct output port (or processing 
element). 
The assumption special to the Brent and Kung (BK) model [3 J is the following: 
BKl. All computations are performed in a convex (multi-) planar region R of the 
chip of areaA. 
Because assumption T3 is not made by Brent and Kung, they explicitly assume 
that more than one output variable may pass through an output port. Thompson 
could have made such an assumption without substantially weakening his results. 
Assumption T2 has been changed recently by Thompson [3] by requiring that the 
degree d of a node be limited to a small constant, say d = 4, which is reasonable, in 
which case, assuming a square node is a very weak condition. This restriction has 
little effect on the lower bounds. Thus, we conclude that the principal difference in the 
models is reflected in assumptions Tl and BKI concerning the geometry of the chip. 
We now turn to the analysis that has been done in these two models. For 
pedagogical reasons it is desirable to begin with the BK model, although it was 
developed after the T model. Using well-known relations between the area, perimeter 
and diameter of a convex region, Brent and Kung have shown that if C is the length 
of any chord perpendicular to a diameter of the convex region R of area A then 
A>5 
7L (1) 
In turn, if w is the number of wires crossing this line 
since each wire and spacing has width Iz and at most y wires overlap. Brent and Kung 
let M be the maximum number of the m output variables that are generated through 
any one port. They then show that by sliding a line perpendicular to a diameter, the 
output nodes can be divided into two disjoint sets, which correspond to a partition 
S, , S, of the m output variables so that ( SI( satisfies, 
MATRIX MULTIPLICATION IN VLSI MODELS 233 
Such a line partitions the input nodes and thus the inputs of the functionf: X” -+X”. 
The amount of information which must flow across this line determines the area and 
time required by the chip and motivates the following new definition. 
DEFINITION. Given f: X” + X”‘, f(a,, a2 ,..., a,) = (b,, 6, ,..., b,), let U,, U, be a 
partition of { 1,2 ,..., n } and V, , V, be a partition of ( 1, 2 ,..., m } ; if 1 U, 1 = k and 
v, = U, , .i, ,..., j,}, let f~,,, be the subfunction off with ouputs (bj,,..., bj,) obtained 
by fixing variables in U, to C, E X’-k. The function f ,‘! .rr, is similarly defined. Let 
If &,I and lj&,I be the cardinalities of the ranges of these two subfunctions. Let 
$(U, , V,) = [log,,, yx V:;~.C~/ 1 
and let #(U,, V,) be similarly defined. These are measures of the maximum amount 
of information that must flow from U, to I’, and from U, to V,, respectively. The 
minimal cowjlow Z(M) is defined by 
It follows from the above definition that at least Z(M) elements from X must be 
communicated across any chord of the convex region R that balances output port 
sizes. If the chord has w wires, the time T required must satisfy 
T  > wf) -r ‘0 
since transmission of each element requires r seconds of 
simultaneously. Combining Eqs. (l), (2) and (3), we have 
(3) 
which at most w occur 
(4) 
This inequality is weak when M is large, since Z(M) is then small, so we observe that 
at least Mr seconds are required to generate the M outputs produced through some 
port whose area p > I*. Thus, 
AT2 > (AZ)’ M2. (5) 
Combining Eqs.(4) and (5) we have the desired generalization of the Brent-Kung 
theorem. 
57 112212~9 
234 JOHNE. SAVAGE 
THEOREM 1. In the Brent-Kung VLSI model [4] the area A and time T used by 
a VLSI chip to compute a function f: XN --f X” satisfy the following inequality: 
AT2 > (j1~)~ , rgcm 
\ 
$ Z2(M), M2 . 
I 
Thompson [2] assumes that the output variables appear in distinct ports or that 
M = 1. He defines o, the minimal bisection width of a chip as the minimal number of 
wires that are cut by any line on the chip that divides the output ports into two 
disjoint sets VT, V$, whose sizes are as nearly equal as possible. It is clear that 
He then demonstrates that the area A of the chip satisfies 
and shows that T 2 Z( 1) r / o since only cc wires are available to carry the I( 1) units 
of information. 
THEOREM 2. In the Thompson VLSI model [2], the area A and time T used by a 
VLSI chip to compute a function f: X’ + X” satisfy the following inequality 
AT’ > (AZ)’ I*( 1). 
If Thompson’s assumption T3 were relaxed, Theorem 2 would be identical with 
Theorem 1 except for constant multipliers. We show in the next section that both 
bounds are identical up to constant multipliers for matrix multiplication. 
3. MULTIPLICATION OF SQUARE MATRICES 
Consider the multiplication of two p x p matrices, A, B over the set X where 
C = AB is a function f: XtpZ -+ Xp’. Let the operations of addition and multiplication 
and the set X satisfy the following weak conditions: 
Ml. The set X is closed under the two operations. 
M2. Additive and multiplicative identities 0 and 1 exist and for x E X 
0.x=0. 
Since the object is to derive a lower bound to the minimal cross-flow, I(M), for 
matrix multiplication, we will show that a good lower bound can be achieved by 
MATRIX MULTIPLICATION IN VLSI MODELS 235 
setting elements in A or B (but not both) to form a permutation matrix. These 
conditions on addition and multiplication will be sufficient. 
Denote with two p x p binary matrices D, and D, a partition of the elements of a 
matrix D into two disjoint sets, where each non-zero entry in Di identifies an element 
in the ith set. If Di and Ei are two such binary matrices, let Din Ei and DiU Ei 
denote their intersection and union which are the matrices that have a non-zero entry 
exactly where both do, and either has an entry. Also, let lDil be the number of non- 
zero entries in Di. It follows immediately that (D, f7 D, / = 0. We are now prepared 
to tackle matrix multiplication. 
Let C, and C, denote a partition of C such that 
and let A,, A, and B,, B, denote partitions of A and B. We shall show that many of 
the input variables in A, or B, can be mapped onto (or equated with) output 
variables in C, or that the same can be done for A,, B, and C, . An essential step in 
this demonstration will be to form matrices A,(i) and A,(i), which are cyclic column 
shifts of A, and A, for some 1 < i < p obtained by setting B to a cyclic permutation 
matrix, and to form matrices B,(j) and BJj), 1 < j < p, which are cyclic row shifts 
of B, and B, obtained by setting A to a cyclic permutation matrix. 
LEMMA 1. There exist integers 1 < i, j < p, such that 
IA,(i) n B,(j)1 < ‘A;!,“’ 
for r= 1,2. 
Proof. Consider the following sum: 
Since A,(i) and B,(j) are formed by cyclic shifts of the columns of A, and rows of 
B,, respectively, each of the r,, l’s in the z&h row of A, encounters each of s, l’s in 
the vth row of B, for each 1 Q v < p, It follows that 
S = IA,1 IBrl 
from which the hypothesis of the lemma follows directly. g 
PROPOSITION 1. The minimal cross-flow I(M) for the multiplication of two p x p 
matrices over the finite set X is bounded by 
p2 M p*+M s-q<I(M)<2. 
236 JOHN E. SAVAGE 
Proof: The upper bound is a direct consequence in the definition of Z(M) of the 
fact that 
The lower bound proceeds from a series of observations. 
Observe that 
for p x p matrices D and E. 
Observe that for r’ # r, r, r’ E { 1,2}, 
s; = I c,, n 4(i)l + I c,, n w.0 
> I c,, fit-4 r(i) u B,(A)1 
2 IA,(i) u BL.0 + I c,t I - P2 
> (A,(i) u B,(j)1 -p* 
so that if the union is large, then either I C,, nA,(i)l or (C,, nB,(j)J will be large 
and at least half the value of the lower bound. However, 
IA,(i) U B,(j)1 = IA,(i)1 + IB,(j)l - p,(i) n B,(j)/. As a consequence we invoke 
Lemma 1 to show that there exist integers i, j such that 
s; 2 IA,(Ol + I~r(.dl - IAm Pm P2 + n4 - ____ p2 2 
for r = 1 and r = 2. Since JA,(i)J and II?,(j)/ are independent of i and j, let x = (A r(i)/, 
y = lZ?,(j)l and note that can can choose r = 1 or r = 2, depending on which gives the 
stronger result. Thus, we have that there exists r, i and j such that 
where 
s; > L - (P’ f MU29 
L = min (p’ - X)(P” - Y) 
O<X*Y<P2 
max x+y-~,(P2--X)+(P2-JJ)- ( P2 1 
= min 
O<X,Y<P2 
max x+y-$9Pz-~ 
( ) 
The first is an increasing function of x and y while the second is a decreasing 
function. Thus, the minimum is achieved when they are equal or when 
x + y = p*. 
MATRIX MULTIPLICATION IN VLSI MODELS 231 
Since xy = x(p2 -x) is a convex function with a maximum at x = p2/2, it follows 
that L > 3p2/4. Thus 
Z(M) > f(L - (p’ + M)/2) > p*/8 -M/4. m 
This bound on the minimal cross-flow is now appllied to derive a lower bound for 
the VLSI complexity measure AT’. 
THEOREM 3. Let C= AB denote multiplication of p x p matrices over a set X 
where X and the operations of addition and multiplication satisfy conditions MI and 
M2. Then the area A and time T in the Brent-Kung (BK) and Thompson (T) VLSI 
models to compute C must satisfy the following inequalities: 
AT* > (k)*(p*/8 - l)* = sd(p4), (T) 
AT* 2 (AT)* 
ProoJ In the BK model the 
max[(4/lrr*) Z’@Z), M*], which due 
M = Z(W/(Y&. I 
( P2 1 
2 
2(1 + 2Y 49 
= sI(p4). WI 
lower bound requires the computation of 
to the monotonicity of Z(M) is achieved when 
Kung and Leiserson [5] have designed a hex-connected cellular sequential machine 
which multiplies two p x p matrices of bandwidths w, and w2 in 3p + min(o,, w2) 
units of time. The machine consists of a number of small identical cells and has an 
area proportional to LL)~ w2. Since the matrices that we consider are full, 
w, = w2 = 2p - 1, and A = O(p’), T = O(p) so AT2 = 0(p4), which meets our lower 
bounds up to a multiplicative factor. Such a machine for the multiplication of m x n 
and n x p matrices is shown in Fig. 1. Recently, Preparata and Vuillemin [6] have 
designed a family of pipelined chips for square matrix multiplication which achieve 
the lower bounds to within a multiplicative factor for @(log n) < T < o(n). 
FIG. 1. Hex network for multiplication of m x n and n x p matrices. 
238 JOHN E. SAVAGE 
4. MULTIPLICATION OF RECTANGULAR MATRICES 
Consider the multiplication of an m x n matrix A by an n X p matrix B under 
assumptions Ml and M2. We derive a lower bound to the minimal cross-flow 1(M) 
from which bounds on AT2 follow. 
Let the elements in C be partitioned into two sets represented by the binary 
matrices C,, C,, where 
Also, let the elements of A and B be partitioned into sets represented by A,, A, and 
B,, B,. Again, the task is to show the existence of matrices A:(i) or Bf(j) such that 
the intersection with C,,, r’ # r, r, r’ E { 1,2}, contains many variables from A, or B,. 
However, A,, A, are m x n matrices and B,, B, are n x p matrices and since in 
general m, n and p are not equal, we cannot map A,, B, or shifts of them directly 
onto C,, by choice of entries in A or B. Another approach is required. 
We create m x p matrices A:(i) and B,*(j) from the m x n and n X p matrices A, 
and B, and these new matrices will be such that they can be mapped onto C,,. If 
n >p, A:(i) is defined to be the first p columns of A,(i), 1 ( i < n. Otherwise, if 
n < p, A,*(i) is the ith cyclic shift of the columns of the matrix [ArOpe”] for 
1 (i(p, where Open is the m x (p - n) zero matrix. Similarly, if n > m, B,*(j) is 
defined to be the first m rows of B,(j) for 1 <j,< n, and if n < m, B:(j) is defined as 
the jth cyclic shift of the rows of the matrix [B,TOm-“lT for 1 ,< j < m, where T 
denotes the transpose of a matrix. The following lemma establishes the essential 
properties of these matrices. 
LEMMA 2. Let a = max(n, p) and b = max(m, n). Then, for r E {1,2} 
Proof: When n > p or n > m the first two relations follow by noting that the n 
cyclic permutations of A,(i) carry every column of A, through each column of A,*(i) 
and similarly for permutations of rows of B:(j). Also if n Q p or IZ <m, 
IA:(i)1 = (A,[ and /B:(j)1 = IB,(. The third relation is derived in a manner similar to 
that of Lemma 1. a 
We are now ready to derive bounds on I(M) for multiplication of rectangular 
matrices. 
MATRIX MULTIPLICATION IN VLSI MODELS 239 
PROPOSITION 2. The minimal cross-flow I(M) for the multiplication of m X n and 
n x p matrices over X is bounded by 
(2a - n)(26 - n) 
2ab 
- T < Z(M) < (mp + M)/2 
and the coefficient of mp is positive if (a - n)(b - n) < n2/2 where a and b are given 
above. 
Proof. The proof parallels that of Proposition 1 and the upper bound argument is 
identical. 
Following the proof of Proposition 1, we wish to show that for some i, j and r, S; 
is large where S; > EL - (mp + M)/2 since C,, > (mp - M)/2 and the matrices in 
question are m x p. Here 
E; = IA:(i)1 + IBT(j>l - IAT(i)n BT(j)(. 
We then use S72 as a lower bound to the minimal cross-flow. 
The following relation is a consequence of Lemma 2. 
i i E;=R,=pb JA,J + am (B,\ - iA,/ lB,I. 
i=l j=l 
Thus, for each r there exist i and j such that E; > L, = R,/ab. Also, it is easily 
demonstrated that 
IAIl + I.4,l= mn, IhI + IhI = v 
using the identity max(u, v) min(u, v) = uv. If we set x = (A I 1, y = JB, 1, and evaluate 
L, and L, using I.4,( = mn -x, JB,I = np - y, we have that there exists r, i and j 
such that E; >, Q/ah, where 
Q = os$J-kl OiY<;nP min max(bpx t amy - xy, mnp(a + b - n) 
The 
The 
and 
- (b - n)px - (a - n) my - xy). 
first of the two functions is increasing in x and y while the second is decreasing. 
minimum is achieved when they are equal, that is, when 
(2b - n)px t (2a - n) my = mnp(a + b - n) 
this condition can be satisfied by 0 <x < mn, 0 Q y < np. 
Using this condition and substituting u = am -x we have 
240 JOHN E.SAVAGE 
for 
a = (2b - n)p, p = a/(2a - n)m 
since u(a -/3u) is a convex function with a maximum at u = a/2/& we have 
Q/ah > mp - a2/4/?ab from which we have that there exist r, i and j such that 
(2~ - n)(26 - n) M 
2ab 
--. 
4 
The coefficient of mp is positive if n*/2 > (a - n)(b - n). 1 
THEOREM 4. The multiplication of m x n and n x p matrices over X under 
conditions Ml and M2 requires area A and time T in the Brent-Kung (BK) and 
Thompson (T) models that satisfy the following inequalities: 
AT* > (Ilt)*(mpK - l/4)* = L!(m*p’), (T) 
AT’ > (Lr)* ( 1 ~~fi)2=fi(m2p2), (BK) 
where 
I _ (2a-n)(2b-n) 
2ab 
is positive for (a - n)(b - n) < n*/2. 
The lower bounds given by this theorem are more difficult to achieve than are the 
bounds for square matrices. Consider again the Kung and Leiserson [5] hex- 
connected network for the multiplication of m x n and n x p matrices, as shown in 
Fig. 1. The area occupied by the network, exclusive of that to hold input and output 
variables, is A = (m + n - l)(p + n - I)h, where h is the area of the hexagonal cell. 
The time to compute all results by bringing them to a boundary cell, assuming that 
all input variables are preceded by O’s, is T = (m + n - 1) + (p + n - 1). Thus, 
AT* = cd(c + d)h, 
where 
c=m+n-1, d=p+n-I 
In the theorem, it is easily demonstrated that (mp4K)* > n4/64 if n/2 < m, p < 3n/2. 
Thus, in this case the upper bound is optimal to within a multiplicative factor. 
MATRIX MULTIPLICATION IN VLSI MODELS 241 
5. REMARKS AND CONCLUSIONS 
Two models for VLSI circuits have been considered, one which assumes that wires 
are laid out on a rectangular grid, and another which assumes that the circuit 
occupies several superimposed planes, the common outer boundary of which defines a 
convex planar region. For each model we have shown that multiplication of square 
p x p matrices requires an area A and computation time T which satisfies an 
inequality of the form AT* = R(p4). 
These results are derived under weak conditions on the two operations of addition 
and multiplication and the set X over which they are defined. The conditions are such 
that they apply to semirings for which it has been shown [7] that the transitive 
closure of a 3p x 3p matrix reduces to matrix multiplication of p X p matrices from 
the following identity: 
Also, the conditions are such that the inverse of a 3p x 3p matrix over a field reduces 
to matrix multiplication of p x p matrices from the following second identity [ 8 ]: 
We conclude that any VLSI circuit for either transitive closure or matrix inversion 
must also satisfy the inequality AT’ = f2(p4). A VLSI circuit for transitive closure 
has been given by Guibas et al. [9] which achieves the lower bound with T = O(p). 
The lower bound for multiplication of m X n by n x p matrices has been derived. 
The bound is good for T = O(m + n + p) if n/2 < m, p < 3n/2. However, it is not 
known to be tight when m, p < n/2. Thus, open problems consist of improving either 
the upper or lower bounds for this case and obtaining bounds when m, p > n and 
(2m - n)(2p - n) > n*/2. 
Other questions of interest are: can similar results be derived for banded matrices 
or those that are triangular; can such results be found for sparse matrices or matrices 
that have structure such as circulants or symmetric matrices. 
ACKNOWLEDGMENT 
The author is thankful to Gerard Baudet for a suggestion which led to an improvement in Theorems 3 
and 4. 
242 JOHN E. SAVAGE 
REFERENCES 
1. C. A. MEAD AND L. A. CONWAY, “Introduction to VLSI Systems,” Addison-Wesley, Menlo Park, 
Calif., 1979. 
2. C. D. THOMPSON, “Area-Time Complexity for VLSI”, Proc. 11th Annual ACM Symposium on 
Theory of Computing, pp. 81-88, Assoc. Comput. Mach., New York, 1979. 
3. C. D. THOMPSON, “A Complexity Theory for VLSI,” Ph.D. thesis, Department of Computer Science, 
Carnegie-Mellon University, 1980. 
4. R. P. BRENT AND H. T. KUNG, “The Area-Time Complexity of Binary Multiplication,” Technical 
Report CMU-CS-79-05, Department of Computer Science, CarnegieMellon University, 1979. 
5. H. T. KUNG AND C. E. LEISERSON, “Systolic Arrays for VLSI,% [I]. 
6. F. PREPARATA AND J. VUILLEMIN, Area-time optimal VLSI networks for parallel matrix 
multiplication, Inform. Process. Lett. 11 (1980), 77-80. 
7. A. AHO, J. HOPCROFT, AND J. ULLMAN, “The Design and Analysis of Computer Algorithms,” p. 203, 
Addison-Wesley, Menlo Park, Calif., 1974. 
8. See [7, p. 2421. 
9. L. J. GUIBAS, H. T. KUNG, AND C. D. THOMPSON, “Direct VLSI Implementation of Combinatorial 
Algorithms”, Proc. Conf. Very Large Scale Integration: Architecture, Design, Fabrication, California 
Institute of Technology, January 1979. 
