The performance of multilective VLSI algorithms  by Savage, John E.
JOURNAL OF COMPUTER AND SYSTEM SCIENCES 29, 243-213 (1984) 
The Performance of Multilective VLSI Algorithms* 
JOHN E. SAVAGE 
Deparlmenr of Computer Science, Brown University, Providence, Rhode Island 02912 
Received June 9, 1982; revised April 15, 1983 
Chip area and computation time are the resource parameters of greatest importance in 
VLSI algorithms. Chip area is used for computation, storage, and for communication, the role 
of the latter being special to VLSI. Several models for VLSI algorithms have been postulated. 
The one treated in this paper differs from most others in that it explicitly permits inputs to be 
read multiple times, that is, algorithms used are multilective. The way in which lower bounds 
derived for VLSI depend on the number of times and places at which inputs are read are 
explored. The contribution of this paper is to present methods for treating such multilective 
algorithms and to apply these methods to a large variety of functions and predicates. Figuring 
prominently in this work is the planar circuit size of a problem. This measure consolidates in 
one place several algorithmic and geometric issues that are usually dealt with separately in 
most analyses of VLSI algorithms. It is used to derive lower bounds to various area-time 
products. Also presented here is a method for deriving lower bounds on the area required by 
multilective algorithms. 0 1984 Academic Press, Inc. 
1. INTRODUCTION 
VLSI algorithms are special in that their realization requires the allocation of area 
on a chip for logical operations, storage, and for communication. As Thompson’s 
paper [l] so clearly indicates, chip area A and computation T time must satisfy a 
tradeoff relation; one cannot be reduced independently of the other. 
A variety of models for VLSI chips have been studied. All assume that chips are 
realized from discrete’ logic elements, memory cells, and wires, each of which has 
dimensions of a size not smaller than some minimal size known as the feature width. 
Some authors make additional assumptions about the geometry of the chip, such as 
wires are rectilinear [ 11, the chip occupies a convex region of the plane [2], or the 
region occupied is a compact region of the plane [3], although these assumptions are 
not needed if wires consist of straight-line segments [4]. The chip model used here is 
the latter. 
Models of VLSI algorithms make assumptions on the reading of input and the 
writing of output. With the exception of work by Kedem and Zorat [5; 61 and Yao 
[7], authors have assumed that algorithms read inputs once or at one place on a chip. 
Under these conditions bounds have been derived of the form AT* = Q(n’), for n the 
*This work was supported in part by the National Science Foundation under Grants ECS-80-24637 
and ECS-83-06812 and by the Semiconductor Research Corporation under Contract 83-01-032. 
243 
0022.0000/84 $3.00 
511/29/2-8 
Copyright 0 1984 by Academic Press, Inc. 
All rights of reproduction in any form reserved. 
244 JOHN E. SAVAGE 
number of input variables of a problem, for a variety of problems. Such bounds have 
been derived for the DFT and sorting [8], for binary integer multiplication, [2; 91, for 
matrix multiplication, matrix inverse, and the transitive closure of a Boolean matrix 
[lo], for binary integer powers and reciprocals [4], and for transitive functions [Ill. 
All of these problems are described by multiple output functions. Similar results have 
been obtained for predicates consisting of graph isomorphism [7], the recognition of 
a context-free language, pattern matching, binary integer factorization testing [ 121, 
and every predicate associated with a multiple output function for which such results 
have been derived [4], such as testing for integer and matrix factorization, and 
whether one matrix is the transitive closure of another. 
Chip algorithms that read inputs at multiple times and places’ have been 
considered by Yao for the function x + y*z with integers over a finite field F. He 
shows that if a compact binary representation is given of F then the inequality 
AT* = i2(n’) must be satisfied for n = log, IFI. If no limit on the redundancy of the 
representation is put, then AT* = f2(r~“‘~). Such algorithms have also been considered 
by Kedem and Zorat [5; 61 for the cyclic shift function on n inputs. They show that 
AT2 = G(n/p)* when inputs are read pun times. Kedem and Zorat, and Valiant [ 131 
have constructed good multilective algorithms for the cyclic shift function, and in 
addition, recently Kedem [ 141 has presented multilective algorithms for cyclic shift 
that are optimal over a wide range of area and time. 
In this paper, we demonstrate that the multilective planar circuit size of functions 
can be used to derive lower bounds to several measures of area and time including 
A*T and AT* for multilective VLSI algorithms. We derive lower bounds to this 
measure for a variety of problems, thus extending the results of Kedem and Zorat 
[5; 61 for transitive functions. With multilective planar circuit complexity we 
consolidate in one measure the several geometric and algorithmic issues that are 
usually dealt with separately in other models. 
We show that the multilective planar circuit size of a function can be bounded in 
terms of its Grigoryev independence properties [ 151, properties that have played an 
important role in the study of space-time tradeoffs [ 161. These properties are known 
for a large variety of problems. We also directly derive lower bounds to the 
multilective planar complexity of matrix multiplication and related problems, such as 
the Kung-Leiserson LU decomposition of a matrix [ 171, problems for which the 
independence properties give weaker results. Our methodology accommodates 
predicates, which we illustrate by deriving a lower bound to the multilective planar 
circuit size of a family of predicates. 
Bounds on chip area have been obtained by several authors under the assumption 
that inputs are read once in a data-independent manner. Bounds linear in the number 
of inputs have been derived for a number of multiple output functions including 
binary integer multiplication [2], for shifting and related functions [18], and for 
matrix multiplication [ 191. A method of bounding chip area in terms of the one-way 
‘Such algorithms are called multiselective. A formal definition is given later. 
MULTILECTIVE VSLI ALGORITHMS 245 
flow between two processes has been given by Yao [7] and used by Savage [4] to 
exhibit a Boolean predicate on n inputs that requires are a a(n/log,n). 
In this paper we also examine chip area for multilective algorithms and obtain the 
first lower bounds in this case. We show that such bounds can be determined from a 
measure of the information flow which these functions exhibit, a measure that is used 
to bound planar circuit size. The area bounds and area-time bounds are combined to 
provide new area-time bounds for multilective algorithms. Finally, we illustrate the 
area-time bounds for multilective matrix multiplication by giving a family of area- 
optimal multilective algorithms. 
In Section 2 our VLSI model is presented, multilective planar circuit size is 
defined, and ‘computational inequalities are presented in terms of this measure. The 
inequalities state lower bounds to the measures AT* and A *T in terms of multilective 
planar circuit complexity. In Subsection 3.1 we define the Grigoryev independence 
condition as well as several other conditions on functions that are needed. In 
Subsection 3.2 we define the new informational complexity measure, W(U, v)-flow, 
that is used to derive lower bounds to the multilective planar circuit size of functions. 
A functionfover the set X has a w(u, v)-flow if for all subsets of the input and output 
variables off that consist of a fraction of at least u and u of the total, respectively, 
there is some assignment to variables not in the selected input set such that the range 
of the function to the selected output variables has at least ]X]w(U9v) points. Theorem 
5 establishes the following fundamental relationship between multilective planar 
circuit size and the W(U, v)-flow of a function: 
cp > w(u, v)Z/K. 
Here K is a constant and u and v satisfy u = (1 -p/P), v = l/P for any positive 
integer P. In Subsection 3.3 we introduce a measure of the separation between inputs 
of predicates and use this measure to derive bounds on the multilective planar circuit 
size of predicates. 
In Section 4 bounds are derived on the planar circuit size of a variety of functions 
including matrix multiplication, the DFT, and the Kung-Leiserson LU algorithm. We 
also derive bounds on the area required for multilective algorithms. In Section 5 the 
results of the previous sections are combined to produce a number of computational 
inequalities involving area A and time T. Section 6 illustrates the lower bounds by 
presenting a family of area-optimal algorithms for matrix multiplication, and Section 
7 closes with comments. 
The paper has two appendices. The first, Appendix A, contains the proof of 
Theorem 4 stated in Subsection 2.2. The second, Appendix B, presents the planar 
separator theorem [20] and an extension that is needed in Subsections 3.2 and 3.3 to 
establish the connection between planar circuit complexity and the measures of infor- 
mation flow and function separation that are defined in these sections. 
246 JOHN E. SAVAGE 
2. MEASURES OF CHIP AND CIRCUIT COSTS 
In this section the computational model used to represent VLSI chips and 
algorithms is described. Also, planar circuit complexity measures on functions are 
defined and then used in computational inequalities that relate them to chip area and 
computation time. 
2.1. The VLSI Model 
The derivation of limits on the performance of VLSI algorithms requires a clear 
understanding of the nature of the computational model used to represent VLSI chips. 
Accordingly, it is assumed that each chip satisfies the following assumptions: 
(Al) The chip realizes a clocked sequential machine whose states are tuples 
over a finite set X and whose transition function is realized by a circuit over a basis 
Q = (h:X* -+ X) (of logic elements), where interconnections of elements are made via 
straight-wire segments.* The state of a chip is assumed to be recorded in some 
number b of memory cells taking values over X. A cycle of a computation is a single 
application of the transition function to the current state and input to produce a new 
state and output. Time T is measured in terms of the number of cycles executed. A 
chip has some number p of input pads and output pads, locations at which inputs are 
read and outputs produced. In principle, such pads can be used for both purposes. 
(A2) The chip has v layers or planes on which wires can reside. Neither logic 
elements, memory cells, nor pads are allowed to overlap wires or one another. 
Connections between wires on two different planes can be made. At most four wire 
segments meet on one plane of the chip. 
(A3) Wires have a feature width and length of at least A, discrete logic elements, 
memory cells and I/O pads occupy an area of at least 1*, and the separation between 
any two wires or logic elements on a given plane is at least A. The area A of a chip is 
the amount of surface area occupied by wires, exclusive of holes. 
In general chips are viewed as sequential machines over the set X = (0, 1) and the 
memory cells and basis elements are Boolean. However, if the machines are word 
oriented, then this model can be directly extended to this case, although, then the 
feature width must be adjusted to account for this. The restriction to logic elements of 
fan-in 2 is not important, although it is essential that the fan-in be bounded, unless 
the basis is restricted to functions with a simple structure. In practice fan-in is limited 
by the electrical properties of gates. 
The restriction on the number of wire segments that meet on one plane reflects the 
limitations of geometry, and the number four is chosen for concreteness and because 
this is the maximum number that intersect when wires are rectilinear. The results are 
not sensitive to this number. It must only be bounded. Chip algorithms are assumed 
to satisfy the following scheduling condition on inputs and outputs: 
*Of course wires carry signals over the set X. 
MULTILECTIVE vsL1 ALGORITHMS 247 
(A4) Input variables are read and output variables produced at times and places 
that are data-independent. 
Input variables can be read one or more times at one or more places on a chip. 
The following definition names each of these types of algorithms. 
DEFINITION 1. A chip algorithm is semelective if each input variable is read 
exactly once. It is unilocal if it reads all instances of any given input variable at a 
given location on the chip. It is multilective if it reads some of its n input variables 
more than once, and it is @,u)-multilective if it reads input variables /_?pn times, but 
only ,un times when multiple reads of a variable at one location are treated as a single 
read. 
2.2. Planar Circuit Size 
Circuits and circuit complexity play an important role in the subsequent 
development. (See [21] for a treatment in depth of Boolean circuit complexity.) In 
particular, planar embeddings of circuits with crossovers of edges play a central role 
here. For concreteness we assume that each circuit element has two inputs, but results 
cited below are only weakly dependent on this assumption. 
DEFINITION 2. The circuit complexity of a function f: X” + X”‘, denoted C(j), is 
the size (number of logic elements) in the smallest logic circuit for f relative to a 
basis Q, which is taken for simplicity to be the set of all functions h: X2 +X. A 
planar embedding (with crossovers) of a circuit is one in which edges may cross, but 
only in pairs, and otherwise elements do not overlap nor do edges and logic elements. 
The size of a planar embedding is the number of logic elements and crossings in the 
embedding. The planar circuit complexity C,(f) over the basis Q of a function f is 
the size of the smallest planar embedding of a circuit forf. If a planar embedding of a 
circuit has one node associated with each input variable it is said to be semelective 
and the complexity measure is denoted as C:‘(f). If a planar embedding of a circuit 
has ,un nodes associated with inputs of a function of n input variables, ~12 1, then it is 
said to be multilective of order p and the planar circuit complexity with this 
restriction is denoted C”‘(f) 
P - 
Planar circuit complexity is defined by associating a planar graph with a circuit by 
treating crossing edges as vertices in a new graph. This association does not change 
the circuit functionality in any way but does allow us to treat only planar graphs. 
The definition of planar circuit complexity for a function f is also the sum of the 
crossing number (defined by Leighton [22]) and the number of logic elements in a 
circuit for f for which the sum is minimal. This definition was prompted by the obser- 
vation of Aggarwal [23] that constructing a planar circuit from a nonplanar circuit 
by replacing crossings edges by three exclusive ORs, as suggested in [20] and as used 
[4] may lead to cycles in a planar graph. This new definition eliminates the problem. 
The usual definition of circuit complexity assumes that the functions and the bases 
are Boolean. Again, for word-oriented chips, it may be convenient to consider non- 
248 JOHN E.SAVAGE 
Boolean functions. That is the reason for considering functions and circuit elements 
over the basis X which may be nonbinary. 
The following two results are restatements of results presented in [4] and derived 
in [24], but suitably modified to take the new definition of planar circuit size into 
account. 
THEOREM 1. For all functions f: (0, 1)” -t (0, lJm and for X= {0, I}, 
for,u> 1. 
THEOREM 2. For any 0 < 6 < 1 a fraction of at least 1 - 2 -bm2n of the functions 
f: (0, l}” + {0, l}m have 
C(f) > m(2”ln)(l - 6 -o(n)) 
tf log m = O(n), and 
C;‘)(f) < 4m2” 
for all such functions and all values of m and n. These result apply when X = {O, 1). 
The first result demonstrates that planar circuit complexity is no larger than 
quadratic in the standard circuit complexity. The latter result demonstrates that for 
most Boolean functions, standard and planar circuit complexity measures are both 
exponential in n. 
2.3. Computational Inequalities 
The following computational inequalities relate the various circuit complexity 
measures to chip area A, the number of cycles of execution T, and the chip 
parameters V, the number of planes in the chip, and A, the minimum feature width. 
Theorem 3 is a restatement of a result in [25]. 
THEOREM 3. Let f: X” -+ Xm be computed in T cycles on a VLSI chip of area A 
with an algorithm that meets conditions (Al) through (A4). Then the inequality 
C(f) < y(A/A=) T 
must be satisfied. 
The next result is an extension of Theorem 4 of [4]. It gives two computational 
inequalities that must be satisfied. The first is bounded by a constant multiple of the 
standard measure AT’ while the second is bounded by a constant multiple of the new 
complementary measure A*T. Both lower bounds are stated in terms of multilective 
planar circuit size, although their multilective orders are different. The proof of this 
theorem is given in Appendix A. 
MULTILECTIVE VSLI ALGORITHMS 249 
THEOREM 4. Let f: X” + Xm be computed in T cycles on a VLSI chip of area 
A 2 A2 and v > 2 planes with a (p,p)-multilective algorithm. Then, each of the 
following inequalities must be satisfied: 
C:‘(f) < ~(v/A)~AT~ 
These two theorems present three tradeoff inequalities involving area, time, and 
two parameters of chip layouts, the feature width and the number of layers of wires 
permitted by the VLSI technology. The three inequalities are not strictly comparable. 
The first, that of Theorem 3, is stated in terms of the product AT, while the latter two 
are stated in terms of the products AT* and A2T, respectively. The measure AT is 
smaller than either of the other two, but the lower bound on it, namely, circuit size 
C(f), is potentially smaller than the other two lower bounds, which are planar circuit 
size measures. Also, the lower bound to A2T, namely, Chb”‘(f), is potentially smaller 
than the lower bound to AT2, namely, C:‘(J) since the former permits more copies 
of the input nodes than does the latter for (j?,~)-multilective algorithms. 
These three inequalities give significance to circuit complexity and planar circuit 
complexity for VLSI algorithms. The next section of this paper is devoted to stating 
and applying general conditions under which lower bounds to the planar circuit 
complexity of important problems can be derived. 
Valiant [ 131 has observed an interesting connection between the measure A *T that 
appears in Theorem 4 and lower bounds to space and time for uniprocessor 
machines. He notes that the analysis used by Grigoryev [ 151, which is for 
multilective algorithms, can be extended to the VLSI model to obtain a lower bound 
to A2T and that all previous bounds obtained with the Grigoryev method apply here. 
Thus, quadratic lower bounds that have been previously derived for space-time trade- 
offs apply to the measure A2T for many multiple-output functions when algorithms 
are multilective. They include such functions as polynomial multiplication over 
GF(2) [ 151, the discrete Fourier transform [26], binary integer multiplication [27], 
and matrix inversion [28]. (Grigoryev [ 151 has also shown an O(s3) lower bound for 
s x s matrix multiplication.) This extension of Grigoryev’s result is obtained by a 
straightforward application of his method of proof to multi-output algorithms. 
In Section 5 a lower bound to the area required by multilective algorithms is 
derived and this bound and a result by Kedem yield the Grigoryev type bound for 
functions other than matrix multiplication. A stronger bound is derived for the latter 
problem. 
3. MEASURES OF INFORMATIONAL COMPLEXITY 
This section deals with three topics, the Grigoryev independence condition, and 
two new information measures that are used to derive lower bounds on the planar 
250 JOHN E. SAVAGE 
circuit complexity of functions and predicates. The new measure for functions is 
called W(U, v)-flow while that for predicates is called W(U, v)-separation. The former 
measures information that must flow from inputs to outputs of the function, while the 
latter measures the degree to which the values of some input variables specify the 
values of other input variables. 
These measures are defined and used to derive lower bounds on the planar circuit 
complexity of functions and predicates. They are applied to individual problems in 
Section 4. 
3.1. The Grigoryev Independence Condition 
The class of functions defined below is an extension of the class identified by 
Grigoryev [ 151 and which has been studied in the context of space-time trade-offs. 
DEFINITION 3. A function f: X” +X” with input and output variables 
I = (x, )...) Xn}, J= {Y,,..*,Ym], respectively, is (a, 1, d)-independent for a > 1 if the 
following conditions hold: 
(1) There exist sets I0 5 I, II,, I= c, J,, c .I, IJ, I= d, such that 
(2) for all k < 1, 
(3) for all sets of k < c indices i,, i, ,..., i,, 
(4) for all sets of I - k ,< d indices j, , j, ,..,, j,_k, 
such that xi, 9*‘*9 xix E I, 7 Yj, F***T Yj,_k E J0 there is an assignment 
0: (Xi,,..., xik} U (Z-I,) + X such that the function (Yj,,..., yj,_,) in the remaining 
input variables contains at least IXI(‘-k)‘a points in the image of its domain. 
Note that any function that is (a, I, d)-independent is also (a, 1’, d)-independent for 
I’ < 1. Note also that 1< min(c, d). To apply this definition to a variety of problems it 
is convenient to make a few additional definitions. 
DEFINITION 4. A function f is a subfunction of a function h if it is obtained by an 
assignment to some of the input variables and/or by the suppression of some output 
variables. 
DEFINITION 5. The Boolean shifting function j-6”’ : (0, 1 } -+ { 0, 1 ), where 
fffl)(Xo,..., x,_ * 9 s, ,..., Sk) = (Y0 ,..., y,,_,) with control variables s1 ,..., sk realizes the 
mappings 
Yj = Xj-t, t<j<t+n-1, 
=o otherwise, 
for each 0 < t < n - 1 by some assignment to (s, ,..., sJ. The Boolean cyclic shift 
function over X = { 0, 1 }, j~~‘(x, ,..., x, _ r, s1 ,..., s,J = ( y0 ,..., yn_ r), realizes each cyclic 
shift of x,,,..., x,_ 1 for some assignment to sr,..., Sk. 
MULTILECTIVE VSLI ALGORITHMS 251 
DEFINITION 6. The function h,(x, ,..., x,, S, ,..., sk) = (y, ,...,y,), Xi,JJjp si E X is 
transitive [ 111 of order n over X if for each permutation g in the permutation group 
G over { 1, 2,..., n}, 
(1) there is an assignment to si ,..., sk such that yi = xgCij, 1 < i < n, and 
(2) for each 1 < i, j < n there exists g E G such that g(i) = j. 
A function is also transitive of order n if it has a subfunction of this type. 
Independence properties have been determined for a variety of functions. In [ 291 3 
it is shown that the shifting function fy’ is (a, n(1 - l/a), (n + 1)/2)-independent for 
1 < a < 2. Thus, the same is true of the binary integer multiplication function fc’, 
which improves on the bound of [27] for this problem when a = 2. In [4; 241 it is 
shown that the functionsfr@’ andftTb’, given below 
f (pn.0) = py], fk”.“’ = [(2”/y)b] 
for 1 < x, y < 2” - 1, a > 1, and b = q/2k > 0, k and q integers independent of n, 
contain the shifting function f s (q) for q = o(n), from which we have their indepen- 
dence properties. A large number of different functions fall into this class such as 
square roots and their reciprocals. In the paper in which Grigoryev defined the 
independence condition [ 151 he demonstrated that the matrix-matrix multiplication 
function for s x s matrices over a ring X is (1, s, s*)-independent. 
Also in [29] it is shown that every transitive function of order n over X is 
(a, n(1 - l/a), n)-independent for a > 1. From this follow the independence 
properties of the sorting function when inputs are binary strings lexicographically 
ordered. Vuillemin has shown [ 111 that the product PAQ of three s X s matrices over 
a finite field F, where P and Q are permutation matrices, is transitive of order s*. 
Since it has been shown that the inverse of an s x s matrix over the finite field F 
contains a subfunction that is transitive of order s2/4 [4; 241, the independence 
properties of matrix inversion are known. Those of banded matrix inversion are given 
in [29]. 
3.2. An Information Flow Measure for Functions 
In this section we introduce an information flow measure that generalizes the 
measures used by previous authors to derive lower bounds to AT* for VLSI 
algorithms. This measure captures the degree to which algorithms are multilective, 
that is, it takes into account the number of times and places that an algorithm reads 
variables. This measure, w(u, v)-flow, is defined below. 
DEFINITION 7. Let A and B be subsets of the sets of input and output variables, 
respectively, of a function f: X” -+ Xm. It is said to have a W(U, v)--row if for all 
subsets UcA and VcB with IUl>u(AI, IVl>vlBI, O<u,v< 1, there is some 
'In this reference the independence condition is stated a bit differently. The difference amounts to 
multiplying a by log, 1x1. 
252 JOHN E. SAVAGE 
assignment to inputs not in U such that the resulting subfunction h off from U to I’ 
has at least ]X]wu’7U) points in the image of its domain. A function f also has a 
W(U, v)-few if it contains a subfunction with this property. 
This measure is a generalization of that introduced by Thompson [ 1 ] and Brent 
and Kung [2] for multiple-output functions and one stated in [4] for semelective chip 
algorithms. The following theorem illustrates the importance of this definition. It 
plays a central role in the derivation of area-time bounds for functions. 
THEOREM 5. Zf f: X" --t X"' has a w(u, u)-flow for u = (1 - ,u/P) and v = 1/4P, 
then 
Cf’(f) 2 w(u, v)*/K:. 
Proof: Without loss of generality, let f have a W(U, v)-flow. Consider a planar 
circuit for f with N vertices. Apply Theorem B.2, an extension of the planar separator 
theorem, to this circuit so as to partition its output vertices into P disjoint sets 
{Ai] 1 < i < P), where each set contains between m/4P and 4m/P of the output 
vertices. Either Ai or A;, the complement of Ai, contains vertices associated with n/2 
of the input variables. Since the total number of input vertices in the P sets is ,un, it 
follows that some set A, contains at most ,un/P input vertices and at most this many 
distinct input variables. Then A;, the complement of A,, contains at least n -,un/P 
input variables that are not contained in A, itself. 
Let C, be the separator associated with A, and assume that ] C,.] < W(U, v) for 
u = (1 -p/P) and v = 1/4P. Consider the subfunction off from the input variables in 
AS that are not found in A, to the output variables in A,. It has at least 1 X] w’U*v) 
points in the image of its domain. However, all paths from A: to A,. must pass 
through C, which does not have a sufficient number of outputs to support a range of 
this size. Therefore, we conclude that K, fl> ] C, 1 > w(u, u), from which we have 
the desired conclusion. 1 
Lower bounds to the semelective planar circuit complexity of functions and 
predicates are given in [4]. 
LEMMA 1. Let f: X” + Y” have sets A and B of input and output variables and 
let it contain a subfunction that is (a, I, d)-independent. Then it has a w(u, v)-jlow 
given by 
w(u, v) = max(min(vd, (I - (1 - u) n)), 0)/a. 
ProoJ Let h be the subfunction off which satisfies the independence condition. 
Let U and V be subsets of the inputs and outputs A and B of h, respectively, where B 
is assumed to be the subset of the d output variables for which (a, I, d)-independence 
is defined, and where I UI > un, I VI > vd, for some 0 < U, v < 1. Since the function is 
(a, I, d)-independent, if (1 - u) n + vd ,< 1, then there is a subfunction obtained by 
some assignment to variables in UC that has at least ]X]“d’a points in the image of its 
domain. If (1 - U) n + vd > I, consider any q = max(Z - (1 - U) n, 0)) output 
MULTILECTIVE VSLI ALGORITHMS 253 
variables in V. Then since (1 - U) n + q = Z, there is a subfunction obtained by fixing 
the variables in UC that has IX1q’a points in the image of its domain. Therefore, the 
function has a W(U, v)-flow for W(U, v) = max(min(ud, (I - (1 - U) n), 0)/a. 1 
The discrete Fourier transform (DFT) has been studied by Thompson [ 1 ] who has 
shown that AT* 2 12(n2) for semelective algorithms. The following lemma states a 
bound on its W(U, v)-flow if Grigoryev independence properties seem to be inadequate 
to use the above lemma. 
LEMMA 2. Let f &: X” + X” denote the DFT over a finite ring X. Then it has a 
w(u, v)-flow which satis.es 
w(u, v) = uvn. 
Proof Consider the Vandermonde matrix F that defines the DFT. Pick any vn 
rows corresponding to output variables. Pick any un columns corresponding to input 
variables. Let the columns be numbered from 0 to n - 1. We claim that there exists 
some set of vn consecutive columns of F (i.e., columns numbered j, j + l,..., 
(j + n) mod n) such that this set contains at least uvn of the chosen inputs. This 
claim is established by a simple counting argument in which we consider all distinct 
consecutive blocks of vn input variables and show that the average overlap with the 
un prechosen inputs has this value. 
As has been noted many times before, the submatrix of F defined by vn rows and 
vn consecutive columns is nonsingular. Thus, the uvn nonconstant input variables in 
the set of vn consecutive inputs are mapped l-l onto the output variables. Thus, the 
image of the domain contains at least IXU”” 1 points. 1 
Lemma 1 is useful for bounding the W(U, v)-flow of many functions. However it is 
weak for matrix multiplication. Below we derive a separate lower bound for this 
problem. 
LEMMA 3. Let the matrix product C = D x E of s X s matrices be defined over a 
ring with a set X of elements. Then it has a w(u, v)-jlow for 
~(24, v) = max(v - (1 - u)‘, 0) s2/2 
forO<u,v< 1. 
Proof. Let U be a subset of the input variables with 1 UI > u(2s2), and let V be a 
subset of the output variables with I VI > v(s2). Let C, D, and E be matrices over the 
set (0, 1 } whose entries are 1 if and only if the corresponding elements of D and E 
are contained in the set U and the corresponding elements of C are contained in V. If 
E is the kth cyclic permutation matrix, then C is equal to D(k), the matrix obtained 
from D by postmultiplying it by E. Let D(k) be the corresponding 0, 1 matrix. 
Similarly, E(j) identities the entries of E that are in U when D is the jth cyclic 
permutation matrix. 
254 JOHN E. SAVAGE 
Let A and B be two square matrices over {0, 1 }. Denote with A n B the matrix 
whose entries are 1 exactly where the entries of both matrices are 1 and with AU B 
the matrix whose entries are 1 where either is 1. Then we have the following identity. 
IAUB(+IAnBI=IAItIBI 
It follows that if these matrices are s x s, then I A n B I > (A 1 t I B I - s2. 
From the above discussion, it is clear that I C n D(k) I is the number of outputs in 
V that can be equated with inputs in U that are elements of D, when E is the kth 
cyclic permutation matrix. Thus, a flow of at least /C n D(k)] is possible. A similar 
interpretation can be given to ]C n E(j)]. Thus, if the expression 
Q = I C n D(k)] + I C n E(j)] is large then one of the two terms is also large and at 
least Q/2. Thus, a flow of at least Q/2 exists. We now derive a lower bound to Q. 
From the preceding discussion it follows that 
Furthermore, we have that 
Q> ICI + IW)I + IEWI - ID(k) -s*. 
Now in Lemma 1 of [lo] it is shown that there exist integers j and k such that 
lWUW.dl G IDI PW where we note that ]D(k)]=]D] and ]E(j)]=]E]. 
Therefore, we have that 
Q>lcltlDltIEI-(ID/ IW*>-~*a 
Inturn,ifweletz=]DIt]E( d an minimize this bound on (D], then we have that 
where z > ~(2s’). Since the bound is a monotone increasing function of z for z < 2s*, 
it is minimized by z = ~(2s~). The result follows directly. fl 
3.3. An Informational Measure for Predicates 
In this section we introduce a measure of the separation between input variables of 
predicates, W(U, v)-separation. This measure captures the degree to which we can 
distinguish between the values of input variables in two disjoint sets of a predicate. 
DEFINITION 8. A function f: X” +X is W(U, v)-separated, where 0 < u, u < 1, if 
for any partition of its variables A into three sets U, V, and Z such that / UI > u JA 1, 
I VI > u IA 1, there is some assignment z to the variables in Z such that for the 
subfunction in the remaining varriables there exists at least ]X]“‘(‘*“) triples 
{ (Ui, vi, z)}, where ui and vi are over U and I’, and such thatf(ui, vi, z) = 1 if i =j, 
and 0, otherwise. (We assume that input variables may be permuted to put them into 
MULTILECTIVE VSLI ALGORITHMS 
this order.) A function f: X” +X” is also W(U, v)-separated 
subfunction which is W(U, v)-separated. 
This condition is a generalization of one given in [4; 241 on 
255 
if it contains a 
functions that are 
computed by semelective planar circuits, which is in turn similar to a condition given 
by Yao [30] to determine the amount of information that must flow between two 
processors to compute a function, and to a condition given by Lipton and Sedgewick 
[ 121 for the computation of functions by unilocal chip algorithms. 
THEOREM 6. If f: X” -+ X is w(u, u)-separated for u = v = (1 - 1 ~,u/(P - l))/ 
16,uP, then 
q'(f)> w(u, V)2/4Ki. 
ProoJ: Let f be itself W(U, v)-separated. Consider a planar circuit for f with a set 
V of N vertices in which the number of vertices associated with inputs is pn. Some 
input variables may be repeated many times, that is, have a high multiplicity. 
However, there are h > n/2 that have a multiplicity of no more than 2,~. Let H be the 
set of such input variables. Apply Theorem B.2 with a cost function c: I’--+ {0, 1 } that 
assigns unit weight to these vertices. Then, the vertices of the circuit are partitioned 
into P disjoint sets {A iI 1 < i < P}, where each set has a cost between c( V)/4P and 
4c( Q/P, where c(V) > h > n/2. We now show that there exist two sets A, and A, 
such that each has at least (n/l6,~P)(l - 16,u/(P - 1)) distinct input variables in H 
that do not appear in the other. 
Let e,(t) have value 1 if a vertex with input label xt in H is contained in Ai, and 0, 
otherwise. Let ai be the number of such vertices. Also let yii be the number of 
vertices in A i n H that are associated with variables that appear in Aj ~7 H, i #j. This 
function is not symmetric in i and j, in general. We show that there exist indices r 
and s such that both yrs and ys, are not too big. The average values of Yij and yji over 
the set {Ai 11 < i < P} are the same and bounds on them are given below. 
E(Yij) = (x 1 ui(t) ej(')) /Cp - l) t i+j 
=[, (; Ii (i )]/ C 1 ei( 1 C i( 1 - C ai(t> p(p - 1) < 2PC(v)/P(P - l), 
since Ci e,(t) < 2~ and Ct Ci oi(t) = c(V). 
Now if two nonnegative random variables y and z on a common sample space 
have averages jr and .F with respect to a uniform distribution, then there is some point 
in the sample space for which both y < 27 and z < 22: (The number of points for 
which y > 27 and for which z > 2F are each less than half the total, so at least one 
point exists for which both of these previous inequalities are satisfied.) It follows that 
both yrs and ys, are no more than 4pc(V)/P(P - 1) for some values of r and s, r # s. 
Therefore, both A, and A, must have at least (c(v)/P)(b - 4p/(P - 1)) instances of 
256 JOHN E. SAVAGE 
input vertices in H that do not appear in the other. Since each input variable in H has 
a multiplicity of at most 2,4 and c(V) > n/2, each of these sets contains occurrences 
of at least (n/l6,~P)(l - 16p/(P - 1)) distinct input variables in H that are not 
contained in the other. Let B, and B, be these two disjoint subsets of H. 
In Definition 8 identify U with the input variables in the set B,, V with the input 
variables in B,, and Z with the remaining input variables. The separator set C, 
breaks all paths between A, and vertices in the rest of the circuit. C, may overlap U, 
V, and Z. It has at most 2 1 C,.I entering edges. Suppose that 2 1 C,. < w(u, u). Fix the 
values of variables in Z. Then, given (Xlw(U~O) distinct values { (ui, vi, z)} for input 
variables in U and V, by the pigeon-hole principle there must be two triples (ui , v, , z) 
and (uz, v2, z) for which the inputs to C, have the same values. If the vertex 
associated with the functionfis located in C,, then (u, , v, , z) and (ui , v2, z) generate 
the same values for all inputs to C, and therefore for f, which is an immediate 
contradiction. Iff is in A, the same argument applies. A similar argument applies to 
A,. Thus, we have that 2K,~>2~C,~>w(u,v), for u=v=(l-16p/(P--l))/ 
16@, from which the desired lower bound follows. fl 
In this section we derive a lower bound on the w(u, v)-separatedness of a predicate 
to illustrate that lower bounds for predicates with multilective algorithms can be 
derived. We also present a general method for deriving lower bounds on flows that 
uses the (a, 1, d)-independence property of a function, and derive a lower bound for 
matrix multiplication which is stronger than that obtained from its independence 
properties. 
LEMMA 4. Let xi denote a b-tuple of Boolean variables, and write Xi = xj for i # j 
if there is a cyclic shift of the first which is equal to the second. Let the Boolean 
predicate f pgbJ : (0, 1 }” + { 0, 1 } on n = pb variables be defined as follows: 
f pyx, )...) x,) = 1 ifx,-xj,forsomei#j, 
=o otherwise. 
This predicate is w(u, u)-separated for 
w(u, 4 > WpM2 - 4)’ - log2Wp) 
for p = [2/u 1 when n > p log,(4p log, 4~). 
Proof. Let U and V be arbitrary disjoint subsets of the input variables A of this 
function in terms of which w(u, u)-separatedness is defined, where I UI > ulA 1, 
I V( > u(A 1, and IA / =pb. By simple averaging arguments there must some b-tuple xi 
such that the number of components in common with U is at least ub. There must be 
a second b-tuple for which the overlap is at least b(up - I)/(p - 1). The same must 
hold true for V with u replaced by U. The b-tuple that has the largest overlap with U 
may be the same that has the largest overlap with V. However, since U and V are 
disjoint, there are two distinct b-tuples, say xi and x2, such that one has an overlap of 
MULTILECTIVE VSLI ALGORITHMS 257 
at least b(up - l)/(p - 1) with U and the other has at least the same overlap with V. 
Let y, and yZ be binary b-tuples whose nonzero components indicate the positions in 
which xi and x2 overlap with U and V, respectively. We claim that there is some 
cyclic permutation uI of x, such that at least b((up - l)/(p - 1))’ nonzero 
components of yi can be identified with nonzero components of y,. (We assume that 
up > 1.) This can be shown by a simple averaging argument. Let rci and x2 be binary 
b-tuples whose nonzero components specify some b((up - l)/(p - 1))’ nonzero 
components of yi and y2 that can be mapped onto one another by the cyclic shift u,. 
Now to the b-tuples of variables other than the first two, assign binary tuples that 
are inequivalent under cyclic permutation. Let Q be the set of such tuples. Let R be a 
set of cyclically inequivalent b-tuples that have common values in positions identified 
by the zero components of 7ri. The remaining components correspond to variables of 
x1 identified by the nonzero components of rr,. We shall show that R has many 
elements. To xi we assign tuples in R and to x2 we assign the tuples obtained by 
applying the cyclic permutation uI to R. Then, the predicatefy9b’ has value 1 if and 
only if the components of x, identified by rr, are identical with the components of x2 
identified by x2. The predicate is w(u, u)-separated if R has 2”‘(‘,‘) elements. 
There at least 2”/b cyclically inequivalent b-tuples. The set Q must contain p - 2 
of these, so the remaining (2b/b - (p - 2)) such tuples are candidates for R. 
We seek b-tuples that are inequivalent under cyclic shift and whose 
x G 41 - ((UP - l)/(~ - 1))‘) components identified by the zero components of 71, 
are fixed. Each assignment to these components defines an equivalence class of b- 
tuples. Since there are at most 2X such equivalence classes, some class must contain at 
least (2”/b - (p - 2)) 2-x elements. This gives the following lower bound to w(u, u): 
W(U, u) > log,[(2’/b - (p - 2)) 2-b”-“Up-“‘@-‘“2~]. 
Since we must have up > 1, let p = [2/u]. The second term is reduced by replacing p 
by 2/u. If 2’/b > 2p, then 
w(u, u) > (n/p)(ul(2 - u))’ - log2 2(n/p) 
for p = [2/u] since b = n/p. But 2’/b > 2p for p > 2 when b > log,(4p log,4p). I 
THEOREM 7. The predicate f c@“’ : { 0, 1 } + { 0, 1 } defined above has a multilective 
planar circuit size that is bounded by 
Cp) = fi(max(n’/~“,~n)) = J?(n’4’13) 
when n is large. 
Proof. Apply Theorem 6 and Lemma B. 1 with P - 1 = 32~ and u = Jo and 
the observation that any circuit that is multilective of order ,U has at least ,un/2 
internal nodes. 1 
I OB 
AI 0 = 
001 1 [~~~]x[~~~AB]. 
Consequently, this problem for 3n x 3n matrices has a w(u, v)-flow which is at least 
that of the product of two n x n matrices. We summarize these results below. 
258 JOHN E. SAVAGE 
This lower bound depends strongly on the multiplicative factor p, in fact, much 
more strongly than the bounds derived below for multi-output functions. Since this is 
the only known lower bound to the multilective planar complexity of a predicate, it is 
desirable to know if the dependence of the bound on ,U can be improved. 
4. LOWER BOUNDS ON CIRCUIT SIZE AND AREA 
Lemmas 1 and 3 can be applied to a great variety of problems. The first applies to 
functions with the (a, Z, d)-independence property, many of which are mentioned in 
Subsection 3.1. The second applies to matrix multiplication or functions which 
contain it as a subfunction. This includes the transitive closure of a 3n X 3n matrix 
over a closed semi-ring and the inverse of matrix, as shown in [3 11. However, a 
stronger result for matrix inversion can be obtained from the following identity in 
which it is shown that the inverse of a (2r) X (2r) matrix, two of whose submatrices 
P and Q are permutation matrices, computes the product of three matrices P-‘BQ-’ 
which, as mentioned in Subsection 3.1, is a transitive function of order r*. Thus, 
matrix inversion has the independence properties of this transitive function: 
The bound on flows for matrix multiplication can be usefully applied to certain 
LUP decompositions. If a matrix A has such a decomposition with P the identity 
matrix, such as is true for A symmetric positive definite, and if the matrices L and U 
are of the form produced by the recurrences of Kung and Leiserson [32, p. 28 11, then 
the matrix shown below has the indicated LU decomposition. 
THEOREM 8. Let f: X” + Xm be any of the following functions: the shifting 
function fin), the binary multiplication function f z’, the DFT f $‘&, the binary 
functions f r*“) and f gq”), which are the normalized powers and reciprocal powers of 
binary integers, where a > 1, and b = q/2k > 0, k and q integers independent of n, the 
(nonbinary) transitive functions h$” of order n, including s x s matrix inversion for 
n = (s/2)*. Then, 
C:‘(f) = Q(max(n2/,u2,pn)) = Q(n4”). 
MULTILECTIVE VSLI ALGORITHMS 259 
Let fM:X2”-+XS2 denote s x s matrix multiplication over the ring X, let 
f,e : XSz + XSz denote the transitive closure of an s x s matrix over a semi-ring X, and 
let f:Aj denote the Kung-Leiserson LU decomposition of s x s matrices for which one 
exists. Then for any of these functions, 
C:‘(f) = a(max(s”/jf”, pn)) = Q(n6/‘). 
Proof: The first set of problems, except for the DFT, are (ar, 1, d)-independent for 
1 = n(1 - a), d > (n + 1)/2. Choose a = 2, P = 2,41+ 1 and apply Theorem 5 and 
Lemma 1. The DFT bound follows from straightforward substitution with P = 2~. 
For the second set of problems, apply Lemma 3 to Theorem 5 with P = 8(n2). In 
both cases note that any multilective circuit of order ~1 has at least pn/2 internal 
vertices. I 
We now show that there is a simple relationship between the area required by 
@flu)-multilective VLSI algorithms and the w(u, v)-flow of functions computed by 
these algorithms. The basic idea used here is that there exists some set of consecutive 
computation cycles during which the number of outputs generated is large but the 
number of inputs read is small and insuffkient to determine the values of all of the 
output variables in this set of cycles. Thus, a large amount of information must be 
retained in the state of the chip, which means that a large area is required. 
THEOREM 9. Any (j3,p)-multilective chip algorithm that computes f: X” -+ Xm 
using operations and memory cells over X and has a w(u, v)-flow requires area A 
satisfying 
A > A2 ;a,~ (min(w(u, v), z IB l/q)), 
where the maximum is taken over 0 < 5 < 1, 1 < q < IBI, and where u = (1 -/?p/q) 
and v = (1 - z)/q. 
Proof Let A and B be the subsets of the input and output variables for which a 
w(u, v)-flow is defined for f. Let c = 1 A 1 and d = I B 1. Let the number of pads p on the 
chip satisfy p < td/q, where 0 < 7 < 1, because if not, then the area lower bound 
applies immediately. Now let h, be the number of outputs in B produced in the rth 
cycle of the computation by the chip and let 
be the number of outputs generated in the first t cycles. Partition the cycles of the 
chip algorithm into intervals I,, I, ,..., I,, where 
and 
I,= ~t~t,lqW=O,h,o+, > 01 
I,= {tr-i < t<t,ld/q-P+ 1 <W,)-4(t,-,)<d/qJ. 
511/29/2-9 
260 JOHN E.SAVAGE 
Since h, (p for all values of s, the set Z, is properly defined. The number of intervals 
is p + 1, where p satisfies q < p. Let Y, be the output variables in the set B produced 
during the cycles in the interval Z,. Then, ]Y,,]=O, Y,.nY,=0 for rfs, 
UP= 1 Y, = B and I J’,.l = WA - 4% J. 
Since the total number of times that input variables are read is ppc, there must be 
some interval Zk, 1 < k < p, in which this number and the number of different input 
variables read are bounded above by &c/p </@c/q. Let U be the set of input 
variables that are not in Zk and V the set of outputs in Y,. Then I UI > u IA I for 
u=(l-pp/q) and (V(>vjBJ f or v = (1 - z)/q. Since the function f has a W(U, v)- 
flow, there is an assignment to inputs in I, such that the subfunction from U to Vhas 
IX/ w(u3”) points in the image of its domain. Because the values of the inputs in U 
influence the outputs in V only through the memory of the chip, and since the b 
memory cells can assume at most IX]* values, it follows that b > w(u, u). The area 
required then is 
A > A2 min(b,p) > A2 min(w(u, v), rd/q), 
where 0 ( r ( 1, 1 < q < (B 1, u = (1 - pp/q), and u = (1 - z)/q. The parameters r and 
q can be chosen to maximize the lower bound. 1 
COROLLARY 9.1. Zf f: X" -+ Xm is (a, I, d)-independent then the area required to 
compute it with a @I, p)-multilective algorithm is at least 
A > 12d/(q(l + a)), 
where q = [(ad/( 1 + a) + P,m)/ll. 
Proof: From Lemma 1 the w(u, v)-flow for a (a, 1, d)-independent function is at 
least max(min(vd, (I - (1 - u) n), 0)/a. In Theorem 9 if we choose r = I/( 1 + a) and 
q as given above, we have the indicated lower bound. 1 
The above corollary provides good lower bounds on area for many of the problems 
described in Subsection 3.1. These are problems for which the number of input and 
output variables are comparable and I and d are linear in the number of output 
variables. However, for s x s matrix multiplication the area bound is not tight since 
I = s and d = s* imply a lower bound of A = n(s) when a is fixed and ,Z$J = 1. This 
lower bound on area is about the square root of the number of input variables when 
algorithms are semelective, but Heintz [ 191 has shown that a linear size area is 
required for semelective algorithms. 
We now use Lemma 3 and Theorem 9 to lower bound the area required by 
multilective matrix multiplication algorithms. The bound can be applied to matrix 
inversion, to the transitive closure of a Boolean matrix, and to computing the LU 
decomposition of a matrix with the Kung and Leiserson algorithm. 
MULTILECTIVE v%I ALGORITHMS 261 
COROLLARY 9.2. Let the matrix product C = D x E of s x s matrices be defined 
over a ring with a set X of elements. Then any (/3, p)-multilective chip algorithm with 
operations and memory cells over X requires area A of at least 
for 4 = 1W3~)~l. 
ProoJ: From Lemma 3 the w(u, v)-flow of matrix multiplication is 
max(v - (1 - u)~), 0) s2/2. The lower bound follows from Theorem 9 by choice of 
r = i, and q as given above. fl 
The area required for predicates is difficult to bound when chip algorithms are 
multilective. In [4; 241 it is shown that the Boolean function fMp : (0, 1 }’ + (0, 1) 
defined by Meyer and Paterson [21, p. 431 requires area R(n/log n) if realized by a 
semelective algorithm. This lower bound can be extended directly to (p,p)- 
multilective algorithms to produce a lower bound of a((n/log n) - @ - 1) n). Clearly 
this bound is of value only for small values of ,D. In fact if p = 2, the area required 
can be as small as O(log n). The semelective lower bound on area is derived by deter- 
mining the one-way flow of information [ 71 to the chip. As this result demonstrates, 
if the inputs can be read several times, the one-way flow of information is not 
sufficient to derive a strong lower bound on the area. 
5. AREA-TIME BOUNDS 
The area-time inequalities given in Subsection 2.2 are now combined with the 
lower bounds on planar circuit complexity and other results to derive computational 
inequalities for several families of problems. 
We use the following observation by Kedem to derive additional lower bounds on 
the measure APTQ for various values of p and q. 
THEOREM 10. Any @,p)-multilective chip algorithm with operations and memory 
cells over a set X which computes a function f: X” +X” requires area and time 
satisfying 
This follows because the &n readings of input variables and the production of m 
outputs require at least T > max(@n, m)/p cycles, where p is the number of pads on 
the chip, and require area of at least A > J2p, since each pad occupies area of at least 
A2. 
Using Corollary 9.1 and the above theorem we have the following theorem. This is 
the type of result that Valiant observed can be derived directly from an application of 
the Grigoryev method for bounding space-time tradeoffs. 
262 JOHN E. SAVAGE 
THEOREM 11. Let f: X” + Xm be (a, 1, d)-independent for a. Then any (& ,u)- 
multilective chip algorithm that computes f using operations and memory cells over X 
requires area A and time T that satisfy the following inequality: 
A2T > @“/v) Pond 
40 + a)’ 
where q = \(/3,un + ad/(1 + a))/11 when the number of output variables m </3pn. 
COROLLARY 11.1. Let f: X” + Xm be any of the following functions: the shifting 
function f I”‘, the binary multiplication function j$“, the DFT f &, the binary 
functions f f,“’ and f rvb), which are the normalized powers and reciprocal powers of 
binary integers, where a > 1, and b = q/2k > 0, k and q integers independent of n, the 
(nonbinary) transitive functions h I;“’ of order n, including s y s matrix inversion for 
n = (~12)~. Then, any VLSI chip algorithm for these problems requires area A and a 
number of cycles T that satisfy 
A ‘T = fi(n’). 
Combining Theorem 10 with Corollary 9.2 we have the following inequality for 
matrix multiplication. This inequality is derived be taking the square root of the area 
bound and multiplying it by the inequality of Theorem 10. 
THEOREM 12. Let the matrix product C = D x E of s x s matrices be defined 
over a ring with a set X of elements. Then any @,p)-multilective chip algorithm with 
operations and memory cells over X requires area A and time T that satisfy the 
following inequality: 
A312T> (A’/& 2j3,us3/& = Q(s”) 
for q = L2~~)‘l. 
This bound is stronger than the lower bound of A2T = Q(s’) implied by Theorem 
11. As mentioned above, the bound can be applied to to matrix inversion, to the tran- 
sitive closure of a Boolean matrix, and to computing the LU decomposition of a 
matrix with the Kung and Leiserson algorithm. 
Computational inequalities can also be derived by combining Theorem 11 with the 
bounds of Theorems 4 and 8. Replacing p by 1 Theorem 11 further weakens the 
inequality. Combining Theorem 4 and the square of this inequality with the lower 
bound of Theorem 8 that depends on p for the first set of functions and the fourth 
power of this inequality with the matrix related functions, we have 
THEOREM 13. Let f: X” -+X” be any of the following functions: the shifting 
function f p’, the binary multiplication function f E’, the DFT f g&., the binary 
functions f p3a’ and f f7bj, which are the normalized powers and reciprocal powers of 
binary integers, where a > 1 and b = q/2k > 0, k and q integers independent of n, the 
MULTILECTIVE VSLI ALGORITHMS 263 
(nonbinary) transitive functions h$ of order n, including s x s matrix inversion for 
n = (s/2)‘, then, any VLSI chip algorithm for these problems requires area A and a 
number of cycles T that satisfy 
A3T4 = fl(n”). 
Let fM : XzS2 -+X”’ denote s x s matrix multiplication over the ring X, let 
fTc : XS2 -+ XS2 denote the transitive closure of an s X s matrix over a semiring X, and 
letf I”:’ denote the Kung-Leiserson LU decomposition of s x s matrices for which one 
exists, then for any of these functions, any VLSI chip algorithm for these problems 
requires area A and a number of cycles T that satisfy 
A5T6 = a@“) for n = s’. 
The inequality stated above for transitive functions has been previously derived by 
Kedem and Zorat [6] for the Brent-Kung model. We now illustrate with a fairly 
weak result the role that w(u, v)-separation can play in the derivation of area-time 
bounds. Combining Theorems 4 and 7 we have the following set of area-time bounds 
for the computation of a predicate with a VLSI algorithm. 
THEOREM 14. Letfrvb’: { 0, 1)” + { 0, 1) be the function defined in Lemma 4. Let 
it be computed in T cycles on a VLSI chip of area A > 12’ and v >, 2 planes with a 
@, p)-multilective algorithm. Then, the following inequalities must be satisfied. 
AT’ = 0(max(n2/~12,~n)) = R(n14’13) 
A2T= ~(max(n’/,u”,~n)) = Q(n’4’13). 
6. MULTILECTIVE ALGORITHMS 
Kedem and Zorat [6] and Valiant [ 131 have shown that the cyclic shift function 
on n Boolean variables can be realized by a multilective algorithm with 
A3T4 = 0(n13’3). Kedem [ 141 has recently improved upon this upper bound. He has 
given multilective algorithms for cyclic shift that he shows are optimal to within 
constant factors over the full range of area and time in the Brent-Kung model, when 
certain restrictions are put on the number of outputs produced at one place on the 
chip. 
In this section we illustrate the inequalities that have been derived above with a 
family of multilective algorithms for matrix multiplication. They have the property 
that they are optimal in use of chip area. 
Consider a rectangular array of r x r cells, each of which has two inputs A and B 
and a state C which is the sum of the previous state C* and the product AB. Such an 
array of cells can be used to multiply two r x r matrices, as shown by Preparata and 
Vuillemin [33]. Each cell is initialized with state C = 0, and the inputs to the array 
are staggered in the obvious way. 
264 JOHN E. SAVAGE 
A family of multilective algorithms for s x s matrix multiplication can be 
constructed by using such an array of cells. Consider the matrix product D = E X F 
over the ring X. This product can be formed by using the array to multiply groups of 
r contiguous rows of E by groups of r contiguous columns of F. If r divides s, then 
each of the s/r blocks of r contiguous rows of E can be multiplied by each of the s/r 
blocks of r contiguous columns of F with this array. The multiplication of any two 
blocks is done in s + r - 1 cycles so that the two matrices are multiplied in a total of 
T = (s/r)‘(s •t r - 1) cycles. (We ignore the time to output the results since this can 
be done in parallel with the computation of other results.) This algorithm uses area 
proportional to r*, and each input variable is read s/r times. Since each input variable 
is read at the same place on the chip, it follows that ,U = 1, and /I = s/r, and that the 
area is optimal to within a constant factor. This family of multilective algorithms 
satisfies the following conditions. 
THEOREM 15. There exists a family of algorithms for s x s matrix multiplication 
over a ring X that use chip area that is optimal to within a constant factor, and for 
which the following inequalities are satisped: 
A”“T = O(s3r) 
AT2 = O(s6/r2) 
for r any integer 1 ,< r < s. The first inequality is tight to within a constant factor 
when r is a constant and the same is true of the second inequality when r is propor- 
tional to s. 
7. COMMENTS 
A framework based on planar circuit complexity has been presented in which 
lower bounds can be derived on the simultaneous use of chip area and time for 
multilective chip algorithms. Lower bounds on chip area have also been derived for 
multilective algorithms. We have demonstrated that the Grigoryev independence 
properties of functions can be used for both types of bounds. For functions for which 
these properties are weak, bounds are derived directly. A large variety of multi-output 
functions are handled in these two ways. The framework also handles predicates, 
which is illustrated by example. Lower bounds on area and area-time tradeoffs are 
illustrated for matrix multiplication by a family of area-optimal multilective 
algorithms. 
A number of issues are raised by the foregoing presentation. The first concerns 
multilective algorithms. Multilective VLSI algorithms assume that the scheduling of 
inputs for presentation to the chip is done outside the chip. This implies that some 
external device contributes, if only indirectly, to the computation on the chip. 
Whether or not such a contribution is significant depends on the number and types of 
permutations and repetitions that are needed to schedule inputs. Thus, a measure of 
MULTILECTIVE %%I ALGORITHMS 265 
the impact of the multilective assumption is the computational effort required to 
schedule inputs. It may be that technological changes will dramatically effect the cost 
of multilective algorithms. For example, if the scheduling needed for a problem 
involves the same cyclic shifts of subsets of the inputs, then this might be done most 
naturally by a set of cyclic delay lines to which inputs are supplied once but recir- 
culated as needed by the chip. A device to implement such shifts could be realized 
inexpensively in any of a number of different technologies. 
It remains to show that multilective algorithms can be put to good use if we ignore 
the cost of preparing inputs for chips. The two problems that have been studied to 
date over a wide range of area and time are matrix multiplication, as seen above, and 
cyclic shift [ 141. There are a number of other problems of practical importance that 
could be examined from this point of view. They include most of the problems 
discussed above. For example, in all likelihood good multilective algorithms for the 
DFT can be constructed using the decomposition of the FFT and the schema for 
trading time for space given in [34]. 
Some VLSI chips, such as CPU’s, interact with external devices in a data- 
dependent manner. To exclude such behavior, as we have done above, is to deny 
applications of considerable importance. A step in the direction of more freedom in 
Z/O behavior has been taken by Lengauer and Mehlhorn [3]. They consider cyclic 
shift under special conditions on Z/O, namely, inputs are read exactly once and called 
from a queue in a predetermined order, but at times that may be data-dependent. 
More realistic models of data-dependent behavior are needed. 
Finally, it must be said that the bounds on area and area-time that are derived 
here and elsewhere are not very tight. They often have the correct asymptotic depen- 
dence on the parameters of a problem, but they are less useful for problems of modest 
size. It is likely that better bounds can be obtained if additional assumptions are 
made concerning chip layouts, the order or placement of inputs supplied to a chip, or 
the chip algorithms themselves. Such assumptions should reflect practice. 
APPENDIX A: PROOF OF THEOREMS 
THEOREMS. Letf :X* +X” be computed in T cycles on a VLSI chip of area 
A>A2 and ~22 planes with a (/3,,u)-multilective algorithm. Then, each of the 
following inequalities must be satisfied. 
Proof. The bounds on planar circuit complexity are constructed by simulating 
with a valid planar embedding of a circuit the computations carried out by a chip 
algorithm. Given a chip, we use its transition function to construct a circuit that 
266 JOHN E. SAVAGE 
computes with logic elements what is computed over time by storing temporary 
results in memory and reusing the logic of the transition function. This amounts to 
unwinding a computation, a very simple notion. 
Consider a physical realization for the transition function of a chip, which we will 
call a layer. It generates outputs to memory cells and to external connections, and 
accepts inputs from the outputs of memory cells and from external inputs. If the chip 
executes T cycles, place T copies of the circuit (or layers) for this transition function 
on top of one another. Connections can be made from one layer to the one 
immediately above to construct a logic circuit that computes exactly the same 
function computed by the chip in the given number of cycles. 
This stack of circuits is not a valid planar embedding since when viewed from 
above there may be multiple edge crossings at a point. Two ways of creating a valid 
planar embedding are considered, which lead to the two different computational ine- 
qualities. 
The first method keeps the layers above one another and makes all wires, pads, 
and logic elements infinitesimal in size and width. It then makes a small diagonal 
displacement of the layers relative to one another and of the v planes on each layer, if 
necessary, so that individual logic elements and wire crossings are visible when 
viewed from above. In particular, let the displacements be such that at most two 
wires are seen to overlap at one point. This is now a valid planar embedding. 
The second method places the T layers side-by-side in a contiguous chain with the 
connections made by running wires from one layer to another. In so doing, these 
wires create crossings with the wires on the chips, as seen from above. Again displace 
the planes of individual layers so that at most two wires cross at one point. 
In each of these two constructions, if the chip algorithm is semelective or unilocal, 
the same will hold true for the embedding. If the algorithm is unilocal and the first 
construction is used, a simple transformation will make the embedding semelective. 
The same transformation applied to this construction will convert any @,p)- 
multilective chip algorithm into a multilective planar circuit of order p. The transfor- 
mation is shown in Fig. 1. We now bound the number of circuit elements and edge 
crossings needed to realize these constructions. 
On a layer let nL and w be the number of logic elements and straight-wire 
segments. When seen from above, wires on different planes of a layer may appear to 
cross. Let nZCR be the number of such places. In addition there may be places, 
FIGURE I 
MULTILECTIVE VSLI ALGORITHMS 267 
exclusive of these, where two or more wire segments meet. Let nCN be the number of 
such places. (Wires can meet on the same plane or can meet between adjacent planes. 
Wires may meet at the same point at which other wires cross. We assume without 
loss of generality that the angle between wires on the same plane is different from 0 
and 180.) Then 
ni_ + ‘2,~ + ‘t,, <A/A’, (A.1) 
since the places at which logic elements occur, wires cross or meet, or wires connect 
but do not cross, occupy disjoint regions of the surface of the chip, and each place 
has area of at least L2, when seen from above. 
Now make all logic elements infinitesimal in size and all wires infinitesimal in 
width. Let logic elements be made infinitesimal as indicated in Fig. 2. This has the 
effect of adding edges to the graph at inputs to logic elements and of increasing the 
number of points of connection of wires by at most 3n,. In each of the two 
constructions, make displacements of the planes of a layer so as to reduce the number 
of wires that cross at one point to two. As indicated above, the places on a chip at 
which wires connect but do not cross are distinguished from places at which wires 
cross but may connect. A situation that may arise is that wires may cross a point of 
connection of several wires. At both types of places the number of wires on a plane is 
at most 4. Thus, since there are v planes, a displacement at a point of crossing makes 
visible at most 4v2 individual pairs of crossings. 
If n& is the number of visible crossings of pairs of wires on the displaced layer, 
then 
n& & 4&l,, (A-2) 
by the above arguments and the fact that connections associated with infinitesimal 
logic elements do not overlap with wires or other connections. Also, let $N be the 
number of visible points of connection on a layer after displacement. Since a point of 
crossing could contain v points of connection, as seen from above the chip, we have 
Here we add the number of new connections that result from making logic elements 
infinitesimal. 
The number of wires on the chip w cannot exceed 
w < vA/A2, (A-4) 
FIGURE 
268 JOHN E. SAVAGE 
since each wire must occupy an area of at least A2 and they can reside on at most v 
planes. 
Consider the first method of constructing a planar circuit from a stack of T layers. 
The layers are prepared as indicated above. Then the layers on the stack are 
displaced uniformly, as shown in Fig. 3a, so as to make overlapping points of 
connection and pairs of crossing wires visible. Since the geometry of each copy of the 
(displaced) memoryless chip is identical, when copies are displaced in this manner, a 
single point of connection (of at most 4 wires) or pair of crossing wires creates at 
most T* crossings. It follows that the graph produced by this type of displacement 
will have n,T logic elements and at most (n & + n&) T* crossings. Consequently, 
we have from Eqs. (A.l), (A.2), and (A.3) that a valid embedding can be constructed 
that has at most AT/l* + (4v2 + v + 3) AT’/l* <AT/J* + 6(v/J)*AT* logic elements. 
This circuit computes the same function that is computed by the chip algorithm in T 
cycles but with multiple occurrences of input variables. Some of these occurrences 
can be converted to logic elements without substantially increasing this bound, as 
shown below. 
Consider a (/?,,u)-multilective chip algorithm. Such algorithms read n input 
variables ,un times when multiple reads at one pad are treated as single reads. In the 
construction of planar circuits given above, multiple reads at one pad are translated 
into nodes associated with a given input variable at the same location on each copy 
of the chip. These nodes can be connected together to make a multilective planar chip 
of order ,u. This is done as shown in Fig. 1. Here the T copies of one input pad are 
shown with the total fan-out from each copy accommodated with two wires. 
Constants and variables that are read once at a pad do not contribute crossings. Let 
xi, x2 ,..., x, be the variables read more than once at a given pad and let xi be read ki 
times, ki > 2. Then, 
m 
-s ki<T. 
i=l 
(A.3 
If connections to instances of xi, x2 ,..., x, are made in this order, as shown in Fig. 1, 
S self-crossings of the new connecting edges are made, where 
m 
(a) 
S< c (i- l)ki. 
i=l 
(A.61 
b) 
FIGURE 3 
MULTILECTIVE VSLl ALGORITHMS 269 
In addition, F < mT crossings are made of these new wires with the one wire per 
copy of the port that is used to accommodate fan-out. Thus, Q = S + F crossings are 
made per port to coalesce the several occurrences of input nodes. Now S < (m - 1) T 
so that Q Q (2m - 1) T. This bound on Q is maximized under (A.5) and ki > 2 by 
ki = 2 and m = T/2. Hence, Q < T2. Thus, since A > 12’, and v > 2, it follows that a 
multilective valid planar embedding of order ,U can be constructed that computes the 
same function computed by the chip in T cycles and which has a size of no more 
than AT/A2 + 7(v/lz)*AT2 logic elements, and this bound is no larger than 
8(v/3L)‘AT2. 
The second construction places the copies of the memoryless chip side-by-side, as 
shown in Fig. 3b, and connects them by a set of parallel wires that run between every 
two copies of the chip. Since there are at most w of these wires and each can intersect 
each of the original straight wire segments at most once, of which there are at most 
o, it follows that at most o2 new crossings are created for each pair of chips. The 
graph thus produced has n,T logic elements and at most n&T + 02(T - 1) 
crossings. Therefore the size of a valid planar embedding that is multilective of order 
/3~ is from (A.l), (A.2), and (A.4) at most 5(v/A)‘AT+ (vA/I’)‘T. This in turn is at 
most 6v2(A/1)‘T. This establishes the second inequality. fl 
APPENDIX B: EXTENSIONS OF THE PLANAR SEPARATOR THEOREM 
The 
PO]- 
following is one version of the planar separator theorem of Lipton and Tarjan 
THEOREM B.l [20]. Let G = (V, E) be any N-vertex planar graph having vertex 
costs c: V-+ [0, 1 ] (the closed interval) summing to c(V). Then the vertices of G can 
be partitioned into three sets A,, A,, C such that no edge joins a vertex in A, with a 
vertex in A,, neither A, nor A, has total cost exceeding (2/3) c(V), and the separator 
C contains no more than K, fi vertices, where K, = 2 fi. 
In this section this theorem is extended to the partitioning of a planar graph into 
many disjoint regions such that it is possible to separate any one region from the rest 
by removing a separator of size O(@). This extension will be used in later sections 
to derive lower bounds to multilective planar circuit complexity. We begin with a 
two-cost version of the planar separator theorem. 
LEMMA B. 1. Let G = (V, E) be any N-vertex planar graph with vertex costs 
c: V + [0, l] summing to c(V). Then the vertices of G can be partitioned into three 
sets A,, A,, C such that no edge joins a vertex in A, with a vertex in A,, c(A ,), 
442) G (9 W), and the cardinalities of these sets satisfy IA, 1, iA21 < 5N/6, and 
ICl<K,fl, whereK,=2\/2(fl+ 1). 
Proof Apply the previous theorem to partition the vertices of G with respect to 
the uniform cost measure. Let the sets in this partition be X, , X2, and the separator 
270 JOHNE.SAVAGE 
S,. Then, ]Xi]<2N/3 for i= 1,2, and ]S,]<K,fl. Ifc(X,)<3c(V)/4 for i= 1, 2, 
the result is established. Consequently, assume without loss of generality that 
c(X,) > 3c( I’)/4 and c(X,) < c( V)/4. Partition X, using the previous theorem and the 
cost function c into the sets Y,, Y,, and the separator S,, where 
1 S, I < K,m < K,m. Then c( Y,) < 2c(X,)/3 and c(YJ ,< 2c(X,)/3. Without 
loss of generality assume that I Y, I < IX, l/2. Let A 1 = Y, UX,, A, = Y,, and 
C = S, u S,. It follows from straightforward substitutions that the conditions of the 
theorem are satisfied. # 
The desired generalization of the separator theorem follows from the application of 
this result and is shown below. It is similar to a result of Bhatt and Leiserson [35] on 
multicolor separators and to a result of Leighton [36] on multicolor bifurcators. 
THEOREM B.2. Let G = (V, E) be any N-vertex planar graph and let 
c: V+ (0, 1 } be the cost function on the vertices of G having value 0 or 1. Then the 
vertices of G can be partitioned into P disjoint sets {Ai1 1 < i < P) each of cost 
(1/4P) c(V) < c(A i) < (4/P) c(V) 
such that for each Ai there is a separator set Ci of at most K, fl vertices and a set 
B, of vertices, Bi = V-A i - Ci such that no edge joins a vertex in Ai with a vertex in 
B,. Here K, = 2 fl(fl+ l)/(l - fi) E 59. 
Proof. The sets are obtained by applying the previous theorem P - 1 times. At 
the first application, let Xi, X,, and C be the two sets and the separator satisfying the 
conditions of that theorem. Assume without loss of generality that c(X,) < c(X,). 
Then by the nature of the cost function, the vertices of nonzero cost in C can be 
partitioned into two sets S, and S, so that the sets Yi = Xi U Si, i = 1, 2, have costs 
that are in the same relation as the costs of X, and X, and 
c(V)/4 < c(Y,) < c(Y*) < 3c(V)/4. 
If P > 2, apply the previous theorem to X,, since it has larger cost, to produce sets 
X XL27 11, and a separator C,. It follows that ]C,( <K, m and that 
C(X,i) < 3c(X,)/4. Thus the vertices of C, U S, of nonzero cost can be partitioned 
into two sets S,, , S,, so that the costs of the two sets Yn = Xii U Slir i = 1,2, are in 
the same relation as the costs of the sets Xi,, X,,, and in addition 
To break all paths between vertices in Yii and vertices in the rest of the graph, it is 
sufficient to remove C and C, sets of size at most K,(l + 6) fl. To do the same 
for YZ it is sufficient to remove C. 
The creation of the P sets corresponds to the building of a binary tree. Initially the 
tree has only the root to which we correspond to the triple of sets, (V, 4, 4). The 
application of the previous theorem partitions V and generates two sons of the root. 
MULTILECTIVE VSLI ALGORITHMS 271 
These sons correspond to the two sets X, and X,. Attach the triples (Xi, 9, Si) to 
these vertices and update that attached to the root to (I’, C, 4). The sets Y, and Y, 
associated with the leaves are the unions of the first and third components of these 
triples. On partition of X, and expansion of its node, update the triple associated with 
it to (Xi, C,, S,) and attach the triples (Xii, 4, S,,) and (Xi2, 4, S,,) to its sons. 
Again the sets Y,, and Y,, are the unions of the first and third components of the 
triples associated with these leaves. 
This process is repeated as other sets are partitioned. At each stage the P sets 
{ Y,a...Z} that are associated with leaves of the tree are equal to the union of the first 
and third components of the triple associated with each leaf. To separate such a leaf 
set from other leaf sets it is sufficient to remove the separator sets which are the 
middle components of the triples encountered on the path from that vertex to the root. 
It is straightforward to demonstrate that a separator set at level d has a size of at 
most K, dm. (The root has level 0.) Thus, at most K, fl/(l - &) vertices 
need to be removed to effect the desired separation. 
Now consider the cost of the sets {Ai} so produced. Let mp and M, be their 
minimum and maximum costs and let ,u, be the second largest cost. We show by 
induction that 
The basis for induction holds from one application of the previous theorem. Assume 
that the result is true for p < P - 1. Then, given a collection of P - 1 sets, the 
procedure given above forms a Pth by splitting the largest in the collection. Conse- 
quently, the smallest and largest in the new collection have costs of 
and 
from which the desired conclusion follows directly. 1 
ACKNOWLEDGMENT 
The author is pleased to acknowledge several helpful conversations with Zvi Kedem on methods for 
treating multilective algorithms. 
REFERENCES 
1. C. D. THOMPSON, Area-time complexity for VLSI, in “Proceedings, 11th ACM Ann. Sympos. 
Theory of Comput., April 1979, pp. 81-88.” 
2. R. P. BRENT AND H. T. KUNG, The area-time complexity of binary multiplication, J. Assoc. 
Cornput. Mach. 28 (3) (1981), 521-534. 
272 JOHN E. SAVAGE 
3. T. LENGALJER AND K. MEHLHORN, On the complexity of VLSI computations, “VLSI Systems and 
Compututions” (H. T. Kung, B. Sproull, and G. Steele, Eds.) pp. 89-99, Computer Sci. Rockville, 
Md., 1981. 
4. J. E. SAVAGE, Planar circuit complexity and the performance of VLSI algorithms, “VLSI Systems 
and Computations” (H. T. Kung, B. Sproull, and G. Steele, Eds.) pp. 61-68. Computer Sci. 
Rockville, Md., 1981. 
5. 2. M. KEDEM AND A. ZORAT, Replication of inputs may save computational resources in VLSI, 
“VLSI Systems and Computations” (H. T. Kung, B. Sproull, and G. Steele, Eds.) Computer Sci. 
Rockville, Md., 1981. 
6. Z. M. KEDEM AND A. ZORAT, On relations between input and communication/computation in 
VLSI, in “Procs. 22nd Ann. Sympos. on Found. Comput. Sci., October 28, 1981,” pp. 37-41. 
7. A. YAO, The entropic limitations on VLSI computations, in “Proceedings 13th Ann. ACM Sympos. 
on Theory of Computing, May 11-13, 1981,” pp. 308-311. 
8. C. D. THOMPSON, “A Complexity Theory for VLSI,” Report No. CMU-CS-80-140, Dept. of 
Comput. Sci., Carnegie-Mellon Univ. Pittsburgh, Pa., August, 1980. 
9. H. ABELSON AND P. ANDREAE, Information transfer and area-time tradeoffs for VLSI 
multiplication, Comm. ACM 23 (1980), 20-23. 
10. J. E. SAVAGE, Area-time tradeoffs for matrix multiplication and related problems in VLSI models,” 
J. Comput. System Sci., 23 (1981), 230-242. 
11. J. VUILLEMIN, A combinatorial limit to the computing power of VLSI Circuits, in “Proceedings 21st 
Ann. Sympos. Found. Comput. Sci., Oct. 13-15, 1980, pp. 294-300. 
12. R. J. LIPTON AND R. SEDGEWICK, Lower bounds for VLSI, in “Proceedings, 13th Ann. ACM 
Sympos. on Theory of Computing, May 11-13, 1981,” pp. 300-307. 
13. L. G. VALIANT, personal communication. 
14. Z. KEDEM, personal communication. 
15. D. Yu. GRIGORYEV, An application of separability and independence notions for proving lower 
bounds of circuit complexity, Notes of Scientific Seminars, Steklov Math. Inst. 60 (1976), 3548. 
16. J. E. SAVAGE, Space-time tradeoffs-A survey, in “Proceedings, Third Hungarian Conf. on 
Comput. Sci., Jan. 26-28, 1981,” pp. 93-104. 
17. H. T. KUNG AND C. E. LEISERSON, Algorithms for VLSI processor arrays, “Introd. to VLSI 
Systems” (C. Mead and L. Conway, Ed.) (pp. 271-292, Addison-Wesley, Reading, Mass., 1980. 
18. G. M. BAUDET, personal communication. 
19. C. A. HEINTZ, personal communication. 
20. R. J. LIPTON AND R. E. TARJAN, A separator theorem for planar graphs, SIAM J. Appl. Math. 36 
(2) (1979) 177-189. 
21. J. E. SAVAGE, “The Complexity of Computing,” Wiley, New York, 1976. 
22. F. T. LEIGHTON, New lower bound techniques for VLSI, in “Proceedings, 22nd Ann. Sympos. on 
Found. of Comput. Sci., October 28, 1981,” pp. 1-12. 
23. A. AGGRAWAL, personal communication. 
24. J. E. SAVAGE, “Planar Circuit Complexity and the Performance of VLSI Algorithms,” Report No. 
69, Dept. of Comput. Sci., Brown Univ., Providence, RI., January 1981; revised April and July, 
1981; INRIA Report No. 77. 
25. J. E. SAVAGE, Computational work and time on finite machines, J. Assoc. Comput. Mach. 19 (4) 
(1972) 660-674. 
26. M. TOMPA, Time-space tradeoffs for computing functions, using connectivity properties of their 
circuits, J. Comput. System Sci. 20 (2) (1980), 118-132. 
27. J. E. SAVAGE AND S. SWAMY, Space-time tradeoffs for oblivious integer multiplication, “Lecture 
Notes in Computer Science” (H. A. Maurer, Ed.) pp. 498-504, Springer-Verlag, 
Berlin/Heidelberg/New York, 1979. 
28. J. JA’JA’, Time-space tradeoffs for some algebraic problems, in “12th Ann. ACM Sympos. on 
Theory of Computing, April 28-30, 1980,” pp. 339-350. 
29. J. E. SAVAGE, Space-time tradeoffs for banded matrix problems, J. Assoc. Comput. Mach. 31 (2) 
(1984), 422437. 
MULTILECTIVE VSLI ALGORITHMS 213 
30. A. C. YAO, Some complexity questions related to distributive computing, in “Procs. 11th ACM 
Ann. Sympos. Theory of Comput., April 1979,” pp. 209-213. 
31. A. V. AHO, J. E. HOPCROFT, AND J. D. ULLMAN, “The Design and Analysis of Computer 
Algorithms,” Addison-Wesley, Reading, Mass., 1974. 
32. C. MEAD AND L. CONWAY, “Introduction to VLSI Systems,” Addison-Wesley, Reading, Mass., 
1980. 
33. F. PREPARATA AND J. VUILLEMIN, Area-time optimal VLSI networks for multiplying matrices, 
Inform. Process. Lett. II (2) (1980), 77-80. 
34. J. E. SAVAGE AND S. SWAMY, Space-time tradeoffs on the FFT algorithm,” IEEE Trans. Inform. 
Theory IT-24 (1978), 563-568. 
35. S. N. BHATT AND C. E. LEISERSON, How to assemble tree machines, in “14th. Ann. ACM Sympos. 
on Theory of Comput., May 1982, pp. 77-84. 
36. F. T. LEIGHTON, A layout strategy for VLSI which is provably good, in “14th Ann. ACM Sympos. 
on Theory of Comput., May 1982,” pp. 85-98. 
