Abstract. The two basic performance parameters that capture the complexity of any VLSI chip are the area of the chip, A, and the computation time, T. A systematic approach for establishing lower bounds on A is presented. This approach relates A to the bisection flow, ~. A theory of problem transformation based on ~0, which captures both AT 2 and A complexity, is developed. A fundamental problem, namely, element uniqueness, is chosen as a corqputational prototype. It is shown under general input/output protocol assumptions that any chip that decides if n elements (each with (1+ e) log n bits) are unique must have ~p = ll(n log n), and thus, AT z = lI(n 2 log 2 n), and A = lI(n log n). A theory of VLSI transformability reveals the inherent AT 2 and A complexity of a large class of related problems.
1. Introduction. In the study of complexity theory, a fundamental problem--normally referred to as a computationalprototype--is chosen as the representative of a class of related problems. Establishing a lower bound on some significant performance parameter of a computational prototype has always been a difficult task, but once it is accomplished, the same bound for the rest of the problems in the class is established by means of problem transformation. Employment of a computational prototype is now classical; the most well-known examples are satisfiability in the theory of NP-completeness [GJ] and element uniqueness in the RAM model [PSI. In the VLSI model of computation as formulated by [T] , [BK] , and [AA], the fundamental complexity measures are A, the area of the VLSI chip, and T, its computation time. VLSI computation theory addresses the problem of designing algorithms (and the corresponding architectures) that use these two resources in an optimal (or efficient) manner. In order to judge the efficiency of a VLSI algorithm, it is useful to establish lower bounds on area, time, or various functions that capture an area-time tradeoff, e.g., AT 2. Standard techniques exist for proving lower bounds on T and AT2; they are based on bounded fan-in arguments (in the case of T) and on information-flow arguments (in the case of AT 2) [T] , [BP] .
In this paper we will present a standard technique for proving lower bounds on A. This technique is very similar to Thompson's bisection-flow technique. Indeed, we will show that a lower bound on the bisection flow for a particular computation immediately implies a lower bound on the area of any chip that performs the computation (subject to appropriate input/output protocol constraints).
To establish a lower bound on the bisection flow for a problem II, there are two ways to proceed. The traditional approach is to start essentially from scratch, without taking advantage of previously derived lower bounds. A different approach is to utilize facts already known about another problem and show, by means of problem transformation, that II is at least as hard as this problem. Until now, the first technique has been used almost exclusively; the second approach has been used only in trivial situations, for example, to observe that inverting a matrix is asymptotically at least as hard as multiplying two matrices (up to a constant factor). Our goal is to establish a framework in which the second technique, that is, problem transformation, can be efficiently employed. This framework can be used to establish nontrivial lower bounds for a large class of related problems.
This paper is organized as follows. In Section 2 we modify the bisection-flow technique of Thompson to lower bound A instead of x/-AT. We investigate the reciprocal roles of area and time in these lower bounds and show how, under this reciprocity, a x/-AT (i.e., AT 2) lower bound, obtained by bisection-flow arguments, implies an A lower bound. In Section 3 we develop a theory of problem transformation in VLSI that is based on the bisection flow. A computational prototype, namely, element uniqueness, is introduced, and nontrivial lower bounds on the bisection flow for this problem are established. Finally, in Section 4, these results are integrated to establish nontfivial AT 2 and A lower bounds for a large class of problems.
2. Lower Bounds Using Bisection Flow. Thompson, in his seminal thesis [T] , proposed a now classical technique for analyzing VLSI complexity, as follows. Consider a problem II(s), where s is the input size, and a chip Cn with area A that is capable of solving II in time T. Let l be a cut that partitions Cn into a left side (L) and a fight side (R), such that each side reads (almost) half of the inputs, i.e., s/2-o(s) bits, as shown in Figure l(a) . The general framework is one in which two processors, PL and PR, associated respectively with L and R cooperate to solve II(s) (see Figure l(b) ). We denote by ~n(S) the number of bits that PL and PR communicate to solve H(s). Of course ~n(S) depends strongly on the distribution of input/output bits between PL and PR, and this, in turn, depends on the input/output protocol of Cn. The exact nature of this dependence will be clarified later in this paper.
As Ullman noted [U], the history of the computation performed by Cn can be modeled with an area-time solid, as shown in Figure 2 . The communication channel between PL and PR is represented by the rectangle F (indicated by the dashed line) that transects the longer of the two area dimensions. Thus, F has sides of length T and (at most) x/A; so AF, the area of F, is at most v/-AT. If ~On(S) bits must flow across this channel, then AF = ~(q~rl(s). Hence, we obtain (1) x/-Ar = 12(q~n(s)), [Y2] and for transitive functions [V] . Here, we generalize both of these results and present a new methodology for proving area lower bounds. Our approach is similar to that of Wu [W] , however, we point out the subtle changes that must be made in the input/output protocol assumptions to obtain valid area bounds.
Again, we consider the area-time solid that models the computational history of Cn. Suppose there is a time tl at which Cn has read (almost) half of the inputs, i.e., s /2-o(s) bits. Let F (indicated by the dashed line) be the rectangular intersection of the plane t = tt with the area-time solid, as shown in Figure 3 . Clearly, A F = A. This bisection also yields a two-processor system. Here, PB and PE, associated respectively with the beginning (0-t -<: h) and end (h < t -T) of the computation of Cn, cooperate to solve II(s) (see Figure 4) . We denote by 0n(s) the number of bits that PB and PE communicate to solve II(s). Here again, 0n(S) depends on the distribution of the input/output bits between PB and PE-(We defer discussion of this distribution and its dependence on the input/output protocol.) Because the electrical circuitry of the chip must be causal (i.e. information cannot flow backward in time), this communication is strictly one-way, from PB to PE. We can now state the following theorem relating A to 0n(s). PROOF. AS above, let us first assume that there is in fact a time tt when s/2 -o(s) bits have been read. Then the rectangle F (dashed in Figure 3 ) represents the communication channel from PB to PE. All information that crosses F must be encoded in the chip's state (i.e., stored in its memory) at time h. Since the storage of a bit requires some constant amount of area under any realistic assumptions, Now, if there is no such time h, then at some instant f~(s) bits must be read simultaneously. This requires the existence of f~(s) input ports, which would occupy l~(s) area. Thus, A=f~(s) in this case. But ~bn(S)<-s/2, since PB can simply send all of its inputs to PE. Therefore, in this case, we also have A = l~(On(s)).
.
[--]
The above theorem gives a convenient relationship between the area complexity of a VLSI chip and the one-way communication complexity of a two-processor system. However, because two-way communication complexity is the measure of interest in the proof of AT 2 lower bounds, it is convenient to relate area to this measure also. If we denote by ~b*(s) the number of bits that PB and PE must communicate to solve H(s) when two-way communication is allowed, then obviously ~b*ri(S) -< @n(s). Thus, we have the following corollary.
COROLLARY 1. Any chip that solves H(s) must have area satisfying
Although this bound may in general be quite weak [DGS], we find it sufficiently tight for many problems. Input/output protocol constraints must be established in VLSI computation theory. Such constraints reflect realistic assumptions regarding the physical structure of VLSI chips and the computing environments in which these chips might be used. They also simplify the combinatorics involved in the lower-bound arguments. Here, we will investigate the reciprocal roles played by area and time in the proof of A and ATL lower bounds. In particular, we will show how a spatial constraint on the input/output protocol, which may be used to bound ~n(s), corresponds to a temporal constraint, which may be used to bound ~*(s). The fundamental observation here is that ~ (~b*) depends only on the distribution of input/output variables between PL and PR (PB and PE), and that the class of allowed distributions is governed by the spatial (temporal) input/output protocol constraints.
We begin by summarizing typical input/output protocol constraints. For the purpose of this discussion, we will assume that the input is organized as n words, each with k bits. First, we have spatial constraints: (A1) Unilocah each input/output bit is available at only one port (but perhaps at several time instants). (A2) Place-determinate: input/output data are available at prespecified (instance-independent) places. (A3) Word-locah for any cut 1 partitioning the chip, o(n) input (output) words have some bit entering (exiting) the chip on each side of I. (A4) Bit-local: for any cut 1 partitioning the chip, o(k) input (output) bit positions have some bit entering (exiting) the chip on each side of/.
Second, we have temporal constraints:
(B1) Semellective: each input/output bit is available at only one time instant (but perhaps at several ports).
s. Horniek and M. Sarrafzadeh (B2) Time-determinate: input/output data are available at prespecified (instanceindependent) times (actually, only the sequence of input/output events need to be instance-independent, which allows for ready/acknowledge input/output protocols). (B3) Word-serial: at any time instant, at most one input (output) word has some, but not all, of its bits already read (written). (B4) Word-parallel: at any time instant, for all but at most one l, either all or none of the/th significant bits of the input (output) words are already read (written). Now we will discuss the manner in which these constraints restrict the class of distributions of input/output variables allowed in the two-processor system. Constraint A1 ensures that any particular input/output bit residues in either PL or PR, but not both. Correspondingly, constraint B1 ensures that any particular input/output bit resides in either PB or PE, but not both. Constraint A2 (or B2) ensures that, for all problem instances of a given input size, any particular input/output bit resides always in the same processor. Constraint (A3) distributes the input/output bits between PL and PR essentially by word (possibly with o(n) words fragmented across processors). Constraint (B3) corresponds to (A3) but is somewhat stronger. It distributes the input/output bits between PR and PE also by word (with O(1) words fragmented across processors). Constraint A4 distributes the input/output bits between PL and PR essentially by their position in their respective words (possibly with o(k) positions fragmented across processors). Constraint B4, similar but stronger, distributes the input/output bits between Pa and PE also by bit position (with O(1) positions fragmented across processors). Because of this correspondence (see Figure 5 ), any theorem lower bounding ~ (and hence x/-AT) that is predicted on some combination of A1-A4 immediately yields a theorem lower bounding ~* (and hence A) that is predicted on a corresponding combination of B1-B4. Hereafter, we will use the notation of the spatial two-processor system to establish lower bounds on ~. From the previous discussion, it is clear that the same arguments can be used to establish lower bounds on ~*.
3. Transformability in VLSI. In this section we will develop a general theory for establishing lower bounds on AT 2 and A. In Section 2 it was shown that the bisection flow captures both the AT 2 and the A measure of complexity. Furthermore, viewing a chip as a two-processor system circumvents certain difficulties of problem transformation in the VLSI domain (e.g., routing and encoding/decoding of data), provided that the two classes of allowed data distributions are essentially the same. In the following discussion, we will make these notions more concrete.
Following the notation of Preparata and Shamos [PSI, consider two problems IIl(sl) and 1"I2(s2) , and assume that a two-processor system Pnl is available that solves IIl(Sl). Problem II2(s2) can be solved as follows:
(1) The input to problem 1-[2(S2) is converted into a suitable input to problem rl~(Sl).
(2) Pnl is used to solve IIl(s0. (3) The output of II~(s~) is transformed into a solution to problem II2(s2).
Thus, it is said that problem II2(s2) has been transformed to problem IIl(S0. If steps 1 and 3 (above) can be done by transmitting ~2,~(s2) bits between the two processors in Pnl, then l-I2(s2) is said to be ~p2,1(s2)-transformable to IIl(s0, we write l~I2(s2) ~2a(s2)> I-Ii(sl). Note that steps 1 and 3 include the conversion of an allowed data distribution for Pn~ into an allowed data distribution for Pnc Therefore, ~2,~(s2) also accounts for the number of bits required to do this conversion. Now we need to search for a problem H(s) for which we can establish a lower bound of r on the information flow and a transformation II(s) o(~(s~) II'(s'), for many related problems II'(s'). II(s) then serves as a computational prototype for this class of related problems. A good computational prototype for a complexity class must be a simple problem, which makes it difficult to establish a lower bound on its bisection-flow complexity. Indeed, this is the case for computational prototypes in other models of computation (e.g., satisfiability in the theory of NP-completeness). In this paper we choose element uniqueness (EU) as a computational prototype. EU(n, h): Given n inputs (xl,..., x~), each of which is represented with h + log n -1 bits, decide if they are all unique (h -> 1, otherwise the problem is trivial). By convention, if they are all unique, then the output (one bit) is 1, otherwise the output is 0.
The following framework will be used to establish a lower bound on the communication complexity of any two-processor system that solves EU(n, h). Consider a decision problem II(s), where s is the number of inputs and let Pn be a two-processor system that solves II(s). In what follows we show that ~0EU(n, h) = f~(nh) under the word-local protocol (Lemma 1) and also under the bit-local protocol (Lemma 2). The two results will be combined in Theorem 2 to Show the same lower bound for EU under the bit model of VLSI computation, where neither assumption (A3) nor assumption (A4) need hold. Consider the input data organized as an array, with each word constituting a row and with the bit positions aligned as columns. We begin by partitioning the input array as X = [M, D] , where M (the matching part) and D (the data) are blocks of log n -1 and h columns (see Figure 6 ). The bits of M will be used to enforce an appropriate matching of the input words, which will be specified later. Subsequently, we will be concerned only with the information flow induced by/9, and all bisection arguments will be based on the bits of D.
LEMMA 1. Under the word-local protocol assumption (A3), ~EU(n, h) = l~( nh ).
PROOF. The proof is based on a restriction of element uniqueness to pair-wise element uniqueness. Without loss of generality, we assume that di enters PL for 0 < -i< n/2 and it enters PR for n/2 < --i< n. We will prove a lower bound on the flow by considering the restricted class of input assignments such that and mi=i for O<--i<n/2, mi=i-n/2 for n/2<-i<n.
In essence, we have partitioned the inputs into n/2 pairs, where each pair contains di and di+,/2 for 0~ i< n/2. The two members of each pair are in different processors (one in PL, the other in PR). Thus, the elements are not unique (output = 0) if di = di+,/2 for any 0---i < n/2. It can be shown by a generalization of the argument in [MS] that the result matrix has full rank (2nh/2), and thus the flow has a bound of lq(nh) [GLTWZ] .
[] Under the word-local assumption, each bit of a given input word enters the same processor (PL or PR). Now we consider the "opposite case," where half of the input bit positions are assigned to each processor. LEMMA 2. Under the bit-localprotocol assumption (A4), ~p~u(n, h) =~(nh), for h = O(log n). Figure 7) . In this setting, the elements are not unique (output = 0) if c0(i ) = %(0 for 0-i < H/2 and any j.
PROOF. The fragment of di in PL (PR) is denoted by d L (dR
There are (H/2)! permutations wj, for anyj. The result matrix for any one group is the submatrix obtained by deleting certain rows from the matrix introduced in Lemma 1. Thus, it also has full rank, i.e., (H/2)!. The overall result matrix is the Kronecke_,' product of the group matrices, and its rank is therefore the product
djH +14/2--1: of the ranks of these matrices:
From equation (5) we can establish the desired bound on the flow:
and, because n=2 h/2+l, ~ = f/(nh) [] Now we will extend the results of Lemmas 1 and 2 to the more general bit model of VLSI computation. In this situation any bit of any word may enter either of the two processors. THEOREM 2. For any unilocal, place-determinate input~output protocol, CEu(n, h) = f~(nh) for h = O(log n).
PROOF. Our strategy is to show that, for an arbitrary (but fixed) partition of the input bits, a large portion of the input words must all be either "substantially" word-local or "substantially" bit-local. The set of inI~ut words is partitioned into two sets, the set B of biased words and the set U of unbiased words. Intuitively, a biased word is one with most of its bits in one processor (PL or PR), and an unbiased word is one with almost the same number of bits in each processor. More formally, Note that B u U = D and B n U = 0. We analyze the distribution of input bits in each processor according to the size of the sets B and U.
Case 1. I BI >-3n/4 (thus I UI <-n/4)
We partition the biased words further into the left-biased BL and the right-biased (1), that any uniiocal, place-determinate chip with area A that solves EU (n, e log n) in time T satisfies AT 2 = [l(n 2 log 2 n), and, by virtue of equation (4), any semellective, time-determinate chip satisfies A =I~(n log n). Siegel has independently proven Theorem 2 [$2]. He has also proven an analogous result for h = to(log n) using square tessellation arguments. The square tessellation technique, however, cannot be applied to time (since time is a unidimensional quantity), so we cannot improve the area lower bound for h = to(log n).
4. Applications. In this section we apply the previous results to establish AT 2 and A lower bounds for related problems. First, however, we prove the following lemma, which facilitates problem transformation by allowing us to relax the unilocal (semellective) assumption, A1 (B1). Instead, we now assume AI' (BI').
(AI') Bilocal: each input/output bit is available at no more than two ports (but perhaps at several time instants). (BI') Bilective: each input/output bit is available at no more than two time instants (but perhaps at several ports).
Let ~'EU(n, h) denote the bisection flow under AI', A2, and A3. Obviously, the traditional bisection technique fails to establish any bound on ~P'EU(n, h) because each input may enter both processors (PL and PR). Nevertheless, we can still obtain a lower bound on ~'EU(n, h) by employing a method similar to the bisection technique. n/2 n/2 n/2 n/2
PROOF. Consider any (convex) chip C~tj that solves EU(n, h) under,Al', A2, and A3. Let us partition the chip into four sections by means of lines parallel to the shorter side of the minimum-area enclosing rectangle, such that each section contains n/2 input words (recall that there are now 2n input words:
{Xo, Xo, x~, x~,..., X,_l, xn_~}). The general framework is one in which the four processors (P1, P2, P3, P4) associated with the four sections of C~u cooperate to solve EU(n, h) (see Figure 8) . A straightforward modification of equation (2) implies
where r h) = max(r r r A lower bound on any one of the r cannot be established independently, for it may be the case that processors to the left or right of the link associated with any r have access to the entire input set and thus do not need to send or receive any information to or from the other processors. In fact, this situation occurs when P1 and P3 each contain a copy of (Xo, ... , , and P2 and P4 each contain a copy of (Xn /2, .
Our strategy is to partition the four processors into two sets, PL and PR, such that each set contains both copies of (at least) n/16 input words. These inputs can then be revealed to the other set only by information flow through the links connecting PL and PR. Since there are a total of n/2 inputs in P1, and each input is repeated twice, there must be at least n/4 distinct inputs in P1. The other copies of these n/4 input words are in P~, P2,/'3, or P4-By the pigeonhole principle, PI and P~ (for some 1---i-4) must contain both copies of at least n~ 16 input words. Let PL = {P1} 63 {Pi} and PR = {P2, P3, P4}-{Pi}-We can view PL-PR as a two-processor system with a flow CLR of f~(nh/16) bits between PL and PR (see Lemma 1) . Clearly, r162162 and thus, ~u(n, h)= max(~, ~, ~)= ~(,nh).
[] From the discussion of Section 2, it is clear that the temporal analog of Lemma 3 also holds under assumptions BI', B2, and B3. Furthermore, if h = O(log n), A3 (B3) may be replaced by A4 (B4) while maintaining the same flow bound.
(The roles of n and h are simply reversed in Lemma 3.)
Now we demonstrate the problem transformation methodology by means of two examples. Specifically, two fundamental problems are shown to be at least as hard (in either the AT 2 or the A sense) as element uniqueness. We conclude with a brief catalog of related problems, together with lower bounds on their AT 2 and A complexity, as obtained via problem transformation.
The first problem is a fundamental one in computational geometry, namely, closest pair. CP(n, h): Given a set of n points Pi = (ai, bi) for 0-< i < n, where each coordinate is represented with h + log n -1 bits, find the coordinates of the closest pair of points.
We want to show EU(n, h)o(nh~ ~ CP(n, h). Assume there is a two-processor system Pce(n, h) that solves CP(n, h). This system can then be used to solve EU(n, h) under the bit model (assumptions A1 and A2) in the following manner:
(1) The coordinates of each point are set as pi = (x~, 0), which is a trivial transformation.
(2) Ice is used to solve this (restricted) closest pair problem. (The chip Ccp is bisected in such a way that each processor inputs half of the "meaningful" data, that is, x~'s.) (3) Once the closest pair of points is determined, PL sends all of its output bits (O(h+log n)) to Pa. Pa then computes the distance between the two points and outputs a 0 if the distance is equal to 0 and a 1 otherwise. It is clear that~he output is 1 if and only if the elements are unique.
By Theorem 2, r h)=l'l(nh). Since steps 1 and 3 above require the transmission of O ( h + log n) + o ( nh ) bits, EU ( n, h ) O(h § n ~ C P( n, h). Theorem 3 follows immediately from the proposition. Thus, any unilocal,, place-determinate chip with area A that solves CP(n, e log n) in time T satisfies AT 2 =l'l(n 2 log 2 n), and any semellective, time-determinate chip satisfies A = fl(n log n). Now we establish a ~ lower bound on the problem of finding the size of the maximum clique in an interval graph (MCIG) by showing that EU is transformable to it. This serves as an excellent illustration of the utility of the previous results. MCIG(n, h): Given a collection of closed intervals 1~ = [l, r~] for 1 --i---n, where li and r~ are respectively the left and right endpoints of interval I, we can define a graph G = ( V, E), where V = {I~[ 1 ---i <-n} and E = {(I,/j) [Ii n/j ~ q~, 1 ---i, j -< n}. Such a graph is called an interval graph. Let h + log n -1 be the length of the integers used to represent the li's and r~'s, i.e., O<-l~, r~<--n2h-~--I for l<--i<--n. The problem is to find the size of the maximum clique in this graph.
We want to show EU(n, h) o(nh)> MCIG(n, h). Assume there is a two-processor system PMcm(n, h)that solves MCIG(n, h). This system can then be used to
