We propose a new model of computation for VLSI which is a refinement of previous models and makes the additional assumption that the time for propagating information is linear in the distance. Our approach is motivated by the failure of previous models to allow for realistic asymptotic analysis. While accommodat#~g for basic laws of physics, this model tries to be most general and technologk~independent. Thus. from a complexity viewpoint, it is especially suited for deriving lower bounds and trade-offs. We present new results for a number of problems including fan-in, addition, transitive functions, matrix multiplication, and sorting.
Introduction
The importance of having general models of computation for VLSI is apparent for various reasons. Among the chief ones, we must include the need for evaluating and comparing circuit performances, showing lower bounds and trade-offs on area, time, and energy~ and more generally building a complexity theory of VLSI computation.
While these models must be simple and general enough to allow for mathematical analysis, they must also reflect reality independently of the size of the circuit. We justify the latter claim by observing that if 1980's circuits are still relatively small, the use of high-level languages for designing chips, combined with the possibility of larger integration and bigger chips, will make asymptotic analysis necessary in the near future.
Yet as circuits are pushed to their physical limits, constraints which could be ignored before become major problems and must be accounted in the models. For example, the laws of physics show" that the propagation of information takes time at least proportional to the distance, which invalidates the assumptions made in previous models, whereby long wires can be driven in constant time [MC80,TH80].
Generally speaking, one major flaw in those previous models has been to regard a circuit as a topological interconnection of nodes where transmission delays between adjacent nodes could be ignored. Instead, Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
©1981 ACM 0-89791-o41-9 /80/0500/0318 $00.75 we propose to take into account the geometry of the circuit by assuming a propagation delay at least linear in the distance.
The model which we propose tries to be most general and is thus primarily suited for deriving lower bounds on the complexity of circuits.
By using geometric arguments, we are able to prove new complexity results for a varied set of problems, including fan-in, addition, cyclic shift, integer multiplication, convolution, linear transform, product of matrices, and sorting. Another purpose of this paper is to introduce several new techniques of complexity analysis, one of which is due to G6rard Baudct.
In the last section, we illustrate how additional assumptions may be needed for building technology-dependent models. We have chosen the NMOS technology as an example where power considerations lead to more stringent requirements. We show how it is then possible to derive stronger lower bounds on the complexity of some problems.
The Model
Our model is for the most part a refined version &the current planar models found in the literature [TH80, BK80, VU80] . These models, which are all in fact very similar, have been used to both find lower bounds and make comparative analyses of circuits. They differ from the traditional RAM model in charging area costs for the transmission of information. However, they show some inconsistency by failing to account for propagation delays. Our model tries to remedy this shortcoming by including all the parameters necessary for performing realistic asymptotic analyses. In particular, the length of wires becomes crucial in evaluating the time performance of a circuit. As a result, a circuit appears as a fully geometrical object rather than a topological one.
Model of computational device
We can think of a computational device as a black box which computes a boolean function (yrY2....)=F (Xl,X2,...). Information is treated digitally, and is communicated to the device through I/O ports in a fixed format. To each variable x i (resp. yi) is associated both an input (resp. outpu0 port and a chronological rank on the set of inputs (resp. outputs). In addition, to each variable must correspond an information to tell when it is available on its port. This information may be independent of the inputs, which involves fixing the times and locations at which the I/O bits can be used [VU80]. On the other hand, that information may be provided by the circuit (output) or the user (inpu0, thus ~dluwing for self-timed computations [MC80, SE79] .
Internally, a computational device is a circuit connecting nodes and wires into a directed graph, and is defined by a geometrical layout oft,his graph. The layout is supposcd to be planar, meaning that all the nodes are on the same plane, and the wires are allowed to lie on a constant number of parallel layers. This implies that there are at most a constant m.nbcr of cross-overs at any point.
Wires may intersect only at the nodes, and their width must exceed a minimum value h. Similarly, all nodes occupy a fixed area. We distinguish I/O nodes (ports) where input and output values are available from the logical nodes (gates) which compute boolean functions. The circuit is laid out within a convex region with all the I/O nodes lying on its boundary.
Model of computation
Two parameters characterize the computational complexity of a circuit: its area A and its time of computation T. For a fixed value of the inputs, T measures the time between the appearances of the first input bit and the last output bit. The maximum value of T for all possible inputs defines the time complexity of the circuit. In this paper, the time T of a circuit will always refer to this worst-case measure. Other approaches can be considered, involving average or even best time of computation. Of course this is pertinent only with self-timed circuits.
Another important parameter is the period of a circuit [VU80]. It is defined only for circuits whose inputs and outputs are available at fixed times and locations. It is then possible to pipeline the computations on several sets of inputs, and we define the period P as the minimal time interval separating two input sets. More precisely, if (al,...,aN) and (bl,...,bN) are two sets of inputs, pipelining the computation means that the time separating the appearance of a i and b i is the same for all i; this interval defines the period P. Informally, this relation means that a node can produce a result only a delay z after its inputs have been available.
The most drastic departure from previous models, however, comes from the next assumption, which expresses that the time of propagation across a wire is at best proportional to the length of the wire. Let I(0 and O(t) be the variables associated with the ends of a wire of length L. We require that
(2) l(t + T) = O(t) for T= fl(L).
The last assumption asserts the bandwidth limitation of wires.
Although we do not wish to restrict the storage capacity of the wire to one bit, we will assume that a wire can carry a number of bits at most proportional to his length. The simplest way to express it is to regard a wire of length L as a sequence of L subwires of unit length connected by nodes computing the identity function. Then for each subwire we have l(t+l)=O(t), meaning that each subwire stores exactly one bit of information and has a unit delay. Note that Assumption (2) follows directly from this approach.
We finally introduce the important concept of datapath. We say that there exists a datapath from an input variable x to a node V if there is a directed path to V from the input port where x is read in. Moreover, we require that information can be propagated from x to V before the circuit completes its computation.
Justification of the assumptions
Like previous models, this one strongly reflects present technologies, especially electrical ones (NMOS, CMOS. Trl,); for example a circuit is taken to be planar, convex, with its I/O ports on the boundary, and we assume minimal dimensions for every part (node or wire). All these constraints find justifications in today's fabrication and packaging processes. Other stronger reasons can be advanced:
• Planarily: This is indeed a matter of choice, since threedimcnsh,nal computing devices are concci~,~ble. Although most of present-day circuits are planar, it would be very interesting to explore the features and properties of a three dimensional model. We are still inclined to think that bcca~,.c of heat dissipation problems, circuit designers will be tmlikely to give up planarity. Indeed, today's circuits all use the third dimension for cooling purposes.
• Convexity ,old I/0 ports." We believe that circuits .should be easy to manipulate and connect together. A good way to achieve handiness is to make circuits into closed systems, where interactions with the outside occur only through the boundary. Convexity thus appears as a natural requirement.
• Mhffmal size." "Ibis seems to be an intrinsic feature of any physical device (quantum mechanics argument). As a consequence, gates and wires may not be arbitrarily small, if they are to be material.
• Propagation delays: The ultimate justification for our assumption comes from a speed-of-light argument. No information can propagate faster than the light. Moreover, in practice, parasitic effects reduce the speed of propagation several orders of magnitude below that limit,
We can illustrate the latter point by briefly examining the delays incurred in ~,.Ios technologies. The information is coded as a potential, and its propagation involves loading the capacitance of a wire. Since a wire is a piece of conducting material, it has a non-null resistance and capacitance, and because of current process conditions, both are proportional to the wire length. Detailed analysis of this situation can be found in [CM81]. One major conscquence is that, because of the diffusion law, dclays on wires are in fact proportional to the square of the length. In this technology, it then appears necessary to decompose long wires into pieces of constant length connected by nodes (globally computing the identity function), in order to achieve delays proportional to the length of the wires. Moreover, the maximum speed of information propagation is largely dominated by the speed of light [SE79] , mainly because the current intensities involved are very low, and many electrical phcnomena (overheat, metal migration [CL80]) impose a limit on them.
Distributing and Collecting Information
"ro fan in or fan out infunnation being two of the most common operations performed by circuits, we first turn to these problems from which we can best measure the significant departure of our model from the previous ones. These results will be basic tools in analyzing further problems. We need the following geometric lemma.
Lemma 1: If P is an arbitrary convex polygon with a boundary of length N, the maximum distance between any vertex of P and an arbitrary point in the plane is 9,(N).
We omit the prool~ which is straightforward. As a result, it takes time 9(N) to propagate a bit from any point M to N points on a convex boundary (e.g. input or output ports).
Fan-out
A fan-out of degree N refers to the distribution of N copies of an information bit at N different locations on the circuit. We have the following.
Theorem 2: It takes time T= f~(N 1/2) to perform a fan-out of degree N.
Proof: A consequence of the fact that the maximum distance between a point and N arbitrary points in the plane
Fan-in
The fan-in is essentially the reverse operation of the fan-out, as N inforhaation bits must now converge from several sources to one destination point. Yet it is a little more general, since the information may be submitted to logical operations on the way to its destination.
Typically, the problem is to compute a boolean function of N inputs and one output. However, to ensure that all the input bits are used in the computation, we give the following definition of a fan-in. Note that to perform a fan-in on a hard input always takes f~(N 1/2) time.
We also observe that the above lower bounds are still valid for boolean functions with an arbitrary number of outputs, as long as at least one output is a fan-in of all the input values. The addition for example falls in that category, since the last carry depends upon all the operand bits.
If the boolean function is a commutative, associative operation on N variables, these lower bounds are tight with today's circuits, as shown in 
Addition
Since our model relates the time of computation to the geometry rather than the topology of the circuit, we can show that many complete binary tree based schemes cease to have the logarithmic time complexity which they enjoyed in previous models. Notable examples include the fan-in and fan-out operations studied earlier, or the addition of two Nbit integers, to which we next turn our attention. Our results are expressed in the following.
Theorem 5: If T is the time, P the period, and A the area required by any circuit to add two N-bit integers, we have T=t2(Nt/2), A'I'=f~(N), ATP=f~(N2).
The first two results come from the computation of the last carry, which can be easily shown to involve a fan-in of degree N. The proof for the last lower bound is more difficult, and requires a few technical lemmas.
To simplify the notation, we will equate all Lemma 7: For any i, 1 <i<__N, and for any t,.ti<t<ti+ l' we have Y(t) < ~l<j<imin(l~l,t-ti).
Proof: l.ct Ym be the highest-order bit output so far at time t. We clearly have
(1) m >__ Y(t)-l.
From Lemma 6, it follows that for all j, l<_j<s m, we have Lj<t-tj.
Since we also have Lj < IZl, we derive m+ 1 = Y.I<_.L<smLj < Y.l<_.L<smmin(lZjl,t-ti), and 
Transitive Functions
In It is worthwhile to notice the serious gap existing between this model and the previous ones, in which a transitive function could be computed in logarithmic time (e.g. the CCC-scheme [PV79, PVS0] or other rccursive schemes [BPVS0,THS0a]). In fact, our model seems to rule out any logarithmic time circuit. However, good performances on the period can be expected from pipelining the computation. Note that although we improve upon the lower bound for T, the well-known trade-off AT 2a = ~2(N t + a), valid for 0<a < 1, remains unchanged.
Sorting attd Matrix Multiplication
Although these two problems are not transitive, we can prove similar bounds on A and T. Note that C. Thompson [THS0] and J. Savage [SA79] have already established bounds on AT 2 similar to those known for transitive functions.
The minimal bisection argument revisited
We can improve on the general technique of minimal bisection [FH79] by introducing geometric arguments. Consider a line L partitioning the circuit into two parts C 1 and C 2, each producing roughly half the output bits. We define the minimal cross-flow I to be the minimal number of bits crossing L. More precisely, let U 1 (rcsp. U2) and V 1 (resp. V2) denote the set of input (rcsp. output) variables assigned in C 1 and C 2 respectively. For any fixed assignment of the inputs U 2, consider the total number of output sets V 2 obtained for all possible inputs U1; we define Z 1 as its maximum value over all possible sets U 2. Similarly, we can define Z 2 by inverting the indices, and we call Z the maximum ofZ 1 and Z 2. Finally, I is defined independently of the circuit as the minimum value of log2Z over all possible bisections and all possible circuits computing the function. Wlog, assume that Z=Z 1. It is clear that I bits must cross the separating line L from C 1 to C 2, and moreover to each of these bits corresponds a datapath going from U 1 to V 2. We can prove the following result.
Lemma 9" Any circuit computing a function with minimum cross-flow I requires time ~2(I1/2).
Proof: Using the above notation, we choose L to be perpendicular to a diameter of the circuit, and we call to the number of wires used by the I bits to cross L. Consider the wire closest to the middle of the chord L. There exists a datapath from an input port of C 1 to an output port of C 2 using this wire, and an elementary geometric argument based on the convexity of the circuit shows that its length is fl(~). 39-2
It follows that the time T is fl(~). Finally, since I bits

Sorting
The problem is now to sort N numbers, each of them being represented on k bits. We will assume that k > 21og2N, implying in particular that all the numbers can be different.
Theorem I0: Any circuit sorting N k-bit numbers, with k>21og2N, has an area A=f~(N) and takes time T= fl(Nk) 1/2.
Proof:
To prove the result on the area, we show that the circuit must be able to realize any permutation of the N lowest-order bits. It suffices to observe that if the N numbers deprived of their lowest-order bit can take N distinct values, any permutation on these N values will induce a permutation on the N lowest-order bits. Clearly, the condition on N and k ensures this property, which shows that sorting N numbers involves computing a transitive function of degree N, and establishes the result.
We will use the result of Lemma ~ to prove the result on the time. To do so, we must evaluate the minimum crossflow I associated with sorting. We assume that each I/O port handles all the bits of a number and not only parts of the bits.
For simplicity, we first assume that it is possible to bisect the circuit by a line into two parts C 1 and C 2, so that each part will produce N/2 output numbers. Let C2be the part which receives the more input variables. Let al,...,aN/2 be the ranks of the output variables of C 2 in the sorted order of the N input numbers. We extend the sequence to ao=0 and aN/2+I=N+I. Note that these ranks are independent of the input Values. l.etting {1,...,F} be the range of possible keys, we next define the sequence b0,...,bN/2+ 1 by the recurrence relation bo=0, bi+ 1 = bi+ a(ai+ 1 -ai-1), where a = [F/N]. We can verify that all bi's lie in the key range. The next step is to assign all the N t input numbers of C l and any set of (N/2-N l) input numbers in C 2 to bl,...,bN/2. Now, we know that if wc assign the remaining input numbers (all of which are in C2) to any values such that for all i, ai+ 1-ai-1 of them lie between b i and bi+ 1, then the inputs will be mapped to the N/2 output ports of C r Therefore, the total number Q of output sequences obtainable in C 1 will give a lower bound on 2 I.
To evaluate Q, we must count the number of ways to choose ai+ 1-ai-1 numbers between b i and bi+ 1. To avoid repetitions, it is easier to assume that these numbers are all distinct. Since ai+l-ai-l=0 implies bi=bi+ r We have Q> [i./a(ai+l-all)-1 ) for all i, ai+l-ai-1 ~ 0 with --~ ai+ 1-ai-1 aN/2+ I=N and a0=0.
Since ~.0<i<N/2(ai+ 1-ai-1) = N/2-1, we derive --(1) Q > mini-I, (aNil) with 0<i<m<N/2, Ni*,0 and Y'i Ni >-N/2-1. Hence, Q>(a-1) N/2-I since (ax-1) > (a.1)x for any x>l and a>2.
It follows that l=fl(N log2a) when a_>2, and finally 1 = f/(N(k-log2N)) = fl(Nk), since k>21og2N.
We can now generalize to the case where we cannot bisect the N output variables exactly. If M is the maximum number of keys passing through an output port, at worst we can only bisect the circuit so that C 2 produces N/2+M/2 output numbers. In this case, the ~me reasoning leads an inequality similar to (1), with 0<i<m<N/2+M/2, Nist0 and If we Consider the N/2 bits of F and G input last, we observe that they can be mapped into 3N/4 distinct positions. Therefore, there exists an input bit x of rank greater than N/2 which can be mapped onto an output bit y of rank less than N/4. Wlog, assume that it is a bit from F, and let G be a permutation matrix performing this mapping.
Only N/4 of the N/2 bits of F input before x can be output before y, which implies by a simple counting argument that the circuit must memorize at least N/4 bits, hence A = fl(N).
When the m 2 elements of the input matrices are now k-bit integers (N =m2k), we can use a tcclmique borrowed fi'om J. Vuillemin [VU80] to reduce the problem to the previous one.
One matrix is a permutation matrix, except for the non- 
Another Model
Although we believe that our model is in a sense minhnal, it is not clear at all whether it can be used to precisely describe the complexity of circuits in real technologies. In this section, we consider a more restrictive model tailored to the NMOS technology. Although it differs from the general model only by three additional assumptions, we can show that it is sufficiently strengthened to give way to stronger lower bounds.
These additional assumptions are justified in [CM81] by electrical considerations. The main constraint is that the electrical power is supplied through conducting wires, and that the density of current at any point of a wire is in practice always bounded by a constant. Thus it becomes impossible to supply enough power to certain circuit layouts, which of course changes the complexity of some problems a great deal.
In the future, this problem may be solved by using vertical wires to supply power, but even in a three-dimensional model, severe problems of power supply and heat dissipation will be difficult to avoid.
7.2, Transitive Functions
It comes as no surprise that since our second model adds physical constraints to the one in which Vuillemin derived his lower bounds, we can significantly hnprove upon his results. Before proceeding, we will establish a preliminary result. Proof: It has been shown in [VU80] that the circuit must have the capability of memorizing N bits. Therefore I.emma 12 impli,.,~ that the circuit must have two active gates G 1 and G 2 at a distance ft(N) apart, hence I1 = ~2(N). We can always assume that for some values of the inputs, information will be transmitted from G 1 to an output port P1 (same with G 2 and an ot;tl~Ut port P2). Consider now an arbitrary ir.pt~t port R. Siuce the function is transitive, there exists a path in the circuit from R to P1 and from R to P2" Among all possible computations, the four paths G1-P 1, G2-P 2, R-P 1, and R-P 2 will be actual datapaths at least once. From l.cmma 1, it then follows that T is at least proportional to Max{G1P1,G2P2,RP1,RP2}. The sum of these four lengths is greater than G1G2=~2(N) as shown in Fig.7-1 , which completes the proof. []
The additional assumptions
Introducing the energy as a new parameter, we make the following assumptions:
• To switch a gate requires one unit of energy.
• Memorizing a bit requires a unit of energy per unit of time.
• The energy is supplied to the circuit through its (planar) boundary, and the density of energy at any point is bounded by a constant. Remark: In this model, these lower bounds are tight for some problems;
for example optimal circuits for performing integer multiplication, based on the Shift&Add scheme, can be found.
Addition
Theorem 14: In the NMOS model, the time T required to add two N-bit integers satisfies: T = fl(N2/3).
Proof: We can prove this result with the same technique used in the general model. Keeping the same notation, we simply introduce I3 as the perimeter of the convex hull of all the active nodes. Since the circuit cannot store more than FI bits at any time, the result of Theorem 5 is still valid if we replace A by H, which gives T > N2/(pFI). On the other hand, T>p for obvious reasons, and since as we will see T>FI, the resuh follows directly. To prove the last inequality, we consider the two active nodes G 1, G 2 which are furthest apart. There exists a datapath fiom an input port PI to G 1 (resp. P2 to G2). Since there is a fan-in betx~ccn any pair of input variables, there exists a datapath from PI :rod P2 
Conclusions
We have proposed a new model of computation which claims to be more realistic than previous ones, yet tries to remain simple and general.
Our aim has been to gather minimal requirements with which any physical computing device should comply. We feel that the ultimate achievement of such endevour would be to lay the foundations of a general theory of physical computabilio,. Considerations of physical complexio, should also bc included, as we have tried to do in dais paper on a small number of well-known problems.
Of course, casting problems in a framework of rain#hal models can only provide negative insights, since these models are primarily suited for establishing lower bounds. Another avenue of research concerns the study of technology-dependent models granting enough refinement to allow for d la Km~th analysis of circuits. We hate to drink that each technology should bring along its own model, drastically different from the others. Yet it is still hard to evaluate the level of modelling sophistication required for reflecting reality faithfully. "/his is all the more acute as the analysis of actual circuits should not be only asymptotic, but should also apply to arbitrary sizes.
