ABSTRACT The problem of performing multtphcaUon of n-bit binary numbers on a chip is considered Let A denote the ch~p area and T the time reqmred to perform mult~phcation. By using a model of computation which is a realistic approx~mauon to current and anucipated LSI or VLSI technology, ~t is shown that
Introduction
We are interested in the design of multipliers suitable for implementation in VLSI chips. The multiplication problem has been considered by several authors (see, e.g., [8, 10, 17, 19, 25, 27] ). Much attention has been paid to the trade-off between time and the number of gates, but until recently little attention has been paid to the problem of connecting the gates in an economical and regular way to minimize chip area and design costs. In this paper we give lower and upper bounds on the areatime product for multiplication circuits, assuming a model of computation which is intended to approximate current and anticipated LSI or VLSI technology. Details of the model are given in Section 2.
The lower bound on A T, where A is the chip area and T the time to perform n-bit binary multiplication on the chip, is the special case a = ½ of a more general lower bound AT 2" = ~(na+~), (1.1)
which is valid for all a ~ [0, 1]. We establish this general result in Section 3. The case a = 1 was established independently by Abelson and Andreae [1] using a more restrictive model than ours (see also [21] ). In Section 4 we sketch a design for n-bit multiplication that gives the upper bound
AT 2'~= O(nl+'qogl+~n),
(1 .2) for all a _> 0.1 Thus the exponent 1 + a ofn in (1.1) and (1.2) is tight for a E [0, II.
In [3] we give upper bounds on A and T for the problem of adding n-bit binary numbers. From (1.1) and the results of [3] we conclude in Section 5 that binary multiplication is harder than binary addition if the complexity measure is A T 2~, for any a _> 0 (see also [71).
The Computational Model and Basic Assumptions
We assume the existence of circuit elements or "gates" which compute a logical function of two inputs in constant time and occupy at least a constant minimum area. Gates are connected by wires which have constant minimum width (equivalently, the wires must be separated by at least some minimal spacing). Our measure of the cost of a design is the area rather than the number of gates required. This is an important difference between our model and earlier models of [4, 26] and others. For motivation and discussion of models similar to ours, see [12, 23] .
To prove the results of this paper, various subsets of the following assumptions A 1 through A8 are used. Comments and justification are given following the statement of each assumption.
A 1. The computation is performed in a convex planar region R of area A.
Because of heat-dissipation, packing, and testing requirements, a two-dimensional planar model is reasonable. The convexity assumption is not restrictive in the sense that almost all existing chips or useful modular designs do have convex boundaries for packaging or modularity reasons. (The convexity assumption can be removed for part of Theorem 3.1 below by using a different proof.) A2. Wires have minimal width X > 0.
is assumed constant, but in applications of our results it will of course depend on the technology. We also assume R has width at least X in every direction.
A3. At most v _> 2 wires can overlap (or intersect) at any point in R.
A chip may consist of v layers. Wire crossings through different layers are allowed. In fact, transistors are typically formed by crossovers of wires. Since
Log denotes log to the base 2 throughout. v ~_ 2, the graph of wires (edges) and gates (nodes) need not be planar in a graph-theoretic sense.
A4. I/O ports each contain a ~ × ~ square and thus have area at least # _> X2. An I/O port can be multiplexed to handle more than one input or output variable.
IfR is a complete chip, p will be large compared to ~2. IfR is only part of a chip and I/O is to other regions on the chip, p could be of order ~2. We do not require each input (or output) variable to appear in a distinct input (or output) port, as required in [23] . I/O ports may be multiplexed as they often are in practice.
A5. A bit requires minimal time ~-> 0 to propagate along a wire or to be transmitted through an I/O port. The time for one gate computation and an arbitrary fanout of the result is included in ~-.
Since dimensions are limited by the minimal wire width ~ and minimal gate area, a minimal propagation time is reasonable. We do not need to assume that the propagation time increases with the length of the wire. With the (small) sizes of chips we now have or anticipate, the propagation time, which is the time needed to charge or discharge a wire, is limited by the wire capacitance rather than the velocity of light. A longer wire will generally have a larger capacitance and thus require a larger driver to maintain constant propagation time, but the driver area need not exceed a fixed percentage of the wire area and so can be ignored if ~ is increased slightly; see [15] . Although it would be reasonable to assume bounded fanout, we do not need this assumption for proving lower bounds. When proving upper bounds, we do assume bounded fanout.
A6. The times and locations at which input and output bits are available are fixed and independent of the values of the input bits.
When proving upper bounds in Section 4, we further assume that if a, and aj are any two bits in an operand such that a, is more significant than aj, then a~ is not input to (or output from) the chip before a~, but they are allowed to be input to (or output from) the chip in parallel.
A7. Storage for one bit of information takes area at least fl > 0.
fl is typically several times larger than ~2.
A8. Each input bit is available only once.
There is no free memory outside R. If the same input bit is required at different times, it must be stored within R, taking area at least fl (see A7).
Lower Bound Results
Let p = p2, ... pl be the 2n-bit product of n-bit integers a ~-an "" al and b=bn.., bl.
3.1 LOWER BOUNDS FOR SmFTING CIRCUITS. When b = 2 J, p is a shifted j bits to the left. Thus any multiplier circuit must also be a shifting circuit capable of performing j-bit shifts for all 0 _< j _< n -1. (3.3) and L is the pertmeter of the chip.
Before proving Theorem 3.1 we need two Lemmas. Table I indicates the p,'s that take the value of a, under j-bit shifts for all n -i _~j .~ n -1. Note that in the table all thep,'s belong to S, which is divided into two parts by the chord X. By (3.6), in the ith row of the table there are at most d of 
For t E I, the input port for a, or the output port for p,+j may intersect the chord X, although their representatives do not. Define I'= {ili E I, and the chord X intersects the input port for as or the output port forp,+j, or both}.
Then I-I' = (zli ~ (d, d + 1 ..... n}, and the input port for a, and the output port for p,+j do not intersect X and they lie on different sides of X).
Consider the computation of the j-bit shift. Note that the j-bit shift, which maps a, to p,+j for i = 1 ..... n, is an identity mapping. Hence, before the shift is complete, at least 1I -I'1 bits of information about a,, i ~ I -I', must cross X for computingp,÷j, i E I -I', and at least [ 1'1 bits of information about a,, i E 1", must be input to or output from some I/O ports intersecting X for computing p,+j, i E I'. Suppose that the chord X is of length C. Then by assumptions A2-A4, at most vC/2t wires or I/O ports cross X. Thus, by assumption A5, the time T to perform the j-bit shift must satisfy the inequality
where r = M/n. Since M outputs come through one output port, assumption A5 gives
First suppose M < n. Then at least one wire or one I/O port crosses X, and assumptions A2 and A4 give Suppose on the other hand that M = n. By assumption A2, R has width at least X in every direction, so we can choose a chord that is of length C _> X and is perpendicular to Y. By 0.5) and (3.8) with r = l, we have which gives
Since any circuit that performs integer multiplications must also be able to perform shifts, O. 1) and (3.2) hold for any n-bit multiplication chip.
Result (3.2) can sometimes give useful lower bounds which are based on the I/O characteristics of a multiplication or shifting chip. If at one time the chip inputs or outputs a total of z bits along its boundary, then by assumptions A3 and A4, L _> zX/p, and (3.2) gives AT -> K2Q~z/p)n. Thus for any multiplication scheme that accepts, say f~(n ~/2) input bits simultaneously along the chip boundary, we know immediately that A T ffi f~(n a/2) (of. the multiplication scheme in Section 4).
Result (3.1) (with a smaller constant for K1) could have been established by a proof parallel to that used by Thompson [23] for the discrete Fourier transform problem. In fact, using his result that relates the area of a graph to its minimum bisection width, one can derive (3.1) without the convexity assumption in AI. Our proof above represents a new approach that incorporates geometric ~operties of the chip boundary in the lower bound proof. We feel that the extra convexity assumption we make is not restrictive, since most existing chips do have convex boundaries for packaging reasons. Furthermore, we note that the convexity assumption is needed for establishing results such as (3.2) that relate A T to the perimeter L. In [6] , under a similar convexity assumption, tight lower bounds on the minimum area required to layout complete binary (or t-ary) trees are obtained.
An interesting corollary of Theorem 3.1 is that lower bounds in (3.1) and (3.2) hold for chips that perform floating-point additions, for which shifts are needed to equalize exponents. This explains why the area-time requirements for floating-point addition are much higher than those for integer addition, as observed in practical implementations. (Charles Leiserson at CMU first pointed out to one of the authors the application of Theorem 3.1 to floating-point addition.) 3.2 A LOWER BOUND ON THE AREA FOR MULTIPLIER CIRCUITS. In Theorem 3.1 we gave lower bounds on A T 2 and A T for shifting circuits. Now, using different techniques, we give a lower bound on A for multiplier circuits. (3.16) , then for all n _> 1, 
Under assumptions A4 and A6-A8, any n-bit multiplication must
8(n) _> L [n-log(nln2)] 6(n)>_ ,(3.
8(n) _> -~
for all n _> 1.
PROOF OF THEOREM 3.2. If n = 1, there is at least one output port, so A _> p, and the result holds. Hence, suppose that n >_ 2.
Consider the state of the computation just before the last input bit(s) is accepted. Let m be the number of input bits still to be accepted, so 1 _< m _< 2n.
It is easy to show that there are some inputs a and b such that the output bits Since a port can accept only one bit at a time, the last m bits must be input through m different ports; so assumption A4 gives The following corollary of Theorem 3.3 seems worth stating separately, for AT is often used as a complexity measure (see, e.g., [ 161). 
Upper Bound Results for Multiplication
It is easy to design practical n-bit multipliers with area A --O(n) and time T = O(n), so AT ~ffi 0(n1+2").
(4.1)
For example, the "serial pipeline multipliers" typically used in the implementation of digital filters and signal processors achieve these area and time bounds (see [9, 14] ). In this section we sketch the design of a multiplier with A = O(n log n) and T ffi O(n 1/21og n), giving
which is asymptotically better than (4.1). The design uses the Convolution Theorem to compute the product of two integers in a complex way, and consequently its implementation appears to be difficult. Nevertheless, the design is theoretically interesting because it shows that the exponent 1 + a of n in Theorem 3.3 is tight. We do not know if there is any practical design having AT 2a --o(n 1+2~) for a E [0, 1]. Straightforward implementations of "fast" algorithms, for example, the SchonhageStrassen algorithm [22] or the "3-2 reduction" algorithm [17, 25] , seem to require area at least order n 2.
In the remainder of this section we assume that (a) n ffi k 2 is a perfect square, and (b) aj = bj = 0 ifj > n/2.
(If not, n may be increased sufficiently without affecting the asymptotic results.) Let p be the smallest prime of the form nq + 1, q _> 1, Fp the finite field of integers modp. It is known that logp ffi O(log n) (see [13, 24] ) and that Fp has an nth root of unity u (see [2] ). Let w = u k, so w is a kth root of unity. Note that in any circuit n is fixed, so we are not concerned with the complexity of finding p, u, w, etc; they will be encoded into the circuit. For facilitating arithmetic in Fp we assume that a 2[logp'lbit approximation to lip is encoded into the circuit. In steps 1-5 below, all arithmetic is done in Fp. In steps 1-3 we compute the discrete Fourier transform a' of (al ..... an) and b' of (b~, ..., bn) over Fp; that is, n--1 a;+l = ~ a,+lu `J t--O for j = 0 ..... n -1, etc. In step 4 we multiply the Fourier transforms. In step 5 we take the inverse transform, and in step 6 the final result is computed.
Step 1. Let A, B, U, and W be k by k matrices with elements
Wv ffi w (t-~)(J-~).

Perform k by k matrix multiplications to compute
A'ffi WA and B'= WB, using a "systolic array" [11] . All computations are performed in Fp, so each processing element of the systolic array needs to perform multiplication and addition in Fp. Using a serial pipeline multiplier and a serial adder, a multiplication and addition step in F, requires no more than area O(logp) and time O(logp). Thus, step 1 can be done with area O(n log n) and time O(n~/21og n).
Step 2. Compute A" = A' o U and B" = B' o U, where o denotes componentwlse multiplication.
Step 3 Step 4. Compute C" = A " o B".
Step Grouping the terms on the right-hand side into k = n 1/2 groups so that the c2s in each row of the matrix C belong to one group, we obtain Gwen that the c~'s are outputs of the systolic array that computes the matrix C, all the R,'s can be formed in area O(nlogn) and time O(nl/Zlogn), using the result of Theorem 5.1 of Section 5 regarding addition circuits. Thus the problem of computing p2 ...... pl has been reduced to the problem of summing k --n x/2 terms in the righthand side of eq. (4.3). Hence, the final step in the computation is
Step 6. Compute p2 ...... p~ from the R~'s. Note that each R, has at most n 1/2 + log n bits. Using (4.3), the p,'s can be computed, n 1/2 of them at a time, with an (n ~/z + log n)-bit adder. This is depicted in Figure 1 . At the end of the ith addition, the first n ~/z low order bits in the output are output as p~k, p~k-1 ..... p(~-l)k+l, and the remaining bits in the output are fed back to the adder to be added to the arriving R, in the (i + 1)st addition. With the result of Theorem 5.1 one can easily see that all the p,'s can be computed in area O(n log n) and time O(n'/210gn).
This completes our outline of the multiplier with area A = O(n log n) and time For ot E [0, 1], the exponent 1 + 2or of log n can be reduced by using a more complicated design than the one outlined above, but we do not know what its minimal value is. For a > 1, a design based on the "3-2 reduction" algorithm gives AT 2~ = O(n21og~n) for some 8 > 0, which is a better upper bound than (4.2).
Concluding Remarks
In [3] we demonstrate a regular layout for look-ahead adders, giving the following result. Thus for any a _> 0, the area-time product for multiplication is asymptotically larger than that for addition. We can say that multiplication is harder than addition as far as the area-time complexity is concerned.
For binary division it is easy to deduce a lower bound of the same form as (3.21), using the method of [5] , and an upper bound A T 2~ = O(n 1+~ logl+2'*n), using Newton's method.
In Section 3 we derived lower bounds on A T 2'~, a ~ [0, 1 ], for binary multiplication. Similar lower bounds on A T 2 have been obtained for computation of the discrete Fourier transform by Thompson [23] , and, for matrix multiplication by Savage [20] . It seems that area-time complexity is, in general, a useful measure for establishing the complexity hierarchy of many classes of problems because it captures important attributes of a computation such as time and space, as well as communication. One should expect that more results along this line will be obtained in the near future.
