The complexity of the Discrete Fourier Transform (OFT) is studied with respect to a new model of computation appropriate to VLSI technology. This model focuses on two key parameters, the amount of silicon area and time required to implement a OFT on a single chip. Lower bounds on area (A) and time (T) are related to the number of points (N) in the OFT: AT2 > N2/16. This inequality holds for any chip design based on any algorithm, and is nearly tight when T • e<Nl/2) or T • 9(1og N).
Introduction
The theory of computation is valid over a synthetic domain: its formal models have relevance only if they correspond to possible computational systems. Technological changes can affect the realm of possiblilty. In this light, it would be surprising If the "VLSI revolution" did not spawn new theoretical models. This paper is an attempt to show that interesting complexity results are available through the use of a "VLSI model of computation".
Two parameters are of overriding interest in a VLSI design, its speed and its size.
Soeed can be handled with familiar complexity tools, that is, measuring time by countins elementary operations. Size '" the VLSl world is best expressed as the total area of silicon used.. This is quite a different metric from a count of "active elements", "gates .. , or "registers". Jt may be the case that most of the chip Is devoted to connections between such active elements. A complexity theory for VLSI must thus concern itself with the leyout of ectlve elements In the plene, elont with their Interconnections. 
C.D.Thompson
The VLSI model: area and time.
There is a natural unit of area for VLSI. Manufacturing and physical limitations give rise to a "minimum feature width", X. This is the width of the narrowest wire, and X 2 is approximately the area of the smallest transistor. The 64K RAM currently available has an area of about 105x2. Chips of 107 or 108 x2 may be possible [Mead 78 ].
The choice of a unit of time is slightly more problematical. Here, following [Mead 78 ], it will be taken as the length of time that it takes a signal to propagate along a wire, or on -chip interconnection. This propagation time can be made independent of the length of the wire, by fitting larger drivers to longer wires. Larger drivers of course occupy more area, but need never take more than 101. of the area of the wire they drive (1 >.. 2 for a wire of length lOX, 104x2 for a 105x wire). By fudging >.. upwards by 51., the area of the driver is thus absorbed into the area of its wire.
A full exposition of the VLSI model is deferred to Section 2.
The OFT.
The computational problem studied in this paper is the Discrete Fourier Transform (OFT). The OFT is defined over any commutative ring, but only finite rings · will be considered here. Elements of infinite rings have no fixed-length representation, leading to grave computational difficulties. Approximate methods are beyond the scope of this paper.
A satisfactory ring does exist for VLSI, the integers modulo m. If m -21c-1, ordinary fixed-point arithmetic on k bit words will produce exact answers. An N-point OFT can be performed in this ring if N divides p-1 for each prime p dividing m [Bonneau 73 ].
Formally, the OFT is a matrix-vector multiplication, ~ • ;. The input vector Is ~, the output vector is ;, and A is an N by N matrix of constants,
The constant w must be a principal Nth root of unity. That is, it must satisfy The complexion of the area-time tradeoff in the computation of the OFT may be expressed in two ways. Following [Mead 78 ], a minimum value may be found for some particular cost function, such as the product of area with time. Alternatively, one may seek a function of area and time that describes the performance of many "good" designs . . The result of this paper is expressed in both of these ways.
For cost functions of the form AT~ with o~~s2, any chip that performs an N point OFT costs at least n.(Nl +~/2). This minimum is nearly achieved on chips whose arithmetic units are connected in a mesh-type pattern.
The relation AT2 > N2/16 bounds the performance of any chip of area A that computes an N point OFT in time T. At least two designs come close to this limit: those with either a perfect shuffle or a mesh-type interconnection pattern. Outline.
Section 2 details the VLSI model of computation. In Section 3, a graph-theoretic quantity is defined that will be used to derive lower bounds on area (in Section 4) and time (in Section 5) for chips that perform DFTs. Section 6 concludes with the main result, that AT2 > N2 I 16 for any chip with area A that computes an N point OFT in time T.
communication are called interconnections, or wires.
Words.
The basic chunk of information considered in this paper is a word. Words are in one-to-one correspondence with elements of the finite commutative ring over which the OFT is defined. To avoid unedifying detail, the word length (in bits) is treated as a constant in this paper.
Wires, units of length and time.
A wire has unit width and transmits a word from one end to the other in unit time. If the transmission is performed bit-serially, the unit of time is proportional to the word length in bits. If the transmission is word-parallel, the unit of length is proportional to word length.
PEs.
A PE contains at most one word of storage. If larger PEs are envisioned, they must be decomposed into word-sized PEs with connecting wires.
A PE may use words from any number of connecting wires to update its own word In any way, but it may take only a constant amount of time to do so. The functions performed by a PE are thus in Rk x R, where R is the ring used to define the OFT.
A PE may output words onto any number of connecting wires, but may only output its own word or any of the words it received in the last time unit. There is thus no bandwidth limitation on PEs: they may act as many-to-many switches.
There are constants o and t such that no PE occupies more than o units of area nor takes more than t units of time to perform an update on its word.
Nexi.
Wires deliver words to and from a nexus associated with each PE. There is exactly one PE per nexus. Communication between a nexus and its PE is free, costing no area or time.
Each nexus is square in aspect, with side d if d wires connect to it. This ensures that there is more than enough edge length on the nexus to accomodate all connecting wires.
CALTECH CONFERENCE ON VLSI , Januar y 197 9 500 C . D.Thompson
The square shape does entail a large area charge for a nexus of large degree, but in this case its associated PE could be very powerful. It would be permissible, for example, for a PE with degree N to act as a "big switch", permuting N words at a time.
Charging 9(N2) area allows enough room for a cross-point; fancier switches with greater delay but less area may be built from small crosspoints. (If a PE with degree N is not a "b ig switch", it should be decomposed into smaller PEs with lesser degrees. For example, a fan-out of N can be acheived with a tree of e(N) constant degree PEs for a total of e(N) area and e(log N) delay.)
Input, Output PEs.
An input PE initially contains one of the N elements of the vector that is to undergo Fourier transformation. There are N input PEs.
An output PE will eventually contain one of the N elements of the result of the OFT.
The N output PEs are not necessarily distinct from the input PEs.
Wire layout.
Wires are laid out on a grid with unit spacing. Restricting wires to run along grid lines assures that unit width is available for each fine if the grid is physically realized with two layers of silicon. One layer is devoted to the "x" direction, one to the "y".
Wires may bend at grid corners. This corresponds to a connection between the two layers of silicon.
At grid corners, wires may cross at right angles with no effect on each other's signals or timing. This corresponds to insulating the two layers of silicon from each other. There is one critical assumption built into the VLSI model, that the information about the N input words is initially localized. That is, each word is stored in a compact region (a PE) of the chip. This assumption is necessary to ensure that the OFT involves some computation, for otherwise one would consider the output words a legitimate initial encoding of the input vector. A similar argument can be made for requiring each of the output words to be stored in its own PE. Localization of the input and output PEs ensure s that their nexi are also localized, so that there is indeed exactly one nexus fo" each P~.
The choice of the word as the basic unit of information is also defensible. Recall that wires of unit width transmit one word in unit time. Wires (and PEs) of smaller capacity are clearly conceivable, and should have fractional width or fractional delay to be true to VLSI implementation costs (for bit-parallel or bit-serial transmission, respectively).
The introduction of such fractional capacities would only obscure the results of this paper, not invalidate them. None of the proofs depend upon the integral nature of the degree of a nexus or of the information capacity of any wire. The description of the VLSI model of computation is now complete. It will be seen in the sequel that it is the pattern of interconnections, not the "programming" of PEs, that limits the speed or magnifies the area of a chip.
Interconnection patterns will be analyzed with the aid of the graph-theoretic definitions developed in the next section.
Minimal Bisection Width
The minimal bisection width of a. graph is, informally, the number of cuts needed to slice it in half. In other words, it is the smallest number of edges whose removal disconnects one half of the vertices from the other.
For example, the minimal bisection width of a linear graph of N nodes is 1:
The minimal bisection width of a mesh of N nodes is Nl/2:
The minimal bisection width of a star of N nodes is N/2:
The minimal bisection width of a graph with an odd number of nodes is defined by relaxing the bisection requirement slightly. A bisection of a graph of 2N+l nodes splits it Into two disconnected subgraphs of N and N+l nodes. The minimal bisection width of the leaves in a binary tree is 1 ~ In general, it is difficult to compute minimal bisection widths: the problem Ia ~.....u . Tnomp son NP-complete, in fact [Garey 74 ). Fortunately, it is enough to know that every graph has a set of edges that realizes its minimal bisection width.
The following sections will derive bounds on area and time for any VLSI design. A graph will be associated with each design, defining a minimal bisection width, rv. Lower bounds of rv2/4 and N/(2rv) will be found on area and time respectively. Thus AT 2 > N 2 /16.
Area
The total area occupied by a VLSI design is the sum of the areas of its PEs, wires, and nexi. A lower bound on wire and nexus area is derived in this section. Inclusion of PE area can at most affect the result by a constant factor, since there is one nexus per PE, and PEs have area bounded by a constant.
Associate with each VLSI design the following graph, G. Each nexus is a vertex. Each wire connecting two nexi is an edge between corresponding vertices. Denote by I the subset of vertices that are nexi of the N "input PEs".
Let rv be the minimal bisection width of I in G. The line x•a is said to account for the w square units of area of wire and nexus that lie within 1/2 unit distance of it. As noted in section 2, a PE can take at most t units of time to update its word. The information used to perform this update must have taken at least one unit of time to reach the PE. Thus, within at most a factor of t, it is the routing of intermediate results rather than the arithmetics performed on them, that limits the timing of VLSI designs.
The following theorem places a lower bound on the time required to compute a OFT.
Its proof is based on a consideration of the amount of information that must be transmitted during the course of any computation of a OFT. This property of the OFT holds for any bisection of the elements of ~. In particular, it holds for the bisection of the output PEs that realizes w, the minimum bisection width.
Choose % 1 to be the set of input PEs that are included with ; 1 in the bisection of the output PEs. The computation of y 1 will require k words of information about ~2, if Rank(A 12 > • k. A similar argument holds for y 2 , so that N/2 words of information must pass over the w wires that separate %1 from ~2· It takes at least N/(2w) time to pass N/2 words over w wires, hence the theorem.
Conclusion
Theorems 1 and 2 can be immediately combined to give the main result of the paper. Since Osxs2, the second term increases with • while the first term decreases with t.
Clearly, . .. o achieves the minimum value, hence the theorem.
• From the proof of Theorem 4, it is clear that the optim•l design has w•e(Nl/2), which corresponds to a mesh-type interconnection pattern.
A similar analysis may be performed for other problems, Including matrix multiplication, Gaussian elimination, transitive closure, sorting, and permutation [Thompson 80 ].
