compactness. However, there is at present essentially no theoretical basis for optimizing the overall organization of systems implemented in this technology.
The conventional complexity theory is inadequate because its measure of cost is the number of steps of a sequential machine. No account is taken of the size of the machine (and hence the time required for each step). Possible concurrency is ignored, thereby ruling out the most important potential contribution of the silicon technology. The traditional switching theory is also inadequate. While it provides a beautiful formalism for describing elementary logic functions, its optimization methods concern themselves with logical operations rather than communication requirements. Even in current integrated circuits, the wires required for communicating information across the chip account for most of the area, and driving these wires accounts for most of the time delay. In very large scale integrated systems, the situation becomes even more extreme. In this paper, we describe a method by which the conceptual organization of a large chip can be analyzed, and a lower bound placed on its size and cycle time before a detailed design is undertaken. The results of this analysis suggest rather general guidelines for the organization of large integrated systems.
METRICS OF SPACE AND TIME

A . Physical Properties
Devices used to construct monolithic silicain integrated circuits are universally of the charge-controlled type. A charge Q placed on the control electrode (gate, base, etc.) results in 0018-9383/79/0400-0533$00.75 0 1979 IEEE a current I = Q/r flowing through the device. The transit time T is the time required for charge carriers to move through the active region of the device.
All times in an integrated system can be formulated as r,imple multiples of 7. For one transistor to drive another identjcal to it, a charge Q must flow through its active region, requiring time r. If the capacitance CL of the load being driven is K times the gate capacitance C , of the driving transistor, a time Kr = (CL/Cg)7 is required.
B. Linear Versus Hierarchical Structures
In large integrated systems it is necessary to communicate information throughout the entire system. As an example, a bit of information stored on the gate of a minimum size transistor in a random-access memory must be communicat?d to the memory bus of a CPU. Since there are many words of data in the memory, there are many possible sources for each wire in the memory bus. Fig. 1 illustrates two possible approaches to organizing such a bus. In the first approach, a transistor associated with each bit drives the bus wire dircctly. If the bus wire has a capacitance C, , the time requireid to drive the bus wire is t = 7(C,,,/Cg).
In a typical computer memory, C, , , is many orders of magnitude larger than C, , and the delay introduced by such a scheme is very long. Sinc: C, is proportional to the length of the wire, it is also proportlonal to S , the number of driver transistors connected to the wile.
A second scheme is shown in Fig. l(b) . Here each transistor drives a wire only long enough to reach its neighbor. 13ach such wire is connected to the gate of a transistor twice as :.arge as the transistor driving it. The arrangement is repeated upward until the top level where all sources have a path tc, the bus. In this scheme the delay in driving the lowest level wire is 27 (assuming the primary capacitance is due to the gate of the larger transistor). The delay introduced by the wires at each level is the same, since each driver transistor is twice as large as those driving it, Hence the delay in driving the bus line is 2rN where N is the number of levels in the structure. Since there are S = 2 N transistors at the lowest level, the d:lay may be written
Comparing ( 2 ) and (I), we see that for large S the delay has been made much shorter by using a hierarchical structure.
C. A Cost Criterion
A hierarchy such as that shown in Fig. l(b) may be built using any integral number a of transistors driving each wire. The driver transistors will in general be a times the Iiize of those driving them. The delay for such a structure is t = ar log,S = T(a/log a) log S . All system delays are thus proportional to T log S , with a penalty factor a/log a dependent upon the branching ratio of the hierarchy. This delay is plotted in Fig. 2 , normalized to its minimum value which is attained at a = e.
While dramatic improvements in the performance of ir..tegrated structures can be achieved by a hierarchical organizatitm, a penalty is always paid in the area required for wires. In the simple case shown, a bus requiring one wire when driven directly requires log, S wires when organized as a hierarchy. For this reason it is not possible to optimize a design without a cost function involving both area and time. In this paper we will use the area-time product as our basic cost function. For the above simple example, the cost function is area . time = T(log S)' ct/(log a ) ' . The cost is minimized for ct = e' = 7.4.
D. Hierarchical Computing Systems
The analysis given above suggests a very general structure for computing systems. Lowest level cells are grouped together into modules in such a way that a cells drive their outputs onto an output wire. Each output wire is connected to a driver transistor which is a times as large as those driving the wire. Modules are grouped in such a way that a of those modules drivers are connected to an intermodule communication wire.
This wire in turn is connected to a driver transistor a' times as large as the lowest level transistors. This process is continued until the appropriate size system has been realized.
RANDOM -ACCESS MEMORY
In this section we discuss the cost and performance of a random-access memory (RAM) of S words of log S bits each. As the unit of length we employ the minimum distance of two conducting paths, For the unit of time we choose the time it takes a basic element to charge a wire of unit length plus another transistor like itself. One unit of time is thus slightly larger than the transit time of a transistor.
A . Organization of the RAM
We organize the RAM in a hierarchical fashion. The elements of level 0 are the bits themselves, each bit consisting of two crossing wires: a select wire and a data wire. When the select wire is signaled it puts its contents on the data wire. We group a' bits into an a X a square to form a module of level 1. If the width of an element (a bit) is bo, the elements have to drive wires of length abo. A module on level 1 consists of an array of crossing select and data wires, constituting the a ' bits of level 0, and some additional logic and wires at the side. We group again a2 of these modules into a square to form a module of level 2:, etc. Fig. 3 shows three levels of the hierarchy for a=4.
To study the memory in more detail we look at a module of level i (Fig. 4) . We describe how one extracts one of its a2' bits. In order to select 1 bit of storage, 2 i log a address wires are required. We run i log a of them, called the row address wires, vertically along the side of the module and the other i loga; the column address wires, horizontally. Its a2 submodules are organized into a rows of ar submodules each. When the select wire of the module is asserted loga of the row address wires are used by the decoder to select one of the a rows of submodules; the select wire running through that row is asserted. The other (i -1) log a row address wires are run horizontally into each of the a rows of submodules, where they serve as column address wires for the submodules.
Of the i log Q column address wires (i -1) log a are run vertically into each of the a columns of submodules, where they serve as row addresses. The other loga address wires are used by the multiplexor to select one of the a data wires coming out of the columns of submodules. The signal on the selected data wire is driven onto the data wire of the module itself.
If we wish to have a memory of S words with N t 1 levels (level 0 through N ) we choose N = log S / 2 log a or S = a 2 N , This gives a hierarchical structure with S bits from which we can extralct 1 bit at a time. If we want the word length to be log S we employ log S of these structures in parallel. To select one word we select 1 bit in each of the log S hierarchies. recurrence relation:
B. Area ofthe RAM
The solution to the above relation is
Rather than the width itself we are interested in the width per bit. In one direction, horizontal or vertical, module i has ai bits; therefore, we compute Lilai.
An interesting property of the width per bit, as expressed by (3), is that its limit for i + 00 is finite.
This means that the width per bit Lilai is bounded from above by (4) independent of the number of levels of a RAM. Expression (3) converges in an exponential fashion towards its limit.
For small values of i , (3) is already very close to (4). Therefore, we use (4) as the width per bit for a RAM; its square is then the area per bit. By dividing the area per bit by the bit area bi we obtain the total area per bit area for a RAM. Fig.  5 shows this quotient as a function of a far four different values of bo. It gives the overhead factor in the area that is due to the wires. For a memory of 64K bits with N = 2, a should be 16. Expression (4) is then equal t o bo t 0.6. This shows that in 2-level 64K dynamic MOS memories, for which bo lies between 1 and 2 , roughly half of the area will be occupied by wires. One may wonder why we have not discussed the area that is consumed by the wires for power and ground. The reason for this is that these wires can be thought of as increasing only the width bo of each bit; they do this by an amount that is roughly independent of a, as is shown in the following analysis.
For simplicity we assume that the wires for power and gromd run in. opposite directions, say parallel to the data and se!ect wires. We compute how much one of them contributes to the width of a module i . The width of a power or ground wire is proportional to the number of bits served by it. Let the wijth at the highest level be u ; given S and the design of the low.est level memory cell, this parameter is easy to compute. :?he width of the wire in a module on level i is proportional to the current it must supply and is hence u ( a z~a z N ) .
In one dilection, horizontal or vertical, there are (aN/ai) such modu.!es. The total contribution of all modules on level i is thus u ( d / a N ) . Taking the sum of this expression for i = 0, 1,
There are fi bits in one direction; the increase of the bit width, due to power and ground, therefore, is u a 6 a -1 --which is roughly equal to u / f i .
We are interested in the optimal choice of a, but to make that choice we will have to look at the access time, which also depends on a as well.
C. Access Time of the RAM Each element of level 0 drives a wire of length abO to reach the periphery of its module on level 1 ; this takes time a b o . Each module on level 1 drives in the same amount of time as a wire that is a times longer to reach the periphery of its module on level 2 , etc. With N being the level of the highest module, the time required to extract 1 bit of storage adds up to a b o N . We use this figure as the access time. For a RAM of S words, the access time is then abo(log S/2 log a).
D. The Cost of the RAM
We take the product of the area and the access time as the cost function of the RAM. A RAM of S words of log S bits each has the following area-time product. than dynamic ones. For dynamic MOS memories the optimal choice for a lies between 8 and 16, for static MOS memories (bo w 4) between 4 and 8. One may speculate that "smart memories," structures in which part of the processing task is distributed over the memory cells, will have small branching ratios and hence relatively deep hierarchies.
IV. CONTENT ADDRESSABLE MEMORY
The basic elements of the RAM were bits. The content addressable memory (CAM) is an example of a word organized memory. We consider a "pure" CAM. It consists of words of w bits each. We access a word by applying w bits of data to the system. We assume that there is only one word in the memory with that contents, and the address of that word is produced by the memory.
A. Organization of the CAM
The basic elements are the bits, each of width b l . The bits do not constitute the modules of level 0. The modules on level 0 of the hierarchy consist of aw words of w bits each. [See Fig. 7(b) ] The w data bits are run via parallel wires vertically through the module. Out of each word comes one horizontal match wire going to the right. A word asserts its match wire if each data bit received is equal to the corresponding bit stored. There are ~1 , words in a module of level 0; the address of the matching word leaves the module via the log aw address wires.
The above organization of a module of level 0 has one defect. It would require the individual bits of storage to drive wires of length wbl , which may be greater than the desired a b l , to reach t h e address wires. In Section 11, we discussed that this type of communication should be achieved by a hierarchy. We, therefore, organize the driving of the match wire by the w bits in a word in the same manner as shown in Section 11.
Each word is chopped up into (w/a) subwords of a bits each [ Fig. 7(a) ]. Each of the (w/a) subwords sends a signal to a "match tree" which has a branching ratio of a and delivers, via log,w levels, the logical product of its inputs. The top node of the match tree can drive a wire of length blalogaw = b l w , the length of a word in the memory. Therefore, the word itself can drive a wire of length bl a w , and we may group together aw words into module 0 [ Fig. 7(b) ] . Notice that the module's length is roughly equal t o a times its width. This will be true for modules on higher levels as well.
We now describe a module of level i (Fig. 8) . It contains words and consists of a4 submodules of level i -1, grouped into a2 rows of a2 submodules each. Each such row contains, besides the a2 submodules, w data wires to transport the data to each of the submodules and log w (~~~-~ outcoming address wires to transport to the right the address of the matching word. Each submodule has wa4i--3 words, and, hence, one row contains wa4'-l words which explains the number of address wires. A module on level i has a2 of these rows and thus requires log wa41t1 outcoming address wires; they are placed to the right of the rows.
In the CAM we have a4 submodules per modul.e, in the RAM only a'. This is only a seeming difference. In the CAM, for simplicity, we have combined two steps in the hierarchy; we have maintained, however, our multiplication factor a for the wire lengths. L i -l , the length of a module of level i -1, is roughly equal to a times Wi-l, the width of a module of level i -1. Therefore, module i -1 can already drivle wires of length a Wi-l . As a consequence, we can put a' submodules into one row as this would only require the driving of wires of length a2 Wi-l in each row. But then we can, and t!ks is the second step, combine a' rows as this would require the driving of wires of a length about a2Li-l , which is roughly equal to wa4i+1 a3 Wi-1.
B. Area of the CAM
We compute the length and the width separately.
For the length L j of a module on level i, we have the relation [cf. Figs. 7(b) and 81
The solution to this recurrence relation is
Li=aZi+'w ( bl + -2:) t ( w t l 0 g w ) -a2
A module on level i has waZi+' bits in the vertical direction. The length per bit, therefore, is Lj/waZi+l. This has the following limit for i + m:
bl tlog w a(w t log w -I. 3 log a) 4tr log a log a w(a2 -1)
As in the case of the R A M , Li/wa2i+1 is alrmdy very close to the limit for small values of i; the rate of convergence is again exponential. We use (6) as the length per bit of a CAM. We find for the width ?Vi of a module on level i the following 
In the horizontal direction there are waZd bits. The width per bit Wi/waZi has as its limit for i + m
We take the product of (6) and (7) as the area per bit.
By dividing the area per bit by the bit area b: we obtain the total area per bit area for a CAM. Fig. 9 shows this quotient for w = 32 as a function of a for different values of bo.
If we compare Figs. 5 and 9, we notice that for small values of a the wires in the CAM cause less overhead in area than those in the RAM. For large values of a it is the RAM that enjoys a smaller overhead in area. For equal bit sizes, i.e., with bo = bl , the area overhead factor for the RAM and the CAM are about equal at a = 8.
As in the RAM we can compute by how much we should increase the bit width bl if we wish to take power and ground into account. Both power and ground
give an increase of u(a'/a' -1) to the length and the width of the CAM. This is even closer to u than in the case of the RAM. If we wish to ammortize this amount over the bits, the bit width b l should be increnlented by 2u a2
--@G a 2 -1 for a CAM of S words of w bits each.
C. Access Time of the CAM
For the access time we take the time required to extract the address of the matching word of data from a memory of S words. With the highest level being level N , we have S = Wa4h'+1 or l o g sl o g w 1
A word of storage has a response time of (log w/log &)abl ; for a module of level 0 this becomes [(log w/log a) t 11 a b l . Each new level of the hierarchy multiplies the wire lengths by a factor a ' and hence requires an additional time of 2 a b l . For N levels we find, hence, 
D. The Cost of the CAM
We again take the product of the area and the access time as the cost function. For a CAM of S words of w bits each, formulae (6), (7), and (8) yield the cost function
log w a(w t log w + 3 log a) 4a log a 10 shows the cost function as a function of a for a CAM dent of the choice of w provided we choose w great enough, say w 2 16. A change in S will basically move the curves only up and down; it will not affect the positions of their minima. We notice again that increasing the bit size will decrease the optimal choice of a. Comparing Figs. 6 and 10 we see that content addressable memories should have smaller branching ratios than random-access memories. For bl = 4, which s:ems a reasonable figure, the optimal choice of a is 4.
V. CONCLUSION
We have presented a general method for analyzing the cost and performance of recursively defined VLSI structures. Parameters of any such structure may be optimized with respect to time, area, or some combination of the two. While we have chosen the area-time product, it is clear that some other choice may be appropriate for any given application,
The results of this study indicate that as more processing is available in each module at level zero, the optimal value of a will decrease. A system with a = 4 would seem to be appropriate for memories in which substantial processing is comirlgled with storage.
Very general arguments were used to generate the basic recursive structure. For that reason it appears that a very large fraction of VLSI computing structures will be designed in this way. We have discussed two examples, one in whick. the basic elements were bits of storage, and one with worcs of storage at the lowest level. They gave rise to rather different recursive structures. The way in which their area and time measures were established should make it clear how to apply these techniques to other recursively defined computing structures.
Carver A. Mead, for a biography and photograph, see this issue, p.
548.
Martin Abstract-Transmission of signals on large capacitance paths in a VLSI system may result in substantial degradation of the overall system performance. In this paper minimization of the delay timer. associated with driving and sensing signals from large capacitance paths by optimizing the fan-out factor of the driver stages, the gain of the input sensing stages, and the path voltage swing are examined. Examples of driving signals on a high capacitance path with two driving schemes are: a push-pull depletion-load driver chain and a fixed dr: ver; and of sensing signals with two sensing schemes: a single-ended dcpletion-load inverter input stage and a balanced regenerative strc'bed latch are presented. We conclude that minimum delay time is achiwed when the delay times of the successive stages of the driver chain, the high capacitance path, and the input sensing stage are comparable. In general, transmission time of signals in a system is minimized when the delay times of the different stages of the system are comparable.
T I. INTRODUCTION HE OVERALL PERFORMANCE of VLSI systems may be seriously degraded if signals need to be transmitted from one part to other parts in the system across large capacitance paths [l] . This large fan-out situation often occurs in the case of control drivers that are required to drive a large number of inputs to memory cells or logic-function blocks across axhip, or in the case of sensing stored information from small cells of large memory arrays. A similar and even more serious problem is driving wires which go off the silicon chip to other chips or input and output devices. In such cases, the
