Cost and Performance of VLSI Computing Structures by Mead, Carver A. & Rem, Martin
COST AND PERFORMANCE OF VLSI COMPUTING STRUCTURES 
by 
Carver A. Mead and Martin Rem 
Technical Report #1584 
(Ten Figures) 
April 11, 1978 
Computer Scieu~e 
California Institute of Technology 
Pasadena, California 91125 
Silicon Structures Project 
sponsored by 
Burroughs Corporation, Digital Equipment Corporation, 
Hewlett-Packard Company, Honeywell Incorporated, 
International Business Machines Corporation, 
Intel Corporation, Xerox Corporation, 
and the National Science Foundation 
The material in this report is the property of Caltech, and is subject 
to patent and license agreements between Caltech and its sponsors. 
Copyright, California Institute of Technology, 1980 
Abstract 
Cost and Performance of VLSI Computing Structures 
Carver Mead and Martin Rem 
California Institute o~ Technology 
Pasadena, CA 91125 
Using VLSI technology, it will soon be possible to implement entire com-
puting systems on one monolithic Silicon chip. What will the nature of such 
systems be? How will they be designed? What will be their cost and performance? 
Conducting paths are required for communicating information throughout 
any integrated system. The length and organization of these communication 
paths places a lower bound on the area and time required for system operations. 
Optimal designs can be achieved in only a few of the many alternative struc-
tures. Two illustrative systems are analyzed in detail: A RAM based system 
and an associative system. It is shown that in each case an optimum design is 
possible, using the area - time product as a cost function. 
This work was supported in part by BMD under Contract No. DASG60-77-C-0097~ 
and the Office of Naval Research No. N00014-16-C-0367. California Institute 
of Technology, Computer Science Department Contribution No. 1584 
1. Introduction 
The silicon integrated circuit technology is evolving continuously 
toward smaller elementary devices and denser, more complex functions on 
each single silicon ch1p. It appears that new processing and lithographic 
techniques will make possible the fabrication of chips containing 107 or io8 
individual transistors. One such chip will contain more function than todays 
largest computers. A large amount of effort has been put into fabrication 
questions, and much more effort will be required to reach the practical 
limits of device compactness. However, there is at present essentially no 
theoretical basis for optimizing the overall organization of systems implemented 
in this technology. 
Conventional complexity theory is inadequate because its measure of 
cost is the number of steps of a sequential machine. No account is taken of 
the size of the machine {and hence the time required for each step). Possible 
concurrency is ignored, thereby ruling out the most important potential con-
tribution of the silicon technology. Traditional switching theory is also 
inadequate. While it provides a beautiful formalism for describing elementary 
logic functions, its optimization methods concern themselves with logical 
operations rather than communication requirements. Even in current integrated 
circuits, the wires required for conmunicating information accross the chip 
account for most of the area and driving these wires accounts for most of the 
time delay. In very large scale integrated systems, the situation becomes 
even more extreme. In this paper, we describe a method by which the conceptual 
organization of a large chip can be analyzed, and a lower bound placed on its 
-2-
size and cycle time before a detailed design is undertaken. The results of 
this analysis suggest rather general guidelines for the organization of all 
large integrated systems. 
2. Metrics of Space and time 
.. ~:l·_ .E_h,t.sj_c~l_P.r:.o.2.e.r:.tj_e~ 
-3-
Devices used to construct monolithic ~ilicon integrated circuits are 
universally of the charge controlled type. A charge Q placed on the control 
electrode (gate, base, etc.) results in a current I=% flowing through the 
device. The transit time T is the time required for charge carriers to move 
through the active region of the device. 
All times in an integrated system can be formulated as simple multiples 
of T. For one transistor to drive another identical to it, a charge Q must 
flow through its active region, requiring time T. If the capacitance CL of 
the load being driven is K times the gate capacitance cg of the driving tran-
CL 
sistor, a time KT = ~ T is required. 
cg 
?:_._g_._Lj_n~a.r:.ye.r:.s.!!s_hj_e.r:.a.r:.c..!li.£al~tr.!!c1u.r:.e~ 
In large integrated systems it is necessary to communicate information 
throughout the entire system. As an example, a bit of information stored on 
the gate of a minimum size transistor in a random access memory must be com-
municated to the memory bus of a CPU. Since there are many words of data in 
the memory, there are many possible sources for each wire in the memory bus. 
Fig. 1 illustrates two possible approaches to organizing such a bus. In the 
first approach. a transistor associated with each bit drives the bus wire 
directly. If the bus wire has a capacitance Cw, the time required to drive 
c 
the bus wire is t = T -f1". lna typical computer memory cw is many orders 
of magnitude larger tha~ Cg, and the delay introduced by such a scheme is 
very long. Since Cw is proportional to the length of the wire, it is also 
proportional to S, the number of driver transistors connected to the wire. 
t = TS ( l ) 
-4-
A second scheme is shown in Fig. lb. Here each transistor drives a 
wire only long enough to reach its neighbor. Each such wire is connected 
to the gate of a transistor twiGe as large as the transistor driving it. 
The arrangement is repeated upward until the top level where all sources have 
a path to the bus. In this scheme the delay in driving the lowest level wire 
is 2-r (assuming the primary capacitance is due to the gate of the larger 
transistor), The delay introduced by the wires at each level is the same, 
since each driver transistor is twice as large as those driving it. Hence 
the delay in driving the bus line is 2-rN where N is the number of levels 
in the structure. Since there are S = 2N transistors at the lowest level. 
the delay may be written 
Comparing (2) and (l), we see that tor larges the delay has been made much 
shorter by using a hierarchial structure. 
2.3. A cost criterion 
---------- -
(2} 
A hierarchy such as that shown in Fig. lb may be built using any integral 
number a of transistors driving each wire. The driver transistors will in 
general be a times the size of those drtving them. The delay for such a 
structure is t = a-r log S = -r ~a- log S. All system delays are thus propor-a loga 
tional to -r log S, with a penalty factor l~ga dependent upon the branching 
ratio of the hierarchy. This delay is plotted in Fig; 2, normalized to its 
minimum Value which is attained at ~=e. 
While dramatic improvements in the performance of integrated structures 
can be achieved by a hierachical organization, a penalty is always paid in 
the area required for wires. In the simple case shown, a bus requiring one 
wire when driven directly requires log S wires when organized as a hierarchy. 
a 
For this reason it is not possible to optimize a design without a cost func-
tion involving both area and time. In this paper we will use the area-time 
-5-
product as our basic cost function. For the above simple example, the cost 
-
2 ~ 7 4 a - e - . . 
function is area * time = T(logS) 2 a 2 (loga) 
The cost is minimized for 
The analysis given above suggests a very gener~l structure for computing 
systems. Lowest level cells are grouped together into modules in such a way 
that a cells drive their outputs onto an output wire. Each output wire is 
connected to a driver transistor which is a times as large as those driving 
the wire. Modules are grouped in such a way that a of those modules drivers 
are connected to an inter-module communication wire. This wire in turn is 
connected to a driver transistor a2 times as large as the lowest level tran-
sistors. This process is continued until the appropriate size system has 
been realized. 
-6-
3. Random Access Memory 
In this section we discuss the cost and performance of a random access 
memory (RAM) of S words of logs bits each. As the unit of length we employ 
the rainimum distance of two conducting paths. For the unit of time we choose 
the time it takes a basic element to charge a wire of unit length plus another 
transistor like itself. One unit of time is thus slightly larger than the 
transit time of a transistor.· 
We organize the RAM in a hierarchical fashion. The elements of level 
O are the bits themselves, each bit consisting of two crossing wires: a 
select wire and a ciata wire. When the select wire is signalled it puts its 
contents on the data wire. We group a2 bits into an a x a square to form 
a module of level 1. If the width of an element (a bit) is b0 the elements 
have to drive wires of length a.b0 •. A module on level 1 consists of an array 
of crossing select and data wires, constituting the a2 bits of level 0, and 
some ridditional logic and wires at the side. We group again CY.2 of these 
modules into a square to form a module of level 2, etc. Fig. 3 shows three 
levels of the hierarchy for a=4. 
To study the memory in more detail we look at a module of level i (Fig. 4). 
We describe how one extracts one of its a2; bits. In order to select one bit 
of storage 2iloga address wires are required. We run iloga of them, called 
the row address wires, vertically along the side of the module and the other 
iloga, the column address wires, horizontally. Its a2 submodules are organized 
into a rows of a submodules each. When the select wire of the module is 
asserted loga of the row address wires are used, by the decoder, to select 
one of the a rows of submodules; the select wire running through that row is 
asserted. The other (i-l)loga row address wires are run horizontally into 
-7-
each of the a rows of submodules, where they serve as column address wires 
for the submodules. Of the iloga column address wires (i-l}loga are run 
vertically into each of the a columns of submodules, where they serve as row 
addresses. The other log~ nrlrlress wires are used by the multiplexor to select 
one of the a data wires coming out of the columns of submodules. The signal on 
the selected data wire is driven onto the data wire of the module itself. 
If we wish to have a memory of S words with N+l levels (level 0 through 
N) we choose N = ~, or S = a2N. This gives a hierarchical structure 2loga 
with S bits from which we can extract one bit at a time. If we want the word 
length to be logs we employ logs of these structures in parallel: to select 
one word we select one bit in each of the logs hierarchies. 
].._£._~r~a_o.f._!h~ RAM..::_ 
Fig. 4 allows us to compute the size of a RAM. Let L; denote the width 
of a module of level i, then we have the following recurrence relation: 
Lo = bo 
L; i1oga + l + loga + a.Li-l 
The solution to the ~hove r~la1tio~2n ~!l ; . 1 ) L. = a 1 b + £....:__ + a -a -a - ~ 1 oga . 
l o a-1 (a-l)2 a-1 
Rather than the width itself we are interested in the width per bit. In 
one direction, horizontal or vertical, module i has a 1 bits; we therefore 
Li 
compute a; . 
L. 
1 
i 
a 
b + _1 + 2a-1 1 oga _ 1 
o a-1 (a-1)2 (a-l)ai [{a~ 1 + l+i) 1 oga + 1] ( 3} 
-8-
An interesting property of the width per bit, as expressed by (3), is that its 
limit for i + 00 is finite. 
L. 
l . 1 1m -. = 
• 1 1-+CO Ci 
b + - 1- + 2a- 1 1 o ga 
o a-1 (ri.-1)2 (4) 
L. 
This means that the width per bit -f is bounded from above by (4) independent 
a 
of the number of levels of a RAM. Expression (3) converges in an exponential 
fashion towards its limit: for small values of i (3) is already very close 
to (4). We, therefore, use (4) as the width per bit tor a KAM; its square is 
then the area per bit. By dividing the area per bit by the bit area b~ we 
obtain the total area per bit area for a RAM. Fig. 5 shows this quotient as 
i'l function nf a for four different va 1 ues of b0 . It gives the overhead factor 
in the area that is due to the wires. For a memory of 64K bits with N=2 a 
should be 16. Expression (4) is then equal to b0 + 0.6. This shows that 
in 2-level 64K dynamic MOS memories, for which bn lies between 1 and 2, roughly 
half of the area will be occupied by wires. 
One may wonder why we have not discussed the area that is consumed by 
the wires for power and ground. The reason for this is that these wires 
can be thought of as increasing only the width b0 of each bit; they do this 
by an amount that is roughly independent of a, as is shown in the following 
analysis. 
For simplicity we assume that the wires for power and ground run in 
opposite directions, say parallel to the data and select wires. We compute 
how much one of them contributes to the width of a module i. The width of 
a power or ground 1-1i re is proporti ona 1 to the number of bits served by it. 
-9-
Let the width at the highest level be u, given S and the design of the lowest 
level memory cell this parameter is easy to compute. The width of the wire in 
a module on level i is proportional to the current it must supply and is hence 
2i N 
a 
u 2"N . 
a 
are c\ such .modules. 
a 1 
In um: direcLio11, hori.wntal or vertical, there 
The total contribution of all modules on level i is thus u aN . 
N+l o: 
T k · th f th· · f · 0 1 N · l d u a - l ~ a a mg e sumo 1s expression or 1= , , ... , y1e s N a-l ~ u a-l · 
a 
There are ./$bits in one direction, the increase of the bit width, due to 
power and ground, is therefore 
u Cl 
.JS a-1 ' 
which is roughly equal to Js. 
We are interested in the optimal choice of a, but to make that choice 
we will have to look at the access time, which also depends on a, as well. 
3.3 Access time of the RAM 
- - - - - - - --- - - - -
Each element of level 0 drivesa wire of length ab0 to reach the periphery 
of its module on level 1, this takes time o:b0 . Each module on level 1 rlrivPs 
in the same amount of time a wire that is a times longer to reach the periphery 
of its module on level 2, etc. With N being the level of the highest module, 
the time required to extract one bit of storage adds up to a.b0 N. We use 
this figure as the access time. For a RAM of S words the access time is 
1 o_gS 
then a.b0 -2Togo:. 
We take the product of the area and the access time as the cost function 
of the RAM. A RAM of Swords of logs bits each has the following area-time 
product. 
-10-
( 1 2cx-1 )2 cxbo 2 bo + cx-1 + 2 loga 2log Slog S (a-1) a (5) 
Fig. 6 shows (5), normalized with respect to Slog2s, as a function of a for 
different values of b0 • One notices that for increasing bit sizes the branch-
ing ration of the hierarchy should decrease. Static memories should therefore 
have a smaller a than dynamic ones. For dynamic MOS memories the optimal 
choice for a lies between 8 and 16, for static MOS memories (b0 ~ 4) between 
4 and 8. One may speculate that 11 smart memories 11 , structures in which part 
of the processing task is distributed over the memory cells, will have small 
branching ratios and hence relatively deep hierarchies. 
-11-
4. Content Addressable Memory 
The basic elements of the RAM were bits. The content addressable memory 
(CAM) is an example of ., word organized memory. We consider a "pure" CAM. 
It consists of words of w bits each. We access a word by applying w bits of 
data to the system. We assume that there is only one word in the memory with 
that contents, and the address of that word is produced by the memory. 
The basic elements are the bits each of width b1• The bits do not con-
stitute the modules of level 0: The modules on level 0 of the hierarchy con-
sist of aw words of w bits each. (See Fig. 7b) Thew data bits are run via 
parallel wires vertically through the module. Out of each word comes one 
horizontal match wire going to the right. A word asserts its match wire if 
each data bit received is equal to the corresponding bit stored. There are 
aw words in a module of level O, the address of the matching word leaves the 
module via the logaw address wires. 
The above organization of a module of level 0 has one defPct: it would 
require the individual bits of storage to drive wires of length wb1, which 
may be greater than the desired ab 1, to reach the address wires. In section 
2 we discussed that this type of communication should be achieved by a hierarchy. 
We therefore organize the driving of the match wire by the w bits in a word in 
the same manner as shown in section 2. 
Each word is chopped up into~ subwords of a bits each (Fig. ?a). Each 
a 
of the '!!!.. subwords sends a signal to a "match tree 11 which has a branching ratio 
ex 
of ex and delivers, via log w levels, the logical product of its inputs. The 
a 
-12-
log w 
top node of the match tree can drive a wire of length b1a a = b1w, the 
length of a word in the memory. Therefore, the word itself can drive a wire 
of length b1aw and we may group together aw words into module O (Fig. 7b). 
Notice that the module's length is roughly equal to a times its width. This 
will be true for modules on higher levels as well. 
We now describe a module of level i (Fig. 8). It contains wa4i+l words 
and consists of a4 submodules of level i-1, grouped into a2 rows of a2 sub-
modules each. Each such row contains, besides the a2 submodules, w data 
wires to transport the data to each of the submodules and logwa4i-l out-
coming address wires to transport to the right the address of the matching word. 
Each submodule has wa4i- 3 words and hence one row contains wa4i-l words which 
explains the number of address wires. A module on level i has a2 of these 
rows and thus requires logwa4i'l outcoming address wires; they are placed to 
the right of the rows. 
In the CAM we have a4 submodules per module, in the RAM only a2. This 
is only a seeming difference: in the CAM we have, for simplicity, combined 
two steps in the hierarchy; we have however maintained our multiplication 
factor a for the wire lengths. Li-l' the length of a module of level i-1, 
is roughly equal to a times Wi-l' the width of a module of level i-1. There-
fore, module i-1 can already drive wires of length aW;_ 1. As a consequence, 
we can put a2 submodules into one row as this would only require the driving 
of wires of length a2Wi-l in each row. But then we can, and this is the 
second step, combine a2 rows as this would require the driving of wires of 
2 3 length about a Li-l' which is roughly equal to a W;_ 1. 
-13-
4.2. Area of the CAM 
We compute the length and the width separately. For the length Li of 
a module on level i we have the relation (cf. Figs. 7b and 8) 
L = w (b + 1 ogw) 
o a 1 loga 
L. ~ a 2 (w + L. 1 + logw~4i-l}. 1 1-
The solution to this recurrence relation is 
2· 2i+2 2 L; = a i+lw(b + logw) + (w + logw) a -a 
1 loga u2_ 1 
(4 
2i+2_4 2 
+ Cl Cl + 
(ci-1)2 
3 2i+2 4,· 2 3 J a -2 a - a loga 
a -1 
A module on level i has wa2i+l bits in the vertical direction. The length 
L; 
per bit is therefore 2i+l . This has the following limit for i + 00 wa 
b + logw + a(w+logw+31oga) + 4aloga 
l loga w(a2-l) w(a2_ 1)2 · 
(6) 
As in the case of the RAM, L; is already very close to the limit for small 
2i+l 
values of i; the rate of co~~ergence is again exponential. We use (6) as the 
length per bit of a CAM. 
We find for the width W; of a module on level the following recurrence 
relation. (cf. Figs. 7b ctnd 8) 
-14-
Its solution is 
2i+2 1 
W; = a 2iw (b1 + ~) + a 2 - logw 
a -1 
w. 
In the horizontal direction there are wa2i bits. The width per bit ~21 . 
wa 
1 
has as its limit for i ~ oo 
2 2 bl + ..!_ + a logaw + 4a loga 
a w(ci-1) w(<i-1) 2 · 
We take the product of (6) and (7) as the area per bit. 
( 7) 
By dividing the area per bit by the bit area bf we obtain the total area 
per bit area for a CAM. Fig. 9 shows this quotient for w = 32 as a function 
of a for different values of b0 • 
If we compare Figs. 5 and 9 we notice that for small values of a the 
wires in the CAM cause less overhead in area than those in the RAM. For 
large values of a it is the RAM that enjoys a smaller overhead in area. For 
equal bit sizes, i.e. with b
0 
= b1, the area overhead factor for the RAM and 
the CAM are about equal at a = 8. 
As in the RAM we can compute by how much we should increase the bit 
width b1 if we wish to take powl::!r and ground inLu ctccounL. We leave it as an 
exercise for the readers to convince themselves that both power and ground give 
2 
an increase of u -I-- to the length and the width of the CAM. This is even 
Cl - l 
-15-
closer to u than in the case of the RAM. If we wish to ammortize this amount 
2 
over the bits, the bit width b1 should be incremented by ~+for a S w a -1 
CAM of S words of w bits each. 
4.3 Access time of the CAM 
For the access time we take the time required to extract the address of 
the matching word of data from a memory of S words. With the highest level 
being level N we have S = wa4N+l or 
N = log S - logw 1 
41oga - 4 
A word of storage has a response time of 1 ;~ ab · for a module of Toga 1' 
level 0 this becomes (~g§~ + 1) ab 1• Each new level of the hierarchy 
multiplies the wire lengths by a factor a2 and hence requires an additiona1 
time of 2ab1. For N levels we find hence 
access time = {2N + 1
109
w + l )ct.bl oga 
=(logs+ logw + 1) b 
2loga 2 a 1 
4.4. The cost of the CAM. 
------------
(8) 
We take again the product of the area and the access time as the cost 
function. For a CAM of Swords of w bits each forw.ulae (6), (7) and (8) yield 
the cost function 
-16-
(bl + logw + a(w + logw + 3logal loga ( 2 ) w a -1 
+ 4aloga ) * 
w(a2-1)2 
Fig. lOshows the cost function as a function of a for a CAM of 65K words 
of 32 bits each. The curves are fairly independent of the choice of w provided 
we choose w great enough, say w ~ 16. A change in S will basically move the 
curves only up and down, it will not affect the positions of their minima. 
We notice again that increasing the bit size will decrease the optimal 
choice of a. Comparing Figs. 6 and 10 we see that content addressable memories 
should have smaller branching ratios than random access memories. For b1 = 4, 
which seems a reasonable figure, the optimal choice of a is 4. 
-17-
5. Conclusion 
We have presented a general method for analyzing the cost and performance 
of recursively defined VLSI structures. Parameters of any such structure may 
be optimized with respect to time, area, or some combination of the two. While 
we have chosen the area-time product. it is clear that some other choice may be 
appropriate for any given application. 
The results of this study indicate that as more processing is available in 
each module at level zero, the optimal value of a will decrease. A system with 
a = 4 would seem to be appropriate for memories in which substantial processing 
is comingled with storage. 
Very general arguments were used to generate the basic recursive structure. 
For that reason it appears that a very large fraction of VLSI computing struc-
tures will be designed in this way. We have discussed two examples, one in 
which the basic elements were bits of storage and one with words of storage 
at the lowest level. They gave rise to rather different recursive structures. 
The way in which their area and time measures were established should make it 
clear how to apply these techniques to other recursively defined computing 
structures. 
Page 18 
Bus Line 
Fig 1 a 
A bus driven directly by memory cells 
Bus Linc 
Fig 1 b 
A bus driver tree 
Relative 
Delay 
6 
5 
4 
3 
2 
1 
0 
Page 19 
I 
1 10 100 
alpha 
Fig. 2 
Delay of a hierarchical structure as a function of alpha 
Page 20 
rDDDD -0000 f DDDD -0000 
f BBBB DODD ii DODD DODD DODD ~DODD DODD 
~DODD DODD ~DOD~ DODD ~ I • I 
r~DDD .. ,_ODDO ,_DODO i'""" DODD 
. DODD ~ DODD DODD DODD DODD !~ DODD ~ DODD DODD ~DODD DODD n DODD ODDO il ~ I I 11 I 
" 
~ DODD 
!-0000 ~DODD ·-0000 DODD DODD ! DODD DODO n DODD DODD r, DODD DODD ! ~DODD f: DODD DODO DODD 
I I I 
,_DODD ,,_DODO ,_DODO '"""DODD 
DODD DODD I DODD ODDO DODD DODD DODD DODD DODD DODO . . DODO DODD 
I I ~J I I 
;\ 
Fig. 3 
Three levels of a memory hierarchy fora = 4 
L 
i 
1 i log (~ loga 
R 
0 
w 
s 
D 
e 
A E 
I 
D c 
e 
D 0 
c 
R D 
t 
E E 
s R 
s 
2 
a 
SUBMODULES 
MULTIPLEXOR 
COLUMN ADDRESS 
data 
Fig. 4 
A RAM module of level i (i > 0) 
Page 21 
a L. 1 I· 
log a 
i log a 
1 
20 
Total area 
bit area 
10 -
8 
2 
1 
2 4 8 16 alpha 32 
Fig. 5 Total area per bit of a RAM as a function of alpha 
Page 22 
64 128 
Page 23 
100 
Area · time 
product 
·10 
1 
alpha= 2 4 8 16 32 64 128 
Fig. 6 Area-time product of a RAM as a function of alpha 
b1 + log w I 
log a 
w ( ab 1+ 1 ) 
Fig. 7a 
One word of storage in the CAM 
aw 
w 
0 
words 
A 
D 
D 
R 
E 
s 
s 
0 
u 
T 
w(1 + n b ) 
1 log (lXW) 
a 
Fig. 7 b 
A CAM Module of level zero 
Page 24 
L 
0 
w 
L 
i·1 
4i-1 
log w a 
n: 2 w -
I· 1 
d a I a 
2 
in 
....... a Submodules • • • • • · 
ADDRESS OUT 
4 
a 
Submodules 
Fig. 8 
A CAM Module of level i (i > 0) 
Page 25 
4i + 1 
log w a 
A 
D 
D 
R 
E 
s 
s 
0 
u 
T 
, 
L 
i 
Total area 
bit area 
10 
9 
8 
7 
6 
5 
4 
3 
2 
1 
Page 26 
2 4 8 alpha 16 32 
Fig. 9 Total area per bit as a function of alpha 
for a CAM with word length 32 
1000 
Area-time 
product 
100 
10 
1 
2 4 8 alpha 16 
Fig. 10 Area.time product for a CAM of 65k 32 bit words 
Page 27 
32 
