Cost and Performance of VLSI Computing Structures by Mead, Carver A. & Rem, Martin
IEEE TRANSACTIONS ON ELECTRON DEVICES,  VOL.  E -26,  NO. 4, APRIL 1979 533 
[32] P. C. krnett and B. H. Yun, “Silicon nitride trap properties as radiation  effects  in IMPATT diodes,  hot-electron  effects  in MOSFET’s, 
revealed by charge centroid measurements on MNOS devices,” design  limitations  and  characterization of  MOSFET structures,  and  two- 
Appl. Phys. Lett.,vol.  96,  pp.  94-96,Feb.  1975. dimensional  simulation  of both bipolar  and  FET  semicamductor devices. 
[33] R. R. Troutman and S. N. Chakravarti, “Subthreshold character- Dr. Cottrell is a member of Eta Kappa Nu, Tau Beta Pi, and Sigma Xi. 
istics of  insulated  gate  fieldeffect  transistors,” IEEE Trans. Cir- 
cuit Theory, vol.  CT-20,  pp. 659465,1973. 
Peter E. Cottrell (S’69-M’73)  was born in Troy, 
N.Y.  He received  the B.S., M.E. and  Ph.D. 
degrees from Rensselaer  Polytechnic  Insti- 
tute,  Troy, NY,  in 1968,  1970, and 1973, 
respectively. 
From  1970  to  1972,  he was an Instructor at 
Rensselaer  Polytechnic  Institute.  Since  1973, 
he  has  been  employed by the IBM General 
Technology Division, Essex Junction, VT. His 
experience has included research and develop- 
ment in the following areas: transient ionizing 
Ronald R. Troutman (S’61-M’63-SM’78), for  a biography  and photo- 
graph, see this  issue,  p.  469. 
T. H. Ning  (M’75), for a biography and photograph see this issue, 
p.  352. 
Cost and  Performance of VLSl Computing  Structures 
CARVER A. MEAD AND MARTIN REM 
Abstract--Using VLSI technology, it will soon be possible to imple- 
ment entire computing systems on one monolithic silicon chip. Con- 
ducting pawls are required  for  communicating  information throughout 
any  integrated  system.  The  length  and  organization  of  these  communi- 
cation paths place a lower bound on the area and time required for 
system operations. Optimal designs can be achieved in only a few of 
the many alternative structures. Two illustrative systems are analyzed 
in detail: a RAM-based system and an associative system. It is shown 
that in each case an optimum design is possible using the area-time 
product as a  cost  function. 
T 
I.  INTRODUCTION 
HE SILICON integrated-circuit  echnology is evolving 
continuously  toward smaller elementary devices and 
denser, more complex functions on each single silicon chip. 
It appears that new processing and lithographic techniques 
will make possible the fabrication of chips containing lo7 or 
10’ individual transistors. One such chip will contain more 
function than today’s largest computers. A large amount of 
effort  has  been  put  into  fabrication  questions,  and  much  more 
effort will be required to reach  the  practical  limits  of device 
Manuscript received September 18, 1978; revised January 10, 1979. 
This work was supported  in  part by BMD under  Contract  DASG60-77- 
C-0097,  and  the Office  of Naval Research  under  Contract  N00014-16- 
C-0367.  (California Institute of Technology,  Computer  Science  Depart- 
ment  Contxibution  1584.) 
C. A. Mead is with the  Department of Computer  Science,  California 
Institute  of Technology,  Pasadena, CA 91125. 
M. Rem is with the Department of Computer Science, California 
Institute  of Technology,  Pasadena, CA 91 125, on leave from  the 
Department  of  Mathematics,  Eindhoven  University of Technology, 
Eindhovert,  The  Netherlands. 
compactness. However, there is at  present essentially no  theo- 
retical basis for  optimizing  the overall organization of systems 
implemented in this  technology. 
The conventional complexity theory is inadequate because 
its measure of cost is the  number of steps of a sequential  ma- 
chine. No account is taken of the size of the machine (and 
hence the time required for each step). Possible concurrency 
is ignored, thereby ruling out the most important potential 
contribution  of  the silicon technology.  The  traditional switch- 
ing theory is also inadequate. While it provides a beautiful 
formalism for describing elementary logic functions, its opti- 
mization methods concern themselves with logical operations 
rather  than  communication  requirements. Even in  current  in- 
tegrated  circuits,  the wires required  for  communicating  infor- 
mation across the  chip  account  for  most  of  the  area,  and driv- 
ing these wires accounts for most of the time delay. In very 
large scale integrated  systems,  the  situation becomes even 
more  extreme.  In this paper, we describe a method  by which 
the conceptual organization of a large chip can be analyzed, 
and a lower bound placed on its size and cycle time before a 
detailed design is undertaken.  The results of this analysis 
suggest rather general guidelines for  the organization of large 
integrated  systems. 
11. METRICS OF SPACE AND TIME 
A .  Physical Properties 
Devices used to construct monolithic silicain integrated cir- 
cuits are universally of the charge-controlled type. A charge 
Q placed on  the  control  electrode  (gate, base, etc.) results in 
0018-9383/79/0400-0533$00.75 0 1979  IEEE 
534 IEEE ':RANSACTIONS ON ELECTRON DEVICES, VOL. ED-26, NO. 4, APRIL 1919 
a current I = Q/r flowing through  the device. The  transit time 
T is the  time required for charge  carriers to move through  the 
active region of  the device. 
All times in  an  integrated system  can  be formulated as r,imple 
multiples of 7. For  one  transistor t o  drive another  identjcal to 
it, a charge Q must flow through its active region, requiring 
time r.  If the capacitance CL of the load being driven is K 
times  the gate  capacitance C, of the driving transistor, a time 
Kr = (CL/Cg)7 is required. 
B. Linear  Versus  Hierarchical Structures 
In large integrated systems it is necessary to  communicate 
information throughout the entire system. As an example, a 
bit  of  information  stored  on  the gate of a minimum size tran- 
sistor in a random-access memory must be communicat?d to 
the memory bus of a CPU. Since there are many words of 
data in the  memory,  there are many possible sources  for each 
wire in the memory bus. Fig. 1 illustrates two possible ap- 
proaches to organizing such a bus. In the first approach, a 
transistor  associated with  each  bit drives the  bus wire dircctly. 
If the bus wire has a capacitance C, the time requireid to 
drive the bus wire is t = 7(C,,,/Cg). In a typical computer 
memory, C,, is many  orders  of  magnitude larger than C, and 
the delay introduced  by  such a  scheme is very long. Sinc: C, 
is proportional to  the  length of the wire, it is also proportlonal 
to S ,  the  number  of driver transistors  connected to  the wile. 
t = 7s. (1) 
A  second scheme is shown  in Fig. l(b). Here each transistor 
drives a wire only long enough to reach its neighbor. 13ach 
such wire is connected to  the gate of a  transistor  twice as :.arge 
as the transistor driving it. The arrangement is repeated up- 
ward until  the  top level where all sources have a path tc,  the 
bus. In this scheme the delay in driving the lowest level wire 
is 27 (assuming the  primary capacitance is due to  the gate of 
the larger transistor). The delay introduced by the wires at 
each level is the same, since each driver transistor is twice as 
large as those driving i t ,  Hence the delay in driving the bus 
line is 2rN where N is the  number of levels in  the  structure. 
Since there are S = 2 N  transistors at the lowest level, the d:lay 
may be written 
t = 27 log, s ( 2 )  
Comparing ( 2 )  and (I) ,  we see that  for large S the delay has 
been made  much  shorter by using a  hierarchical structure. 
C. A Cost Criterion 
A hierarchy such as that shown in Fig. l(b) may be built 
using any integral number a of transistors driving each wire. 
The driver transistors will in general  be a times the Iiize 
of those driving them. The delay for such a structure is t = 
ar log,S = T(a/log a) log S .  All system  delays are thus  propor- 
tional to T log S ,  with a penalty  factor  a/log a dependent  upon 
the branching ratio of the hierarchy. This delay is plotted in 
Fig. 2 ,  normalized to  its  minimum value which is attained at 
a = e.  
While dramatic improvements in the performance of ir..te- 
grated structures can be achieved  by  a  hierarchical  organizatitm, 
Bus Line 
BUS Llno 
I I 
(b) 
Fig. 1.  (a) A bus driven directly by  memory  cells.  (b) A bus driver tree. 
O L  I I 
1 10 
alpha 
100 
Fig. 2. Delay of a  hierarchical  structure as a function of a. 
a penalty is always paid in  the area required for wires. In  the 
simple case shown, a bus requiring one wire when driven 
directly requires log, S wires when organized as a hierarchy. 
For this reason it is not possible to  optimize a design without 
a  cost function involving both area  and time.  In  this paper we 
will use the area-time product as our basic cost  function. For 
the above simple example, the cost function is area . time = 
T(log S)' ct/(log a)'. The  cost is minimized for ct = e' = 7.4. 
D. Hierarchical Computing Systems 
The analysis given above suggests a very general structure  for 
computing systems. Lowest level cells are grouped together 
into modules in such a way that a cells drive their outputs 
onto  an  output wire.  Each output wire is connected to  a driver 
MEAD AND REM: VLSI COMPUTING STRUCTURES 
transistor which is a times as large as those driving the wire. 
Modules are grouped in such a way that a of those modules 
drivers are connected to an  intermodule  communication wire. 
This wire in  turn is connected to  a  driver transistor a' times as 
large as the lowest level transistors. This process is continued 
until  the  appropriate size system  has  been realized. 
111. RANDOM -ACCESS MEMORY 
In this section we discuss the cost and performance of a 
random-access  memory (RAM) of S words  of  log S bits  each. 
As the  unit of length we employ  the  minimum  distance of two 
conducting  paths,  For  the  unit of time we choose  the time it 
takes a basic element to charge a wire of unit length plus 
another transistor like itself. One unit  of time is thus slightly 
larger than  the  transit  time of a transistor. 
A .  Organization of the RAM 
We organize the RAM in a  hierarchical fashion.  The  elements 
of level 0 are the bits themselves, each bit consisting of two 
crossing wires: a select wire and a data wire. When the select 
wire is signaled it  puts  its  contents  on  the  data wire. We group 
a' bits into an a X a square to  form a module of level 1. If 
the width of an element (a bit) is bo, the elements have to 
drive wires of  length abo. A module  on level 1 consists of an 
array of crossing select and  data wires, constituting  the a' bits 
of level 0, and  some additional logic and wires at  the  side. We 
group again a2 of these modules  into a  square to  form a module 
of level 2:, etc. Fig. 3 shows three levels of the hierarchy for 
a=4. 
To  study  the  memory  in  more  detail we look  at a module of 
level i (Fig. 4). We describe how one extracts one of its a2' 
bits.  In  order t o  select 1 bit of storage, 2 i  log a address wires 
are required. We run i log a of them, called the row address 
wires, vertically along the side of the module and the other 
i loga; the column address wires, horizontally. Its a2 sub- 
modules are organized into a rows of ar submodules  each. When 
the select wire of the  module is asserted loga  of the  row  ad- 
dress wires are used by  the  decoder to select one  of  the a rows 
of submodules; the select wire running through that row is 
asserted. The other (i - 1) log a row address wires are run 
horizontally into each of the a rows of submodules, where 
they serve as column address wires for the submodules. Of 
the i log Q column address wires (i - 1) log a are run vertically 
into each of the a columns of submodules, where they serve 
as row addresses. The other loga address wires are used by 
the  multiplexor  to select one of the a data wires coming out 
of the columns of submodules. The signal on the selected 
data wire is driven onto  the  data wire of the  module  itself. 
If we wish to have a memory of S words  with N t 1 levels 
(level 0 through N )  we choose N = log S / 2  log a or S = a 2  N ,  
This gives a  hierarchical structure  with S bits  from  which we 
can extralct 1 bit  at a time. If we want the word length to be 
log S we employ log S of these structures  in parallel. To select 
one word we select 1 bit in each of the log S hierarchies. 
B. Area of the RAM 
Fig. 4 allows us to  compute  the size of a RAM. Let Li  denote 
the  width of a module  of level i ;  then we have the following 
535  
I ' 1  
Fig. 3 .  Three  levels of a memory hierarchy for a = 4. 
recurrence relation: 
Lo = bo 
Li = i l o g a t  1 + l o g a + a . L i - l .  
The solution to the above relation is 
ai- 1 + (2ai t1 - a i - a  i + 1 )  
a -  1 (a - 1)2 a -  1 Li = &'bo t - log a. 
Rather  than  the  width itself we are interested  in  the  width per 
bit. In one direction, horizontal or vertical, module i has ai 
bits;  therefore, we compute Lilai. 
Li 1 - = b o t - t  - 2 a -  1 log a ai a -  1 (a- 1)2 
An interesting  property  of  the  width per bit, as expressed by 
(3), is that  its limit for i + 00 is finite. 
Li  1 lim - = b o t  - t - 2 a -  1 
i+- ai a-1 (a-1)' log a. 
This means  that  the  width per bit  Lilai is bounded  from above 
by (4) independent  of  the  number  of levels of a RAM. Expres- 
sion (3) converges in an exponential fashion towards  its  limit. 
For small values of i ,  (3) is already very close to  (4). There- 
fore, we use (4) as the  width per bit  for a RAM; its square is 
then  the area per bit. By dividing the area per bit  by  the  bit 
area bi we obtain  the  total area per bit area for a RAM. Fig. 
5 shows this quotient as a function of a far four different 
values of bo.  It gives the overhead factor in the area that is 
due  to  the wires. For a memory of 64K bits with N = 2,  a 
should be 16. Expression (4) is then equal t o  bo t 0.6. This 
shows that in  2-level 64K dynamic MOS memories, for which 
bo lies between 1 and 2 ,  roughly half of the area will be oc- 
cupied by wires. 
One may  wonder  why we have not discussed the area that is 
IEEE rRANSACTIONS ON ELECTRON DEVICES, VOL. ED-26, NO. 4, APRIL 1979 
20 k 
Total area 
hit area 
10 
8 
6 
A 
2 
1 
2 
a 
SUBMODULES 
M U L T I P L E X O R  
C O L U M N   A D D R E S S  
I d a l a  
Fig. 4. A RAM module of level i ( i  > 0). 
2 4 8 64 1:!8 l6 alpha 32 
Fig. 5. Total area per bit of a RAM as  a function of a .  
consumed by  the wires for  power  and  ground. The reason for 
this is that these wires can be thought of as increasing only  the 
width bo of each  bit;  they  do  this  by  an  amount  that  is roughly 
independent of a, as is shown  in  the following analysis. 
For simplicity we assume that  the wires for  power and gromd 
run in. opposite  directions, say parallel to the  data  and se!ect 
wires. We compute  how  much  one of them  contributes  to  the 
width of a module i .  The  width  of a  power or  ground wire is 
proportional  to  the  number of bits served by  it.  Let  the  wijth 
at  the highest level be u;  given S and  the design of  the low.est 
level memory cell, this parameter is easy to  compute. :?he 
width of the wire in a module  on level i is  proportional to the 
current it must  supply  and is hence u ( a z ~ a z N ) .  In  one  dilec- 
tion, horizontal or vertical, there are (aN/ai) such modu.!es. 
The  total  contribution  of all modules  on level i is thus u ( d /  
a N ) .  Taking the sum of this expression for i = 0, 1, , N 
yields 
u aN+l - 1 
aN f f - 1  a -  1 '  
- a w u  - 
There are fi bits in one direction; the increase of the bit 
width,  due  to power and  ground,  therefore, is 
u a  
6 a-1 -- 
which is roughly equal t o  u / f i .  
We are interested in the optimal choice of a,  but  to make 
that choice we will have to  look  at  the access time, which also 
depends  on a as well. 
C. Access Time of the RAM 
Each  element of level 0 drives a  wire of  length abO t o  reach 
the periphery of its module on level 1 ; this takes time a b o .  
Each  module  on level 1 drives in  the same amount  of time as a 
wire that is a times longer to reach the  periphery of its module 
on level 2 ,  etc. With N being the level of the highest module, 
the time  required to  extract 1 bit  of storage adds up  to a b o N .  
We use this figure as the access time.  For a RAM of S words, 
the access time is then  abo(log S/2 log a). 
D. The Cost of the RAM 
We take the product of the area and  the access time as the 
cost function  of  the RAM. A RAM of S words of log S bits 
each  has  the following  area-time product. 
1 2 a -  1 
(bo -I. - t- l ogay  * a - 1 (a - 1)2 2 log a Slog  2 S .  ( 5 )  
Fig. 6 shows ( 5 ) ,  normalized with respect to  Slog ' S ,  as a 
function of a for  different values of bo .  One notices  that  for 
increasing bit sizes the branching ratio  of  the hierarchy  should 
decrease. Static memories, therefore, should have a smaller a 
MEAD AND REM: VLSI COMPUTING STRUCTURES 5 3 1  
Fig. 6 .  
alpha= 2 4 8 16  32 64 128 
I I L 
Area-time product  of a RAM as a function  of a. 
than  dynamic  ones.  For  dynamic MOS memories the optimal 
choice for a lies between 8 and 16,  for  static MOS memories 
(bo w 4) between 4 and 8. One may speculate that “smart 
memories,” structures  in  which  part  of  the processing task is 
distributed over the memory cells, will have small branching 
ratios and hence  relatively deep hierarchies. 
IV. CONTENT ADDRESSABLE MEMORY 
The basic elements of the RAM were bits.  The  content  ad- 
dressable memory (CAM) is an example of a word organized 
memory. We consider a “pure” CAM. It  consists of words of 
w bits  each. We access a word  by  applying w bits of data  to 
the system. We assume that there is only one word in the 
memory with that contents, and the address of that word is 
produced by the  memory. 
A. Organization of the CAM 
The basic elements are the  bits,  each of width b l  . The  bits 
do  not  constitute  the  modules  of level 0. The modules  on level 
0 of the hierarchy consist of aw words of w bits each. [See 
Fig. 7(b)]  The w data  bits are run via parallel wires vertically 
through the module.  Out of each word comes  one  horizontal 
match wire going to the  right. A word asserts its  match wire if 
each  data  bit received is equal  to  the  corresponding  bit  stored. 
There are ~ 1 ,  words in  a module of level 0; the address of the 
matching word leaves the  module via the log aw address wires. 
The above organization of a module of level 0 has one defect. 
It would  require the individual bits of storage to  drive wires of 
length wbl , which may be greater than the desired a b l ,  to 
reach t h e  address wires. In Section 11, we discussed that this 
type of communication should be achieved by a hierarchy. 
We, therefore, organize the driving of the  match wire by  the w 
bits  in a  word in  the same manner as shown  in  Section 11. 
Each word is chopped  up  into (w/a) subwords of a bits  each 
[Fig. 7(a)]. Each of the (w/a) subwords sends a signal t o  a 
“match tree” which has a branching ratio of a and delivers, 
via log,w levels, the logical product of its inputs. The top 
node  of  the  match  tree can drive a wire of length blalogaw = 
b l w ,  the length of a word in the memory. Therefore, the 
word itself  can drive a wire of  length bl  a w ,  and we may group 
together aw words  into  module 0 [Fig.  7(b)] . Notice that  the 
module’s length is roughly equal t o  a times  its  width. This will 
be  true  for  modules on higher levels as well. 
We now describe a module of level i (Fig. 8). I t  contains 
words and consists of a4 submodules of level i - 1 ,  
grouped into a2 rows of a2 submodules  each.  Each  such row 
contains, besides the a2 submodules, w data wires to trans- 
port  the  data to  each  of  the  submodules  and log w ( ~ ~ ~ - ~  out- 
coming address wires to transport t o  the right the address of 
the matching word. Each submodule has wa4i--3 words, and, 
hence, one row contains wa4‘-l words  which  explains the 
number of address wires. A module on level i has a2 of these 
rows and thus requires log wa41t1 outcoming address wires; 
they are placed to the right of the rows. 
In  the CAM we have a4 submodules  per modul.e,  in the RAM 
only a’. This is only a seeming difference. In the CAM, for 
simplicity, we have combined two steps in the hierarchy; we 
have maintained, however, our  multiplication  factor a for  the 
wire lengths. Li - l ,  the length of a module of level i - 1 ,  is 
roughly equal to a times Wi- l ,  the  width of a module of level 
i - 1. Therefore,  module i - 1 can already drivle wires of length 
a Wi-l . As a consequence, we can put a’ submodules  into  one 
row as this would only require the driving of wires of length 
a2 Wi-l in  each  row. But then we can,  and  t!ks  is  the second 
step, combine a’ rows as this would require the driving of 
wires of a length about a2Li-l , which is roughly equal t o  
wa4i+1 
a3 Wi-1. 
B. Area of the CAM 
We compute the length and the width separately. For the 
length L j  of a module  on level i, we have the relation [cf. Figs. 
7(b)  and 81 
L o = a w  b* t - ( 2:) 
Li = a2(w t Li-l t log wa4i-1). 
The solution to this recurrence  relation is 
a2 i+2 
Li=aZi+’w ( bl  + - 2:) t ( w t l 0 g w )  - a2 
Ct2 - 1 
4$i+2 - 4$ 3&2i+2 - 4ia2 - 
t 
a2 - 1 -) loga .  
A module  on level i has waZi+’ bits  in  the vertical direction. 
The length per bit, therefore, is Lj/waZi+l. This has the fol- 
lowing limit  for i + m: 
bl t - log w a(w t log w -I. 3 log a) 4tr log a 
log a w(a2 - 1) w(a2 - * 
t t 7- (6) 
As in the case of  the R A M ,  Li/wa2i+1 is alrmdy very close to 
the limit for small values of i; the  rate of convergence is again 
exponential. We use (6) as the  length per bit of a CAM. 
We find  for  the  width ?Vi of a module  on level i the following 
h +  
1 
IEEE T,i.AN§ACTIONS ON ELECTRON DEVICES, VOL. ED-26, NO. 4 ,  APRIL 1979 
log w I 
L 
0 
w ( l  + o b  log (awl  
a 
(b) 
Fig. 7 .  (a) One word of storage in the CAM. (b) A CAM module of 
level zero. 
1.1 w l  
4 
Submodules 
L 
- 
A 
D 
D 
R 
E 
S 
S 
0 
U 
T 
- 
P 
w I  
I 
Fig. 8. A CAM module of level i (i > 0). 
MEAD AND REM: VLSI COMPUTING STRUCTURES 
recurrence  relation  [cf.  Figs.  7(b)  and 81 : 
Wo = -- (ab1 + 1) t l ogaw W 
(X 
wi = azwi-, i- log W d i + l  . 
Its  solution is 
539 
4#+2 - i (a' - 1)2 4a2 azi+2 - 4 i -  1 a2 - 1 t log a. 
In  the  horizontal  direction  there are waZd bits.  The  width per 
bit  Wi/waZi  has as its  limit  for i + m 
II a2 logaw 4 2  loga  
a w(a2 - 1) w(a2 - 1)" 
b 1 t - t  t (7) 
We take  the  product  of (6) and (7) as the  area  per  bit. 
By dividing the  area  per  bit  by  the bit area b: we obtain  the 
total  area  per  bit area for  a CAM. Fig. 9 shows this  quotient 
for  w = 32 as a  function  of a for  different values of bo.  
If we compare Figs. 5 and 9, we notice  that  for small  values 
of a the wires in the CAM cause less overhead in area than 
those in the RAM. For large values of a it is the RAM that 
enjoys a smaller overhead in area. For equal bit sizes, i.e., 
with bo = bl , the area overhead factor for the RAM and the 
CAM are  about  equal  at a = 8. 
As in the RAM we can compute by how much we should 
increase  the  bit  width bl  if  we wish to  take  power  and  ground 
into  account.  Both  power  and  ground give an  increase of 
u(a'/a' - 1) to  the  length  and  the  width  of  the CAM. This is 
even closer to u than in the case of the RAM. I f  we wish to  
ammortize  this  amount over the  bits,  the  bit  width b l  should 
be increnlented  by 
2u a2 --
@G a 2 - 1  
for  a CAM of S words of w  bits  each. 
C. Access Time of the CAM 
For  the access time we take  the  time  required  to  extract  the 
address of the matching word of data from a memory of S 
words. With the  highest level being  level N ,  we have S = 
Wa4h'+1 or 
l o g s -   l o g w  1 N =  - 
4 log a 
_ -  
4 '  
A word of storage  has  a  response  time of (log  w/log &)abl ; 
for a module  of level 0 this  becomes [(log w/log a) t 11 a b l .  
Each  new  level of the  hierarchy  multiplies  the wire lengths  by 
a factor a' and hence requires an additional time of 2 a b l .  
For N levels we find,  hence, 
I I I -I 
2 4 8 alpha 18 32 
Fig. 9. Total area per bit as a function of a for ,a CAM with word 
length 32. 
Area.timo I 
I I I L  
2 4 8 alpha '16 32 
Fig. 10. Area-time product  for a CAM of 65K 52-bit words. 
D. The Cost of the CAM 
We again take the product of the area and the access time 
as the cost function. For a CAM of S words of w bits each, 
formulae (6),  (7),  and (8) yield the  cost  function 
(bl + - w(a' - 1) t ---) W(Q2 - 1)' log w a(w t log w + 3 log a) 4a log a log a + 
* ( b l  t - t 
1 a2 logaw 4a' loga  
a w(aZ - 1) w(a2 - 112 
t 
log s t log w 
= 
( l o g S + l o g w  2 log a (8) of 65K words of 32 bits  each. The  curves are fairly  indepen- 
Fig. 10 shows the  cost  function as a  function  of a for  a CAM 
540 IEEE TF.ANSACTIONS ON ELECTRON DEVICES, VOL. ED-26, NO. 4,  APRIL 1979 
dent  of  the choice of w provided we choose w great enough, 
say w 2 16. A change in S will basically move the curves only 
up  and  down;  it will not affect  the  positions  of  their minima. 
We notice again that increasing the  bit size will decrease the 
optimal choice of a. Comparing Figs. 6 and 10 we see that 
content addressable memories should have smaller branching 
ratios  than  random-access memories. For bl = 4, which s:ems 
a  reasonable  figure, the  optimal choice of a is 4. 
V. CONCLUSION 
We have presented a general method for analyzing the cost 
and  performance of recursively defined VLSI structures. 
Parameters of any such structure may be optimized with re- 
spect to time, area, or some combination of the two. While 
we have chosen the area-time product, it is clear that some 
other choice may be appropriate  for  any given application, 
The results of this study indicate that as more processing 
is available in  each  module  at level zero,  the  optimal value of 
a will decrease.  A system with a = 4 would seem to be  appro- 
priate  for memories in  which  substantial processing is comirlgled 
with  storage. 
Very general arguments were used to generate the basic re- 
cursive structure. For that reason it appears that a very large 
fraction of VLSI computing structures will be designed in 
this  way. We have discussed two  examples,  one  in whick. the 
basic elements were bits of storage, and one with worcs of 
storage at  the  lowest level. They gave rise to rather different 
recursive structures. The way in which their area and time 
measures were established should make it clear how to apply 
these techniques to other recursively defined  computing 
structures. 
Carver A. Mead, for a biography and photograph, see this issue, p. 
548.  
Martin Rem was born in The Netherlands on 
September 22,  1946. He  received the B.S. 
degree in  mathematics  and physics and  the M.S. 
degree in  mathematics  from  the University 
of  Amsterdam,  Holland,  in 1968 and 1971, 
respectively,  and the Ph.D. degree in computer 
science from Eindhoven University  of  Tech- 
nology,  Eindhoven, The Netherlands, in 1976. 
He is an Associate Professor of Mathematics 
in the Department of Mathematics, Eindhoven 
University of Technology. He is presently  a 
Visiting  Professor of  Computer Science in  the  Department of Computer 
Science, California Institute of Technology, Pasadena, CA. His major 
research interests  are  in  the  area of programming  of  machines  with  in- 
store processing, semantics of programming languages, correctness 
proofs,  and  well-structured  machine designs. 
Dr. Rem is a member of the Association for Computing Machinery, 
the  Dutch  Computer  Society NGI, and  the  Dutch  Mathematical 
Society. 
Delay-Time  Optimization for Driving and Sensing of 
Signals  on  High-Capacitance  Paths of 
VLSI Systems 
AMR M. MOHSEN, MEMmR, IEEE, AND CARVER A. MEAD 
Abstract-Transmission of signals on large capacitance  paths in a 
VLSI system may result in substantial degradation of the overall sys- 
tem performance. In this paper minimization of the delay timer. as- 
sociated with driving and sensing signals from large capacitance paths 
by optimizing the fan-out factor of the driver stages, the gain of the 
input sensing stages, and the path voltage swing are examined. Ex- 
amples of driving signals on a high capacitance path with two driving 
schemes are: a push-pull depletion-load  driver  chain  and  a  fixed  dr: ver; 
and  of sensing signals with two sensing schemes: a single-ended dcple- 
tion-load  inverter input stage  and  a balanced regenerative strc'bed 
latch  are  presented. We conclude  that minimum  delay  time is achiwed 
when the delay times of  the successive stages of the driver chain, the 
high capacitance path, and the input sensing  stage are comparable. 
A. M. Mohsen is with Intel Corporation, Santa Clara, CA, and the 
C. A. Mead is with the  Department of Computer Science, California 
California Institute  of  Technology, Pasadena, CA 91  125. 
Institute of Technology, Pasadena, CA 91125. 
In general, transmission time of signals in a  system is minimized when 
the  delay  times of  the  different stages of the  system  are  comparable. 
T 
I. INTRODUCTION 
HE OVERALL PERFORMANCE of VLSI systems may 
be seriously degraded if signals need to be transmitted 
from one part  to  other  parts in the system  across large capaci- 
tance paths [ l]  . This large fan-out situation often occurs in 
the case of control drivers that are required to drive a large 
number of inputs to memory cells or logic-function blocks 
across axhip, or in the case of sensing stored  information  from 
small cells of large memory arrays. A similar and even more 
serious problem is driving wires which go off the silicon chip 
to  other chips or  input  and  output devices. In  such cases, the 
0018-9383/79/0400-0.540$00.75 Q 1979 IEEE 
