Cache Design for Low Power and Yield Enhancement Committee: by Baker Shehadah Mohammad et al.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Copyright 
by 
Baker Shehadah Mohammad 
2008 
 
  
 
The Dissertation Committee for Baker Shehadah Mohammd Certifies that this is the 
approved version of the following dissertation: 
 
 
Cache Design for Low Power and Yield Enhancement 
 
 
 
 
Committee: 
 
Jacob Abraham, Supervisor 
Adnan Aziz 
Mohammed G. Gouda 
Michael Orshansky 
Martin Saint-Laurent 
Nur Touba  
Cache Design for Low Power and Yield Enhancement 
 
 
by 
 
Baker Shehadah Mohammad, B.S, M.S 
 
 
Dissertation 
Presented to the Faculty of the Graduate School of  
The University of Texas at Austin 
in Partial Fulfillment  
of the Requirements 
for the Degree of  
 
Doctor of Philosophy  
 
 
The University of Texas at Austin 
August 2008  
 
 
 
 
 
Dedication 
Dedicated to my dear parents Shehadah and Fatima Mohammad whose 
encouragement and high value they placed on education, even though they had limited 
access to it, is the single biggest reason I joined graduate school.  To my beloved family 
Fairouz, Eman, Moath, Hamza, and Sarah whose sacrifices and support made this 
possible. 
   v 
 
 
 
 
Acknowledgements 
 
I start by thanking God the most gracious, most merciful for the continuous gifts 
that allowed me to continue my graduate studies and the opportunity to meet all the nice 
people who helped me finish this work.  This dissertation would not have been possible 
without  the  support  and  encouragement  of  my  research  advisor,  Professor  Jacob 
Abraham. Even during those inevitable lulls in my research, during which progress was 
measured in weeks or months, Professor Abraham’s unwavering confidence in my ability 
to complete this dissertation gave me the incentive to continue.   
Professor Adnan Aziz and Dr. Martin Saint-Laurent deserve special recognition. 
While I knew what I wanted to relate in this dissertation and other publications, I was not 
always able to do so in a concise, clear manner. Professor Aziz and Dr. Saint-Laurent 
provided invaluable guidance that insured that my ideas were expressed completely and 
articulately, which contributed greatly to the quality of this dissertation. I thank Professor 
Mohammed G. Gouda, Professor Michael Orshansky, and Professor Nur Touba for their 
time to serve on my dissertation committee.  Thanks to Bassam Jamil and Hani Saleh for 
encouraging me to pursue graduate school and been great help during this journey.   
I am indebted to my high school Math and physics teachers (Mohammad Abu 
Erar  and  Mohammad  abu  Samra)  and  all  the  staff  of  Yatta  Secondary  School  their 
influence on my life with both their teaching and guidance stayed with me through out 
the years.     vi 
My father, Shehadah, my mother, Fatima, and all my brothers and sisters deserve 
most of the credit for not giving up on my return to academic ranks from industry while I 
often lost hope to act on the delayed dream. It would not have been possible to conceive 
the  completion  of  my  journey  through  a  Doctorate  of  Philosophy  without  their  constant 
encouragement.   
Being a part-time PhD student and a full-time working father is an adventure that pushes 
a man to his creativity and production limits. The fact that this experience was actually smooth 
and enjoyable was due to the sacrifices, the insistence, and the high morale of my beloved family.  
Their sacrifices and understanding when I spend many late nights away from them catching up at 
work  or  at  the  UT  campus  attending  Professor  Abraham’s  weekly  meetings  all  makes  this 
accomplishment as a family achievement.      
Finally, I feel the need to acknowledge the great support, encouragement, and 
understanding I received from my colleagues and managers at the Qualcomm Austin site, 
especially Paul Bassett, Willie Anderson, Chuck Fisher, and Hong Kim. 
 
15 August 2008   vii 
 
 
Cache Design for Low Power and Yield Enhancement 
 
Publication No._____________ 
 
Baker Shehadah Mohammad, PhD 
The University of Texas at Austin, 2008 
 
Supervisor:  Jacob Abraham 
 
 
One  of  the  major  limiters  to  computer  systems  and  systems  on  chip  (SOC) 
designs is accessing the main memory, which is typically two orders of magnitude slower 
than the processor. To bridge this gap, modern processors already devote more than half 
of the on-chip transistors to the last-level cache.  Caches have negative impact on area, 
power, and yield.  This research goal is to design caches that operate at lower voltages 
while enhancing yield. Our strategy is to improve the static noise margin (SNM) and the 
writability  of  the  conventional  six-transistor  SRAM  cell  by  reducing  the  effect  of 
parametric variations on the cell.  This is done using a novel circuit that reduces the 
voltage swing on the word line during read operations and reduces the memory supply 
voltage during write operations.  The proposed circuit increases the SRAM’s SNM and 
write  margin  using  a  single  voltage  supply  that  has  minimal  impacts  on  chip  area, 
complexity, and timing.  A test chip with an 8-kilobyte SRAM block manufactured in 45-
nm technology is used to verify the practicality of the contribution and demonstrate the 
effectiveness of the new circuit’s implementation.     viii 
  Cache organization is one of the most important factors that affect cache design 
complexity, performance, area, and power.  The main architectural choice for caches is 
whether  to  implement  the  tag  array  using  a  standard  SRAM  or  using  a  content 
addressable  memory  (CAM).   The  choice  made  has  far-reaching  consequences  on 
several  aspects  of  the  cache  design,  and  in  particular  on  power  consumption.  Our 
contribution in this area is an in-depth study of the complex tradeoffs of area, timing, 
power, and design complexity between an SRAM-based tag and a CAM-based one.  Our 
results indicate that an SRAM-based tag design often provides a better overall design 
point and is superior with respect to energy, especially for interleaved multi-threading 
processors. 
Being able to test and screen chips is a key factor in achieving high yield.  Most 
industry standard CAD tools used to analyze fault coverage and generate test vectors 
require  gate  level  models.    However,  since  caches  are  typically  designed  using  a 
transistor-level flow, there is a need for an abstraction step to generate the gate models, 
which must be equivalent to the actual design (transistor level).  The third contribution of 
the research is a framework to verify that the gate level representation of custom designs 
is equivalent to the transistor-level design.     ix 
TABLE OF CONTENTS 
Acknowledgements  v 
List of Figures  xii 
List of Tables  xv 
CHAPTER 1 Introduction  1 
1.0  Motivation.................................................................................................1 
1.1  Dissertation Statement..............................................................................2 
1.2  Contribution..............................................................................................3 
1.3  Dissertation Organization.........................................................................4 
1.4  Interaction Between Voltage, Power, and Performance...........................5 
1.5  Process Variation and its Effect on Yield.................................................7 
CHAPTER 2 Overview of Memory Sub-System and SRAM Cell Design  13 
2.0  Memory Sub-System and Cache Hierarchy............................................13 
2.1  SRAM Cell Design and Parametric Yield Failures Type.......................15 
2.1.1  SRAM Cell Stability......................................................................17 
2.1.2  Write Completion...........................................................................20 
2.1.3  SRAM Access Time ......................................................................22 
2.2  Interaction Between Read and Write Operations ...................................25 
CHAPTER 3 Related Work  26 
3.0  Low Voltage and High Yield Approaches in SRAM memory...............26 
3.0.1  Process Technology Transistor Sizing and Layout........................27 
3.0.2  Modified SRAM............................................................................28 
3.0.3  Voltage Islands...............................................................................29 
3.0.4  Body Bias.......................................................................................31 
3.0.5  Read and Write Assist Circuits......................................................32 
3.1  Related Work in Cache Organization CAM vs SRAM tag....................34 
3.2  Related Work in Leakage Current Reduction.........................................34   x 
3.2.1  Multi-threshold Voltage (MTV)....................................................34 
3.2.2  Voltage Islands...............................................................................35 
3.2.3  Well and Substrate Back Biasing...................................................35 
CHAPTER 4 Power Efficient and Improved Yield SRAM Cache Memory  36 
4.0  Reduced Wordline Voltage and Memory Supply...................................36 
4.1  Mathematical Model...............................................................................36 
4.2  Simulation Model using Hspice..............................................................39 
4.3  Timing Impact.........................................................................................45 
4.4  Circuit to Generate Reduced Voltage Swing..........................................46 
4.5  Summary.................................................................................................51 
CHAPTER 5 Design of 8KB SRAM Memory Test Chip with RVS Circuit  53 
5.0  Test Chip Description.............................................................................53 
5.1  Interface Signals and Logical View........................................................55 
5.2  Block Level and Timer circuit Design....................................................58 
5.3  Timing Simulation Results.....................................................................61 
5.4  Testing Strategy and Chip Integration....................................................63 
5.5  Expected Silicon Result .........................................................................64 
CHAPTER 6 Cache Organization: CAM versus SRAM Tag  66 
6.0  Tag Array Design for High Associatively Cache...................................66 
6.1  Structural Comparison............................................................................67 
6.2  Area and Floor plan Comparison............................................................71 
6.3  Timing Comparison................................................................................74 
6.4  Power Comparison..................................................................................75 
6.5  Summary.................................................................................................78 
   xi 
CHAPTER 7 Verification of Gate Level Model for Custom Memory Design in 
Scan Mode  79 
7.0  Test Pattern Tool Flow ...........................................................................79 
7.1  Custom Macro Design Flow...................................................................82 
7.2  Gate-Level Model and Schematics Validation for ATPG......................84 
7.2.1  Verifying ATPG Tool Compatibility and Coverage Analysis.......84 
7.2.2  Validation through HDL Simulation .............................................86 
7.2.3  Validation with Golden Model ......................................................86 
7.3  Experimental Results..............................................................................88 
7.4  Summary.................................................................................................89 
CHAPTER 8 Leakage Reduction on Wordline Logic for SRAM Memory  91 
8.0  Motivation...............................................................................................91 
8.1  Usage of Head and Foot Switch for leakage reduction ..........................92 
8.2  SRAM-based Memory Leakage.............................................................94 
8.3  Design Example......................................................................................96 
8.4  Proposed low leakage wordline logic...................................................100 
CHAPTER 9 Conclusions and Future Work  102 
9.0  Conclusion............................................................................................102 
9.1  Future Work..........................................................................................103 
REFERENCES  104 
Vita    111 
   xii 
List of Figures 
Figure 1-1: Supply voltage versus F, active and leakage power for different Vt  ........................... 7 
Figure 1-2 :  3-D random doping fluctuation in the CMOS channel  [Kuhn 24]............................ 8 
Figure 1-3: Spice simulation result of ring oscillator delay normalized to TT corner.................. 10 
Figure 1-4: Monte Carlo Spice simulation of 45nm SRAM cell................................................... 12 
Figure 2-1: Basic RISC architecture pipe stages........................................................................... 13 
Figure 2-2: Memory main blocks and cache hierarchy................................................................. 14 
Figure 2-3: Details of SRAM 6T Cell........................................................................................... 16 
Figure 2-4: SRAM cell voltage versus cell ratio for α=2, α=1, and Vtn=0.35............................... 18 
Figure 2-5: Cell ratio versus SNM for α=1 and α =2.................................................................... 20 
Figure 2-6: Write margin plot when Vddwl=Vddmem........................................................................ 21 
Figure 2-7: SRAM-based memory column schematic and connectivity....................................... 24 
Figure 2-8: SRAM-based memory access time waveforms.......................................................... 24 
Figure 2-9: Basic SRAM-based memory block ............................................................................ 25 
Figure 3-1: Schematic and SIM picture of 6T cell for 90, 65, and 45nm   [Kuhn 24].................. 28 
Figure 3-2: 8T SRAM cell............................................................................................................. 29 
Figure 3-3: SRAM butterfly curves show the SNM enhanced as SRAM supply increase ........... 31 
Figure 3-4: Read assist circuit using voltage divider to reduce wordline voltage on SRAM........ 33 
Figure 4-1: Vn1 versus wordline voltage Vddmem=1v...................................................................... 38 
Figure 4-2 : SNM versus wordline voltage and Vddmem=1v........................................................... 38 
Figure 4-3: Write completion plot of Vn2 versus wordline voltage for different Vccmem................ 39 
Figure 4-4: Circuit used for simulation to find the inverter threshold Vth .................................... 40 
Figure 4-5 Circuit to find memory cell voltage and simulation waveform................................... 41 
Figure 4-6: Simulated SNM for different voltages and cell ratios................................................ 42 
Table 4-2: SNM of 6T cell in 45nm process technology .............................................................. 42 
Figure 4-7: 6T cell SRAM SNM for different voltage.................................................................. 43   xiii 
Table 4-3: SNM with RVS circuit and fixed supply..................................................................... 44 
Figure 4-8: SNM for 45nm foundry SRAM cell using 1V Vdd and different wordline voltage ... 44 
Figure 4-9: Relative performance of SRAM cell at different process corners and voltages......... 45 
Figure 4-10: Normalized read current of SRAM cell using RVS wordline versus voltage ......... 46 
Figure 4-11: Basic circuit to generate RVS low signal................................................................. 47 
Figure 4-12: Basic RVS low circuit Spice simulation result......................................................... 48 
Figure 4-13: Improved RVS low circuit with bypass and programmable capabilities.................. 48 
Figure 4-14:  Spice simulation waveform of improved and programmable RVS circuit.............. 49 
Figure 4-15: Traditional and RVS high circuit and waveform for SRAM wordline..................... 50 
Figure 4-16 SRAM-based memory main block showing the RVS control circuit location.......... 51 
Figure 5-1: Block diagram of the test chip.................................................................................... 53 
Figure 5-2: Detailed view of the test chip die showing the placement of the main blocks........... 54 
Figure 5-3: Test chip interface timing diagram............................................................................. 56 
Table 5-1: Interface signals and timing......................................................................................... 56 
Figure 5-4: Test chip logical organization and address decode stage ........................................... 58 
Figure 5-5: Detail block level presentation with major interface signals...................................... 58 
Figure 5-6: Detailed timer circuitry with clock generation and control signals interface............. 59 
Figure 5-7: Delay control circuit with acc bit signals for read and write accelerators.................. 60 
Figure 5-8: Simulation result of separation value and time to develop across different PVT....... 61 
Figure 5-9: Signals waveforms from HSIM simulation when RVS is disabled and enabled........ 62 
Figure 5-10: Chip-level integration with SR on all input and output............................................ 63 
Figure 5-11: Expected result from test chip.................................................................................. 64 
Figure 6-1: CAM cell schematic................................................................................................... 68 
Figure 6-2: SRAM-based cache operation and data flow.............................................................. 69 
Figure 6-3: CAM-based tag memory organization and data flow................................................. 71 
Figure 6-4: SRAM-based tag 32KB memory organization........................................................... 72 
Figure 6-5: CAM-based tag 16KB memory organization............................................................. 73   xiv 
Table 6-1: Area of L1 32KB 16 ways SRAM-based tag............................................................... 74 
Table 6-2: Area of L1 32KB 16 ways CAM-based tag................................................................. 74 
Figure 6-6: Power distribution in L1 data cache tag (SRAM-based tag) for SA = 0.5................. 76 
Figure 6-7: Power distribution in L1 data cache tag (CAM-based tag) for SA = 0.5 ................... 77 
Figure 6-8: Switching capacitance (energy-delay
2) of CAM-based tag and SRAM-based tag..... 77 
Figure 7-1: ASIC design flow with ATPG pattern generation and verification............................ 80 
Figure 7-2: Common mismatch between schematic view and gate-level view of the macro ....... 81 
Figure 7-3: Custom circuit design flow......................................................................................... 82 
Figure 7-4: Gate-level model validation framework..................................................................... 86 
Figure 7-5: The flow to generate golden model............................................................................ 88 
Figure 7-6: Gate-level simulation test example............................................................................. 89 
Figure 8-1: Detail schematic of head/foot switch.......................................................................... 93 
Figure 8-2: Foot/head switch examples......................................................................................... 94 
Figure 8-3: 32KB cache organization example............................................................................. 97 
Figure 8-4: Traditional wordline driver......................................................................................... 98 
Figure 8-5: Proposed wordline driver design to limit leakage current.......................................... 99 
Figure 8-6: Detail of the new wordline driver last stage............................................................... 99 
   xv 
List of Tables 
Table 2-1 : Vth for α=1 and α=2  19 
Table 4-1: SRAM voltage (Vn1) for different Vddwl and for Vddmem=1V  37 
Table 4-2: SNM of 6T cell in 45nm process technology  42 
Table 4-3: SNM with RVS circuit and fixed supply  44 
Table 5-1: Interface signals and timing  56 
Table 6-1: Area of L1 32KB 16 ways SRAM-based tag  74 
Table 6-2: Area of L1 32KB 16 ways CAM-based tag  74 
Table 8-1: 32KB SRAM array leakage and wordline driver leakage for different PVT 101 
Table 8-2:  Active power to the addition on foot/head switch  101 
 
 
 1 
 
CHAPTER 1   Introduction 
1.0  Motivation 
Caches  are  becoming  an  increasingly  important  part  of  embedded  processor 
design because of their positive impact on performance. However, caches can negatively 
impact  area,  power,  timing,  yield,  and  design  time.  The  ever-increasing  gap  between 
processor frequencies and DRAM access times, popularly referred to as memory wall, 
has dictated that processors use more  and more on-die static random  access memory 
(SRAM) to meet performance targets [1][2]. As a result, in many chips the SRAM arrays 
contain more than 70% of the devices and occupy about half of the chip’s area [3]. But 
since the primary emphasis of the DRAM is density rather than speed, the performance 
gap between the processor and the main memory is consequently even greater. Process 
scaling, with the ability to double the number of transistors in each generation, ultimately 
makes it possible for on-chip memory to nearly double in each generation.  
  The  optimum  operating  voltage  for  most  System  on  Chip  (SOC)  designs  is 
configured for performance, active power, and leakage power.  The lower end of the 
operating voltage for such devices is determined by the stability of the storage element. 
In typical SOCs, SRAM cells for cache design are most frequently used. This is why the 
SOC  yield  is  determined  mostly  by  the  SRAM  cell.  The  minimum  voltage  (Vddmin) 
required for the product to guarantee a certain yield is also determined by the SRAM 
cell.    Increasing  process  variability  [4][5][6]  for  new  technologies,  coupled  with 
increasing reliability effects like negative bias temperature instability (NBTI) [7], all 
contribute to raising Vddmin.  However, the increased density of SRAM cells due to the 
use  of  small  geometry  process  technologies  increases  the  probability  of  cell  failure 
because of the small geometry, as there are large numbers of cells being used.   2 
  This direct relationship between voltage supply and SRAM stability highlights the 
need to address this issue, especially for chips targeted at low-power and cost-sensitive 
applications, such as mobile devices. It is important for chips targeted for cost-sensitive, 
mobile applications, like those in cell phones, to support a wide range of performance, 
power,  and  high-yield  requirements.  Lowering  the  Vddmin  reduces  the  active  power 
consumption quadratically, but the subthreshold leakage power exponentially.  This has 
implications for most SOC designs, especially for mobile and handheld battery operated 
products. High-power consumption has negative effects on battery life, heat negatively 
affects  chip  performance,  and  the  heat  that  the  device  dissipates  necessitates  more 
expensive packaging. 
Both academic and industrial studies have attempted to find ways to minimize the 
active and leakage power of memory sub-systems [34][35] as well as yield loss due to 
SRAM parametric failure [17] [18][20]. Chapter 3 will discuss in more detail some of the 
techniques that have been proposed.  
1.1  Dissertation Statement 
Though the need to allocate the majority of a chip’s space for the cache has a 
positive impact on chip performance, this need compromises area, power, and design 
complexity. SRAM cell electrical parameter shift, which is a product of process variation, 
often results in parametric yield failure.   
To minimize parametric yield loss and enable lower operation voltage in order to 
improve  the  system’s  power  efficiency,  simple  and  cost-effective  adaptive  design  is 
needed. Moreover, selecting the right cache tag architecture is also an important decision 
to be made that substantially influences memory design, power, and area. 
Leakage  power  during  active  mode  is  becoming  a  big  percentage  of  the  total 
power.    Wordline  driver  logic  usage  of  big  devices,  coupled  with  low  activities  and 
regular structure make them good candidates for active leakage power reduction.    3 
 
1.2  Contribution 
The focus of this dissertation is a detailed study of embedded memory and its 
impact on modern SOC performance, power, and cost.  Our main contributions are the 
following. 
1.  Improved  SNM  SRAM-based  memory.    The  proposed  approach 
improves the static noise margin (SNM) of the SRAM cell circuits while 
enhancing yield. The approach focuses on changing the wordline voltage 
level of the SRAM to reduce the minimum voltage required to achieve 
cell  stability.    This  is  important,  especially  for  cost-sensitive  mobile 
applications, where the chip has to support a wide range of performance 
and  power  targets.  For  example,  an  embedded  processor  for  mobile 
devices, like cell phones, needs to support high-performance applications 
like  video  decoding  (H.264)  or  High  Speed  Downlink  Packet  Access 
(HSDPA); at the same time, the processor needs to run MP3 players in 
which  performance  is  not  problematic  and  power  consumption  is 
minimal.  Voltage scaling is an effective way to lower both active and 
leakage power, especially when performance is not a priority [47].  
2.  Design  of  test  chip  to  prove  the  practicality  of  the  approach  and  to 
quantify some of the overhead cost associated with using the proposed 
design.  
3.  Cache organization as it relates to selecting CAM-based or SRAM-based 
Tag  array.    This  choice  has  far-reaching  consequences  on  memory 
performance, area, power, and complexity.  We studied the two common 
cache organizations and showed that SRAM-based tags often provide a 
more optimum design point than CAM-based tags [12].     4 
4.  High  test  coverage  as  a  key  contributor  to  achieving  high  yield.  Our 
contribution in this area is a detailed design flow that verifies that the 
gate level of the cache, which is used by automatic test pattern generation 
(ATPG) tools to generate test vectors, is equivalent to the actual design.  
Industry-standard  tools  are  used  to  develop  the  flow  with  100% 
coverage, with all the test vectors generated from the gate level model is 
verified with the transistor level design [52].     
5.  Leakage  power  consumes  a  substantial  portion  of  the  total  power  of 
today’s SOC.  Our fifth contribution is to reduce SRAM-based memory 
wordline leakage power though power  gating,  utilizing modern multi-
threshold voltage (MTV) process technology capabilities.   
1.3  Dissertation Organization 
This dissertation is divided into nine chapters. Chapter 2 includes an overview of 
cache organization, power and SRAM cell principles, and cell stability issues.  Related 
work is discussed in Chapter 3.  Chapter 4 introduces our approach to low voltage and 
improved  yield  SRAM  cache  memory,  while  Chapter  5  describes  the  test  chip  that 
implemented  the  proposed  techniques.  Cache  organization—specifically  CAM-  and 
SRAM-based tags—will be discussed in Chapter 6.  An innovative gate-level model for 
custom memory design in scan mode is presented in Chapter 7.  Chapter 8 will describe a 
wordline  driver  logic  with  less  leakage  power.  Finally,  Chapter  9  will  conclude  this 
dissertation and discuss future work. 
Section  1.4,  which  follows,  will  elaborate  on  the  relationship  between  voltage, 
power, and performance.  
 
   
   5 
1.4  Interaction Between Voltage, Power, and Performance 
Selecting the right process technology to achieve the optimum operating point 
between  performance,  active  power,  and  leakage  power  requires  a  complex  balance. 
Many SOC-based chips are designed to support a wide variety of user applications. For 
example,  a  mobile  phone  application  processor  can  be  used  as  a  high  performance 
processor when playing videos or when using computing intensive application algorithms 
like fast Fourier transform.  At the same time, the processor is required to run at a low-
power mode for an extended period of time when running an MP3 player, for instance.  
The operating voltage should ideally be kept to a minimum in order to save on both 
leakage power and active power during the low performance application mode. 
The voltage, Vddmin, specification of an SOC is determined by SRAM stability.  
This limit on voltage scaling, even at the low-power mode, has a sizable impact on both 
active and leakage power.  The NBTI [21], which significantly shifts the pMOS threshold 
voltage, results in reduced drive current and shifts the inverter trip point, which leads to a 
decrease in SRAM SNM. 
To illustrate how voltage, power, and performance interact, we will use the first-
order equation for active power and saturation current.  Equation 1-1 represents the total 
power consumed by an SOC.  The first term A is the activity factor of a certain node in 
the design, the second term is the active power for a particular capacitance (C), with 
voltage (V) and frequency (F).  The third term is derived from the short circuit current for 
CMOS gates. The fourth term shows the leakage power as a function of leakage current, 
Ileak, and supply voltage.  Assuming the Vsupply is equal to Vswing, then the active power 
relationship to Vsupply is quadratic.  Equation 1.2 is for the saturation current of a MOS 
device, ß is device parameter, Vgs is voltage supply, α is closely related with the velocity 
saturation of carriers and is a number between  (1-2) and Vt is the threshold voltage.  
Equation 1-3 shows the nearly linear relationship between a gate delay (tPHL) and supply 
voltage (Vdd) for a certain capacitance (C).  The tT term in equation 1-3 is the input signal   6 
transition  time.    These  first-order  equations  show  the  value  of  voltage  scaling  as  it 
reduces power quadratically and linearly affects the propagation delay (tPHL) [22], which 
demonstrates the relationship between power and frequency. 
  
   = + + supply sc supply leak supply ( )+ swing P A CV V F I V I V        1-1 
 
α β
= − ( )
2
d s a t g s t I v V                        1-2 
where 
ox
W
C
L
β µ =  
0.1
2
dd
pHL T
dsat
cV
t
I
t + =               1-3 
 
Leakage current is the current that flows in the transistor when the transistor is in 
the off mode [23].  It plays a big role in selecting the process technology and the design 
style.  With many small geometry effects on MOS devices like drain-induced barrier 
lowering  (DIBL)  and  low  threshold  voltage,  the  leakage  currents  gain  even  greater 
importance.  Equation 1-4 shows the exponential relationship between leakage current, 
Ids, and Vgs, which is proportional to the supply voltage [23].   
 
2 1.8 1
ds
T t
gs t
n t
v v V
v V Ids v e e e β
− −  
 
−  
 
 
=         1-4   
where vt is the thermal voltage and its value is equal to 26mV at room temperature and n 
is a process-dependent term that ranges in value between 1.4 and 1.5 for bulk silicon [32]. 
Figure  1-1  shows  the  active  power,  leakage  power  and  frequency  for  varying 
threshold  voltages  (Vt).    The  graph  shows  that  for  process  technology  with  high  Vt 
(which is used to reduce leakage power), the voltage has a greater effect on performance 
than it does on the lower Vt process.  If leakage power can be controlled by means other   7 
than Vt, then selecting a lower Vt device will result in an overall better performance 
power operating point.  Additionally, the lower Vt enables the chip to run at a lower 
voltage  because,  as  research  will  demonstrate  in  Chapter  2,  cell  stability  is  a  strong 
function of Vt. 
In addition to SRAM cell stability, lower Vt is better for low voltage operation as 
smaller threshold voltage reduces the effects of process variation on the transistor current.  
Many techniques have been proposed to address leakage current; for example, the multi-
threshold voltage process (MTV) technologies enable the designer to select the type of 
devices based on timing and leakage power. 
Figure 1-1: Supply voltage versus F, active and leakage power for different Vt normalized to vdd=1v 
 
1.5  Process Variation and its Effect on Yield  
The successful introduction of semiconductor processes technology with smaller 
geometry  has  become  increasingly  dependent  on  the  use  of  advanced  manufacturing 
techniques, including tools that enhance the performance of silicon-based structures. One 
0
0.2
0.4
0.6
0.8
1
1.2
0.7 0.8 0.9 1
Vdd (v)
n
o
r
m
a
l
i
z
e
d
 
F
r
e
q
u
e
n
c
y
 
 
a
n
d
 
p
o
w
e
r
Fmax vt=0.3
Fmax vt=0.35
Fmax vt=0.4
Fmax vt=0.5
Active Power
Leakage
Power
   8 
of the consequences of adding new manufacturing techniques to enhance performance is 
the subsequent increases in variation of the characteristics of the processes occurring 
across the wafer and chip. Process variation is further made worse by higher levels of 
complexity in the design and the demand for chips with high performance but low power 
consumption.  The parameter variations are random in nature and are expected to be more 
pronounced  in  minimum  geometry  transistors  commonly  used  in  memories  such  as 
SRAM. Consequently, a large number of cells in a memory are expected to have their 
electrical parameter vary which can result in a low yield due to faulty SRAM.  CAD tools 
like  spice  allow  us  to  ascertain  both  the  target  performance  and  the  distributions  of 
certain  circuit  properties  for  process  technologies  tailored  to  a  specific  geometric  or 
electric parameter.  
Two of the most important of these circuit parameters that have sizeable impacts 
on circuit performance, leakage power, and voltage scaling are the threshold voltage and 
the channel length of the transistor.  Figure 1-2 is 3-D representation of a MOS transistor 
showing the randomness of the doping atom in the channel [25].   
Figure 1-2 :  3-D random doping fluctuation in the CMOS channel  [Kuhn 24] 
   9 
Each process technology tries to capture these factors, among other electrical and 
geometrical  parameters,  to  analyze  the  effect  of  these  factors  on  the  behavior  of  the 
design with the goal of improving yield.  There are two main ways to analyze the effect 
of process variation on circuit performance and functionality, the first of which is the 
process corner approach. In this approach, the process technology has five corners, which 
relate the effect of the process variation on device performance.  The five corners are 
typically  noted  in  two  letters  corresponding  to  the  nMOS  and  pMOS  transistor 
parameters. For example, typical nMOS and typical pMOS (TT) denote that all transistor 
parameters are at the mean of the distribution of the process variation. The second corner 
is  fast  nMOS  and  fast  pMOS  devices  (FF);  at  this  corner,  the  parameters  of  both 
transistors have the lowest parameters that correspond to the highest current. The third 
corner is fast nMOS and slow pMOS (FS), the forth corner is slow nMOS and fast pMOS 
(SF), and the fifth is slow nMOS and slow pMOS (SS).  Circuit analysis using the five 
corners is often referred to as corner based analysis.  Hspice simulation can be used to 
check the functionality of the SRAM cell using the five process corners.  
 The second method uses a statistical approach in which a Monte Carol-based 
analysis  is  used  to  analyze  the  effects  of  process  variation  on  circuit  and  system 
performance.  Figure 1-3 depicts the result of spice simulation for ring oscillator delay 
normalized  to  typical  process  (TT)  corner  and  the  Monte  Carlo  simulation  of  1,000 
samples. Since the effect of variation on the circuit is different in light of both types of 
analysis and the role the circuit plays in the system, a deep understanding of the role of 
the device and the process corner definition is required when corner analysis is used.     10 
Figure 1-3: Spice simulation result of ring oscillator delay normalized to TT corner 
There are two main paradigms available to deal with process and environment 
variability effects on chips: design for time optimization, and design for variability with 
post-silicon tuning to adapt to variation (also known as on-line tuning).  
Design for time optimization in the presence of process and environment variation 
to achieve chip performance, power, and yield targets often results in over-design, which 
leads to creating less competitive products.  To illustrate how the design time approach 
results in excess power and less efficient design, we used 6T SRAM cell for a 45nm 
process  optimized  for  low  power  and  used  Hspice  with  Monte  Carlo  simulation  to 
measure  the  degree  of  separation  between  BL  and  BLB  referred  to  as  the  bitline 
development  (Vbl)  when  sense  amplifier  enable  is  asserted.  The  sense  amplifier  is 
designed to guarantee correct read operation (yield) from the SRAM cell when Vbl has a 
certain minimum value  (often 200mV).  Figure 1-4 shows  a Monte Carlo simulation 
result of the Vbl.. If C is the total capacitance on a bitline, then the total energy expended 
   11 
in pre-charging and evaluating the bitlines is shown in Equation 1-6 which is derived 
from the following set of equations:  
 
.
( . )
. .
bl
bl
source source source
source bl
Q C V
V V
Q
I
T
CV
P V I V
T
E PT CV V
=
∆ = ∆
∆ =
∆
=
=
= =
             1-5 
. . source bl Energy CV V =               1-6 
Where ∆Q is the charge used from the source to charge the bitline, and the T is 
the time it takes to charge up the bitline.  Note that Equation 1-5 does not have a  1
2 term 
in the energy equation because the bitline is both pre-charged and evaluated in the same 
cycle. For the design time optimization approach, the bitline separation on most cells will 
exceed the target and result in excess power consumptions. This is true because the slow 
cell determines the time required to keep the Wl pulse at logic 1 value to meet the target 
bitline separation. 
   12 
Figure 1-4: Monte Carlo Spice simulation of 45nm SRAM cell  
   13 
 
CHAPTER 2 Overview of Memory Sub-System and SRAM Cell Design 
2.0  Memory Sub-System and Cache Hierarchy 
The  system  cache  is  responsible  for  much  of  the  system  performance 
improvement in today's chips and SOC. The importance of the cache in the embedded 
system using a typical RISC architecture is illustrated in Figure 2-1, which shows that 
two out of five instruction (IF and MEM) are memory-related [50]. 
Figure 2-1: Basic RISC architecture pipe stages 
  The cache is a buffer between the very fast processor and the relatively slow 
memory that serves it. There are, in fact, several different "layers" of cache in modern 
processors, each acting as a buffer for recently used information with different capacities 
and access times. Figure 2-2 shows both the memory subsystem main blocks and the 
cache hierarchy.  The memory management unit (MMU) is responsible for managing the 
data  between  the  disc  and  the  main  memory.     The  Bus  Interface  Unit  (BU),  which 
controls the data transfer between the processor and the main memory, moves in the 
cache line size (typically 256 bits).  The on-chip caches are often referred to by their 
distances from the CPU.  For example, Level1 (L1) is the closest large storage area in the 
range of 32KB of memory, while the L2 is the second level cache and it ranges in size 
from 256KB and up.  As shown in Figure 2-2, the size of the cache and the access time 
both  increase  as  distance  from  the  CPU  increases.  Some  modern  high  performance 
processors employ L3 cache in excess of 4MB of memory [47].  Most cache systems 
   14 
employ paging and virtual memory [50] to better service the increased size of program 
data.  The translation lookaside buffer (TLB) is used to translate the virtual address into a 
physical address.  The data flow between the CPU and the memory sub-system differ in 
architecture from each other based on the write policies (write back or write through) and 
the pipeline.  The L1 cache is the most frequently used cache.  If the data needed by the 
CPU is not present in one level (L1) of the cache, then the second level (L2) is used to 
lookup the data. 
 
Figure 2-2: Memory main blocks and cache hierarchy 
 
Since the size and the access time differ between the different levels of the caches 
so does the SRAM cell used and the design style employed.  For example, the L1 cache 
often uses a relatively larger SRAM cell and fewer numbers of rows per memory bank to 
meet the target performance.  The larger SRAM cell size used by the L1 cache, combined 
with  more  periphery  logic  due  to  the  fewer  numbers  of  rows,  results  in  lower  area 
unitization compared to the L2 cache.  Also, because of the number of cells used in the 
   15 
different  cache  levels,  different  approaches  are  used  to  address  yield.  For  example, 
redundancy  is  often  used  in  the  higher  level  caches  to  minimize  yield  loss  due  to 
manufacturing  defects.    It  is  not  often  used  in  the  L1  cache  because  of  the  timing 
overhead and the small size of the cache.   
 
2.1  SRAM Cell Design and Parametric Yield Failures Type 
The SRAM 6T cell typically is the most frequently used cell in designs requiring 
on-chip memory.  Its main function is to store data for the program to access; it retains 
the stored data so long as power is applied.  The schematic of a 6T cell is shown in Figure 
2-3. Its design involves complex tradeoffs between the following seven factors.  
1.  Minimization of cell area is key to achieving high-density memory, and 
reducing power and the cost of the chip. 
2.  Obtaining of good cell stability with minimum voltage is important for the 
cell to perform its main function, which is storing data.  A cell with poor 
Static  Noise  Margin  (SNM)  can  cause  operational  errors  due  to  data 
corruption. 
3.  Robust cells are needed to minimize parametric failure due to process, 
voltage, and temperature variations. This has a direct impact on overall 
chip yield. 
4.  Good soft error immunity is required.  In systems with high reliability 
requirements,  a  data  error  due  to  a  soft  error  can  lead  to  catastrophic 
failures. 
5.  High cell read current is necessary to minimize access time.   
6.  Minimum wordline pulse width during access helps to minimize bitline 
active power.   16 
7.  Low  leakage  currents  are  necessary,  especially  for  battery-operated 
systems, to enable long battery life, during both active and standby modes. 
The  interaction  among  all  the  above  requirements  in  many  cases  results  in 
intersystem conflicts.  For example, a high cell read current cell or a soft error 
immune cell necessitates larger transistors, whose inclusion in the design would 
result in a larger cell area.   
  
 
Figure 2-3: Details of SRAM 6T Cell 
 
In addition to the above factors, sizing of the SRAM cell transistors is based on 
three main criteria, read stability, write completion, and access time. 
   17 
2.1.1  SRAM Cell Stability 
The SRAM cell is a regenerative bistable circuit.  When the cell is accessed, its 
content  is  expected  to  stay  the  same.    Figure  2-3  illustrates  the  6T  SRAM  cell  with 
wordline node controlling the access transistors; n1 and n2 are the internal nodes, and BL 
and BLB are the bitlines of the cell. If the memory state changes, then the memory would 
be declared unstable. This would occur when the wordline is turned on for read, BL/BLB 
are both high, n2 is at logic 1 (Vdd) and n1 is at logic 0, transistor PG1 would be in 
saturation and PD1 in the linear region, essentially creating a voltage divider that results 
in  increasing  the  voltage  at  n1.  For  the  cell  to  function  properly  under  all  operating 
conditions, the current through PD1 needs to be greater than the current through PG1 (I1 
> I0), which will guarantee that Vn1 is less than the inverter threshold (trip point). If this 
condition is not obtained, then the memory  cell will flip state and change the stored 
value.  There are many mathematical models that try to illustrate these requirements. 
Using the MOS first-order equations [23] will illustrate the different constraints on the 6T 
cell sizing.  Equation 2-1 shows the read stability requirements. 
IPD1(linear) = I PG1(saturation) 
 
α
µ
µ
− − =
− −
1 1
1
1
( )( )
2
( )( )
2
ox
ox
PD
n ddmem t n n
PD
n PG
ddwl n t
PG
W
C V V V V
L
C W
V V V
L
    2-1 
If we define the relative strength of the PD transistor to PG as cell ratio (CR), then we 
can define CR as: 
PD
PD
PG
PG
W
L CR
W
L
=                     2-2 
In the normal operating condition, both the logic 1 of the wordline voltage (Vddwl) and the 
cell logic 1 (Vddmem) is the same (Vddmem =Vddwl).  We assume that α, which is normally a   18 
number between (1-2), is equal to 2. Combining equations 2-1 and 2-2 and solving for 
Vn1 will give equation 2-3 
 
− + ± +
=
+
1
( )(1 (1 ))
1
dd t
n
V V CR CR CR
V
CR
        2-3 
Equation 2-3 shows that CR, Vdd, and Vt are the three main parameters that affect Vn1.  
Figure 2-4 shows the plot of Vn1 for different Vdd value as a function of CR and for α =1 
and 2.  It shows that Vn1 decreases as CR increases when Vt is equal to 0.35v; this is true 
across different voltages.  The value of Vn1 is higher for α=1 case which means that for 
small geometry where the value of α is close to 1.5 the SRAM cell stability becomes 
more challenging. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
α =2              α =1 
 
 
Figure 2-4: SRAM cell voltage versus cell ratio for α=2, α=1, and Vtn=0.35 
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 1.5 2 Cell Ratio
V
n
1
(
v
)
vdd=0.7v
vdd=0.8v
vdd=0.9v
vdd=1v
0
0.05
0.1
0.15
0.2
0.25
1 1.5 2 Cell Ratio
V
n
1
(
v
)
vdd=0.7v
vdd=0.8v
vdd=0.9v
vdd=1v  19 
For the SRAM cell to be stable, Vn1 has to be smaller than the inverter threshold 
(Vth).  This implies that the smaller the Vn1 value, the more stable the cell is.  Assuming 
for the first order that the static noise margin (SNM):  
1 th n SNM V V = −           2-4 
Equation 2-5 and 2-6 shows that Vth, when α is equal to and 1, respectively, which is a 
function of pMOS and nMOS threshold voltages, device sizing, and voltage supply [32]. 
α =
+ +
=
+
( 2)
1
( )
1
(1 )
tn dd tp
R
th
R
V V V
K V
K
             2-5 
α =
+ +
=
+
( 1)
1
( )
1
(1 )
tn dd tp
R
th
R
V V V
K V
K
              2-6 
Where 
n n ox
n
R
p
p ox
p
W C L K
W C L
µ
µ
=      
 
Traditionally, the approach to achieving a robust SRAM cell is through transistor 
sizing, keeping the other variables fixed.    
Assuming that the voltage at n1 must remain below Vth, CR needs to be greater 
than 1 to achieve acceptable SNM, that is, greater than 0.2V.  This SNM accounts for 
device mismatches and all other sources of noise such as power supply noise, and device 
parameter coupling noise.  We plotted the (Figure 2-5) SNM from Equation 2-4 for the 
values of Vn1 from Figure 2-4 and finding the values of Vth from Equation 2-5 and 2-6 
and listed them into Table 2-1. 
Table 2-1 : Vth for α=1 and α=2 
   KR=2.2*2=4.4  Vtp=0.4 Vtn=0.35 
   vdd=0.7  vdd=0.8  vdd=0.9  vdd=1 
Vth α=2  0.334  0.366  0.398  0.431 
Vth α=1  0.341  0.359  0.378  0.397   20 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
     
α=2            α=1 
 
Figure 2-5: Cell ratio versus SNM for α=1 and α =2 
Also from Figure 2-5 it can be seen that the higher cell ratio is more optimal for 
SNM, but for lower operating voltages, the SNM becomes less dependent on the CR. 
This is mainly due to the fact that the Vdd-Vt term in equation 2-3 becomes small enough 
at lower voltages, which reduces the dependency on the CR. By examining equations 2-3 
and 2-4, it is clear that a fast nMOS and slow pMOS (FS) transistor constitute the worst 
process corner for cell stability.   This process corner will result in higher Vn1 voltage and 
lower inverter threshold (trip point) because the nMOS Vt is smaller at the fast nMOS 
corner.  
 
2.1.2  Write Completion 
During a write operation, WL is at Vdd and the write driver pulls one bitline low 
causing PG2 to go into the linear region and PU2 to go into saturation.  This creates a 
voltage divider between PU2 and PG2. For the write operation to complete correctly, the 
current through PG2 (I3) needs to be greater than the current through PU2 (I2). This is to 
guarantee  that  the  internal  node  n2  is  pulled  to  the  inverter  threshold  level  to  finish 
pulling n1 to Vdd.  If this condition is not satisfied due to process, voltage, or temperature 
0.1
0.15
0.2
0.25
0.3
0.35
1 1.5 2 Cell Ratio
S
N
M
(
v
)
vdd=0.7v
vdd=0.8v
vdd=0.9v
vdd=1v
0
0.05
0.1
0.15
0.2
0.25
1 1.5 2 Cell Ratio
S
N
M
(
v
)
vdd=0.7v
vdd=0.8v
vdd=0.9v
vdd=1v  21 
(PVT)  variations,  then  the  cell  will  not  be  writable.    Equation  2-7  illustrates  this 
condition. 
  IPG(lin) = IPU(sat)    
 
α
µ
µ
− − − =
− −
2 2
1
( )( )
2
( )(0 )
2
ox
ox
PG
n ddwl bit tn n n
PG
p PU
ddmem tp
PU
W
C V V V V V
L
C W
V V
L
            2-7 
We define pull-up ratio PR ratio as 
   
µ
µ
=
( )
( )
PG n
PG
PU p
PU
W
L PR
W
L
                  2-8 
Substituting 1-10 into 1-9 yields 
α − − − = − 2 2
1 1
( ) ( )
2 2
ddwl bit tn n n ddmem tp PR V V V V V V V              2-9 
Assuming the value of PR is 1.5, Vtn = 0.4V, Vtp = 0.35V. Substituting equation 2-
8 into 2-7 results in Equation 2-9. This relationship for the SRAM internal node voltage 
Vn2, wordline voltage, and memory supply is one of the factors that contribute to Vddmin. 
 
Figure 2-6: Write margin plot when Vddwl=Vddmem  
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7 0.8 0.9 1
supply voltage (v)
V
n
2
/
V
t
r
i
p
 
(
v
)
Vn2 Vddwl=Vddmem
Vtrip
minimum vdd to guarantee write 
   22 
 
  Figure 2-6 plots the value of Vn2 obtained from equation 2-9 for different supply 
voltages; it also shows the voltage at the trip point, Vtrip, of the forward inverter.  In order 
to flip the cell, Vn2 needs to be as small as possible.  The maximum value that Vn2 can 
have at which the cell can be flipped is Vtrip.  In this specific case, a minimum supply 
voltage of 0.87V is required to complete the write.  This voltage is obtained by finding 
the intersection point between the two curves shown in Figure 1-5.  Also from examining 
equation, 2-9 it is clear that a higher Vddwl and a lower Vddmem will result in smaller Vn2.  
As we know, the worst-case corner to check writability is at slow nMOS and fast pMOS 
corner with minimum voltage. Write failure can occur if the parameter for pMOS or 
nMOS is shifted due to process variation or if the wordline pulse is not wide enough to 
complete the write. 
It is apparent from equations 2-3 and 2-9 that there are conflicting requirements in 
the SRAM transistor sizing.  On the one hand, the PG transistor W/L needs to be as large 
as possible to improve access time and write margin of the cell.  On the other hand, it 
needs to be as small as possible to increase the SNM. 
 
2.1.3  SRAM Access Time 
In many processors, memory access time is one of the chip’s most timing critical 
paths  because  it  defines  how  fast  data  can  be  moved  to  and  from  execution  units. 
Balancing  the  SRAM  cell  size,  threshold  voltage,  and  leakage  to  achieve  the  target 
performance is a complex process. Figure 2-7 shows detail schematic of one column of 
typical SRAM memor.  In addition to read and write logic the precharge logic is used to 
pre condition the bitlines of the specific column for the next access. The read operation of 
the memory starts with selecting the memory entry by denoting the wordline signal as 
logic 1. When the wordline of a specific memory row is at logic 1, the PG and PD 
transistors of the SRAM start removing charge from the BL or BLB, and as a result, the   23 
voltage level of one side will decrease. Both BL and BLB are sampled by a sensitive 
circuit (sense amplifier) that can resolve small voltage differentials.  
Timing of memory systems differs among designs by the location of the cache in 
the memory hierarchy. For example, the L1 cache that is close to the processors tends to 
use  a  larger  SRAM  cell  and,  consequently,  a  synchronous  timing  scheme  to  enable 
single-cycle access with the same frequency as the processor. Figure 2-8 illustrates the 
basic timing waveform for read operation with Ta controlled by the SRAM read current 
and bitline capacitance, both of which are determined by the SRAM transistor sizes. If 
the  SRAM  cell  read  current  is  low  due  to  a  weak  cell,  then  the  required  bitline 
development set by the sense amplifier may not be enough to resolve the correct logic 
value. As a result, memory access will fail. This failure is referred to as access failure. 
Moreover,  it  is  typical  to  have  post-silicon  tuning  capabilities  to  control  the  delay 
between the wordline and sense amplifier in order to allow more time for the SRAM cell 
to achieve the correct voltage separation between BL and BLB.  
   24 
 
 
Figure 2-7: SRAM-based memory column schematic and connectivity 
 
 
Figure 2-8: SRAM-based memory access time waveforms    25 
2.2   Interaction Between Read and Write Operations 
 As  explained  in  the  previous  two  sections,  the  read  and  write  operations  of 
SRAM-based memory have different failure mechanisms and are affected by different 
factors. Still, the read and write operations are not wholly independent of each other. For 
example, even during a write access, some of the memory will be reading, a process often 
called a “dummy read.”  This occurs because of the use of column multiplexing between 
the adjacent columns, which itself is an effective way to share input/output circuit and to 
increase area utilization. Figure 2-9 shows a basic memory block in which Wl<3> is 
asserted to access SRAM1 cell for either a read or write operation, and the other three 
SRAM cells in the same row are, consequently, accessed in a read mode. This ultimately 
puts limitations on the Vccmem value for the column undergoing a dummy read during a 
write operation, as this voltage has to be high enough to maintain the correct data. 
Figure 2-9: Basic SRAM-based memory block 
 
   26 
CHAPTER 3   Related Work  
3.0   Low Voltage and High Yield Approaches in SRAM Memory 
As noted in the previous chapter, supply voltage, cell ratio, and threshold voltage 
of  the  devices  are  the  factors  that  determine  whether  a  cell  is  robust  and  stable.  In 
addition  to  these  factors,  controlling  variability  through  process  technology  further 
reduces the device parameter shift. The SRAM cell stability and its effect on both yield 
and power have been addressed through several techniques. 
1.  Process technology and transistor sizing 
2.  SRAM cell modification 
3.  Voltage islands 
4.  Body/well biasing 
5.  Circuit techniques 
Each of the above techniques targets one or more factors to reduce the impact of 
cell stability on the overall power and yield. We will discuss each option in more detail in 
the coming sections.   
In  addition  to  improving  the  cell  response  to  PVT  variation,  manufacturing 
defects  also  impact  product  yield.    A  way  to  improve  yield  is  by  reducing  defect 
occurrences but due to the large number of devices and high density defect can occur.  
The  defect  produced  during  manufacturing  process  can  be  eliminated  by  introducing 
capabilities  to  bypass  the  defects  by  swapping  in  redundancies  [61]  [62].    This  can 
significantly enhance the yield of the manufacturing process, improve reliability of the 
outgoing product, and increase quality of the overall system.  The overhead of adding 
redundancy on area, timing, and design time makes it impractical for L1 caches but an 
effective approach for higher level caches where area and speed degradation overhead is 
less critical.    27 
3.0.1  Process Technology Transistor Sizing and Layout 
The tradeoffs among the chief cell characteristics, including cell area, cell stability 
with minimum voltage, soft error immunity, cell read current, write margin, and low 
leakage current, are some of the key factors taken into account when determining the 
design of the 6T cell.  Each process technology node has approved SRAM cells that are 
carefully designed and, in many cases, use less than the minimum design rules (SRAM 
design rules versus logic design rules) to optimize area. The devices of the SRAM cell 
are the first to be manufactured and qualify the process technology node.  Once a 6T cell 
is verified to work and its yield metric is characterized, a set of guidelines is produced for 
usage to enhance uniformity and reduce variation.  Figure 3-1 shows a schematic of the 
6T cells and a scanning electron microscope (SEM) picture of the 6T cell for the three 
technology nodes [25].  The cell-aspect ratio (width/height) is 2 for sub 90 nm process 
technologies [24] [25][26] [28] This aspect ratio is used to reduce variability by making 
all transistors poly in the same direction. It also reduces loading on the bitline because the 
short bitline length results in a small interconnect capacitance. The usage of metal 2 (M2) 
for bitline routing results in both resistance and capacitance reduction of vias that would 
be  required  if  upper  layer  metal  is  used.  The  selection  of  a  wider  cell  has  the 
disadvantage of a longer wordline, which results in a bigger wordline driver needed to 
meet the slew rate and timing requirement. A wider cell also challenges the floor plan of 
the design, as the SRAM arrays are no longer square, but rectangular instead. 
The usage of high K and metal gate is also helping to reduce the process variation 
through eliminating the need for transistor channel doping.  This is true because the sub 
threshold leakage using when using the high K and metal gate is control better than the 
traditional poly gate based transistor. 
   28 
 
Figure 3-1: Schematic and SIM picture of 6T cell for 90, 65, and 45nm   [Kuhn 24] 
 
3.0.2  Modified SRAM 
New SRAM cell designs have been proposed to deal with cell stability issues 
[31].  In  these  approaches,  SRAM  cell  area  is  increased  to  improve  stability.    For 
example: Wang [26] reported an increase of 17% of the SRAM cell size on 65nm process 
technology to support low power operation. Figure 3-2 shows a schematic of a typical 
eight-transistor  cell  (8T)  cell  in  which  the  cell  uses  eight  transistors  instead  of  six 
transistors. The fundamental difference between the 8T cell- and 6T cell-based design 
resides in the sensing scheme. In the 8T cell-based design, the RBL transitions between 
logic 1 to logic 0 voltage. This design type is often referred to as large signal array 
(LSA).  
On the other hand, the 6T cell-based design is a small signal array (SSA); this is 
because, during read, the small difference between BL and BLB is used to sense the 
selected cell logic value. Leland et al [29] reported that 6T and 8T SRAM cell area for   29 
the 32-nm process technology has the 8T cell 1.6 times the 6T cell, and 0.1998 µm
2 
versus 0.124 µm
2, respectively. This substantial difference in area between the 6T cell 
and the 8T cell has also been observed on other process technology nodes to be between 
1.6 times and 2 times larger. The main reason for this area increase is the symmetry that 
is lost when new transistors are added to the 6T cell. 
 
 
Figure 3-2: 8T SRAM cell 
Modifying the SRAM cell approach is not practical for chips with big embedded 
SRAM, as it limits the size of the memory that can be used due to the increase in SRAM 
cell area. Since the read is no longer differential and the RBL has to swing from logic 1 
to logic 0, the LSA-based array consumes more power compared to the SSA-based array. 
Also, for the same reason, the number of rows connected to each RBL is smaller than that 
of the SSA-based array using the 6T cell. The organization of the array will be different 
than that of the SS array, too, and will have more logic to combine the different RBL 
paths. The 8T cell with LSA-based array works well for memories requiring multi-port 
access, such as register file in most processors. 
3.0.3  Voltage Islands and Separate Voltage Supplies 
Voltage  Island  is  used  by  many  SOCs  to  deal  with  the  fact  that  the  minimum 
operating voltage of a chip is determined by SRAM Vddmin [33]. The approach is based on 
   30 
separating the memory supplies from the rest of the logic supply (Vddx), with each supply 
controlled separately based on performance and power requirements.  The drawback to 
this approach is the need of level shifters [38] for all signals crossing the two voltage 
domains.  Also,  the  need  for  two  separate  supplies  complicates  both  the  routing  of 
resources and the design cost.  
The interface between Vddmem and Vddx can vary due to complexity, leakage, and 
clock skew. A simple approach is to have level shifters at the interface of the cache and to 
have all interface signals go through the level shifter; in this approach, the power supply 
and the interface signal will have a clear boundary and, therefore, simplify the timing 
analysis of the design. A more complex approach calls for the interface, which resides 
between  the  memory  supply  and  the  logic  supply  domain,  to  be  pushed  inside  the 
memory block; this approach is frequently employed to save active and leakage power by 
putting more logic on the core supply and less on the memory supply. 
Some chips employ different power modes for the memory supply, such as active, 
standby, and retention modes [39]. The active mode is the normal mode when Vddmem is 
at its peak value to meet both the SRAM stability voltage and access time. The standby 
mode is when a certain bank or part of the memory is not being accessed so that its 
supply can be lowered to standby voltage level while the memory peripheral logic stays 
on standby. The retention mode is when the memory has not been accessed, and wordline 
has  been  held  low  for  long  time.    The  difference  between  the  standby  and  retention 
modes is that the retention voltage is lower than standby in retention mode, and it is 
applied to all the memory banks, while the standby is on a bank boundary. The advantage 
of this scheme is reduction of the leakage current on all unused memory cells. The access 
patterns and memory organization determine the granularity of the supply separation and 
the time between the different power modes.   
Zhang et al. [39] proposed to dynamically switch the power supply of the SRAM 
cell  to  different  levels  based  on  the  read  or  write  operation.  With  different  voltages   31 
created between the SRAM cell wordline and its internal nodes, the cell read and write 
margins can be optimized separately without compromising each other. This approach is 
mainly employed for yield improvement on high-end processors, and does not address 
the power supply scaling. At the same time, it adds the cost of level shifter, routing 
resources and multiple voltage supplies.  It is also based on raising the Vddmem to a higher 
value than the wordline supply voltage so as to increase the SNM, which is not a desired 
feature for low-power mobile designs.  Figure 3-3 shows the SNM increase as memory 
supply increases.  
 
 
  Zhang et al [39] 
 
Figure 3-3: SRAM butterfly curves show the SNM enhanced as SRAM supply increase  
3.0.4  Body Bias 
Mukhopadhyay et al. in [41] used body bias for nMOS and well bias for pMOS to 
shift the threshold voltage higher or lower based on the inter-die process corner; leakage 
and ring oscillator delay monitoring is used to determine the inter-die process corner. The 
main purpose of this work was to apply body bias to reduce the number of parametric 
failures. Since the principal reason for parametric failures is random doping fluctuation-
induced threshold voltage shift, reducing this variant effect will decrease the probability   32 
of the cell to fail. Negative threshold voltage (Vt) shifts from the mean affects read and 
hold  failures,  while  positive  threshold  voltage  shifts  affect  access  and  write  failures. 
Hence, sensing the process corner and specifically the inter-die threshold voltage shift 
can determine which failures are most likely to occur.   
A circuit to select the proper body bias to minimize the impact of the Vt shift is 
activated, and the body voltage is applied to form a forward body bias (FBB), for high Vt, 
or  reverse  body  bias  (RBB),  for  low  Vt.  This  approach  shifts  all  nMOS  transistor 
threshold voltages the same way, so its effectiveness in addressing the SNM issue is 
limited to changing the trip point of the forward inverter inside the 6T cell. Additionally, 
the approach addresses the global variation and can minimize the yield loss due to SRAM 
parametric failures, especially if used along with redundancy; redundancy can be used to 
fix limited number of faulty cells in a column or row, so adding FFB and RBB increases 
the chance of passing parts.  
3.0.5  Read and Write Assist Circuits  
Special circuit design techniques, such as the ones in [18] [37] [39], are used to 
change the voltage applied at the wordline and the Vddmem in order to improve SRAM cell 
stability and yield. Yabuuchai et al. in [37] proposed SRAM read/write assist circuits to 
enlarge  the  operating  margin  against  wide  process  and  temperature  variations  with  a 
single supply voltage. His approach used a voltage divider to reduce wordline voltage and 
a dummy bitline capacitance to reduce the Vddmem during write. Essentially, this approach 
is intended to tune the wordline voltage to increase read stability during memory access.  
Figure 3-4 shows the implementation of the wordline driver proposed by [37]. In 
both cases, the WL voltage value is reduced through contention, which increases the 
active power. In one approach, a pMOS transistor is used to supply the current from 
supply, while in the second approach, a resistor is used. To increase write margin, Vddmem 
is reduced via charge sharing of the Vddmem column and dummy metal capacitance; the   33 
dummy  metal  capacitance  is  discharged  after  each  access.    With  process  variation, 
balancing the capacitance to achieve a balance voltage is challenging. Also, with high-
density circuits, the need for routing resources becomes essential, but adding dummy 
metal is less desirable, as it takes away those metal tracks.  
 
 
Figure 3-4: Read assist circuit using voltage divider to reduce wordline voltage on SRAM 
Yamaoka et al. in [18] used floating Vddmem during write to increase the write 
margin of the 6T cell. This approach works well for low-frequency applications, as it 
improves the cell writability by reducing the Vddmem. However, at high frequencies, the 
approach has a limited effect because the Vddmem capacitance is comparably big and the 
discharge path to reduce Vddmem has to go through the pull-up transistor (PU) of the 6T 
cell, which is a particularly small device. 
Pilo et al in [19] also proposed to use a read and a write circuitry to improve the 
SNM and enable lower voltage operation.  Read-access disturbs can be decreased by 
reducing the amount of charge injection from the VDD-precharged BL to the low node of 
the cell. The quicker the BL can be discharged, the less likely an unstable cell could lose 
its data when disturbed. Unstable cells are especially vulnerable during the half-selected 
operations. Half-selected columns are the columns whose cells share WL selection, but 
are neither written to nor read out during write or read operations. They build a circuit to 
assess  the  SRAM  cell  by  reduce  the  bitline  voltage  of  accessed  cells  through  nfet 
φ
   34 
transistors  that  is  shared between  all  cells  in  the  same  column.    They  also  proposed 
lowering  the  memory  supply  voltage  on  the  column  that  is  been  written  to.    The 
disadvantages of this approach are: 
1.  Need a reference voltage to generate the reduced memory supply.   
2.  Increased the bitline power through reducing the bitline voltage for all the 
bitline in the accessed array.  
3.  During write operation the SNM issue exist when column muxing is been 
used.    This  is  true  because  all  cells  in  the  same  row  share  the  same 
wordline 
 
3.1  Related Work in Cache Organization CAM vs SRAM tag 
Both  academic  and  industrial  studies  [10]  [36]  have  described  several  of  the 
reasons for choosing CAM-tags over SRAM-tags for high associative caches; however, 
these reasons do not include detailed quantitative arguments.   
 
3.2  Related Work in Leakage Current Reduction  
There are several methods to minimize the effect of leakage current.  Some of 
these include the following. 
3.2.1  Multi-threshold Voltage (MTV) 
The introduction of the MTV for the sub 100-nm process technology attempts to 
minimize the leakage power of non-critical timing logic. It introduces an opportunity for 
the design to select the transistor type based on threshold voltage to tradeoff performance 
with leakage power [13] [14][15].  All the approaches are based on design time analysis 
of the timing critical paths and to select the corresponding threshold voltage gates.  But 
since memory access is often in the timing critical path and selecting a HVT cell will   35 
have big impact on performance.  Some other usage of the MTV is to add low leakage 
transistors that have high threshold voltages (HVT) in series with the supply to reduce the 
leakage current of the logic gates of a certain block [26]. An HVT pMOS device can be 
used in series with the logic 1 supply voltage (Vdd) to limit the leakage current (head 
device), or an HVT nMOS device inserted in series with logic 0 supply (Vss) can be used 
(also known as foot switch). Since the leakage current can only go from a high potential 
supply Vdd to low potential, one Vss, it is sufficient to use a foot or a head switch to limit 
the leakage from a given gate.  In all current approaches the control to the head or foot 
switchs is a sleep signal that is only activated on sleep or standby mode and not during 
active operation. 
 
3.2.2  Voltage Islands  
This technique is widely used on complex SOC, where the logic is partitioned in 
such a way that certain parts can be turned off when not active. To separate different 
logic on the SOC is an effective method to reduce leakage power in standby mode. It is 
typical in today’s SOC to have more than three power domains that can be controlled 
independently  [33].  This  approach  requires  the  software  to  be  aware  of  the  different 
power modes and can effectively utilize them.  Also it introduces the need for isolation 
cells and level shifter to translate between the different voltage domains.  
3.2.3  Well and Substrate Back Biasing  
Since  sub-threshold  leakage  current  reduces  exponentially  with  increasing 
threshold voltage, many proposals have suggested the use of well and substrate biasing to 
increase Vt and reduce leakage [12].  The drawback of this technique is that it increases 
the gate leakage current and result in increase in gate delays. 
.   36 
 
CHAPTER 4   Power Efficient and Improved Yield SRAM Cache Memory  
4.0  Reduced Wordline Voltage and Memory Supply  
Since  SRAM  stability  plays  a  central  role  in  determining  the  Vddmin  and  chip 
parametric yield, it is important to address cell stability to reduce Vddmin and to minimize 
the impact of process parameter shift on yield. Our contribution is based on selectively 
reducing the voltage level of wordline (WL) to a value less than the memory supply 
voltage, using a single voltage supply. By reducing the voltage on the access transistor, 
its  saturation  current  I0  (Figure  1-2)  will  decrease  and  will  be  less  than  the  current 
through PD  I1.  This will reduce the voltage on the SRAM internal node and, hence, 
increase  the  cell  stability.  As  was  shown  in  the  introduction,  reducing  the  pass  gate 
current  will  make  write  completion  more  difficult.  Therefore,  to  address  the  write 
completion, we reduce the Vddmem value of the column that needs to be written.  The write 
enable signal, which is functionally needed, is used to control the circuit to reduce Vddmin 
to its present value. This mode of operation, where the wordline voltage is reduced and 
the memory supply is lowered during write for the certain column, can be bypassed. 
Also, the value of the wordline voltage and the memory supply value can be programmed 
through control bits.   
4.1  Mathematical Model 
We first show mathematically how our approach improves SNM of SRAM cell.  
Recall from equations 2-1 and 2-2, which describe the current in the PG and PD of 
SRAM cell during read operation that 
  
α µ
µ − − = − − 1 1 1
1
( ) ( ) ( ) ( )
2 2
ox
ox
n
n PD ccmem t n n PG ddwl n t
W C W
C V V V V V V V
L L
   4-1 
Rearranging equation 4.1 using CR variable gives 
   37 
   
α − − = − − 1 1 1
1 1
( ) ( )
2 2
mem t n n ddwl n t CR Vdd V V V V V V               4-2 
Assuming  0.35V  Vt  and  CR  =  2,  solving  equation  4-2  for  Vn1  with  1v  Vddmem  and 
different Vddwl  is shown in table 4-1. 
Table 4-1 lists Vn1 and the SNM values for 45nm based SRAM cell. It shows how 
the SNM improves when Vddwl was reduced. The SNM is calculated by subtracting Vth, 
which is calculated using equation 2.7 with Vdd =Vddmem=1V from the Vn1 values.  The 
Vth in this case is constant as the inverter in the SRAM cell always has Vddmem value. 
 
Table 4-1: SRAM voltage (Vn1) for different Vddwl and for Vddmem=1V 
 
Vddmem  Vddwl  Vn1  Vth  SNM 
1  1  0.119  0.431  0.312 
1  0.9  0.088  0.431  0.343 
1  0.8  0.061  0.431  0.370 
1  0.7  0.04  0.431  0.392 
 
 
  Since α and CR can be different based on technology and SRAM cell 
design we calculated and ploted the internal node voltage Vn1 and SNM for the extreme 
values that these variable can assume.  This shows that the approach is scalable to new 
technologies with small electron to hole mobility ratio.   38 
     α = 2             α = 1 
Figure 4-1: Vn1 versus wordline voltage Vddmem=1v 
 
Figure 4-1 shows the Vn1 voltage as a function of Vddwl for different CR and  α = 
1, and 2.  It can be seen that for α=1 the Vn1 level is higher for all voltages and CR.  We 
used the values of Vn1 along with Vth of the inverter to calculate the SNM of the 6T cell 
for the same voltages and CR as described above. 
 
 
 
 
 
 
 
 
 
     
α = 2, Vth=0.431v          α = 1, Vth=0.379 
 
Figure 4-2 : SNM versus wordline voltage and Vddmem=1v 
 
 
 
0
0.05
0.1
0.15
0.2
0.25
0.3
0.7 0.8 0.9 1 1.1
Vddwl (v)  
V
n
1
 
(
v
)
CR=1
CR=1.5
CR=2
_
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.7 0.8 0.9 1 1.1
Vddwl (v)
V
n
1
 
(
v
)
CR=2
CR=1.5
CR=1
_
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.7 0.8 0.9 1 1.1
Vddwl  (v)
S
N
M
 
(
v
)
SNM  cr=2
SNM  cr=1.5
SNM cr=1
_
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.7 0.8 0.9 1 1.1
Vddwl (v) 
S
N
M
 
(
v
)
SNM  cr=2
SNM  cr=1.5
SNM cr=1
_  39 
Write  completion  is improved  by  reducing  the  Vddmem  for  the  selected  write  column.  
Figure 3-2 shows the result of solving equation 1.10 for different values of Vddmem and 
Vddwl.  The plot shows that reducing Vccmem to 0.4V reduces the minimum voltage Vddmin 
by 29% (0.7V versus 0.9V). 
 
Figure 4-3: Write completion plot of Vn2 versus wordline voltage for different Vccmem  
 
4.2  Simulation Model Using Hspice  
We built simulation circuits and used Hspice with the BSIM4 model for 45nm 
low-power foundry process technology to find the SNM and the writability of the SRAM 
cell. Figure 3-3 shows the schematic for the circuit to generate the value of Vth. It also 
shows the waveform output of the Hspice simulation of the circuit. We swept Vn1 from 0 
to Vdd and then found the Vn1 value when Vn1 = Vn2. 
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7 0.8 0.9 1
Vddwl
V
n
2
 
(
v
)
Vn2 Vddwl=Vccmem
Vn2 when vccmem=0.4
Vth
new vddmin when vccmem=0.4v  traditional vddmin 
   40 
 
Figure 4-4: Circuit used for simulation to find the inverter threshold Vth (also shows the spice 
waveforms) 
 
The value of Vth for the inverter is the maximum noise from all sources that the 
cell can tolerate when it is not accessed. We refer to this as cell retention SNM.    
 
 
Vn2 (v) 
 
vdd  Vth 
0.7  0.29 
0.8  0.336 
0.9  0.382 
1  0.425   41 
To see the effect of reducing wordline voltage on the active static noise margin, 
we simulated the circuit shown in Figure 4-4 to find Vn1 for different voltages and CR. 
For each wordline voltage waveform, we plotted the corresponding Vn1 voltage, as shown 
in the spice result waveform in Figure 4-5.  
 
 
 
Figure 4-5 Circuit to find memory cell voltage and simulation waveform   
 
 
To validate the relationship between CR and the static SNM, we designed SRAM 
with different cell ratios and kept the PU ratio constant. The result of the simulation for 
 
   42 
the different voltages is shown in Figure 4-6. Although the trend between the first-order 
equation result and spice simulation differs in exact values, both agree in the overall 
trend. 
0
0.05
0.1
0.15
0.2
0.25
0.3
0.36 0.72 2
CR
S
N
M
 
(
v
o
l
t
)
vdd=0.7
vdd=0.8
vdd=0.9
vdd=1
 
 
Figure 4-6: Simulated SNM for different voltages and cell ratios  
 
We used the circuit in Figure 4-5 to find the active SNM from simulation, using 
the butterfly curve approach [49]. Hspice tools were used with full parasitic extraction 
netlist of the circuit in order to determine SNM, again using the butterfly curve method as 
shown in Figure 4-7. Table 4-2 shows the values of SNM for each supply voltage, 
assuming the same voltage for all transistors.  
 
Table 4-2: SNM of 6T cell in 45nm process technology 
 
Voltage (V)  0.8  0.9  1  1.1 
SNM (V)  0.112  0.132  0.14  0.17 
  
   43 
 
 
The table and graph show that, the higher the supply voltage, the better for SNM and the 
cell stability; this translates to better yield. But as voltage increases, power increases 
quadratically. 
 
Figure 4-7: 6T cell SRAM SNM for different voltage  
 
The same circuit used to generate the SNM for the different supply voltage was also 
simulated with the RVS approach. The supply voltage for all transistors was fixed at 1V 
and the wordline voltage was changed from 0.7V to 1V. Table 4-3 shows that reducing 
the wordline voltage by 100mV has the same effect on SNM as raising the supply voltage 
on traditional designs by 100mV.    44 
 
Table 4-3: SNM with RVS circuit and fixed supply 
 
Wl_hi_max(V)  SNM(V)  Vdd(V) 
0.700  0.257  1.00 
0.800  0.205  1.00 
0.900  0.177  1.00 
1.000  0.117  1.00 
 
Improvement of SNM using RVS circuit is also shown in Figure 4-8, using the butterfly 
approach. 
 
 
Figure 4-8: SNM for 45nm foundry SRAM cell using 1V Vdd and different wordline voltage levels 
   45 
4.3  Timing Impact 
The typical process spread of the different corners holds true for our process.  
Lowering  the  supply  voltage  increases  the  impact  of  process  variations.  A  given  VT 
variation  has  more  of  an  impact  when  the  supply  voltage  is  low  because  Idsat  is 
proportional to VGS – VT. Figure 4-9 shows the relative performance of the SRAM cell 
at the different process corners and as a function of supply voltage. 
Figure 4-9: Relative performance of SRAM cell at different process corners and voltages 
 
Since the proposed design assumes a reduced voltage swing on the pass transistor 
(PG), the read current of the SRAM cell will be reduced. We simulated the SRAM cell 
for different wordline voltages and at different corners. Figure 4-10 shows the relative 
performance of the new design using the RVS  approach with wordline voltage  Vddwl 
reduced by 10%. The performance is normalized over the normal operation mode at tt 
corner. For both FF and FS corners using the RVS circuit, the performance still exceeds 
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
0.6 0.7 0.8 0.9 1 1.1 1.2
voltage (v)
r
e
l
a
t
i
v
e
 
a
c
c
e
s
s
 
t
i
m
e
ff/tt
fs/tt
tt
sf/tt
ss/tt
   46 
the tt performance of the traditional design.  This is important, as we have shown before 
that the SRAM stability is worse for the FS and FF corners and lesser for the SS corner. 
So, if we select to include the RVS technique at the FF and FS corners, then there would 
be no impact on the timing. At the same time, the SNM is improved for the high-risk 
corner. 
Figure 4-10: Normalized read current of SRAM cell using RVS wordline versus voltage measured at 
different process corners  
 
4.4  Circuit to Generate Reduced Voltage Swing 
The proposed circuits to generate both high and low reduced voltage swing (RVS) 
values will now be described. The low RVS is employed to reduce the SRAM memory 
supply to less than Vdd during write. Conversely, the RVS high circuit is used on the 
wordline driver of each row to limit the wordline logic 1 value to a certain voltage. The 
approach in both RVS circuits is intended to stop the charge or discharge path once the 
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0.6 0.7 0.8 0.9 1 1.1 1.2
voltage (v)
r
e
l
a
t
i
v
e
 
a
c
c
e
s
s
 
t
i
m
e
ff_rvs/tt
fs_rvs/tt
tt_rvs/tt
sf_rvs/tt
ss_rvs/tt
   47 
target signals reach the set value. The signal will always be actively driven and has no dc 
current.  The basic RVS low circuit is shown in Figure 4-11.   
 
 
Figure 4-11: Basic circuit to generate RVS low signal 
 
In this approach, no extra supply is needed, and the area overhead by adding one 
transistor per column has a minimal area overhead compared to the total array area. If 
column  multiplexing  is  used  in  the  cache,  then  one  RVS  low  circuit  can  be  shared 
between the same set of columns. The rvs_wr signal is a control signal that can be used to 
select the low voltage operation mode; otherwise, the value of Vccmem will always be 
equal to the normal supply Vcc. The Wren signal is used to indicate that there is a write 
operation, guaranteeing that that Vccmem value is reduced only during the write operation. 
The logic that the rvs_mode generates can be shared among all columns from the same 
memory bank. However, one limitation of the new circuit shown in  
Figure 4-12 shows that it only limits the swing of Vccmem between Vdd and Vdd – 
Vt,  where  Vt  is  the  value  of  the  threshold  voltage  of  the  LVT  transistor.  Another 
limitation is the fact that the discharge path of the LVT transistor has its gate connected 
to Vccmem, which has variable levels and results in a long tail for the Vccmem signal during   48 
high to low transition. Figure 4-12 shows the spice simulation result of the basic RVS 
low circuit. 
 
 
 
 
 
 
 
 
Figure 4-12: Basic RVS low circuit Spice simulation result  
 
The main advantage of this circuit is the low overhead and the simple implementation.  
We developed another circuit option shown in Figure 4-13.  The improved programmable 
RVS low circuit can vary the logic 0 value based on how many bits of the Cnt[n:0] bus 
are selected. Each of the Cnt[n:0] bits corresponds to a W[n:0] transistor, and each varies 
with regard to how fast the fbd node can be discharged to Vt through the dotted path 1. 
 
 
Figure 4-13: Improved RVS low circuit with bypass and programmable capabilities 
 
   49 
The transistor Mp1 is a minimum size transistor and has no impact on the speed of the 
circuit. A key advantage to this circuit, in addition to programmability, is that the fdb 
node is isolated from the Vddmem node and is kept at Vdd until the Vddmem is driven to a Vt 
lower than Vdd. This enables a sharp pull-down transition on Vddmem, compared to the 
basic RVS circuit. Also, the nd1 and nd2 and gates can be shared among all columns 
from the same bank, which minimize the area impact. 
 
 
Figure 4-14:  Spice simulation waveform of improved and programmable RVS circuit 
 
The same circuit technique can be used to clip the top of the signal waveforms 
with RVS high, as shown in Figure 4-15.  One advantage of this design is that the Wl will 
be actively driven at all times, and the logic high of the wordline can be varied based on 
the Cnt[n:0] programmable bits. The Mp3 transistor clamps the Wl node into pk0 when 
both Pwr_mode_ Wl and any of the Cnt[n:0] signals is at logic 1. When Wlb transitions 
from logic 1 to logic 0, transistor Mn1 will guarantee pk0 is at logic 0, which activates 
Mp2. A path from Vdd to Wl will begin to charge up. When the difference between pk0 
and Wl becomes more than the value of a Vt of the Mp3, the transistor turns on and 
begins to shut off transistor Mp2. 
 
 
   50 
 
Figure 4-15: Traditional and reduced voltage swing high circuit and waveform for SRAM wordline 
 
Figure 4-16 shows a typical SRAM memory main block organization with the added 
RVS  control  circuitry  marked  in  yellow.  The  location  of  these  blocks  has  minimum 
impact on the floor plan and added area.  
 
 
   51 
 
Figure 4-16 SRAM-based memory main block showing the reduced voltage swing control circuit 
location 
 
 
4.5  Summary 
With  today’s  small  geometry  process  technology—technology  that  has  greater 
than 30% variation in design time optimization—over-design often results, and a less 
competitive product is made. The alternative to geometry process technology is adaptive 
design in which the circuit is tuned to the process parameter and adjusts its response to 
minimize the effect of process variation. We proposed a power-efficient, low-cost, and 
yield-enhancing design methodology that can be employed in SRAM-based memory. In 
this design, the area overhead is kept at a minimum, and it bears no effect on the memory 
access time.  
B
l
[
n
]
A
d
d
[
n
:
0
]
c
l
k
d
i
n
[
m
:
0
]
/
2
c
d
o
u
t
[
m
:
0
]
/
2
c
A
d
d
[
c
:
0
]
W
l
_
c
l
k
p
r
e
_
d
e
c
R
e
a
d
_
e
n
W
r
i
t
e
_
e
n
B
l
[
0
]
V
d
d
m
e
m
V
d
d
   52 
In this chapter, we showed mathematically and with spice simulation using 45nm 
foundry process files that SRAM cell SNM increases with decreasing wordline logic 1 
value. We also showed that decreasing wordline voltage while keeping all other supplies 
at the traditional level has the same effect on SNM as increasing the supply voltage. 
Finally, we studied the effect of reducing wordline voltage on the memory access time at 
the different process corners.   
As expected, the slow process corner memory access ran slower when we reduced 
the wordline voltage. Luckily, the SNM effect on parametric failure is worst at the fast 
corners. We showed, in fact, that the fast corner access time has plenty of timing margins 
to be used for reducing the wordline voltage. We then proposed a novel circuit that can 
be used to locally generate a reduced voltage signal for the wordline voltage and for the 
memory supply. The end result is a memory design that has the capability to respond to 
process variation and adjust its behavior to cope with the effects of parameter shifts using 
single  supply  voltage.  This  capability  enables  full  voltage  scaling  and  removes  the 
requirement for minimum voltage based on SRAM stability. Reducing operating voltage 
results  in  quadratic  reduction  in  active  power  and  exponential  reduction  in  leakage 
power.  
   53 
 
CHAPTER 5    Design of 8KB SRAM Memory Test Chip with RVS 
Circuit 
5.0     Test Chip Description 
A test chip of eight kilobytes (KB) of SRAM is designed with the proposed RVS 
circuit. The chip is manufactured using 45nm low-power process technology optimized 
for mobile applications. The chip is compromised of four memory arrays, with each array 
composed of 64 wordline (rows) and 256 columns. A 4:1 multiplexor is used between the 
columns of the same array, followed by 2:1 multiplexor between the two adjacent arrays.  
Figure  5-1  shows  the  test  chip  main  blocks  and  a  detailed  floor  plan  with  a  list  of 
interface signals. The Cs_n pin is the chip selected, which is used to enable the local clock 
of the chip. The addr[10:1], along with the waysel_rd and waysel_wr, is used to select the 
set.  Wb_rd  and  Wb_wr  signals  select  between  either  read  or  write  operations,  while 
rd_rvs[1:0] and wr_rvs signals are used to control the RVS system and set the voltage 
level.   
 
 
Figure 5-1: Block diagram of the test chip 
   54 
   
The acc[6:0] is used to tune the delay from wordline to sense amplifier assertion 
during read, and it is also used to turn off wordline after write completion.   
 
Figure 5-2: Detailed view of the test chip die showing the placement of the main blocks 
 
Figure 5-2 depicts the die photo of the 8KB test chip die with the main block 
placement. In addition to the four memory blocks, there are two local input output (IO) 
   55 
blocks, each of which has 8:1 multiplexers between the two adjacent memory banks. The 
two local IO blocks’ output is sampled by the global IO block, which has the sense amp. 
The area increase due to adding the RVS circuitry is 4% of the total area (207µm x 
271µm versus 207µm x 283µm). This chip area overhead accounts for both the RVS 
control  and  actual  circuitry,  which  together  lower  the  wordline  voltage  and  memory 
supply voltage to the memory during write  
5.1  Interface Signals and Logical View 
   
The signal interface to the memory can be categorized by three types: first is the 
data bus, which consists of the input data from the CPU and the output data from the 
memory to the CPU; second is the address bus, which determines out of which entry  out 
of the 256 row is accessed; and the third interface type is the control bits, which identify 
whether the access is read or write. This third interface also has the acceleration bits, 
which are used to tune the delay from the wordline rises to sense amplifier assertion 
during read access, and also trigger the write completion during write operation. Since 
the memory access starts off the rising edge of the clock, all interface signals go through 
a low latch, with the latch transparent during the low phase of the clock. The address bits 
should be set up according to the wordline clock, which means that the set-up timing for 
these signals is higher than for the normal flop setup.  
Figure 5-3 shows the main logic of the memory block and the relative timing of 
each function. The lclk logic includes clock gating of the array clock with chip select 
(cs_n) signal. The wordline logic combines the pre-decode logic of the address bits with 
array local clock to select the corresponding wordline. After wordline is selected, the 
memory is accessed and data is steered through three levels of multiplexing to drive the 
output bus.   
   56 
 
Figure 5-3: Test chip interface timing diagram 
 
Table 5-1: Interface signals and timing 
 
   Number 
of bits 
Input 
/output 
Default 
value  Description  Timing (ns) 
wb_do[127:0]  128  out     output data  max  min 
acc[6:0]  7  in  b0110000 
control bits to 
change delay from 
wordline to sense 
amp asserted  bit 6 
is for write and 5:0 
for read 
static  static 
wb_din[127:0]  128  in     input data  0.07  0.17 
addr[10:8]  3  in     one bank out of 8  0.34  0.28 
addr[7:3]  5  in     one out of 8 
wordline  0.22  0.12 
addr[4:3]  2  in     one out of 4 groups  0.22  0.12 
addr[2:1]  2  in  b11  byte enable  0.7  0.24 
size  1  in  b1  always 1 for 128 bit 
access  0.38  0.28 
rd_rvs[1:0]  2  in  b00  control RVS during 
read  static  static 
cs_n  1  in     chip select  0.12  0.17 
nb_rd  1  in  b0  set to 0  static  static 
wr_rvs  1  in  b0  control RVS during 
write  static  static 
update   1  in  b0  set to 0  static  static 
waysel_rd_nb  1  in  b0  set 0  static  static 
waysel_rd_wb  1  in     way select part of 
the address  -0.1  0.26 
waysel_wr  1  in     way select part of 
the address  -0.1  0.29 
wb_rd  1  in     read enable  0.23  0.29 
wb_wr  1  in     write enable  0.23  0.29 
   57 
 
The list of the chip interface signals and their functionality and timing appears in 
Table 5-1.  The decode logic of the address bits are fashioned in multiple stages based on 
the timing criticality of the bits. The memory is divided into eight banks, with the higher 
order bits of the address (bit 10, 9, and 8) used to select one bank out of eight.  In each 
bank, there are 32 cache lines (entry) bits: 7, 6, and 5 are decoded to 8 one hot signals 
(predec765[7:0]); then bits 4 and 3 are used to decode 2 to 4 (predec43[1:0]. The final 
wordline  selection  is  a  logical  AND  between  the  two  pre-decode  and  local  clock 
(pred765&predec43&lclk).  Figure  5-4  depicts  the  logical  view  of  memory  and 
organization. The waysel signal is most timing critical  in the original design; due to the 
tag lookup, it is used as column muxing select signal, which reduces the timing setup 
requirement on it. 
   58 
 
Figure 5-4: Test chip logical organization and address decode stage 
 
5.2  Block Level and Timer Circuit Design  
The detail block level design with the major interface signals between the sub-
blocks is shown in Figure 5-5. The pre-decode logic and the timer include all the control 
logic and all the synchronization signals. The chip uses a synchronous timing scheme to 
trigger read and write access. The sense amp and write enable signals are tracked using 
the dummy bitline and the acceleration circuit that can be controlled either through a 
software register or by direct access through the racc control bits.  The wordline  
 
Figure 5-5: Detail block level presentation with major interface signals 
 
decode is done in two stages, where the first stage decodes 3 to 8 for both high and low 
order  bits,  and  then  the  two  8  bit  buses  are  combined  with  the  clock  to  select  one 
wordline out of 64.  The wordline driver including the RVS control circuit which is 
B
i
t
 
l
i
n
e
B
i
t
 
l
i
n
e
 
b
a
r
i
c
l
k
P
r
e
_
d
e
c
1
[
7
:
0
]
P
r
e
_
d
e
c
2
[
7
:
0
]
c
l
k
A
d
d
[
6
:
0
]
C
s
_
n
S
c
a
n
_
n
W
b
_
r
d
W
b
_
w
r
V
c
c
_
m
e
m
R
d
_
r
v
s
[
1
:
0
]
W
r
_
r
v
s
   59 
controlled  by  rd_rvs[1:0]  signals  is  pitched  match  with  the  SRAM  cell  height.    The 
memory supply for the SRAM cell (Vcc_mem), which is routed parallel to the bitline, is 
part  of  the  IO  sub-block.    Its  value  is  controlled  by  the  wr_rvs  signals  during  write 
operation and is clamped to the chip supply all other modes of operation. 
Figure 5-6 shows the detail timer design with a pull-push circuit to generate the 
internal clock (iclk). This iclk signal will be asserted if the chip select cs_n is low; if the 
chip is not in a test mode like ATPG, it will be deasserted when ready signals are high. 
The ready signal is generated using control circuits (racc and dummy column) that try to 
track the SRAM array performance. The dummy bitline is used to match the capacitance 
of the actual bitline and the acceleration bits, which are used to adjust the delay between 
wordline, rise to sense amp during read operation and turn off the write enable during the 
write  operation.  This  adjustment  of  delay  can  be  done  post-silicon,  which  gives  the 
design flexibility to tradeoff speed versus yield. 
 
 
Figure 5-6: Detailed timer circuitry with clock generation and control signals interface   60 
Figure 5-7 shows the delay control circuitry, which consists of a single pull-up 
transistor (d_prech) that is used to precondition the dummy bitline by pulling it to supply 
voltage during the pre-charge phase. When the iclk is asserted, the pull-down stack turns 
on and commences to pull down on the dummy_bl node. The time it takes to pull down 
the dummy_bl to logic 0 and assert the sa_clk and ready signal is controlled by the acc 
bits value. The delay for write is different than read and the wren signal is used to get the 
different pull-down stack.   
 
A
c
c
<
0
>
A
c
c
<
1
>
A
c
c
<
2
>
A
c
c
<
3
>
A
c
c
<
4
>
A
c
c
<
5
>
A
c
c
<
6
>
A
c
c
<
7
>
w
r
e
n
D
u
m
m
y
_
b
l
 
 
Figure 5-7: Delay control circuit with acc bit signals for read and write accelerators 
   61 
 
 
5.3  Timing Simulation Results 
 
0.00E+00
5.00E-02
1.00E-01
1.50E-01
2.00E-01
2.50E-01
3.00E-01
ss_1v_125
ss_1v_m30
sf_1v_m30
sf_1v_85
fs_1v_m30
fs_1v_85
ff_1v_85
tt_1v_85
tt_1v_m30
PVT
s
e
p
e
r
a
t
i
o
n
 
(
v
)
0.00E+00
1.00E-10
2.00E-10
3.00E-10
4.00E-10
5.00E-10
6.00E-10
7.00E-10
8.00E-10
t
i
m
e
 
(
s
)
bl_sep_normal
wl_r_sensb_f
 
Figure 5-8: Simulation result of separation value and time to develop across different PVT 
 
The  internal  timing  of  the  memory  has  built-in  race  conditions  that  pose  a 
challenge  across  the  different  PVT  conditions.  One  of  the  important  built-in  race 
conditions in the SRAM is a synchronous memory design that is the assertion of the 
sense  amplifier  and  the  bitline  development  value.  Since  the  bitline  development  is 
measured in volts and the sense amp delay is in the time domain, we try to find the 
divergence point between the two events, which is noted as the rising edge of the clock.  
The bitline development starts when the wordline is at logic 1; at the same time, the timer 
circuitry, which uses the dummy bitline to match the same bitline capacitance, triggers 
the sense amplifier enable signals. Figure 5-8 shows the measured bitline separation and   62 
the  time  to  develop,  both  of  which  are  measured  when  wordline  rises  to  sense  amp 
enable. A full spice simulation, including the post-layout parasitic is done on the chip 
using high-speed spice simulator Hsim tools. The waveforms of the major signals for 
both normal and RVS circuit control are shown in Figure 5-9. The simulation was done 
with  500  MHz  access  time,  and  for  both  normal  and  RVS  enable  cycles,  the  timing 
margins were met at all corners except for the slow process corner, where we showed 
earlier that SNM is not an issue. 
 
 
Figure 5-9: Major signals waveforms from HSIM simulation when RVS is disabled and enabled 
clk, lclk 
prech, wl 
wren 
vddmem 
  bl, blb 
n1, n2 
datain 
saen_l 
Write 
cycle 
Read 
cycle 
write RVS 
enable 
Read RVS 
enable   63 
5.4  Testing Strategy and Chip Integration  
The 8KB memory chip (DUT) is integrated into a foundry shuttle that has other 
modules. A simple top-level controller is used to route the different module interfaces 
into the chip pins. For the DUT, a shift register (SR), with 8 flip-flop deep on each input 
and output pin, is used to hold the data leaving and entering the DUT. A slow clock 
(100MHz) is used to load and unload the SR.  Also, a software like BIST will be used to 
generate data and address for all memory locations. The data patterns will be all h0’s, 
h1’s, hA’s (1010), and h5’s (0101). Additionally, the same data patterns will be generated 
with four acc setting and four RVS combination. Meanwhile, the RVS bus controls the 
level of the reduced wordline voltage, as explained in Section 4.4. 
 
 
 
Figure 5-10: Chip-level integration with SR on all input and output  
 
 
 
   64 
 
5.5  Expected Test Chip Result 
The test chip has been manufactured and is waiting for packaging but due to delay 
in fabrication and wafer probing at the FAB we didn’t get a silicon measurement in the 
time frame for this thesis.  However a detail spice simulation and a detail design analysis 
flow that is used to qualify production chips is been used to predict the expected result of 
the silicon. 
 
 
 
Figure 5-11: Expected result from test chip  
 
Figure 5-11 shows the frequency vs. voltage shmoo result of the test chip.  The 
proposed approach (Fig B) enables the chip to run at lower voltages than the traditional   65 
design (Fig A) without the RVS system.  As we showed earlier in section 4.3 that for the 
same process corner the when RVS is enabled the chip will run slower than the 
traditional design which is also shown in the simulation result.  This predicted result is 
based on detail monte carlo spice simulation including full parasitic extraction and 
foundry process files that been calibrated with test chip silicon results.  We are planning 
on publishing the actual result once they become available. 
The improvement in power from using the RVS approach can be seen when the 
system supply voltage is reduced by 150mv from the traditional Vddmin of 0.8v.  This 
assumes that the application is not requiring high performance but care more about powr 
efficiency like MP3 player.  The power saving on this case is 47% which is calculated 
assuming the frequency in both cases is the same.  
Another mode where the RVS save power on when the system get activated based 
on silicon corner being fast.  As we shown in section 1.5 in the fast corner the bitline 
voltage reduced by 2.5x more than the slow corner (500mv vs. 200mv).  When the RVS 
get activated based on the silicon global corner the power saving on the bitline related 
power is 250%.   66 
 
CHAPTER 6      Cache Organization: CAM versus SRAM Tag 
6.0   Tag Array Design for High Associatively Cache 
The question of whether CAM tag- or SRAM tag-based cache designs are better 
is as old as cache design itself. However, with the introduction of the StrongARM™ 
processor  [8],  the  issue  has  taken  on  greater  significance  in  the  embedded  processor 
space. Ever since this design demonstrated the superiority of CAM-based designs, it has 
been a widely held belief that CAM-based caches inherently operate using lower power 
than SRAM-based ones. Both academic and industrial studies [10] [36] have described 
several  of  the  reasons  for  choosing  CAM-tags  over  SRAM-tags  for  high  associative 
caches; however, these reasons do not include detailed quantitative arguments.  
Moreover, with the introduction of multithreading and multicore SOC, the need 
for bigger cache size as well as better associative interaction  are essential to achieving 
high memory performance. This, combined with usage of the smaller geometry process 
technology—with wire cap becoming a significant contribution to power consumption— 
demonstrates the need for a detailed and through study of the two tag options.   
We will present a detailed analysis of the same cache architecture implemented in 
both styles and show that, for usage patterns with moderate to high switching in address 
and data, CAM tag caches will consume the same or even more power than will SRAM 
tag caches.  
In our comparison, we looked in detail at a recently completed DSP core Level 1 
(henceforth  L1)  data  cache  and  used  data  from  a  very  similar  cache  from  a  high-
performance ARM core designed for the same process technology; both devices were 
designed  at  Qualcomm.  In  both  cases,  the  L1  data  caches  were  32KB,  16  ways-set 
associative caches, each with a 32-byte cache line size and 64 entries per way.    67 
The  design  of  the  DSP  and  ARM  cores  chosen  for  the  L1  caches  were  both 
physically tagged and virtually indexed using 32-bit virtual addresses (VA) [50]. The 32-
byte cache line size and a minimum page size of 4kB effectively divide the addresses into 
the tag, index, and offset fields, as shown in Figure 1. The tag bits of the VA generated 
by the Address Generation Unit (AGU) must be translated into a Physical Address (PA) 
through  the  Translation  Lookaside  Buffer  (TLB),  before  the  cache  access  can  be 
completed. The untranslated, index, and offset bits of the address are available much 
earlier than the PA tag bits. This timing difference is an important factor in the critical 
speed differences between the cache organizations.  
We  will  discuss  the  comparisons  between  the  two  cache  styles  in  terms  of 
structural, timing, area, and power. 
 
6.1  Structural Comparison 
The main difference between a CAM-based tag and SRAM-based tag is that, in 
the CAM tag, each entry of the tag has its own comparator. The CAM cell has both a 6T 
SRAM cell and a comparator with different topologies based on speed, power, and metric 
area [36]. The sl and slb are the search lines where the PPN address is compared to the 
stored value of the tag. The match line combines several CAM cells—typically eight of 
them—and it features a dynamic signal with pre-charge logic to precondition the node to 
logic 1. The match line also depends on the value of the tag, and either remains or is 
discharged  to  logic  0.  In  the  CAM-based  tag,  the  higher  address  bits  are  distributed 
among the selected sets of tags to compare to the stored tag. The result is referred to as 
hit way.   68 
 
Figure 6-1: CAM cell schematic  
 
In the SRAM-based tag, the number of comparators is equal to the number of 
ways. The tag data is stored in a typical small-signal array, which is accessed using the 
lower address bits to select the appropriate sets that need to be compared to the PA.  
Figure 6-2 shows the data flow of the cache array using an SRAM-based tag. 
   69 
Figure 6-2: SRAM-based cache operation and data flow  
 
Besides the data and tag arrays, this cache stores the cache-line state in a separate, 
multi-ported memory array referred to as the state array. The cache operation starts with 
the delivery of the 32-bit VA from the AGU. The VA must be translated into a PA using 
the Translation Lookaside Buffer (TLB). The VA index bits are used to access the tag, 
   70 
data, and state arrays. The PA tag is compared to the tag values stored in the tag array 
entries, and after being qualified by the cache entry state, the hit results are used to select 
the corresponding data array entries. The state bit is used to identify the status of the 
cache line, that is, whether it is valid, invalid, or reserved. The replacement algorithm 
keeps track of each cache entry and updates the state array accordingly. 
A key architectural decision is whether the data arrays are accessed in series with 
the tag arrays or accessed via parallel tag arrays. For power considerations, a serial cache 
lookup is typically desired in an embedded processor. This implies that the TLB, tag, and 
state  arrays  are  accessed  first  and  that  the  data  arrays  are  only  accessed  after  the 
compared tag results are available. In such a design, the cache data array will not start 
until the exact set and way have been selected.  Figure 6-3 shows the organization of the 
CAM-based cache. It is similar in many respects to the SRAM-based cache. For a CAM-
based tag, the cache banking must be based on the index in order to store all the contents 
of a cache line, with its respective CAM entry. Additionally, all 16 cache lines for each 
set must be stored in the same bank to ensure that only a single set of CAM comparators 
is activated. Overall, these requirements allow for less flexibility in the organization of 
the CAM-based cache. Moreover, since the L1 in our case is pseudo-dual ported, keeping 
the entire cache line in one set of a bank is important for minimizing bank conflicts. 
Other banking schemes could work functionally but would require either duplicate CAM 
entries or the activation of more than a single bank of CAM comparators.   71 
Figure 6-3: CAM-based tag memory organization and data flow    
 
6.2  Area and Floor Plan Comparison 
The choice of CAM tag instead of SRAM tag array directly affects both banking 
options and the floor plan used. SRAM-based tag array caches are more flexible with 
   72 
regard to banking options, as the wordline selection occurs through the decoding of the 
index bits while factoring in the hit signals from the tag array. In our design, the wordline 
decoding occurs in three levels: first, the quad level, which is 8KB and selected using 
EA[4:3]; second, sub-array selection is done using EA[6:5]; and third, the set of 16 ways 
is selected by EA[8:7]. Finally, the hit vector will select one of the 16 ways. Each sub-
array has 64 IO (compared to 256 in the CAM-based tag), with 4:1 column muxes in sub-
array selected by EA[10:9]. Figure 6-4 illustrates the data array area and hierarchy using 
an SRAM-based tag.   
 
 
 
 
Figure 6-4: SRAM-based tag 32KB memory organization 
 
Figure 5-5 shows how the CAM-based tag data array is organized; bits 8 and 5 of 
the EA are used to select sub-arrays while bits 10 and 9 are used to select sets of 16 ways.    
 
 
32 KB Data Array 
 
 
Wordline driver/decode 
2
K
B 
184 um 
310 um 
750 um 
510 um 
16 way SRAM TAG 
 
Quad 
DW 
  IO 
      4way 
Tag 
IO 
comparator 
Utilization is 31.8 % 
6T size is 0.54 um^2   73 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 6-5: CAM-based tag 16KB memory organization 
 
The cache line for the SRAM tag is distributed to four double words (DW).  Each 
quad contains one double word from all sets. This organization makes the fill and evicts 
bus routing much simpler than does the CAM-based tag, as each quad drives a DW. The 
state array is 64 entries, which match the number of entries per way, and has 48 columns, 
which are 3 bits per way. For the CAM-based tag, the state bits are added to the CAM 
array.  The  6T  SRAM  cell  for  65-nm  area  is  0.52µm
2  and  the  CAM  cell  area,  the 
conventional dynamic and CAM-based cell are 4µm
2. Note that the CAM cell has an area 
eight times the size of that of the SRAM cell. 
 
  
 
   74 
Table 6-1 and Table 6-2 show the area for each tag implementation. The CAM-
based design occupies 18% more area than the SRAM-based tag design. This arises from 
the difference in area between the two tag designs. 
 
 
 
Table 6-1: Area of L1 32KB 16 ways SRAM-based tag 
 
SRAM-based Tag Area  
   X(µm)  Y(µm)  Area (mm
2) 
32KB data array  510  750  0.3825 
SRAM tag  184  310  0.11408 
State array  148  80  0.01184 
Total area         0.508 
 
 
Table 6-2: Area of L1 32KB 16 ways CAM-based tag 
 
CAM-based TAG Area 
   X(µm)  Y(µm)  Area (mm
2) 
32KB data array  542  700  0.3794 
CAM tag   280  700  0.196 
State array  150  160  0.024 
Total area         0.599 
       
 
 
6.3  Timing Comparison 
As is true for most caches, generating hit signals to determine which way of the 
16 total ways needs to be accessed is the most critical path for the two cache designs. The 
CAM tag is distributed and tightly coupled with the data array sub-bank, making the 
timing path from the TLB to hit more critical.  For SRAM tags, the tag array is compact 
and localized in a relatively small area; this makes the main speed path from TLB to hit 
signal less critical for SRAM-based tags. Intel xscale [10] with 32 ways set associative 
implements a speculative CAM tag search parallel to the TLB. This results in a special   75 
read/write operation on the data array to enable the retaining of old data in case the TLB 
access misses. Further, it requires the addition of temporary storage of the previous data, 
which increases the cache size by 2KB (which amounts to about a 6% increase in the 
cache area). The output of the SRAM tag is 16 ways hit vectors that is one hit, and can be 
optimized  in  both  routing  and  power.  One  more  complication  stemming  from  the 
physically distributive nature of the CAM-based tag is the combination of the hit/miss 
way, which is necessary for a replacement algorithm. Moreover, if the cache is a dual 
issue cache, such as pseudo-dual ported caches, the timing also becomes a challenge. 
 
6.4  Power Comparison 
Most of the previous work, which, for the most part, related to power comparison 
between  CAM-  and  SRAM-based  tags,  overlooked  the  power  associated  with  wire 
capacitance.  Our  analysis  assumes  65-nm  process  technologies  from  a  commercial 
foundry.  In  our  comparison,  we  assume  that  the  functions  common  to  the  two 
implementations—such as TLB, state array, and data array access, as well as the power 
associated with driving the load/store bus—are all equal. As is clear from our earlier 
discussion, fill and evict operations consume more power in CAM-based tags, but these 
operations only seldom occur, so their effect on the total power consumption is small. 
We now turn our focus to analyzing the power associated with the tag array and 
hit generation, which is the principal difference between the two designs. Figure 6-6 and 
Figure 6-7 show the power distribution in the CAM-based tag and SRAM-based tag, 
using a switching factor (SF) of 0.5 for both cases. The switching factor is the percentage 
of the signal switching from cycle to cycle. For example, the PA bus is 22 bits, so an SF 
of 0.5 means that only 11 bits of the bus switch from low to high or from high to low 
between  consecutive  cache  accesses.  Figure  6-7  illustrates  the  power  consumed  by 
distributing the PA bus and state vector, which is mostly switching the wire capacitance.   76 
This process constitutes 63% of the total active power consumed by the CAM tag of the 
L1 data cache. This makes the CAM-based tag more dependent on the data-switching 
factor. The magnitude of the SRAM-based tag’s dynamic power is mostly due to gate 
switching and accessing data from the SRAM block, which is implemented as a small 
signal array. The biggest power contributor in the SRAM tag implementation is to the 
process of doing 16 comparisons of 22-bit (35% of the total power). 
Figure 6-8 shows a comparison of the two tag power implementations with a 
different SF. The graph shows that, for SF of 0.6, both tag implementations consume the 
same dynamic power. A smaller switching factor is more favorable to the CAM based 
tag, with about 60% less power consumed that the SRAM-based tag when SF = 0.25. 
This key trend makes the decision between CAM versus SRAM tags dependent on the 
processor architecture and workload. For example, a shared cache for multicore- or fine-
grained multithreading-based SOC will have a high SF on the PA bus due to its running 
of different programs. On the other hand, a single-issue general-purpose processor will 
have less activity on the PA bus, which makes the CAM-based tag more power-efficient.  
power distribution in sram tag
Tag sram array
10%
wire to distribute tag
11%
wire cap inside comp
2%
sense amp
13%
from sense to compar
8%
State array 64x16
3%
hit signal distribution
4%
decoder
14% total comparator 
power
35%
 
 
Figure 6-6: Power distribution in L1 data cache tag (SRAM-based tag) for SA = 0.5         77 
 
 
 
State array 
64x48
14%
reserved bit
5%
gate load on 
PFN
7%
Wire cap to 
distribute state
27%
wire cap to 
distribute PFN
36%
search lines
1%
hit vector (only 
one active)
4%
1set compare
6%
 
Figure 6-7: Power distribution in L1 data cache tag (CAM-based tag) for SA = 0.5  
12.353
9.929
7.505
5.081
11.83
10.95
10.07
9.19
1.04 0.91 0.75 0.55
0
2
4
6
8
10
12
14
1 0.75 0.5 0.25
p
o
w
e
r
 
(
m
W
)
power_cam_tag
power_sram tag
cam/sram
 
Figure 6-8: Switching capacitance (energy-delay
2) of CAM-based tag and SRAM-based tag 
   78 
6.5  Summary 
Deciding  on  the  tag  array  used  in  the  memory  subsystem  has  significant 
implications on power, area, and speed. In our analysis, we showed that CAM-based tags 
always  are  larger  in  area  (constituting  about  10%  to  20%  of  the  total  cache).  Since 
memory subsystems constitute more than 50% of the area in modern processors, this 
characteristic  makes  the  CAM-based  tag  area  overhead  to  the  total  processor  area 
between 5% and 10%. CAM-based tags have more timing challenges than SRAM tags 
due to the increase in area and the nature of the hit signal being physically distributed; 
recall that the hit signal is relatively localized in SRAM-based tags. Using CAM-based 
tags limits the banking options and affects the data array organization; column muxing 
and routing resources become commonplace. The advantage of CAM tags is that they are 
more power-efficient than SRAM-based tags, but only for processors with low switching 
activity  factor on the physical address and state bits. This makes it  architecture- and 
workload-dependent, and these characteristics need to be weighed before choosing one 
tag over the other.  
With technology scaling, the impact of wire capacitance and leakage current on 
both area and speed becomes increasingly important. The SRAM arrays contain more 
than 90% of the device and use 50% of the chip area. Tag array itself consumes more 
than  half  the  power  of  the  memory  subsystems. Hence,  early  planning  and  thorough 
understanding of all the factors that contribute to the power, area, and speed in SRAM 
memory access is also essential to making the right tag selection. 79 
 
CHAPTER 7   Verification of Gate Level Model for Custom Memory Design in 
Scan Mode 
7.0  Test Pattern Tool Flow  
The need for both high-performance and low-power memory design has ensured 
that more and more embedded memories are being added to chips, with more logic being 
realized  in  the  process  using  custom-design  style.  Built-in  self  tests  (BIST)  are  the 
predominant way to test memory. Testing is done at full speed with algorithmic pattern 
generators and  cycle-by-cycle response comparisons with pass or fail signatures. The 
BIST engine has a simple interface with input, output, address, enable signal, and pass or 
fail signal.  The main goal of BIST is to test the SRAM cell and the main logic around it, 
such  as  decode,  input,  and  output  logic;  however,  BIST  has  limited  effectiveness  in 
testing other custom circuits that exist within the memory. The manufacturer’s stuck at 
tests and transition tests for these custom-circuit designs are an essential part of today’s 
complex System on Chip (SOC) designs for high-test coverage meant to guarantee high 
yield. Design for Test (DFT) standard tools are built around the ASIC design flow and 
require a gate-level net-list describing the design to be used for generating test patterns, 
which will be used to validate the silicon. Figure 7-1 shows the typical ASIC flow with 
the automatic test pattern generation (ATPG) and verification fully integrated into the 
flow; this integration makes it correct by construction.  
Custom designs are crafted using transistor-level models and tools. The transistor-
level model ultimately needs to be translated into gate level to be used by the DFT tools. 
It is essential to have a robust and accurate flow to verify that the gate level net-list, with 
all DFT features and schematics, is equivalent.  Our flow uses an industry standard RC 
verilog switch-level simulator to effectively and thoroughly verify the scan equivalency 
between these gate and transistor-level models.    80 
One important view of the custom macro is a gate-level model typically used for 
ATPG tools like TetraMax [1]. These ATPG patterns are used to screen for manufacture 
stuck at and transition faults for these custom circuit designs used on the chip 
 
 
Figure 7-1: ASIC design flow with ATPG pattern generation and verification 
 
-  The patterns generated by the ATPG tools, which take in gate-level models, are 
used to validate the silicon. If the gate-level model and schematic are not equivalent, then 
the gate model used to generate manufacturing test patterns will not represent the silicon.   
This adds more challenges to the silicon debugging because, when patterns fail, one has 
to determine whether it is due to incorrect representation of the schematic by the gate 
model or simply a design failure. The process of debugging failing ATPG patterns can be 
extremely difficult and time-consuming. This is why it is essential to verify that the gate-
level model for which we are generating patterns and the schematic of the actual design 
are equivalent [2].   
   81 
 
 
 
Figure 7-2: Common mismatch between schematic view and gate-level view of the macro 
 
Figure 6-2 shows a typical mismatch that can occur between the schematic view 
(actual design) and the gate-level model. The first mismatch is the scan order and the 
other is the realization of logic A in schematic versus logic Z in the gate-level model 
Figure 7-3 shows where the gate-level model view fits in the design flow of the custom 
macro. The Register Transfer Language (RTL) used to describe the custom macro often 
does not include scan, and if it does, the scan is added through a manual process that 
needs  to  be  checked  against  schematic  for  equivalency.  The  equivalency  checking 
between the RTL model and the spice model, which is done on box 4 of Figure 7-3, is 
often targeted only for functional mode and does not cover scan mode because the full 
details of the scan mode behavior are often not included in the RTL model. The two 
models (RTL and schematic) could be equivalent during functional mode but different 
during scan. The flow to generate gate-level model (box 6, figure 1) from schematic 
differs from one project to another, but due to tool limitation and design complexity, the 
process in many instances requires manual edits of the net-list or additional constraints or   82 
assertions to guide the tool. Ultimately, this makes the model creation process prone to 
error. 
 
 
Figure 7-3: Custom circuit design flow 
 
7.1  Custom Macro Design Flow 
High performance and low power SOC design requires a comprehensive strategy 
and multi-level optimization from software to hardware. For the hardware design, all 
different design styles need to be exploited. One of the critical decisions that must be 
made  when  designing  a  chip  is  which  portions  of  the  logic  are  implemented  using 
custom-design style and which portions are synthesized. The decision whether to select   83 
custom  design  style  versus  synthetic  design  is  based  on  complex  tradeoffs  between 
achieving  high  density,  better  timing,  and  lower  power  versus  added  complexity, 
resources,  and  schedule. Memories  (SRAM/CAM),  TLBs,  and  register  files  are  good 
candidates for custom design style.   
Once a part of logic is chosen to be realized using custom-design flow, that logic 
will be separated out and put into a new design hierarchy. For custom semiconductor chip 
designs, the RTL- and transistor-level models are developed and verified using separate 
CAD tool suites for most of the design, but are intended to model the same function. 
Once  complete,  the  RTL-level  and  transistor-level  models  must  then  be  checked  to 
ensure that they represent the same Boolean function [54]. One way to do this is by 
translating the transistor-level net-list into a gate-level model through model abstraction 
and then using verilog equivalency-checking tools, like Verplex from Cadence, to verify 
functionality. Another way to verify equivalency between the RTL and the schematic 
spice  net-list  is  to  utilize  switch-level  simulators,  like  ESPCV  from  Synopsys.  Both 
approaches have advantages and a typical flow will use multiple approaches to verifying 
correctness. 
An illustrative custom-design flow is shown in Figure 7-3, where it starts with 
high-level  description  of  the  intended  functionality  and  the  different  specs,  such  as 
timing,  area,  and  power.  The  next  step  is  to  generate  schematic  for  the  logic  using 
schematic-capture  CAD  tools  like  Cadence  Virtues  custom  design  platform.  The 
schematic can be a mix of standard library cells and custom cells that are built up from 
the transistor level, typically with a complex hierarchical structure in the design. After 
completing the optimized design that implements the required functionality described in 
the RTL, functionality of the custom design can be verified against the behavioral RTL 
using the ESPCV tool from Synopsys.  ESPCV is switch-level simulator that can read in 
a design in both behavioral RTL format and a transistor-level net-list format and attempts 
to perform a symbolic, formal verification of their equivalency.     84 
Since quite often the RTL does not fully model the details of the DFT features 
built  into  the  design,  these  features  must  be  disabled  without  having  undergone 
verification. Many tools in the design flow do not deal well with transistor-level designs, 
so once the design is complete, the transistor-level net-list can be translated into a gate-
level  net-list  for  these  tools.  Most  of  the  logic  in  a  typical  custom  design  can  be 
automatically translated into logic gate using logic abstraction tools like the Verplex tools 
from  Cadence.  However,  structures  with  more  complex  behaviors,  like  SRAM  cells, 
sense amps, and complex latch structures cannot be automatically translated and must be 
manually modeled by the designer. In this manual modeling process, the potential for 
errors arises; and even the abstraction tools themselves are not error free, so it is desirable 
to have an efficient gate-level model validation flow. 
 
7.2  Gate-Level Model and Schematics Validation for ATPG 
The gate-level model and schematics validation process consists of three steps: 
first, ATPG tool is run to generate patterns, second HDL verilog simulation is used to 
validate the patterns against the gate level model,  and third validate through ESPCV with 
RC Switch level model generated from SPICE net-list.  Figure 7-4 is an overview for gate 
level model validation flow.  This flow is described from the standpoint of verifying 
ATPG patterns and DFT functionality but the same principles can also be applied to 
functional mode verification as well. 
7.2.1  Verifying ATPG Tool Compatibility and Coverage Analysis 
 
Before attempting to generate ATPG patterns, the ATPG tool first performs a 
thorough validation of the gate-level model from a DFT-compatibility standpoint. The 
main goal of this step is to ensure that the gate-level model passes a series of scan design 
rule checks. After the DFT DRC stage, ATPG patterns are generated with the goal of   85 
achieving  100%  fault  coverage.  In  a  custom  macro,  there  are  often  circuits  that  are 
difficult to control or observe, so the coverage is likely to be less than 100%.  
To achieve accurate fault coverage, there are times when nonstandard gates need 
to be changed with APTG-friendly standard gates. For example, bit line keepers must be 
modeled in such a way that the tool understands that they preserve a node’s state while 
not  actively  driving  it.  Many  of  the  clocking  and  control  strategies  used  in  custom 
designs may confuse the tools, so that during the DRC checks and pattern generation, the 
ATPG tool may detect that there are errors, causing broken scan chains and invalid input 
control. These false errors can prevent successful pattern generation until the offending 
circuits are remodeled in a tool friendly manner. Finally, the actual memory cell array can 
be modeled using built-in memory-primitive models, which have advantages over using a 
detailed  cell-level  model  in  terms  of  simulation  time  and  complexity.  This  memory 
model enables the tool to test shadow logic outside the memory model.  
Once patterns have been successfully created, the tool can create a test generation 
pattern output file and verilog test bench, which will be used in last two steps. 
   86 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 7-4: Gate-level model validation framework 
 
7.2.2  Validation Through HDL Simulation 
The second step in the verification flow is to use an HDL simulator like VCS or 
Modelsim to validate that the ATPG tool was correctly interpreting the gate-level model 
by simulating the application of the ATPG patterns to the design. Problems in the gate-
level model, invalid ATPG input constraints, and other problems can result in ATPG 
patterns that do not produce the expected output results.   
Failures in the ATPG pattern validation can be debugged using standard RTL 
simulation debug tools by creating VCD or FSDB waveform files for viewing by any 
wave viewr tools like Novas or nWave. The FSDB dump file will also be used in step 
three if the ESPCV finds mismatches in the transistor-level verification. 
7.2.3  Validation With Golden Model 
 
   87 
Even with verification of the ATPG patterns against the gate-level model and 
RTL simulation, there may be failures on actual silicon tests due to 
1) ATPG results predicted based on 0-delay RTL environment  
2) Imperfect gate-level model creation flow. 
In this flow, ESPCV is applied to the problem of verifying that the gate-level 
model correctly reflects the transistor-level design. ESPCV is a symbolic simulator that 
has  been  tailored  to  perform  custom-circuit  equivalence  checking.  It  is  designed  to 
provide  functional  verification  coverage,  a  verilog  reference  design  against  a  spice, 
netlist, or verilog switch-level design. 
ESPCV provides two modes: the binary mode and symbolic mode. The tool is 
primarily intended for use as a symbolic simulator to verify the very complex functional 
modes of the block under all possible input stimuli. For ATPG pattern verification, our 
flow uses the binary mode of ESPCV to quickly simulate the application of the ATPG 
patterns to the design. ESPCV binary mode is much faster than transistor-level simulators 
like HSIM or NanoSim, which also have been used for this sort of verification.  
The flow to generate the golden model is shown in  
Figure 7-5. First, the ESPSV utility translates the spice net-list to a golden RC 
verilog switch-level net-list, using a configuration file that has port information. This net-
list is annotated with transistor widths and lengths and process information. This simple 
step makes it possible to run ESPCV’s RC mode algorithm, which dynamically resolves 
the strength issue and automatically calculates net delays to correctly resolve the behavior 
of processes like SRAM cells’ write operations and timing delay chains. Compared to 
traditional  transistor  simulators,  ESPCV  can  provide  both  functional  accuracy  and 
simulation speed. This makes it possible to simulate many more patterns and to gain 
much higher confidence in the equivalence of the two models. For most designs, EPSCV 
can be up to 100% confident by running all patterns.    88 
 By using the same verilog test bench for the VCS gate-level verification and the 
ESPCV simulations, debugging failures is also simplified.   
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 7-5: The flow to generate golden model 
 
7.3  Experimental Results 
Table 1 shows the result of run time compared to transistor-level simulation. Even 
for large designs, ESPCV can verify gate model with many patterns. It also can simulate 
normal verilog files, which have delay parameter ESPCV is fast enough to verify the 
custom macro gate model. 
 
 
Table 1: Simulation time 
Simulation Time (min) 
Circuit  Number of 
patterns 
Transistor level 
Simulator 
RC verilog switch level 
simulator 
   89 
Circuit A  7  1442  1.83 
Circuit B  167  40324  271 
Circuit C  411  Not testable  762 
 
This  flow  was  run  on  several  custom  memory  blocks  and  found  that  many 
simulation runs with gate-level model were not identical to expected values.  
Figure 7-6 is the snapshot of one of the test results.  It depicts the number of 
patterns, expected output, output pin name, and time of failures. From this result and 
internal dump values, users can easily find the location of the incorrect model.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 7-6: Gate-level simulation test example 
 
7.4   Summary 
 
Generating  gate-level  model  from  custom  circuits  for  testing  has  traditionally 
been a complex, manual process wholly contingent on the test engineer's skill and the 
designer's  intelligence.  This  paper  describes  validation  methods  between  gate-level 
... Starting ESP simulation: 
//            0.00 ns : Begin test_setup 
esp Info:  DC initialization complete. 
//           30.00 ns : Begin patterns, first pattern = 0 
//           30.00 ns : ...begin scan load for pattern 0 
// *** ERROR during scan pattern 1 (detected 
during load of pattern 2) 
1 87 6 (exp=0, got=1)  // pin                                                        
IU_S_sout[64], scan cell 6, T=        6204.00 ns 
1 87 10 (exp=1, got=0)  // pin                                              
IU_S_sout[64], scan cell 10, T=        6244.00 ns   90 
models and schematics for custom macro designs. The proposed framework undergoes a 
three-step validation process to ensure its correct functionality. The flow accepts spice 
net-list from schematic to generate golden RC verilog switched-level net-list. Then, the 
test patterns generated using the ATPG tools are simulated with the RC verilog switch-
level  netlist.    Our  flow  effectively  verifies  the  scan  equivalency  between  these  two 
models, gate-level netlist and schematic.  
 91 
 
CHAPTER 8   Leakage Reduction on Wordline Logic for SRAM Memory 
8.0  Motivation  
The use of batteries in hardware targeted for handheld and cell phone applications 
necessitates that the product meets stringent energy requirements. Leakage current (i.e., 
the current flowing through the device during its “off” state) has increased drastically 
with  technology  scaling  [1].  Leakage  minimization  in  standby  mode  is  important  for 
chips in general, but is critical for handhelds and mobile phones because products have 
long idle times and limited energy to spare. The leakage power often determines the 
standby time a product can last before its battery is drained.  
Modern  SOCs  have  multiple  functional  and  hardware  acceleration  units  with 
complex power management that effectively control the power to the different part of the 
system [20].  There are different leakage mechanisms in today’s scaled devices; the three 
major ones can be identified as: sub-threshold leakage, gate leakage, and reverse-biased 
drain-substrate and source-substrate junction Band-To-Band-Tunneling (BTBT) leakage 
[1].  
The threshold voltage (Vt), scaling, and reduction due to Short Channel Effects 
(SCE) [1] result in an exponential increase in the sub-threshold current. Extrapolating 
from Zhang et al [4, Fig. 1(a)], more than 30% of the total power in a 65-nm part in 
active operation mode will be consumed by leakage.   
A simplified sub-threshold current equation is shown in Equation 1  
     
0
[ ]
/
1
1
g t
t
ds ds
ds t
V V
n
V I I e
e
ν
ν
−
  = −    
     
 
where 
2
0
1.8
ds t
W
I v e
l
µ φ =  is the current at threshold   92 
µ is the effective carrier mobility, 
W
l
 is device width to length ratio, 
φ  is a process dependant constant, and 
t
kT
q
ν =   is the thermal voltage (26mV at 300K). 
 
From equation 1, we conclude that the leakage current increases exponentially 
with decreasing threshold voltage (Vt); it also scales linearly with transistor width (W), 
exponentially with thermal voltage (vt), and has a complex relationship with the channel 
length l. On one hand, it has a linear relation with the reciprocal of channel length for 
long  channel  MOS,  but  due  to  short  channel  effect  [56],  the  threshold  voltage  also 
changes with channel length. For some limit, the leakage reduces  exponentially with 
increasing channel length and then resumes to the linear relationship. The gate voltage, 
Vg, which is equal to the supply voltage,  also has an exponential relationship to the 
leakage current. 
8.1  Usage of Head and Foot Switch for Leakage reduction 
It is challenging when certain applications like MP3 player span across multiple 
functional units and do not require high performance but instead require lengthy run time.  
For this kind of application, active leakage becomes a large percentage of the total power 
consumed. Global power collapse or even power domain cannot be used, since the unit 
needs to stay on for relatively short times and the a full power collapse requires software 
intervention to correctly go in  and out of the different power modes.  This challenge 
requires a more distributed and precise control of the power supply to effectively reduce 
leakage power during active mode.   
The usage of low leakage transistors that have high threshold voltages (HVT) in 
series with the supply voltage is a well-known technique used to reduce the leakage 
current of the logic gates of a certain block [26]. An HVT pMOS device can be used in   93 
series with the logic 1 supply voltage (Vdd) to limit the leakage current (head device), or 
an HVT nMOS device inserted in series with logic 0 supply (Vss) can be used (also 
known as foot switch). Since the leakage current can only go from a high potential supply 
Vdd to low potential, one Vss, it is sufficient to use a foot or a head switch to limit the 
leakage from a given gate.   
 
 
Figure 8-1: Detail schematic of head/foot switch 
Figure 7-2 shows the detailed schematic connection of the logic gates into the 
supply voltages.  The transistors with the gate connected to sleep signals are the foot and 
head switches. The sizing of the foot and head switch is based on a tradeoff among 
leakage,  speed  degradation,  and  overhead  area.  Sizing  typically  limits  the  speed 
degradation by 2% to 3%, due to the addition of the series foot or head switch. Since 
nMOS electron hole mobility is greater than pMOS electron hole mobility, it is better 
from an area and speed perspective to use nMOS as a foot device rather than pMOS. Still, 
the fact that most chips use single tub process, which means that all transistors share the 
same substrate, results in extra routing resources to isolate Vss_sub from Vvss nodes. If a 
head switch were used, Vdd_sub can be connected to Vvdd.  
The saving on leakage when using foot or head switches results from the fact that 
HVT devices leakage is much less than that in the normal Vt devices, and the total width   94 
of the transistor connected to the supply is limited by the floor or head width, both of 
which are much smaller than the width of the total logic gates. Furthermore, the stacking 
effect that limits the leakage current, due to negative gate to source voltage, also adds to 
the savings. 
 
 
 
 
Figure 8-2: Foot/head switch examples 
It is important to guarantee that the logic implementation of a head or foot switch 
is designed with no potential dc current between Vdd and Vss supplies. Figure 1-2 shows 
cir1 and cir2 with correct implementation of alternative head and foot switch, while cir3 
and cir4 have the potential of dc current during sleep mode from Vdd to Vss through the 
second inverter when n1 node floats during sleep mode. 
8.2  SRAM-based Memory Leakage 
SRAM arrays contain more 90% of the devices and use 50% of the chip area [5].  
In addition to the fact that most of the cache circuit elements are idle, this characteristic 
of SRAM makes it a good candidate for leakage power reduction.  The SRAM cell is   95 
normally designed with small transistor sizes to optimize performance, area, and leakage. 
For many power-sensitive chips, high Vt transistors are also used to further reduce the 
SRAM cell leakage.  
Wordline logic is the second largest portion of the memory after the 6-T SRAM 
cells [4]. The large load presented by the wordline to the wordline logic dictates that the 
wordline logic will use wide devices for performance and area utilization. This makes the 
wordline leakage a significant part of the total leakage power consumed by the memory 
subsystem in both active and inactive operation modes.  
We present a scheme that reduces this leakage power consumption by 20 times; 
our approach assumes existing power gating techniques, which are applicable only in 
standby mode. We exploit high- and low-Vt devices to achieve this result without any 
performance  overhead;  furthermore,  our  solution  is  completely  transparent  to  the 
software and logic that interfaces the SRAM. Since at most one wordline can be active in 
any one cycle, the pfets on the inverters driving the wordlines are always leaking. We 
will present data for an advanced commercial process that demonstrates that this leakage 
is at least comparable and sometimes even greater than the collective leakage current of 
all the 6T cells in the array.   
One mechanism for power reduction is to dynamically gate the power supplies to 
the wordline logic along the memory addressable unit or bank. Several authors have 
proposed such a solution [1], [2]. However, they only address leakage power in standby 
modes, such as sleep (during which SRAM state is restored on wakeup) or stop (during 
SRAM contents are invalidated). These modes are controlled by software and have area, 
speed,  and  software  complexity  overhead.  The  modes  offer  only  coarse  control  over 
leakage minimization; they are at a unit level, so even if one entry of the SRAM needs to 
be active, the SRAM is precluded from being in a power-save mode.     96 
Zhang  et  al  [39]  addressed  the  wordline  logic  leakage  by  using  long  channel 
devices on the driver. This does reduce leakage but has penalties on speed due to the 
increased gate capacitance and reduced drive capability.   
 
8.3  Design Example 
We illustrate our approach on a simple, single-ported 32-kilobyte (KB) SRAM. It 
is typical for cache organization to use a multiple hierarchy to minimize active power.  
For our illustrative example, we assume the SRAM is divided into 16 banks, and each 
bank is divided into two sub-banks (1KB), with the wordline logic of these two sub-
banks sharing the pre-decode, and differing only on the last decode stage. Figure 1 shows 
the  assumed  cache  organization.  Figure  2  illustrates  the  gate-level  logic  of  the  final 
wordline. 
   97 
Figure 8-3: 32KB cache organization example 
 
Let Cnfet be the gate capacitance of one nfet pass-gate (PG) in the SRAM; the 
access devices for each cell adds a load of 2Cnfet to the wordline. The wire cap per cell is 
approximately equal to 1 PG cap (this estimate is for the wide 6T SRAM cells designed 
in sub-90 nm, where the cell’s aspect ratio is close to 2 with the bit-line direction being 
the shorter side [4]). Hence, each cell contributes a total of 3Cnfet capacitance to the 
wordline.  
   98 
Figure 8-4: Traditional wordline driver 
The currently used wordline logic shown in Figure 1 can be sized for minimum 
delay using the theory of logical effort [3, page 184], which, in essence, tells us that for 
optimum delay on a path, devices should be sized so that each stage sees a stage effort of 
4. For a memory block with 2
n  wordlines  and  2
m  bit-lines, this means that  the final 
inverter on the wordline logic (NVT_inv1 in Figure 1) should have input capacitance 
equal to 
_ 2 (3 )/4
m
wl in nfet C C = i i     (2) 
In our illustrative design, n = 6 and m = 7, so each SRAM sub-bank is 1 KB. For 
this, the wordline inverter’s total width can be calculated (using equation 2) to be 96Cnfet. 
The pass gate is at a minimum size, with a long channel to ensure read stability. Cnfet is 
equal to 0.15fF for the technology we are using. Substituting in equation 2, we see the 
input capacitance on the inverter that actually drives the wordline is 96·0.15 = 15fF. The 
gate capacitance in 65-nm technologies is on the order of 1 fF/µm, so the total worldline 
driver size is 15µm. Assuming holes have roughly half the mobility of electrons, and 
   99 
equal rise and fall time desirable, the pfet width will be equal to 10µm, and nfet width 
will be equal to 5µm. 
 
Figure 8-5: New wordline driver design with HVT head and foot switch to limit leakage current  
 
 
 
Figure 8-6: Detail of the new wordline driver last stage 
   100 
 
8.4  Proposed Low leakage Wordline Logic 
For our illustrative design, when the array is not being accessed, all wordlines are 
off. On a read or write operation, a bank is selected through decoding of the index bits, 
which are part of the address bits. Exactly one of the wordlines is asserted. In this case, 
the pfet on each remaining wordline driver is always leaking except when the power is 
turned off.    
Our  research  has  been  executed  using  data  from  a  65-nm  process  from  a 
commercial foundry that includes devices with three values of Vt—low, normal, and high 
[7,8]. Because of confidentiality agreements, we cannot divulge exact values of leakage 
and  their  dependences  on  the  process,  voltage,  and  temperature  (PVT);  instead,  we 
present representative values.   
The leakage per µm of gate width for a pfet will be referred to as L nA/µm, where 
the value of L depends on the process technology and on the PVT points. The reported 
value for the nominal voltage, 25C from both IBM and TSMC [7,8], is 7 nA/um for 
nMOS. The 6T cell transistor cell is designed by the foundry; the devices in the cell are 
minimum  width  devices,  have  longer  channel  length  and  higher  threshold  implants, 
which makes the leakage very small. We will refer to all leakage as Ls pA per cell. For 
the typical corner, 25C the leakage current per cell is reported to be 10pA. For the 2KB 
bank in our illustrative design, the total leakage current of all of the wordline drivers and 
array cells would be 
 
6
_ 10 22 ( ) 1.28 ( ) wl leak I L nA L A µ µ = = i i i  
6 7
_ 2 2 2 0.016384 ( ) sram leak s s I L L A µ = = i i i i  
 
 
 
 
 
   101 
 
 
 
 
Table 8-1: 32KB SRAM array leakage and wordline driver leakage for different PVT  
 
SRAM leakage data  Wordline leakage data   
Number 
of 
SRAM 
cells 
Leakage 
per 
SRAM 
cell (pA) 
Total 
SRAM 
cells 
leakage 
(µA) 
Number 
of word 
line 
driver 
Total 
PMOS 
width for 
wordline 
driver 
(µm) 
PMOS 
leakage 
per µm 
(nA/ µm) 
Total 
Wordline 
driver 
leakage 
(µA) 
Total 
WL 
leakage 
/ Total 
SRAM 
leakage 
262144  10  2.62  2048  20480  0.3  6.14  2.3 
262144  20  5.24  2048  20480  0.5  10.24  2 
262144  50  13.11  2048  20480  0.65  13.31  1 
 
Table 8-2:  Active power to the addition on foot/head switch  
 
head switch 
size (um) 
foot switch 
size (um) 
total gate cap of 
head and foot (fF)  voltage  power C*V^2*AF 
uW/GHZ 
48  24  57.6  1.2  41.47 
 
 
   102 
 
CHAPTER 9   Conclusions and Future Work 
9.0  Conclusions 
The objective of this dissertation is to better understand the different factors that 
affect SRAM-based memory power, area, performance, and yield in order to identify 
opportunities for improvements. Starting form the architectural level and going through 
circuit implementation down to the layout and floor plan step of the design, there are 
complex tradeoffs among the different factors and their effects on the chip and system 
levels. Moreover, the complexity of today’s SOC makes it essential to have a robust and 
comprehensive testing strategy for all gates on the chip to achieve high coverage and to 
minimize doling out bad parts to customers.   
Chapters 1 and 2 introduced the SRAM-based  memory  design basics  and the 
challenges  presented  due  to  limited  voltage  scaling  caused  by  SRAM  stability  and 
technology  scaling.  The  SRAM  functionality  and  parametric  yield  failures  were  also 
introduced and analyzed with the different factors that affect each one.  
Access failure and write completion failure mostly occur due to a slow nMOS 
transistor and fast pMOS, while read stability (destructive read) is generally caused by 
fast nMOS and slow pMOS devices. We also surveyed the different approaches used in 
today’s chips to address the voltage scaling limit and the yield loss due to SRAM failures.   
There are four main approaches to achieving the goal of power-efficient and high-
yield SRAM-based memory: modified SRAM, voltage island, body/well bias, and read 
and write assist circuits. 
We described a method of using assist circuits that can minimize the effect of 
SRAM cell parametric variation on the memory behavior. The approach used a reduced 
voltage swing (RVS) and a high circuit that reduces the WL voltage. Also, the approach 
employed an RVS low circuit to reduce the memory supply during write operation. The   103 
approach was selectively activated based on silicon behavior. In the end, it did improve 
SNM of the cell by 20%, with minimum impact on timing and area. 
The second contribution was a detailed study of the cache-tag organization and its 
impact on area, power, timing, and design complexity. Our results showed that CAM-
based tags often result in more optimal design points, rather than SRAM-based tags. 
The third contribution related to leakage power is SRAM memory.  The high area 
percentage occupied by the memory, the regular structure of the memory; and the low 
ratio  of  active  to  ideal  circuits  all  make  the  memory  an  ideal  candidate  for  leakage 
reduction during active operation mode.  We should be aware that the wordline leakage is 
a substantial percentage of the total memory leakage and so we proposed a power gating 
technique to minimize the wordline leakage. 
  
9.1  Future Work 
The  variability  of  small  geometry  process  technology  makes  design  time 
optimization approach of critical circuits impractical.  Adaptive and tunable designs that 
can respond to process variation are key to making competitive products.  
This dissertation described a method that can be used in SRAM-based memory to 
minimize the parametric yield loss and to enable lower voltage operation.  The test chip 
designed proved the practicality of the approach and quantified the area, with timing 
overhead. It did not include the feedback loop that can automatically activate the RVS 
system to adjust the voltages on the chip based on process corners.   
Much research in the field of process monitoring and identification [58] [59] [60] 
describes how to identify the silicon behavior.  The future work would complete the 
system in such a way to have Automatic Reduce Voltage Swing (ARVS).  It requires 
process monitors that can identify silicon behavior and activate the control logic that 
regulates the described RVS system.  104 
REFERENCES 
[1]  Wilkes, M. The memory gap and the future of high performance memories,  ACM 
Computer Architecture News, vol. 29, March 2001, pp. 2-7. 
[2]  Wulf, W.; McKee, S. Hitting the memory wall: Implications of the obvious, ACM 
Computer Architecture News, March 1995, pp. 20-24. 
[3]  Gerosa, G. et al. A sub-1W to 2W low-power IA processor for mobile internet 
devices and ultra-mobile PCs in 45nm Hi-k metal gate CMOS, Proc. of  ISSCC, 
2008, pp. 256-258. 
[4]  Borkar S. et al.  Parameter variation and impact on circuits and microarchitecture, 
in Proc. of DAC, 2003, pp 338-342. 
[5]  J. Bhavnagarwala et al.  The impact of intrinsic device fluctuations on CMOS 
SRAM cell stability, IEEE J. Solid-State Circuits, volume 36, April 2001, pp 
658–665. 
[6]  Kapre, R. et al. SRAM Variability and Supply Voltage Scaling Challenges, IEEE 
International Reliability Physics Symposium, 2007, pp 23-38. 
[7]  Kumar, S.V., Kim, C.H., and Sapatnekar A..  Impact of NBTI on SRAM Read 
Stability and Design for Reliability, in Proc. ISLPED, 2006. 
[8]  Furber, S. et al.  ARM3 - 32b RISC processor with 4kbyte on-chip cache, In G. 
Musgrave and U. Lauther, editors, Proc. IFIP TC 10/WG 10.5 Int. Conf. on VLSI, 
1989, pp. 35-44. 
[9]  Montanaro J. et al.  A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor, IEEE 
J. Solid-State Circuits, volume 31, November 1996, pp. 1703-1714. 
[10]  Clark, L.T. et al. An Embedded 32-b Microprocessor Core for Low-Power and 
High-Performance Applications, IEEE J. Solid-State Circuits,  volume 36, 
November 2001, pp. 1599-1608.   105 
[11]  Zhang, M. and Asanovic, K.. Highly Associative Caches for Low-Power 
Processors, Kool Chips Workshop, 33rd International Symposium on 
Microarchitecture, December 2000. 
[12]  Mohammad, B.; Bassett, P.; Aziz, A; and Abraham J. Cache Organization for 
Embedded Processors: CAM-vs-SRAM., IEEE International SOC Conference, 
September 2006,  pp. 299 – 302. 
[13]  Kao, J.; Chandrakasan, P.  Dual-Threshold Voltage Techniques for Low-Power 
Digital Circuits, IEEE Journal Of Solid-state Circuits, volum 35,  July 2000, pp. 
1009-1018. 
[14]  Rao, R.M; Burns, J.L. Analysis and optimization of enhanced MTCMOS scheme, 
Proc. VLSI Design, January 2004; pp. 234-239. 
[15]  Amelifard, B; Fallah, F.; Pedram, M.  Leakage minimization of SRAM cells in a 
dual-Vt and dual-Tox technology, Trans. On VLSI Systems, 2008. 
[16]  Ricci, F. et al.  A 1.5 GHz 90 nm Embedded Microprocessor core, Symposium on 
VLSI Circuits Digest of technical Papers, 2005, pp. 12-15.  
[17]  Mukhopadhyay, S.; Mahmoodi, H, and Roy, K.  Modeling of failure probability 
and statistical design of SRAM array for yield enhancement in nanoscaled 
CMOS, Proc. of  TCAD , December 2005, pp. 1859-1880. 
[18]  Yamaoka, M et al.  90-nm process-variation adaptive embedded SRAM modules 
with power-line-floating write technique, IEEE J. Solid-State Circuits, volume 42, 
March 2006, pp. 705-711. 
[19]  Pilo, H. et al. An SRAM Design in 65nm technology node featuring read and 
write-assist circuits to expand operating voltage, IEEE J. Solid-State Circuits, 
volume 42, April 2007, pp. 813-819.   106 
[20]  Tsukamoto, Y. et al.  Worst-case analysis to obtain stable read/write DC margin 
of high density 6T-SRAM-array with local Vth variability, Proc. ICCAD, 
November 2005, pp. 398. 
[21]  J. G. Massey.  NBTI: what we know and what we need to know - a tutorial 
addressing the current understanding and challenges for the future, In IEEE 
International Integrated Reliability Workshop Final Report, 2004; pp. 199–211. 
[22]  Sarkurai, T; Newton, A.R,  Alpha-power law MOSFET model and its applications 
to CMOS inverter delay and other formulas, IEEE J. SSCC,  volume 25, April 
1990, pp. 584-594. 
[23]  Chandarkasan, W.J., and  Fox F.  Design of High-Performance Microprocessir 
Circuits, IEEE Press 2000.  
[24]  Samuel K. et al.  65nm CMOS High Speed, General Purpose and Low power 
Transistor Technology for High Volume Foundry Application, Symposium on 
VLSI Technology, June 2004; pp. 92-93. 
[25]  Kuhn, K. Reducing variation in advanced logic technologies: Approaches to 
process and design for manufacturability of nano scale CMOS, Proc. IEDM, 
December 2007, pp. 471-474.  
[26]  Wang, Y. et al. A 1.1 GHz 12 µA/Mb-Leakage SRAM in 65 nm ultra-low-power 
CMOS technology with integrated leakage reduction for mobile applications, 
Proc. ISSCC; January 2008; pp. 172-179. 
[27]  Steegen, A et al.  65nm CMOS technology for low power applications”, Electron 
Devices Meeting, IEDM Technical Digest. IEEE International, December 2005 
pp. 64-67. 
[28]  Cheng, K.L. et al.  A highly scaled, High performance 45nm bulk logic CMOS 
technology with 0.242um
2 SRAM, Proc. IEEE IEDM, December 2007, pp. 243-
246.   107 
[29]  Leland C. et al.  Stable SRAM Cell Design for the 32 nm Node and Beyond , 
Symposium on VLSI Technology, June 2005; pp. 128-132. 
[30]  Verma, N; Chandraksan A.  A 65nm 8T Sub-Vt SRAM Employing Sense-
Amplifier Redundancy ,  IEEE ISSCC, February 2007; pp. 327-330. 
[31]  Koichi T. et al.  A Read-Static-Noise-Margin-Free SRAM Cell for Low-Vdd and 
High-Speed Applications, IEEE J. Solid-State Circuits, volume 41, January 2006, 
pp. 113-121. 
[32]  Weste, N. and Harris, D.  CMOS VLSI Design: A Circuits and Systems 
Perspective, Addison-Wesley, 2005. 
[33]  Royannez, P;Mair H., and Dahan F.  90nm low leakage SOC design techniques 
for wireless applications, IEEE ISSCC Multimedia, 2005; pp. 138-141. 
[34]  Zhang, K. et al. SRAM design on 65-nm CMOS technology with dynamic sleep 
transistor for leakage reduction, IEEE J. Solid-State Circuits, volume 40, April 
2005; pp. 895-901. 
[35]  Ohbayashi, S. et al.  A 65-nm SoC Embedded 6T-SRAM Designed for 
Manufacturability With Read and Write Operation Stabilizing Circuits, 
volumeIEEE J. Solid-State Circuits, volume 42, April. 2007; pp. 820-829. 
[36]  Pagiamtzis, K.; Sheikholeslami, K.  Content-Addressable Memory (CAM) 
Circuits and Architecture, IEEE J. Solid-State Circuits, volume 41, March 2006; 
pp.712-727. 
[37]  Yabuuchi, M. et al.  A 45nm Low-Standby-Power Embedded SRAM with 
Improved Immunity Against Process and Temperature Variations, IEEE J. Solid-
State Circuits IEEE , April 2007, pp. 820-829. 
[38]  Tran, C.Q.  Low-power High-speed Level Shifter Design for Block-level 
Dynamic Voltage Scaling Environment, IEEE International Conf. on Integrated 
Circuit and Technology, 2005, pp. 229-232.   108 
[39]  Zhang, K. et al.  A 3-GHz 70-Mb SRAM in 65-nm CMOS Technology With 
Integrated Column-Based Dynamic Power Supply, IEEE J. Solid-State Circuits, 
volume 41, April 2006, pp. 146-152. 
[40]  Wang Y. et al.  A 1.1GHz 12µA/Mb-Leakage RAM Design in 65nm Ultra-Low-
Power CMOS with Integrated Leakage Reduction for Mobile Applications, in 
IEEE ISSCC, February 2007, pp. 324-327. 
[41]  Mukhopadhyay, S; Kim, K.; Mahmoodi, H.; Roy, K.  Design of a Process 
Variation Tolerant Self-Repairing SRAM for Yield Enhancement in Nanoscaled 
CMOS, IEEE J. Solid-State Circuits, volume 42, June 2007, pp. 1370-1376. 
[42]  M. Miyazaki, G. Ono, T. Hattori, K. Shiozawa, K. Uchiyama, and K.Ishibashi.  A 
1000-MIPS/W microprocessor using speed-adaptive threshold-voltage CMOS 
with forward bias, in IEEE ISSCC Dig.Tech. Papers, February 2000, pp. 420-421. 
[43]  Chandrakasan, A., and V. De.  Adaptive body bias for reducing impacts of die-to-
die and within-die parameter variations on microprocessor frequency and leakage, 
IEEE J. Solid-State Circuits, volume 37, November 2002, pp.422-423. 
[44]  Keshavarzi A., Ma S., Narendra S., Bloechel B., Borkar S., and V. De. 
Effectiveness of reverse body bias for leakage control in scaled dual Vt CMOS 
ICs, in Proc. ISLPED, Aug. 2001, pp. 207-212. 
[45]  Calhounm, A.; and Chandrakasan, A.  Analyzing Static Noise Margin for sub-
threshold SRAM in 65nm CMOS, ESSCIRC, September 2005, pp. 1673-1679. 
[46]  Frank, A.  Power-constrained CMOS scaling limits, IBM Journal of Research and 
Development, volume 46, December 2001, pp. 235-244 
[47]  Mohammad, B; Saint-Laurent, M; Bassett P.; Abraham J.  Cache Design for Low 
Power and High Yield, ISQED, March 2008, pp. 103-107.   109 
[48]  Weiss, A.; Wuu, J and Chin, V.  The On-chip 3-MB Subarray-based Third Level 
Cache on an Itanium Microprocessor, IEEE J. Solid-State Circuits, volume 37, 
November 2002, pp. 1523-1529. 
[49]  Seevinck, E.,List, F.J., and  Lohstroh, J.  Static Noise Margin Analysis of MOS 
SRAM Cells,  IEEE J. Solid-State Circuits, volume 22, October 1987;pp. 748-
754. 
[50]  Hennessy, J. and Patterson, D.  Computer Organization & Design, 3rd ed., 
Morgan Kaufmann  2005. 
[51]  Zhang , K et al.  A 3-GHz 70-Mb SRAM in 65-nm CMOS technology with 
integrated column-based dynamic power supply, IEEE J. Solid-State Circuits, 
volume 41, January 2006, pp146-152. 
[52]  Mohammad, B; Seok, G.; Kim, H.  Verification of gate level model for custom 
design in scan mode,  IEEE Microprocessor Test and Verification Conf., 
December 2007;  
[53]  TetraMAX, Version Z-2007.03-SP1, Synopsys Inc., 2007.  
[54]  Zarrineh, K.; Ziaja, T.A. and Majumdar,A..  Automatic Generation and Validation 
of Memory Test models for High Performance Microprocessors, ICCD conf., 
2001, pp. 526-529. 
[55]  Kundu, S.  Gate maker: A transistor to gate level model extractor for simulation, 
automatic test pattern generation and verification, International Test Conf., 
October 1998; pp. 372-381. 
[56]  Roy, K; Mukhopadhyay, S; Mahmoodi-Meimand.  Leakage current mechanisms 
and leakage reduction techniques in deep-submicron CMOS circuits, Proceeding 
of the IEEE, volume 91, Feb 2003; pp. 305-327.   110 
[57]  Wang, W; Raghunathan, A; Laksminarayana G; Jha N.  Input space adaptive 
design: a high-level methodology for optimizing energy and performance, IEEE 
transaction on Very Large Integration System; June 2004; pp. 590-602. 
[58]  Das et al.  A self Tunning DVS Processor using delay-error detection and 
correction, IEEE J. Solid-State Circuits, volume 41, Jan. 2006, pp. 792-804. 
[59]  Datta, R; Abraham; J.A; Diril A.U; Chatterjee, A; Nowka K.  Adaptive design for 
performance-optimized robustness, IEEE international symposium on defect and 
fault tolerance in VLSI systems; June 2004; pp. 590-602. 
[60]  Chen, Q; Meterelliyos M; Roy K.  A CMOS thermal sensor and its applications in 
temperature adaptive design, ISQED, March 2006. 
[61]  Melzner, H; Olbrich, A.  Maximization of good chips per wafer by optimizing 
redundancy, IEEE Transaction on semiconductor manufacturing; May 2007; pp 
68-76. 
[62]  Bickford, J; et al. SRAM redundancy- Silicon area versus number of repairs trade-
off, IEEE Advanced semiconductor manufacturing conference; May 2008; pp 
387-392.    111 
Vita 
Baker  Shehadah  Mohammad  was  born  in  Yatta-Hebron,  Palestine,  the  son  of 
Shehadah and Fatimah Mohammad.  He is a senior staff engineer at Qualcomm Austin, 
where he is engaged in designing the next generation Qualcomm DSP processor.  Prior to 
joining Qualcomm, he worked on a wide range of processors at Intel Corporation.  He 
has more than 12 years of experience in processor design with an emphasis on circuit and 
physical  design.    He  received  the  B.S  degree  from  the  University  of  New  Mexico, 
Albuquerque,  and  the  M.S.  degree  from  Arizona  State  University,  Tempe,  both  in 
Electrical Engineering. 
 
 
 
Permanent address:  13316 Kinder Pass,  
Austin, TX 78727 USA 
 
This dissertation was typed by Baker Shehadah Mohammad. 
 
 
 
 