Energy-efficient SOC design technology and methodology by David W. Flynn (7210841)
University Library 
AuthorlFlhng Title 
Class Mark 
•• Loughborough 
• University 
Please note that fines are charged on ALL 
overdue items. 
c; R REFEREh E ONLY 
\\\1\\\1 \\\\\ 
L-________________________________________________________________ __ 
j:~('4 j.(I,I'~ 
f' ~v) :. ~;../ ... 
r ' 
.. <6 \'1-00:& ____ , 
_____ " '" ".----:\~--l 
Energy-efficient SCC design technology and methodology 
by 
David Waiter Flynn 
Doctoral Thesis 
Submitted in partial fulfilment of the 
requirements for the award of 
Doctor of Engineering of Loughborough University 
March 2007, final updates June 2007 
© by David Flynn 2007 
ARM" 
~ Lo1}gh~rough 
• Uruverslty 
Energy efficient SOC design technology and methodology 
System-On-Chip Technology Demonstrators 
Developed within the EngD Research Programme 
saC#1 TSMC 180nm G 
Standard Low-Power 
saC#2 TSMC 130nm G 
Dynamic Voltage Scaling 
saC#3 UMC 130nm SPILL 
Dynamic Voltage Scaling 
SOC#4 TSMC 6Snm LP 
Dynamic Voltage Scaling 
& Leakage Management 
sac#s TSMC 90nm G 
Enhanced Leakage 
Management 
11 
Energy efficient sce design technology and methodology 
Abstract 
This thesIs covers the portfolio of research projects addressing dynamic and static power 
reducllon applicable to mass-market designs It covers work over the penod 2001-2006 
while employed In the Research and Development Group at ARM Lld In Cambndge, UK, 
and seconded dunng 2005 to ARM Inc m Sunnyvale, California, USA 
The Research Programme has been focussed on developmg design styles and 
methodologies for syntheslzable microprocessor and support Intellectual Property (IP) to 
address energy-efficient chip Implementation uSing both dynamic and static power 
reducllon while minimizing changes required to tools and library components 
A canonical System-On-Chlp (SOC) design, representallve of portable battery powered 
low-power customer designs, was speCified and developed In the first year of the 
research project and has been extended and developed over the five years This has 
resulted in 5 technology demonstrator chips and evaluation platforms successfully bemg 
Implemented and fabncated successfully The author developed the evaluallon platforms 
and demonstration software have been developed by the author to serve as technology 
demonstrators for the product groups In ARM to showcase new licensable control 
software and hardware IP as well as new power management standard library cells and 
multi-voltage components 
ARM Lld has a number of advanced technology licensees and partners who have bUilt 
tools, propnetary design flows and their own memory and library technology Addressmg 
the Significant customer base of less expert licensees who do not have such m-house 
expertise has been the pnmary requirement for this research in order to equip then With 
EDA tools from partner companies and PhyslcallP bUlldmg blocks from ARM that allow 
advanced dynamiC and static power management techniques to be applied to 
syntheslzable deSigns 
ThiS EngD theSIS covers the public matenalln relation to the technical innovation, the 
real-world englneenng challenges, the subset of the Patents filed that have been granted 
or have received notice of acceptance, and the published results and conferences and 
publications The "Intelligent Energy Manager" control software, the ARM1176-famlly 
DVFS-ready CPUs, the Intelligent Energy Controller, and the baseline speCifications for 
the "Multi-Voltage Kits" and "Power Management Kits" are all ARM licensable products 
that have been developed from thiS research project 
Finally the major achievement of the programme has been the inVitation to contnbute as 
a pnmary author to "The Low Power Methodology Manual", published by Spnnger 2007, 
aimed at practical real-world engineenng application of the methodology developed 
III 
Energy efficient SOC design technology and methodology 
Acknowledgements 
Loughborough University 
Thanks to Loughborough staff supporting this EngD research program 
• Prof Simon Jones - project supervision 2001-2002 
• Or Vince Owyer 
Synopsyslnc 
- research programme oversight and supervision 
Thanks are due to many staff at Synopsys for collaborative support for the programme 
• Michael Keating - Synopsys Fellow, now In Advanced Technology Group 
o Primary sponsor for the Synopsys low-power and EDA collaboration 
• Alan Gibbons - Principal Engineer and deSign flow speCialist 
o Primary UK englneenng resource working on ARM flow development 
o Implementation lead on the SAL T926 project 
• Pin Han-Chen - Back-end deSign expert In IP and PHY team 
o Implementation lead on the DVS926 project and test-shuttle liaison 
ARM Limited 
Thanks to ARM staff who worked on the underlYing projects that make up thiS EngD 
portfolio 
• Or Olpesh Patel - R&D Group manager 2004-6 
o Seconded to R&D In 2003 to productlze the early DVFS IP 
• Or Kris Flaunter - IEM Software Architect and now Director of Research 
o Research partner on the DVFS control aspects and low-power expert 
• Or Rob Aitken - Research Director for the ARM PhyslcallP 
o With Artisan pnor to the ARM acqUisition In 2005 
• Oavid Howard - Principal VLSI engineer 
o Seconded to R&D In 2005-2006 for UL TRA926 and SAL T926 programmes to 
work on the experimental physlcallP deSign and layout 
• John Biggs - Consultant Engineer In R&D 
o Slgmficant support In deSign flow development, back end library and chip 
Implementation on the SAL T926 SIlicon In particular, and revlewer-In-chlef 
• Sachin Idgunji - Semor R&D engineer In PhYSical IP diVISion, from 2006 
o DeSign Implementation consultant to UL TRA926 project (while at Synopsys) 
o Power analYSIS and Timing closure expert for the SAL T926 project 
• Ashwin Rebello - Sponsored MSc Internship student 2006 
o 65nm SIlicon evaluation support and leakage modelling 
• Oavid Roberts - PhD student at Umverslty of Michigan 
o 2003, 2004 and 2005 summer Internshlps, DVFS evaluation and Llnux 
portlng to the DVS926 reference deSign n particular 
v 
Energy efficient sac design technology and methodology 
IV 
Energy efficient sac design technology and methodology 
VI 
Energy efficient SOC design technology and methodology 
Table of Contents 
iii Abstract 
Abstract ... ... .. ..... ..... . . .......... ........ .... ..... . ... ......... .•.. .... iii 
Acknowledgements .. ... ..... .... ... ......... .... ........ ... ............ ...... v 
CMOS Power Fundamentals 
Dynamic and Leakage Power Mitigation Approaches 
xiii 
XIX 
State·of·the·art at the start ofthe EngD research programme.. .... xxi 
References xxix 
Bibliography ........................... ............................................ xxxiii 
Chapter 1 - Eng-D Context 
Introduction . ..... . . ... ...................................... . ............... 1·1 
Background ........................................... .. ..................... 1·1 
Commercial Environment ........... ....... ............................... 1·1 
Project Team Management Responsibilities .............................. 1·1 
Low power portable product collaborations 
Licensable IP Business Model 
IP Security and Abstract Models 
Commercial and Business Cycles (2002 down·turn) ................. . 
2005 Acquisition of Artisan Components (New PhysicallP) .... . 
Collaborative Partnership Relationships ............................ . 
Chapter 2 - Canonical sac Design - 5 Generations 
1·3 
1·3 
1·5 
1·5 
1·7 
1·7 
Requirements ................................ ............................... ...... 2·1 
SOC#1 -sARS2 "Audio Reference System Design" . ..... .. .•..... 2·3 
SOC#2 -DVS926 "Dynamic Voltage Scaling Demonstrator" ... .... 2·5 
SOC#3 - UL TRA926 "DVFS Reference System Design" 2·7 
SOC#4 - ATLAS926·65LP "Leakage and DVFS Demonstrator" ... 2·9 
SOC#5 - SAL T926·90G "Leakage/Physical·IP Demonstrator" ..•..... 2·11 
Chapter 3 - Design for Low Dynamic Power - DVFS 
Technology Dependent Design Constraints ......... ........... 3·1 
Standard Operating Cond itions . .... ........... ........ ........ ... •..... 3·1 
Performance Scaling Design Requirements ................ ........ ..... 3·5 
Voltage/Frequency Scaling Range ................................ ......... 3·7 
Variable Clock Latency Management 
PowerlTestlResetlClocking Requirements in detail 
Detailed IEM System Clock Generation 
Pre·compensating for DVS Clock Tree Latency Variation 
Static Timing Analysis Strategy 
AVS Current and Target Clock Generation 
Dynamic Clock Control Interface Specification 
vii 
3·9 
3·13 
3·17 
3·19 
3·21 
3·21 
3·23 
Energy efficient SOC design technology and methodology 
vlli 
----------------------------------------------------------------------
Energy efficient soe design technology and methodology 
Chapter 4 - Design for State-Retention Power-Gating - SRPG 
Principles of Power gatlng design methodology and flow 
Chip architecture for power gating 
4·1 
4·7 
Power networks and their control . ... ...... ........ .... ... ... .. . ... 4·9 
Power domain interface signal isolations methods ....... .... ... ..... 4·19 
State retention and restoration methods . ..... ... ..... ... ........ 4·25 
Power gatlng control .. ....•... ............. ... ..... ........ .... .... ..... ...... .... 4·35 
Power gating design verification - RTL simulation ................... 4·39 
Design·For·Test considerations ................... . ......... .. ... 4-41 
power·Gating/State·Retention in the SALT project 4-47 
Chapter 5 - PhysicallP for Low Power Design 
Library IP support for DVFS .................................................... 5·1 
Library IP support for (MTCMOS) Power Gating .................... 5·3 
Library IP characterization for Power Gatlng . .... ........ .... ....... .... 5·9 
Library IP support for State Retention ......................................... 5·11 
Library IP support for dynamic well-bias . . .. .... ... ............ 5-13 
Library IP support for Super·Cut-off CMOS (SCCMOS) 5·15 
Chapter 6 - Evaluation Platforms 
Design Goals ................................................................. 6·1 
180nm sARS2 Evaluation Board . ....... ..... ........ ........ ... .... .... 6·3 
Software Development Board for IEC . ....... ..... ........ ........ ... .... .... 6-5 
130nm DVS926 Voltage Scaling Test Platform 6-7 
130nm DVS926 Voltage Scaling Demonstration Platform 6-9 
"IEM" Voltage Scaling Exhibition Platform 
130nm UL TRA926 Voltage Scaling Demonstration Platform 
65nm ATLAS926 DVS/Leakage Demonstration Platform 
Diagnostic and Analysis Boards 
90nm SAL T926 Leakage Test Platform next .. 
ix 
6·11 
6-13 
6-15 
6-17 
6·19 
Energy efficient sac design technology and methodology 
x 
Energy efficient SOC design technology and methodology 
Chapter 7 - Technology Demonstrator Evaluation and Analysis 
sARS2180nm SOC Evaluation 
DVS926 130nm Silicon Evaluation 
DVS926 130nm DVFS Energy Savings 
7·1 
7·3 
7·5 
ULTRA926130nm Silicon Evaluation ........................................ 7·7 
ULTRA926130nm DVFS Energy Savings 
ATLAS926·65LP Silicon DVFS Evaluation 
7·9 
7·13 
ATLAS926·65LP SIlicon Leakage MItigation Evaluation 7·15 
ATLAS926·65LP Thermal Leakage Characteristics ...................... 7·17 
Chapter 8 - Patents Filed/Granted 
US 6,883,102 Power Management Control API 
US 6,950,951 RTL Power Control Interface 
8·3 
8·5 
US 2004/0153762 Bus Based State Save and Restore .................. 8·7 
US 2004/0139361 IEC Performance Available Response 8·9 
US 2004/01388331EC PWM Dynamic Performance Scaling 8·11 
US 2006/0152268 StateSaver Retention Register 
Chapter 9 - Publications and Conferences 
8·13 
Chapter for "Closing the Gap between Custom and ASIC", 2002 .... 9·3 
CanadIan Microelectronics Corp Keynote, Banff 2002 .......... 9·3 
MIcroprocessor Reports DVS/AVS Article, Jan 2003 ............. ... 9·5 
DeslgnCon 2003, SA2·3 Hardware/Software paper/presentation .... 9·5 
HotChlps 2003, Intelligent Energy Management with ARM926 ....... 9·7 
EE·Times Jan 2004, Design and Evaluation of power·efficient SOCs 9·9 
DATE 2004, Energy Efficient SOC with Dynamic Voltage Scaling .... 9·11 
IEE/ACM Colloquium, SOC Design Test & Technology, Sep 2004 .... 9·11 
Synopsys EDA Interoperabillty Conference, Oct 2004 .................... 9·13 
DAC 2005 AII·day Low Power Tutonal • DVFS and Leakage .......... 9·15 
ARM Developers Conference, Leakage Control, Oct 2005 .. .. .... 9·17 
DAC 2006 Leakage Technology Demonstrators (TSMC & UMC) .... 9·19 
ARM Developers Conference, Low Power Panel, Oct 2006 .......... 9·21 
Chapter 10 - Conclusions and Future Work 
EvaluatIon against the research goals ............ .. ...................... 10·1 
Contnbutions to knowledge and Industrial application .............. 10·12 
Future Work ........ .. ......................................................... 10·13 
Appendix A - Script for RTL emulation of SRPG A·1 
Appendix B - External non·confidential publications B·1 
XI 
Energy efficient sac design technology and methodology 
xii 
Energy efficient SOC design technology and methodology 
Energy Efficiency Challenges 
Managing the power dissipation of complex System-On-Chlp (SOC) designs has been a 
prime design consideration of ARM-based product designers for some years now It IS well 
understood that reducing both the peak and average power consumption will reduce the 
manufacturing and packaging costs as well as Improve the reliability and battery life 
However, the thriving market for ever more sophisticated mobile wireless devices such as cell 
phones, media players, Personal Digital Assistants and cameras IS placing ever increasing 
demands on the battery Consumers want more and more features In their mobile devices but 
stili demand a convenient form factor and long battery life Unfortunately battery technology IS 
not developing fast enough to meet this demand and this short-fall IS relentlessly driving the 
demand for cheap, low power, energy effiCient SOCs With ever greater IntegrallOn 
CMOS Power Dissipation Fundamentals 
There are three major sources of power dissipation In digital CM OS CIrCUItS and they can be 
broken down In to 
• dynamic power dissipation from the sWitching activity to charge and discharge 
capacitive loads, the transitory short-CIrCUit power caused when CMOS output drivers 
sWitch (Pswltching + Psholt-circuit) 
• and leakage power dissipation (Pleakage) due to the fact that the transistors are not 
perfect sWitches and do leak some current when logically "off" 
~e{/kage (1 ) 
Addressing Dynamic Power 
In order to minimise the dynamic power dissipation term of equation (1) then not only should 
the clock frequency (felk) be lowered but also the sWitching activity (a) and where possible 
Ideally redUCing the supply voltage (VDD) 
One of the simplest ways to reduce the sWitching activity (D) IS to Inhibit registers from being 
clocked when It IS known that their output Will remain unchanged In a typical SOC as much 
as 30% of the sWitching power IS diSSipated In the clock tree so thiS technique, known as 
Clock Gatlng (CG), can Yield a significant saving In both power diSSipation and energy 
consumption 
As power IS the rate of dOing work then the average power diSSipation of a system can be 
reduced by slOWing the rate at which work IS done In practice thiS means lowering the clock 
frequency (felk) when the maximum system performance IS not reqUired ThiS technique, 
known as DynamiC Frequency Scaling (DFS), leads to a linear reduction In average power 
diSSipation 
xiii 
Energy efficient SOC design technology and methodology 
Figure 0.1 A- Leakage and Dynamic power trends from ITRS road map 
Kim , N.S. Austin , T. Baauw, D. Mudge, T. Flautner, K. Hu, J.S. Irwin , M.J. Kandemir, M. 
Narayanan, V. "Leakage current: Moore's law meets static power" IEEE Computer 
Vol. 36, Issue 12, 2003 
100 ,....-----____________ -,300 
Gat length 
, 
1 
I 
Dynamic 
power 
Possible trajectory 
, I1 hloh-kdielecrrlcs 
, reach ma.lnstream 
production 
\ 
2010 2015 
Figure 0.1 B- Leakage and Dynamic power trends from fTRS road map 
I LEAK I SUB + I GATE + I GIDL + I REV 
XIV 
250 
E 
200.:. 
.r; 
Cl 
c 
.J!! 
150 ~ 
'" 01 
8 
100 ~ 
0.. 
50 
Energy efficient SOC design technology and methodology 
However DFS does not reduce the energy consumption for a given task as the work done 
remains a constant For some very "leaky" processes, the total energy consumption may In 
fact Increase due to spending longer In active mode 
Real energy savings can however be achieved when at the same time as reducing the clock 
frequency, the voltage IS also reduced to a level that IS just high enough to support this 
lowered clock frequency This results In less work to do In charging the Internal capacltances 
to the supply voltage (VOO) and so less energy IS consumed This technique, known as 
Dynamic Voltage and Frequency Scaling (DVFS), leads to a quadratic reduction In energy 
consumption and a cubic reduction In average power dissipation It should be noted that, as It 
IS not possible to dynamically scale the voltage and frequency Instantaneously, there IS some 
energy overhead In moving between the various performance levels And there IS a 
technology dependent "fioor" below which voltage scaling becomes unsafe and logic may 
behave unreliable or be subject to soft-errors 
Addressing Leakage Power 
The other source of power dissipation IS leakage power which IS predominantly due to the fact 
that transistors are not perfect sWitches and so can never be completely turned off 
Although leakage power used to be considered Insignificant when compared to dynamic 
power at 180 and 130nm technology, It has become significant at 90nm, and potentially 
dominant at 65nm and below, so can longer be Ignored 
Figure 0.1A shows the trend projections for leakage power In relation to dynamic power for 
projected technology scaling from the International Technology Roadmap for Semiconductors 
(revised annually) 
Leakage power IS dissipated In both active mode and standby mode and the currents which 
go to make up the total leakage are Increasing fast (Figure 0.1A) In some applications, It 
may be more energy effiCient to run fast and stop rather than to lower the voltage and 
frequency due to the high active leakage currents 
There are four main sources of leakage currents In a CMOS transistor shown In Figure 0.1 B 
(1) Sub-threshold Leakage (lsus) the current which fiows from the drain to the source 
current of a transistor operating In the weak inverSion region 
(2) Gate Leakage (IGATE) the current which fiows directly from the gate through the oXide to 
the substrate due to gate oXide tunnellng and hot carner Injection 
(3) Gate Induce Drain Leakage (lG/Dd the current which fiows from the drain to the 
substrate Induced by a high field effect In the MOSFET drain caused by a high VOG 
(4) Reverse Bias Junction Leakage (lREV) caused by minority carner drift and generation of 
electron/hole pairs In the depletion regions 
xv 
Energy efficient sce design technology and methodology 
xvi 
Energy efficient sce design technology and methodology 
Of the vanous components which go to make up the total leakage current (hEAK) It IS currently 
the sub-threshold leakage (lsus) which IS dominant However, the gate leakage (lGArE) IS 
becoming significant but may yet be mitigated by high K dlelectnc matenals In future (the 
lower dotted projection In Figure 0.1 A) 
The most effective techniques for mitigating sub-threshold leakage are Power Gatlng (often 
but confUSing called Multi-Threshold CMOS, MTCMOS) and Back-bias (a form of what IS 
often named Vanable Threshold CMOS, VTCMOS) descnbed below 
For the 
Sub·threshold Current (/sus) 
The MOSFET IS said to be In the weak inverSion region when the VGS IS below Vlh In thiS 
region the Vas drops across the reverse bias substrate drain Junction Hence the electnc field 
along the channel IS small Since the electnc field IS small the dnft component of the current In 
the channel IS small and the current IS dominated by diffUSion The carners move by diffUSion 
along the channel similar to charge transport across the base of bipolar transistors Weak 
InverSion dominates modern deVice off-state leakage due to the low Vlh The sub-threshold 
current can be expressed as shown In the equations below 
Sub-threshold leakage occurs when a CMOS gate IS not turned completely off Its value IS 
given by 
J -1 (Vcs- Vr)/(aVth) sus - 0 e 
where Vt IS the deVice threshold voltage, Vth IS the thermal voltage kTlq (25 9mV at room 
temperature), VGS IS the gate source voltage, 10 IS the current when VGS = VT The parameter 
alpha IS a function of the deVice fabncatlon process and ranges from 1 0 to 2 5 
VCS-VT+T/VDS 
nJ~h 
• Where V'h IS the threshold voltage 
• VT = KT Iq IS the thermal voltage 
• COX IS the gate OXide capacitance 
• ... IS the zero bias mobility 
• m IS the sub-threshold sWing coeffiCient (also called the body effect coeffiCient) 
And Wand L are the Width and Length parameters for the translsltor The Vas dependence IS 
prominent In short channel deVices In long channel deVices the sub-threshold current IS 
Independent of the drain voltage for Vas larger than a few VT 
XVII 
Energy efficient sac design technology and methodology 
XV111 
Energy efficient sac design technology and methodology 
Dynamic and Leakage Power Mitigation Approaches 
ARM licenses and delivers both Syntheslzable components (CPUs and System level components 
and Interconnect) and Physical "Standard Cell" and Memory library components 
The research programme In this ThesIs has been driven by the requirements to allow the benefits 
and dynamic and leakage power management to be applied with as few reqUirements to change 
the Electronic Design Automation (EDA) tools and flows that customers acqUire or Internally 
develop, 
Dynamic Voltage and Frequency Scaling (DVFS) 
Dynamic power management has been addressed by developing cell-level hardware components, 
design and Implementation methodologies to address the voltage-scaled timing relallonshlps, and 
all bUilt WIthin a licensable Operating System software component The voltage-scaled power and 
energy savings are only realized by a high-level set of performance mOnitoring and predicting 
poliCies that determine the real-time processing workload and bUild the knowledge to set the 
operating performance levels Just high enough to allow task deadlines to be met and the 
frequency and voltage to be scaled effectively {DesCribed In detail In AppendiX B papers, pages 
B3-B93] 
DVFS IS a system level deSign challenge, and a Performance setting and monitoring control 
approach has been taken to abstract the ARM control functions as licensable Intellectual Property 
hardware and software components, abstracting away the customer-specific voltage scaling and 
power supply management details 
Leakage mitigation using State-Retention Power-Gating (SRPG) 
Although analogue techniques such as DynamiC Threshold Scaling or adaptive well bias 
schemes appear attractive, the requirement to prOVide reliable portable PhYSical IP components 
that are free from production compleXities (to aVOid deVice latch-up for example) prOVided a focus 
for the research work to focus on extensions to standard logiC cell libraries to support effective 
leakage mitigation 
BUilding on a foundation of power gatlng SWitch cells, Isolation cells and retention registers the 
challenge was to bUild a standard-cell abstraction extension to standard syntheSIS and place-and-
route Implementation flows to allow deSign and verification Without haVing to dive down to 
transistor level views and analYSIS (something that IS pOSSible for expert companies With their own 
transistor level deSign expertise, but not appropriate for licensable PhYSical IP to less expert 
deSign teams and customers) 
A State-Retention paradigm that allows seamless state saving and restoring before and after 
power gatlng respectively has been chosen and developed that can be applied transparently to 
RTL deSign and support very energy effiCient state retention and restoration appropriate for fast 
real-time responsive systems 
xix 
Energy efficient sac design technology and methodology 
xx 
Energy efficient soe design technology and methodology 
Industrial State-of-the-Art at the start of the EngO Research Programme 
A clear distinction has to be drawn at the outset of this thesIs between cuttlng-edge academic 
research and the realities of commercially viable and re-usable solutions that can be widely 
licensed and deployed 
• Academic research - university (including some industrial collaboration) research that 
has driven forward the techniques and theoretical benefits of a variety of low power and 
high performance techniques applicable to micro-processors, multi-media systems or cell 
library deSigns The primary references for the power management techniques adopted 
for thiS research are Introduced and discussed In thiS section 
• 'Expert' Industrial Usage - where Integrated DeVice Manufacturers (IDMs) apply some of 
the academic research work to key markeVappllcatlon areas uSing their detailed 
knowledge and access to In-house wafer fabrication data, Internal and pre-productlon 
external deSign automation tools, and source-level lP, libraries, memories etc to support 
transistor level analYSIS and Sign-off of complete solutions ARM has the priVilege of 
delivering syntheslzable processor and system IP to such customers and gains VISibility 
of the techniques and challenges presented and the bespoke solutions often adopted for 
In-house deSign flows Understanding approaches but steering clear of proprietary or 
potentially patented techniques IS Important, the references are tYPically to publicly 
announced products and lag behind the early confidential material The leading-edge 
industrial references that were public at the time are Introduced below Interaction With 
such advanced deSigned groups also has the benefit that early VISibility of the EDA tools 
enhancements Influenced by such cutting-edge groups 
• Mainstream deployment - requIring Widely-available syntheslzable and phYSical IP 
together With mainstream EDA tools and methodology ThiS IS ARM's primary bUSiness 
and IS the focus of the "applied research" covered In thiS theSIS The process technology 
often lags behind that of the leading-edge IDMs and relies on deSign tools that help 
abstract the detailed transistor problems away from deslgnerilmplementer such that the 
methodology and tools are as close as pOSSible to the synthesls/place-and-route/tlmlng 
analYSIS flows they are familiar With The abstraction away from transistors and the 
commercially sensitive/valuable deSign data then supports licenSing to end deSigners 
through Foundry models where the Implementallon flows rely solely on "front-end" views 
and abstracted models, and the transistor-level back-end Internals are merged In the 
licensed Foundry 
One key reference used to plan the technology demonstrators was the International Technology 
Roadmap for Semiconductors The ITRS2001 roadmap [1] was the starting reference for thiS 
theSIS focusslng on the 90nm and 65nm technology schedules ThiS full roadmap IS updated 
every two years (2003, 2005) With an Intermediate update published In the even years 
XXI 
Energy efficient sac design technology and methodology 
XXll 
Energy efficient SOC design technology and methodology 
Academic Prior Art for Dynamic Voltage Scaling 
A comprehensive survey of low power design techniques was published In IEEE transactions 
VLSI Systems by Benlnl, Bogliolo and DeMlcheli [2] This usefully applied voltage scaling to Intel 
ARM StrongARM SA 1100 processor as one of the primary low power deSign examples 
The author's research bUilds largely on work that ARM Initiated and supported at Berkeley 
Wireless Research Center, University of Berkeley, California [3], [4] An ARM81 0 cached 
microprocessor deSign was made available to Thomas Burd and team and used as the reference 
deSign for dynamiC voltage and frequency scaling The hardware control techniques focussed on 
a ring-oscillator approach, the frequency of operation varying with the voltage[5] The research for 
the DVFS work covered In thiS theSIS has focussed on the alternative approach of setting a target 
frequency (better understood by conventional static timing analYSIS tools) and then supporting 
table look-up or adaptive dynamiC voltage control schemes 
In order to offer a licensable system level solution a major component IS the algOrithms to set and 
mOnitor deadlines In order to explOit the potential for reduced frequency and hence reduced 
headroom power rail control The research behind the software control techniques has been bUilt 
on the work at the University of Michigan, under Prof Trevor Mudge's research group [6], notably 
by Krlsztlan Flautner's PhD research [71, [81 (Dr Flautner was recrUited to ARM In 2000 and ARM 
licensed the technology to use In hardware and software products subsequently) 
Commercial Prior Art for Dynamic Voltage Scaling 
A number of companies had worked on chip level voltage scaling where the entire deSign InSide 
the pad-ring could be voltage scaled, and the level shlfters In the InpuVoutput pads provide the 
analogue voltage interfacing from the core chip voltage to the 10 pad voltages 
For example Intel had published the deSign challenges from their perspective In 1999[9] and 
announced a 1 GHz Mobile Pentlum III processor With SpeedStep proprietary frequency/voltage 
scaling technology In 2001 [1 01 Transmeta had announced their low-power Crusoe processor 
With "Long Run" dynamiC voltage and frequency scaling technology [11], [12] 
The primary ARM licensee working on delivering products With DVFS was Texas Instruments, the 
OMAP1510 application processor plus DSP for Wireless applications was unveiled In 2001 [13] 
and DVS control support was unveiled In 2002 [14] The ARM925 CPU used In thiS product was a 
derivative of the ARM926EJ-S CPU that has been used for all but the first of the research 
projects undertaken by the author In thiS research programme 
As the projects developed the adaptive voltage scaling work grew out of work at University of 
Boulder, Colorado [15] which were subsequently explOited JOintly With National Semiconductor Inc 
XXIII 
Energy efficient SOC design technology and methodology 
XXIV 
Energy efficient sac design technology and methodology 
Academic Prior Art for Power Gating and State Retention 
Power-gatlng IS the preferred name used In this thesIs for the less descriptive Multiple Threshold 
CMOS, "MTCMOS" used In academic research The original technique was proposed as early as 
1993 by Mutoh et al applied to 0 5 micron technology [16] Subsequently techniques for 
preserving the state of registers while logic was power gated was described by the same team at 
NIT Labs as applied to DSP subsystems [17] 
Kao and Chandrakasan at MIT published a paper at the Design Automation Conference In 1997 
dealing with the Issue of designing and SIzing MTCMOS SWitch tranSistors [18] A follow-on paper 
that described the system Issues to be addressed to make MTCMOS work practically without 
sneak leakage paths was presented at the European Solid-State CirCUits Conference, ESSCIRC 
2001[19] ThiS paper outlines the practical application of HI-Vt SWitches to both combinatorial 
logic and sequential CirCUits 
Stan at University of Virginia described the concept of "Multi-Voltage" CMOS, MVCMOS, as an 
enhancement over logic-level MTCMOS, where gate control voltages above and below the 
standard supply rails, In a paper published at ISLPED In 1998[20] 
Run-time power gatlng IS a technique to dynamically sWitch off functional sub-systems to cut 
leakage power while some of the design remains active, described at MICRO-35 conference In 
2002 by a group from University of Rochester [21][21] 
Commercial Prior Art for Power Gating and State Retention 
Research work published by NTT from 1996 at ISSCC IS definitive starting pOint for the power 
gatlng work [17] and Introduces the concept of "balloon latches" to preserve state and the 
application of hlgh-Vt MTCMOS (header) sWitches to Isolate leak logic on a virtual power rail 
In 2002 Zyuban from IBM published a study of retention register designs [22] and although thiS IS 
largely focussed on a preference for level-senSitive scan latch design, It provided a thorough 
review of the area cost Impact of different retention latch designs and control alternatives Zyuban 
was one of the group who also contnbuted to an ISLPED paper In 2004 that described run-time 
power gatlng applied to the Power-4 CPU architecture [23] 
From a generic cell library perspective Virtual SIlicon Inc announced libraries with Internal power 
gatlng and support for gate negative-bias [24] under the "Moblllze™" brand (Virtual SIlicon was 
subsequently acquired by MOSAID Inc In 2005) The approach of delivering a "fine-grain" leakage 
controlled library was commerCially Interesting as It sought to re-use standard EDA syntheSIS 
flows as near as pOSSible, bul at the expense of larger cell area and reduced performance The 
ability to characterize the SWitch IR-drop effects entirely Within the cell were highly attractive 
compared 10 the compleXities of requiring Instance-based library cells With shared power gatlng 
Sequence Design was the first of the EDA companies to announce lools to support designing 
With shared power SWitch component "cells" In 2004 [25] 
xxv 
Energy efficient SCC design technology and methodology 
XXVI 
Energy efficient SOC design technology and methodology 
RTL Design for Low Power Prior Art 
Many of the preceding techniques focus on the low-level technology components and cirCUits, or 
higher-level system design approaches and methods But for the purposes of this thesIs the 
Register-Transfer Level, RTL, of coding IS that of most relevance as this applies to both the 
syntheslzable IP that ARM licenses and must be cleanly and efficiently supported by EDA tools 
and "standard cell" library components and abstractions 
Because a lot of the dynamiC power IS diSSipated In the clock tree In high-speed deSigns, 
deSigners often code In a level of Architectural Clock Gatlng where clocks to subsystems are 
gated expliCitly By carefully ensunng that syntheslzable RTL coding of clock enable functions IS 
carefully constrained the clock gatmg can be automated the pnnclples were outlined back m 1994 
by Benlnl et al [26] Such techniques are now supported widely by EDA syntheSIS and 
optimization tools 
RTL coding IS fundamentally bUilt on the concepts of "global" power that IS ImpliCitly always on, 
and "Ideal" clocks that have zero latency Multi-voltage approaches, whether they are based on 
dynamiC voltage scaling or power-rail gatmg for redUCing leakage power however reqUire careful 
management of expliCit power rails and clock tree latencles that vary With voltage 
Power Formats that can annotate the multi-voltage Intent on to standard syntheslzable RTL 
deSign are finally startmg to become mdustry-standardlzed In early 2007 [27], [28] 
XXVII 
Energy efficient sac design technology and methodology 
xxviii 
Energy efficient SOC design technology and methodology 
References 
[1] International Technology Roadmap for Semiconductors ITRS Imtlally ITSR2001 and 
subsequent 2003, 2005 editions available on-line http IIwww Itrs net/reports html 
[2] Bemm l , Bogllolo A , and De M Ichell G , "A survey of design techmques for system-
level dynamic power management," IEEE Trans VlSI Systems, vol 8, no 3, pp 299-
316, June 2000 http IIportal acm erg/citation cfm?coll=GUIDE&dl=GUIDE&ld=340746 
[3] Burd T D, Brodersen, R W , "Design Issues for dynamiC voltage scaling" Proceedings 
of the 2000 international symposium on low power electromcs and deSign, Italy, 
Pages 9 - 14, ISBN 1-58113-190-9 http IIportal acm org/cltatlon cfm?ld=344181 
[4] Burd T D, Brodersen R W , "Energy EffiCient Microprocessor DeSign" Kluwer 
Academic Press, 2002, 376 P ,ISBN 978-0-7923-7586-9 
[5] http IIbwrc eecs berkeley edu/Presentatlons/RetreatsiWlnter Retreat Jan 2000/ 
Monday%20PM/burd-dvs pdf 
[6] Mudge T, "Power A First Class Architectural DeSign Constraint" IEEE Computer, vol 
34, no 4, April 2001 http IIdol leeecomputersoclety org/10 1109/2 917539 
[7] Flautner K , Relnhardt S , Mudge T , "Automatic performance setting for dynamiC 
voltage scaling", Proceedings of the 7th annual International Conference on Mobile 
Computing and Networking, 2001, Pages 260 - 271 ISBN 1-58113-422-3 
[8] Flautner K, Mudge T ,"Vertigo Automatic performance-setting for IInux' Proceedings 
of the Fifth Symposium on Operating System DeSign and Implementation (OSDI'02), 
Decem ber 2002 
[9] Borkar, S, "DeSign Challenges ofTechnology Scaling" lEE Micro Vol 19, Issue 4,1999 
[10] "Intel unveils 1GHz Mobile Pentlum Ill", 
http IIwwwfindartlcles com/p/artlcles/ml mOEKFfls 12 47/al 72068750 
[11] "Technology behind the Transmeta Crusoe processors" 
httpllwwwcharmed com/PDF/CrusoeTechnologyWhltePaper 1-19-00 pdf 
[12] Transmeta "longRun" wwwtransmeta com/crusoellowpowerllongrun html 
[13] OMAP150 httpllfocustlcom/pdfslvf/Wlreless/omap1510 bulltn pdf 
[14] OMAP1510 DVS control httpllfocustl com/htlan/slva123/slva123 pdf 
[15] Dhar S , Makslmovlc D , Kranzen B , "Closed-loop adaptive voltage scaling controller 
for standard-cell ASICs," ISLPED'02, August 2002 
http lIece-www colorado edu/-pwrelect/pubarch/avs-lslped02 pdf 
[16] Mutoh S et al "1v high speed digital CirCUit technology With 0 5um multi-threshold 
voltage CMOS" Proc 6th IEEE ASIC Conference, pages 186-189, September 1993 
XXIX 
Energy efficient sac design technology and methodology 
xxx 
Energy efficient sac design technology and methodology 
[17) Mutoh S et al "A lv multl-lhreshold voltage CMOS DSP with an efficient power 
management technique for mobile phone applications" ISSCC1996, pages 168-169 
[18) Kao J , Chandrakasan A , Antonladls, "Transistor SIZing Issues and tool for multl-
threshold CM OS technology" Proc 341h Design Auotmatlon Conf, pp 409-414, June 
1997 
[19) Kao J , Chandrakasan A , "MTCMOS sequential circuits" Proc ESSCIRC, 2001, pp 
332-339 http/lieeexplore Ieee org/xpls/abs_all Jsp?arnumber=1471397 
[20) Stan M , "Low-Threshold CMOS CircUits With Low Standby Current: Proceedings of the 
Internalional Symposium on Low-Power Electronics and Design Monterey, CA 
IEEE/ACM, 1998, pp 97-99 
[21) Dropsho S et ai, "Managing static leakage energy In microprocessor functional Units", 
Proceedings 351h IEEE/ACM Intl Symposium of Mlcroarchltecture (MICRO-35), pp321-
332, Nov 2002 
[22) Zyuban V , Kosonocky S , "Low Power Integrated Scan-Retention Mechanism: 
Proceedings of the International Symposium on Low Power Electronics and Design, 
August 2002, pp 98 -102 
[23) Hu Z, et al , "Mlcroarchltectural techniques for power gatlng of execution Units", 
Proceedings of the International Symposium on Low-Power Electronics and Design 
Monterey, CA IEEE/ACM, August 2004, pp 32-37 
[24) Mobilize httpllwww embeddedstar com/press/contentl2004/7/embedded15474 html 
[25) Sequence Design, Inc, "Leakage Power Solutions·, NanoCool Lowe Power Design 
Seminar, Tokyo, Nov 2004 
[26) Benlnl L , Slegel P , De Mlcheli G , "Automatic SyntheSIS of gated clocks for power 
reduction In sequenlial circuits", IEEE Design Test Computer magazine, 1994, 11(4) pp 
32-40 
[27) Accellera "Unified Power Format (UPF) Standard 1 0", February 2007 
http IIwww accellera org/apps/group publlc/download php/887/upf v1 0 pd! 
[28) SI2 Low Power Coalition "Common Power Format (CPF)", March 2007 
http IIwww sl2 org/?page=766 
XXXI 
Energy efficient SOC design technology and methodology 
xxxii 
Energy efficient sac design technology and methodology 
Bibliography 
• Keatlng M , Bncaud P , "Reuse methodology manual for system-on-a-chlp designs" 
Kluwer Academic Publishers, 1998 ISBN 0-7923-8175-0 
• Chandrakasan A , Brodersen R , "Low-Power CMOS Design", IEEE Press 1998, 
ISBN 0-7803-3429-9 
• Narendra S , Chandrakasan A , "Leakage In Nanometer CMOS Technologies", 
Springer 2006, ISBN 0-3872-5737-3 
• Pedram M , Rabaey J , "Power Aware Design Methodologies", Kluwer Academic 
Publishers, Norwell, MA, 2002 ISBN 1-4020-71523 
• AI-Hashlml B ,"System-on-Chlp Next Generation Electronics", lEE London 2006, 
ISBN 0-8634-1552-0 
• van der Meer, P , van Staveren, A , van Roermund, Arthur H M "Low-Power Deep 
Sub-Micron CMOS Loglc",Sprlnger, 2004, ISBN 978-1-4020-2848-9 
XXXIII 
Energy efficient SOC design technology and methodology 
XXXIV 
1. Eng-D Context 
Introduction 
The focus of this research and development program IS to understand and Improve what IS 
required for effective and successfullP deployment and re-use In system level design This 
requires IdentifYing, documentmg, and where possible, solving problems With the IP or workmg 
with Electronic Design Automation, EDA, companies to Improve their tools and data 
representations A primary output IS that of defining and documenting best practices 
Background 
ARM primarily designs and licences processors for embeddmg m chip designs supported with 
development tools and bus technology and perlpherallP Key attributes to the commercial 
success mvolve not only the obVIOUS hardware characteristics such as performance, power and 
area but also less tangible criteria such as clean product mtegratlon, systems level software 
modelling and design partitioning (aspects that all lead to better IIme-to-royalty) 
Commercial Environment 
ARM has grown considerably as an organization, survived a major mdustry down-turn In 2002 
and diversified mto new product areas largely due to a number of acquIsitions The author's role 
has changed a number of times In particular leading technical due diligence on potential 
acquIsitions, only some of which have been successful 
Despite thiS the author been able to mamtaln a role wlthm the Research and Development group 
although with a focus often on shorter-term advanced product development rather than actual 
research 
The ARM business model IS bUilt on partnership with leading technology companies and, with a 
Significant portion of Ilcensmg busmess based on syntheslzable lP, the need to work closely with 
EDA companies has required ARM to bUild collaborative projects with EDA partners Synopsys 
Inc was chosen as the collaborative partner In the work described m thiS portfoliO report as the 
majority of ARM's customers use front-end syntheSIS tools from thiS company 
Project Team and Management Responsibilities 
As an "ARM Fellow" In the R&D group the author's responsibilities essentially have been 
technical but Include line management responsibility for a small group and a mentonng role for a 
series of engmeers seconded to the research group for periods of 6 months to 2 years In order to 
take on the chip developments that have run over the duration of the Eng-D projects the author 
has also had the responsibility to negotiate with EDA and Foundry partners to bUild collaborative 
virtual teams to see through both the Implementation and the successful evaluation of the 
technology demonstrators that have been demonstrated to customers 
1-1 
Energy efficient SOC design technology and methodology 
Figure 1.1A- Technology Demonstrator Research Vehicles 
sARS2 180nm (2001-2) DVS926 130nm (2002-3) 
- -
ULTRA926 130nm (2003-4) A LAS926 65nm (2005-6) 
Figure 1.18- IP Creation/Implementation/Integration model 
Core-based design: Integration focus 
Creation 
Implementation ( .......... ;~;=~~;;:~ ......  .
(IP Provider) 
RTl 
HDL 
(IP Licensee) : (IP End User) 
RTl . RTl 
.. ; j ljj. _ /' R~ ' I 
_ _ .... • ~ t gulrements 
NeWst 
LOGICAL j Hetllst 
PHYSICAL 
GOSH GOSII t..... GOSII .... 
.............. ...................... 
OF EngO ARR01 Figure: 1 
1-2 
EngO Context 
Low power, battery-power end market focus research collaborations 
The most important market for ARM is in licensing microprocessors and support fabric and 
peripherals into the very cost sensitive battery power product area. Energy efficiency is in fact the 
primary metric for this market area rather low power consumption per-se so the application of 
techniques to minim ize dynamic energy consumption and standby or leakage power consumption 
have remained the primary R&D goals and context for the work over the last 5 years . As the 
technology has advanced the R&D focus has moved to real -world deployment of voltage scaling 
for the 0.18u and 0.13u technology nodes, and then leakage management techniques for 90nm 
and 65nm . 
The basic canonical design taken as the starting point for the Eng-D in 2001 has proved a 
valuable starting point. Every year a low-power technology demonstrator chip has been designed, 
developed and manufactured that has been used to build credibility with end customers . 
Representative low-power silicon developed and evaluated (Fig 1.1A): 
• 2001 -2: TSMC 180nm 1.8V ARS2 (ARM946) embedded SOC design 
• 2002-3: TSMC 130nm-G 1.2V DVS926 Dynamic Voltage Scaling 
• 2003-4: UMC 130nm-HS/LL 1.2V UL TRA926 optimized Dynamic Voltage Scaling 
• 2005-6: TSMC 65nm-LP 1.2V Fine-Grain Leakage + Dynamic Voltage Scaling 
• 2005-6: TSMC 90nm-G 1.0V Coarse-Grain Power Gating Leakage Control 
Widely (re)licensable IP and partnership model 
There are a number of licensing models that are supported on different commercial terms and 
conditions that affect visibil ity of certain design data when it comes to integrating a core in a 
system . In simple terms the key license agreements are: 
• Architectural License - rights to design and implement new micro-architectures subject to 
compliance with Architectural Validation Suites provided by ARM . 
• Technology License - the rights to design and manufacture a specified micro-architectural 
implementation typically of a synthesizable core (or a hard macro). 
• Single-Use/Multi-Use Design License (SUDUMUDL) - the rights to design with a pre-
qualified core wi th a licensed semiconductor Foundry partner using only abstracted model 
views, where the physical layout of the core is merged before fabrication . 
Design for re-use techniques and the low-power methodologies developed as part of the Eng-D 
portfolio have had to address the diversity of partners and design flows in order to ensure that 
customers both of synthesizable and pre-hardened (and pre-verified) IP components have all the 
design views required to integrate in a timely and risk-managed way. 
One of the technology demonstrators used a foundry core (UMC 130nm ARM926EJ) while the 
rest have all focussed on synthesizable RTL IP (again ARM926EJ). Synthesizable processor 
development is now the primary focus for ARM and the deployment of soft cores is illustrated in 
Fig 1.16 
1-3 
Energy efficient SOC design technology and methodology 
Fig1 .2A - 200203 Announcement 
NEW S A $ I( A"" III 
02 OC-Iobe r 2002 
ARM Holdlng~ plc: 0 3 Tr:Jid l ng Upd:Jite 
CAMB RlOG • UK. 2 Oct()h~1 2002- ARM Hold ings plc )(LSE ARM). (NilSdaq ARMHy») announ ces today a 
periOd- end Il'iIdlno update for ltI e th ree months 10 30 Sepl embe l 200 2 
In Our Qvaner • 2001 0 rnlngs nnouncemenlln J nu ry 2002. ~ Indlc 10 dl/'\allfl/,\ <lownlurn In Ihe 
som lconduc tor IndU$try porslste d our vi s ibility could be atreel d In Our second aua rtor earning s 
n noyne.ment In Juty, we referr ed 10 COntinued c h , lI en glng m rKO I conditions In the Industry The s. 
conditions have deterio r a ted fu rlhe r In the third Qu a rtOr", resu lting in th e defe rm ent oflrweSlmenl decis ions 
by our partner . a nd lher'efore a slowdown in licensing CIMfy AI Ihe s ame Ume. Ine weakening US dollar 
IS a l so Impacting oy, reQOrl ed re s yllS 
As e conseQu en ce. revenues for Ine three months to 30 September 2002 are exPected to b e 
9Pproxim91 to( £33 million The fo reign elCch9nge Imptllct on reported revenues Is exPected 10 amoynt to t:2 
million ~r e- 19)1 pront for the periOd Is exp ec:ted 10 be epproxlmatety t:8 million 
The comp ny continuel 10 g ene ,.a le calh Wllh cash ba la nces likely to Increase 10 apprO)(jmlil lety i::l:21 1 
million a llhe perlod end, comJ)&red With i:: 115 4 million $1130 June 2002 WO c onlinue 10 manage OUI 
work1ng c plt81 rtgorousty a nd :accounts , ecelvable II p rOjected to r 11 to apprOl(lmalety £28 1 million at 30 
September 2002 from R AO 2 mutlon a130 June 2002 
The a lowdawn In licensing actMty In the Ihlrd qU9 r1er h as given rise 10 a reduction In the backlog a llhe 
end o(Seplember Oefen ed revenue, b eino that portion Of the back10a lh t h •• b •• n Invoiced 10 p_rtners 
bul n OI yet ,·.cognl •• d In Ine protll ;md lOSS account. Is projec te d 10 decrease. a s exPected, (rom 17 4 
million _t 30 June 200210 apprOlClmatelY£13 8 million (lIthe end of Se pi ember 
Whilst our I le. pipeline and b cklog of signed conlfael l g IVe UI re.lonable ""llblftty In Our bUllnel., 
the persistent d lftl cu ll mafkel conditions mean Ih I th e tlml ng oflhe c to sure o fucen s lng deals Is 
unpredlet b le , Al though Ihe fourth quarter Is u l u ally IIbonge, Ihan the third quarter, we do nol antlclpala 
any a lgnlneanf uplurn In busin ess activity b efore next yea r 
Key long-Iorm growmlndlca\ors for the company remain healtny. s upported bY Our partners' commitment 
10 ARM" prod uct rO.dm .. p end 8X1enstve third party network which s uppo rts ARM 184::hnoloay. New ana 
existing pflrtn.,.. conllnue lo choose the ARM architecture for Incr ••• lng number. ofproJeet8 In &ddll10n 
aome Dartn.,.. WhO are ' flki nQ .~tion 10 ,."duce theh R&O tOS1$ h8VO laken declslon a to support onry one 
. ,cnU.elur., being ARM OEMS aaoptlng Ih. ARM arehlteeture •• In "pi trorM 01(hol'8 are _Iso drMng 
Fig1 .2B - 200203 Announcement and Share-price impact 
___ ......... ,.., o..-. ,n,n ....... .?_ '---:r ..... _ " .... 
.... OIl -.. _ 00' ~ _ 
-
1-4 
-
-
-
-
OM 
OM 
EngD Context 
Satisfying design integration views for the single-/multi-use design environment is the most 
challenging and is of immediate use to technology licensees. In figure 1.1 B the implementation 
flow for ARM (or architectural licensees) is shown in the left column, that of the semiconductor 
licensee in the middle and the system integrator on the right. 'Brick walls ' are effectively enforced 
at the boundaries to ensure that only pre-verified IP views are provided to the next column 
together wi th sufficient support to allow the next stage implementation to be verified successfully. 
IP security requirements and the need for abstract models 
The ARM business model as depicted in Fig 1.1 B requires customers who are only licensed to 
integrate ARM CPU cores to work with abstracted model views in order to prevent the 
synthesizable technology-independent CPU designs from being inadvertently or maliciously 
copied or distributed . Implementation licensees take on the legal responsibility for protecting 
synthesizable deliverables under the terms of their license; integrators are often much smaller 
companies or sub-contractors who do not necessarily have the financia l resources or established 
relationship to take on the "crown-jewels" RTL and validation suites. 
Therefore the need to ensure that abstract models hide the detailed IP design and 
implementation is a requirement of the methodologies and best-practice developed under this 
research programme. In particular the models need to include: 
• Functional model(s) - for simulation. These range from high-level behavioural models 
that simply generate realistic bus traffic through to implementation-specific models that 
are cycle accurate and contain both functional and, for example, scan-test accurate 
models to allow detailed macro-cell integration in a larger SOC device. 
• Timing models - for synthesis constraints, static timing analysis (STA) and timing 
accurate simulation . Abstracted models need to hide the internal paths but provide 
accurate context-dependent timing such that input ramp times and load-dependent output 
transitions are handled accurately. 
Commercial and business cycles (2002 downturn, subsequent growth) 
When embarking on the Eng-D programme in 2001 the basic research and development direction 
had been agreed and established between ARM and Loughborough. Commercial real ities have 
meant that the author has been called upon to help support other parts of the organisation and 
work flexibly with both the product and business development teams. The research agenda has in 
fact rem ained largely unchanged from the company perspective and the research projects have 
moved into mainstream engineering and products over the last three years. 
The industry downturn - ARM 2002Q3 (see Fig 1.2A, 1.2B) resulted in some lay-ofts and serious 
cost-cutting and re-focussing on core business strengths. From a research programme 
perspective this drove me deeper into collaborations with EDA and foundry licensees and 
partners to maximise the benefits for ARM while minimizing the financial burden . As mask and 
fabrication costs rise relentlessly it has proved invaluable to have forged relationships with other 
companies to share the costs and derive mutual value . 
1-5 
Energy efficient SOC design technology and methodology 
Fig1 .3A - 2004 Announcement to acquire Artisan Components from 2005 
H E W S A5K Aft,.. .. 
23 AAJgu$l 2004 
ARM And Arti5~n Combine To Delive,. Svne m -On-C h ip IP So l ut"ion~ 
AMBRJOGE. UK mid SUNNVVALE. AUF. ~ All!). 23.200:1 - ARM H Oldings; plc (LSE ARM), (Nssdaq 
AAMI-fY) ilnd ArU6~n ComponenlS. Ine (N'Uid~q AAT1) lo day announced thallhey h~ve enle'ed In lo ill 
tlOnnlllve ilgro.monl undor Whit-h ARM will ilcqulro Mlsan 
IiIUI .IiUI1f5 
Under the term s of the ag reement. NUsan S.tockholders wfll u !cetve , 960 In caSh and ARM Sl ock 
equal to" oil ARM AOS s for each outs tanding Artis an s hale Based on closing pf1C8 S ' 01 ARM 
AOS~ . s of AUgus l 20, 2004, 11\0 impuea value Is ' 33 89 pe' M15,m shilre, repr8Senl lng an 
~ggrOga lO ton.ld.~I,on of .-."p,ox!mOltoty $91 3 million 
ThiS ',.,,"5 ;ocl lon rep resents an excellent &:ln1leg ie comblnallon 
Enables the combined company to deliver one 01 the Industry's broadest portfolios of system-on-
chip (SoC) Intellectu~ 1 ploperty (lP) to their exten51Ye, comblneO cu s tomer b9se 
Betle' po sitions the combined comp;ilny 10 I;ilke adv;ilnt aga 01 grOW1h OPPOr1unltle. acro ss mulllple 
Indu, t,l o • • s systom doslgn comp lOXity IncI'.'O' In rho s ub-mlelon ago 
Highty eomplemenl8ry sa le s ch nnels combi ning ARfiI4 'S ehannel l o mare than 130 silicon 
manufacturers, with MJs8n"8 ehannel lo more than 2.000 companles_ 
S trengthens the linkS between key aspects ofSoe development, enabling the combined company 
to delIVer SOlutions tnat . re further optlml:ted for powe" fH'd p e rformance 
Wanen etst, C hlefEKecUlIVe omeel Of ARM WIll continue a s CnlefExeeulive omce, orthe 
combined companie s. wtth Luclo L Lanza, C hairman of AtUsan, $linO Jil4afk R Templeton, President 
&ond C hief EXeeutlVe omce' ot MIS8n. Joining the BOClfd or Olrectors Of MM a s a non-executlva 
dlreetor and an e.OCulfVe dlrlcl0r. r';'Plctlvety, on compl.llon oflh. \fans'cllon 
• Dlr.ctors and ox.eutM:t on1 cors or ARM .nd Mls,n h.-vo .gfoed to voto In (;:wo, ofth. aIIcQulsJtlon in 
,espect or . h.,eholdlngs mounting to .n agg,.o.l e of . pproldmatetv 2 7 pertent Of ARM'S 
outstanding s hares and . 6 pertent of M lsen" outstanding sha,8a 
The completion onne trans action Is expeCled 10 occur In the founn quaner of 200' and Is s Ubjecl 
to MM and Misan s tockholder ,nd regulClilory apptOVilIls Clnd olher customary clo sing condItion s 
Mlsan Is. leading prO'ridet o(phYSlcel lP components forlhe des ign and manufactu,e of complex SoC 
Integr.Jled circuits (lCS) The company's comprehensive product portfotlo In cludes slii1ndard cell Hbrpries, 
ombeddaa momorl.s, Inputtoutput COilS, ;;In;;llog functions and hlgh·spe.d Inl . l1."c. IP "'rU •• " '$ proOucts 
' I'a OPUml:e:d (or porformanu, dansl1y, powef' and ",aiel .nd a~e availAble In SuPPOr1 01 pro, • •• 
technOlogie s a, m.ny ofth. ~Ia .. le . Olno semlconCJuetor~an~r8C1urer. Mls.n ha. IIcen.eo ItS IP 
Fig1.38 - 2003 PowerWise DVS collaboration with National Semiconductor Inc 
NEWS RELEASE 
Modlo Contoct : 
M ike He 
Notional Somiconductor 
408- 721 - 5222 
mlke ,heCnsc ,com 
Solveig Loesch 
Notional Sornlconductor 
ur-ope 
+49-9141-35-1443 
50Iv8Ig. loeschOnsc-,com 
Michelle Spencer 
ARM 
+44- 1628 427780 
michelle ,spencerCarm.com 
N atio nal Sem ic o nductor and AR M R e lease P o vve r-Wise 
I nte rface O p e n -St andard S p ecificat ion 
SANTA CLARA, C AuF. ANO CAMBRJOC , U K - October- 6,2003 - National 
Somlconduc tor corpora t ion ( NVS E : NS M) .and ARM (LSE: ARM); ( NaSdaq : ARMHY») 
today re leased powerwiss '" Interlac e ( PWI) technology .. join tly promoted as an 
open s tandard interface for system power management. PWt technology enables 
r apid d oploymont of advan c d p owor m .an ag m on t So lu\lo n s: In h.andhold oloctronlC 
d evices by provid ing an open, induSlry-wide s t andard (or the interconnect betwee" 
digital processors and powsr management integrated circu its . 
Today's portable electroniC devices such as mobile phones. handheld gaming 
console s a nd portable media players: offer a host of new benefits for cons .... mers. 
How vor, th s. bono fits p lace .. s ignlfic.nt s train on the pow I' bUdgot. Dosl;nors 
now face the ditemma of having to reduce the power consumption of d igital 
proc e ssors while simultaneou sly ma)(imlzing battery He, This dilemma has led to the 
d o volopm nt of more advanced powor managoment s o h.Jtions t h a t dynamically 
reduce power consumption based on the application software workload and 
e nvironmental conditions. pwt technology proVIdes the hardware Interconnec t 
s t andard for univ rsal d ployment of such s o lutlons _ 
1-6 
EngO Context 
The author has been asked to take on extra responsibilities of the last five years which have 
certainly broadened commercial and technical experience 
• technical due diligence for several potential acquisitions in the USA that ARM considered 
but chose not to acquire (in the case of a DSP company) or was out-bid by a competitor 
(in the case of the configurable logic company, Triscend', where a number of months 
were spent working in California in late 2003 before the company was acquired by Xilinx 
in March 2004' ) 
• integration and building communication channels in subsequent acquisition (see next 
sub-section) where the author was seconded to California for 6 months and now 
continues to support remotely from Cambridge. 
These have all had a significant effect on the timescales to complete the work toward the Eng-D 
portfolio, but thankfully these have all been technical ly well aligned to the underlying low power 
and IP deployment brief for my research . 
2005 acquisition of Artisan Components and new Physical IP business 
The decision to acquire Artisan Components Inc3, (see Fig 1.3A) headquartered in Sunnyvale, 
California was a major one and the author became involved in helping understand the technology 
and low power specific product portfolio. The scrutiny of investors and analysts was intense and 
resulted in a six month secondment to the "ARM PhysicallP Division" as Artisan had become 
from July 2005. 
This was an intense learning experience but being located in the midst of a number of major ARM 
customers and close to the headquarters for both Synopsys Inc and Cadence Inc, the major EDA 
partners used by the majority of ARM customers did have significant benefits . 
The opportunity to live in a different country has been an enriching experience but not an easy 
one from the point of view of personal commitments. Although this delayed the Eng-D completion 
further it provided a whole new layer of data and resources once the Standard Cell and Memory 
Physical IP was all part of the broader ARM portfolio - and now are a part of ARM 's problems 
that could not be blamed on a third party! 
Collaborative Partnership relationships built up 
ARM has given the author the freedom and responsibility to build up the external relationships to 
support the multi-way R&D projects . The key people and organizations worked with include: 
Synopsys Inc - Michael Keating, VP Engineering and now Synopsys Fellow in Advanced 
Technology Group, Rich Goldman VP Strategic Alliances, Alan Gibbons Principal Engineer. The 
partnership has helped sponsor design tools and the Synopsys-hosted secure "DesignSphere 
Access" web-based design cluster that has been used to great benefit for the Synopsys projects 
that underpin this research. 
, http://www.arm .com/news/4741 .html 
, http://www.xi linx.com/prsrls/xilcorp/0435triscendacquisition .htm 
3 http://www.arm .com/news/601S.html 
1-7 
Energy efficient SOC design technology and methodology 
Fig1.4A - 2006 Collaboration success announcement with TSMC 
IO"'~"V ••• ,,", ...... _. 
P .... _. 
AI_.IOOt ...... 
.. ;:"j:" ' j;C:.T jLT'" ............. >_mn", 
'1.)"",100& 
T.IWM> &.ImM(lnolItl'" .... ""'.t1IIrl"1I COmo_(J"8MC) ~CJ ARM ,. ..... rCin'Um,o In,,1 <OU'1)o, .bQ.n on .. CI~ 
fWlAOMe~ r (rIrtI) I_~r .... 1 UWI " ... "''''''(1 11'1 ClrHr\alloC 'Hl,lt'IIOf\' fro UOIl'oo,n.mlc ~a '1II1(.II04~ ' 
Th .......... ",0 CO"'bOf.oon "'"",1140 If' .. "'non " " C/'Ilp (I MO 0t'I thotI AAMOlO P"OC .... or o..-n",,,,,,uno 
.(tWInc-o _ manatl,menlMC:n.noIOO", e,. 'lICIf'!WIO.".,..rnw: ¥OIUo-- eM treou,M"I' .c: ..... ll lunnlqU ••• 
u .. 1.IICI'lIPllt .... O •• 1114' el:lllll1lO ope •• ", .-I"', 1_ .. 1 PO'~I' ~ I_I for .Ie" mocu 0'I0CItJlt.1I0n 11"1 
., ... (e •• , _ AA" If.' , ... t~CI .. ClJ'flaII'Ik ~ ..... Cllon 0' _ SO p.",ent 8Ionltl,.,.".....- _ "'t. 
TSMC t!;l.P tOW"''--O. ClfOe •••• ao..ncea ~·g'b~ .. c:hnologyklrtM. ""lIIIeo .laMby' , ...... ". try .. 
111(110' ot t 1:1"" ••• .,. c. ....... ""'. " '0 
.......... . , IIIC"n.tv 'e ..... ."o.I I ....... O".n\ c:t .. ""'''' "'e'nll " ... mkO"'~* Infll*1ri " mOl:*t O~e" .ptO" 
._IK.OIWOC ••••• to oa'_' 0'''''' Llnelio"'" _Cl " "onnanu'" .akI O_a r".,,,,, ARIot r..."., -ARM'I'M! 
,....C .r. pa""""D Of'! &StH" anod . &rtm "ClWIOlOOV CI~mot"" '"0 IN, PfGtKI CI"'"orKII ........ 
• 1O~nl I.a~ and ~."* P"""" f.Clutlien' ..... _ Ur'I tIC"I ......... OVO ... UOM tltl'WllC_1 C-"DOrMliOtl 
,,.0 wnIJMmel"l1lliOn etA.lI~ rune"n., ,UI,.,... • 
"On. Of TSMC', '-Y ~f_"b_,." _ ,,, ... tlMC. 0fI 1>1'"",,"0 CH.If ,.M< .. aNI mos. 01 _ p~.1n 
,.0,,- .. ,... DfI"","o"'m 10 the CI"lp" (ommunltt, -..ala 1IIC1 w .... ,,1'ftI01 Cll,"IO' eto.tAo" •• ~ 
PtoCt\M:t '"110( .... 0. TSMC 'OUt coU.bO,abOr'I ~ARIoII Mmcln ... _. ~nO 001lbt "".1 ._c." PrG(" •• ' 
lltC"nootogIIe., ~_" ....... Inn~ o.""n ' II'«!nIQil,l", '''11 PrO< • • .., .. ou.Otfor.JI .... c.n 1<'11 ..... 
ClIIUMt .no 'IO~'" CIOWt<f ._r.o.,......,..n I, .b.aIVl"""'1MoI1O COMPJofll •• on lMo lSOCtwl06001C-tII .. _O _00_ " 
-rn. "'I cIIIIP IMOfPOf""'!oO¥o'o ~, m.mewymKfo ........ I ............ ,_ntlOfl nlpolrop., a"'llI.OiIllon (.n. In 
a .. ",-srv, WMCft I1 ~'K"""'O 10' M\,QIin" vonllo9" 
.-. 
Fig1 .4B - 2006 Collaboration announcement with ArchPro Design Automation 
M 
ARM: AND AR 'HPRO VERIFY 
65-NANOMETER 
LTI-VOLT 0 B FORE ILl o 
NJO E. HBRJl) E. . - JIlI~' 24, 2006 - ARM (L E:ARM) : 
!nsdaq:ARI.'4HY)] nud AI'c.hPl'o. 10llny fi.IUlO IlI1 ed S\1 cess in n j oint R&D proje I 10 vnHdme 
Advanced m\11ti-volln~e power m.'\nll~elnenl desi,gn tcclUliqlles . Elnploying. ARM® lutell ig:cut 
EIle.rgy Mnnn~er (1 ,(T),) technology nJld vnlidAled with AJ'cbPI'O's Mult i- o lt tlttC' imu1:l1 I' 
Tool (MV IIvl) resulted in an E O 1001 fo rprc-t31>COlll vcrific31l 0n o fpowCI'·mauag.ed designs, 
Thc project pl oduccd wOl'kin~ 65-113n0 l11ctcl' (11.111) s ilicon fOI':l complex reference sy tcm-Oll-
chip ( ) de ir:.n jollllly verified by the two COIUI>fUIICS. n le 0 use d 311 AR.:VI processor wilh 
1 "1\'1 hx hno logy deUl0ns tJll ting n number OfAclive and Icc)) modes. to ... rnl)' powc:-r fi nd 
perfom lnllce . 
"rv'" lM has pl'OVell li S abil ity to vtrify sophis ticntcd volln{tc schemes priol 10 fApeOlll 
1-8 
EngD Context 
National Semiconductors Inc - PowerWise4 OVFS announced October 2003 (see Fig 1.38) as 
part of the consortium required to build credibility to the ARM Intelligent Energy Manager IEMs," 
hardware and software products that grew out of the R&O group work from 2001 . Gordon 
Mortensen, the Engineering Manager, and Ravi Ambatipudi , the Product Manager have been the 
primary points of contact. 
United Microelectronics Corp, Taiwan - collaboration relationship built with Or Patrick Lin , Chief 
SOC Architect' and his team at UMC resulting in the UL TRA9268 silicon announced in November 
2004. (Dr Oar-Sun Tsien, Ming Hsu in particular) 
Taiwan Semiconductor Manufacturing Corporation - collaboration relationship built wi th Dr Cliff 
Hou· and the Design Services team at TSMC on the ATLAS'o project announced in June 2006 
(see Fig 1.4A) (Dr LC Lou, Ken Wang and Helen Chang in particular) 
ArchPro Design Automation " , an EDA start-up based in California and India and which resu lted 
in collaborative work on the 65nm Multi-voltage (ATLAS) project verification 12 announced in July 
2006 (see Fig 1.48). The author was invited to join the Technical Advisory Board at ArchPro in 
August 2004, with permission granted by ARM . The primary relationship has been built up the 
founder and CTO, Srikanth Jadcherla from the inception of the company. 
4 http ://www.arm .com/news/3800 .html 
s http://www.arm.com/products/esd/iemhome.html 
6 http://www.ieee-socc.org/SOC2003/Program/Saturdav/saturdav.html 
, http://www.umc.com/English/news/2004011 5.asp 
8 http://www.arm .com/igonline/news/partnernews/7051 .html 
• http://www.bjic.org.cn/01zttg/kaivuan/003 18a.htm 
'0 http://www.arm .com/igonline/news/ARMnews/14019.html 
11 http://archpro-da.com/companv.htm 
12 http://www.arm .com/news/14061 .html 
1-9 
Energy efficient sac design technology and methodology 
1-10 
2. Canonical SOC Design - 5 generations 
As part of the R&D work on advanced core deployment flows an Initial test chip has been 
developed as a baseline 'canonical' design This test-case Integrates a hardened CPU and Tlghtly-
Coupled Memories, TCMs, (with assOCiated high-speed CPU clock domain) with representative 
system blocks (system bus clock domain) and basIc peripherals 
Requirements 
ARM has traditionally bUilt test chips In order to verify IP In silicon The focus IS on a vehicle to run 
validation sUites for a specific core processor and Implements test co-processors and 'trick-boxes' 
In hardware together with minimal memory controller Interface for code and data 
The reqUirements specification drawn up for the first SOC, "Reference System DeSign #1", RSD1, 
are driven by methodology and best practice system reuse criteria rather than by an actual product 
deSign, but the underlYing deSign has been chosen to be representative of real-world problems and 
compleXities 
• Real-time system deSign reqUiring careful hardware and software partitioning The project IS 
primarily hardware focussed to start With but With the ability to run audio syntheSIS or 
decompression algOrithms 
• Multiple clock domains, including semi-synchronous CPU and bus clocks, and asynchronous 
subsystems TYPically ARM processor test chips have focussed on the CPU clock domain but 
the deSign challenges for system Integrators are those of Integrating, verifYing and testing 
deSigns on a derived system bus clock 
• Hardened IP deployment - ensure integration at 'black-box' level only, not uSing Internal views 
ThiS "core-based" deSign approach matches well that of licensees that work With Foundry pre-
qualified lP, and many partner companies that license RTL but have specific Internal groups 
that harden particular cache configuration cores for Wider deployment by product groups 
• Rapld-prototYPlng environment supporting early functional testing and software development In 
FPGA The aim here IS to ensure that the deSign practices target both FPGA and SOC tools 
• DeSign taken to layout and extracting timing - With a potential for test shuttle fabrication run 
where sponsored, or "virtual tape-out" to ensure the back-end flow and parasitic extraction are 
fully understood - rather than optimistic layout estimates 
• Derive best-practice gUidelines for test integration for the hardened IP Within the system 
• SuffiCiently small die-area and pad-count to target Single "shuttle" die size and low-cost plastiC 
BGA packaging (408-pln BGA developed for thiS program used for Initial four technology 
demonstrators, a pre-exlstlng 388-pln BGA for the fifth In the series) 
2-1 
Energy efficient SOC design technology and methodology 
Figure 2.1A - 180nm sAR52 ARM946 Reference Design 
E 
E 
.i 
:E 
~~ 
~. ii 
~~ 
1l~ 
;1 
", z. 
,----
~ ~ ~ 
~ 
... 
~ 
N 
L--
.~ 
~ < ~t ~l 
H 
- ,---
!L., .. ~ H 
.' ~= ;;,CII 5: ig.~ ~OZIg ~'J; 8 ~'~Il. 
T T 
:;--
... ~ I ~ j • 
," i ~' I • w • 
* 
~I....--T 
• . . 
-
-- ,'- --
j. ~o • ·0 ~, :I,~I 
'. "8.0: ,e< i~ !li i'<' §' :5 :> 
T T T T 
1 qdB MO 
m& 
IU 
-
""~ ~~u~ 0< 
-
-
~. ~,!1 U ~~ II 
"'0 ~ ~u 
·WiCi AA.-. 
-
-
-
jj SDRAM Interf.oe 
m-
E ~ :' 0 
-
-
jj 
Cl 
.!! 
c 
.. 
u 
o 
~ .8" ~SK -.L + E 2116-brt Flash Interfe<:l <>~ • iii 
Co 
.E 
u 
~ 
N 
~ 
« 
~ 
.. 
> 
• .... 
E 
~ ,., 
U) 
t:i~ 
,;-f 
~!8 ,---~~~ 3iu; ~i" ~(! 
.. ~!! H h~ i3 
<a '--, ~~E 
z.E 
~ 
E 
li ;:.' H ~ ... ~ 0 !!1 "5~ ;:' -~ ~ 
T 1 01 
~~~s ~::n::;.l!l 
, 
Figure 2.1 B - 180nm sAR52 ARM946 Reference Design 
Memory Map 
~ Simple efficient decode 
OxFFFFFFFF 
ROM 
OxCOOOOOOO f------j 
~ 
~ 
I 
. 
• 
~ 
10 10 space Pnvlleged access [AHB and APB] 
Ox80000000 
EXP Expansion ABORT If unused 
space 
Ox40000000 
RAM t Off-chIP SSRAM On-chlp TCRAM 
OxOOOOOOOO 
OF EngO ARR01 Figure 4 
2-2 
Canonical SOC Design - 5 generations 
SOC#1 - the sARS2 "Audio Reference Design" 
The first project was a single-voltage ·standard" low power design To minimize board-level 
hardware complexity, a software-controlled audio playback system was chosen as a reasonably 
demanding real-time application target (Fig 2.1A) 
• ARM946 CPU With dual 8Kbyte Instruction and Data caches 
o 200MHz CPU clock target 
• AMBA AHB on-chlp Interconnect clocked at 100MHz 
o 32-blt SDRAM Interface off chip for 32Mbyte+ bulk memory 
o 16-blt Flash EPROM Interface for external ROM boot 
o 32-blt Synchronous SRAM fast memory Interface 
o ·Software-DMA" Interface to audio subsystem 
• AMBA APB on-chlp peripheral bus clocked at 50MHz 
o Interrupt controller and System counter timers 
o Programmable UART for terminal Interfacing 
o GPIO to support LEDs/Swltches and other Interfaces 
• Dual PLL deSign 
o Dual crystal clock provIsion 
o 400M Hz System PLL for CPU and bus IP With external diVider control to control 
CPU In 12MHz steps from 84MHz to 264Mhz 
o 384MHz fixed rate PLL for reference clock diViders to AudiO (48MHz) Timers 
(1 MHz) UART (3 84MHz) 
• JT AG serial debug agent connection 
o Support external de bugger connection and memory-mapped diagnostics 
o Support on-board flash programming 
• AudiO subsystem 
o Stereo subsystem to merge up to 8 sound streams In hardware With hardware 
stereo Image positioning 
o Support for 16-blt PCM or 8-blt log format audiO data 
o Over-sampling digital anti-alias filtering 
o Digital pulse-denSity modulation drive of standard output pads 
o 2-pads per channel support for Simple R-C off-chip filtering 
The deSign was kept small enough to target onto Single Xlllnx XCV2000E deVice to support 
functional verification, albeit at 1/1 Olh of the target CPU performance but enough to write baSIC 
waveform generation algOrithms In C, uSing the SSRAM Interface rather than the full 
SDRAM/Flash memory controllers 
A Simple memory map was used as the starting pOint (Fig 2.1 B) 
The project was used to develop best practice TesVReseVClock domain controllers, Investigate 
multiple domain clock gatlng In the EDA Implementation flows and to get the baSIC FPGA and 
SOC deSign flow and verification Infrastructure In place 
2-3 
Energy efficient SOC design technology and methodology 
':; " ... 0 2.2A - 130nm DVS926 Reference 
2 x 16Kbyte 
TlghUy Coupled 
RAM 
TC -RAM Voltage 
Domain 
(with power-down) 
Dynamic Voltage 
Scaling Cached CPU 
(with power-down) 
Dve 
Dynamic I Dynamic 
AMBA AHBfAPB subsystem Perlrofflance Voltage ~c~o~nrro;;;:/~A~P:/..ftir;;;r1 Controller 
soc 
(always 
powered) 
Figure 2.28 - 130nm DVS926 Reference Design 
VDDRAM domain 
VOOCPU domain 
Dynamic 
e_ 
Generator 
PLL 
Prototype 
DVFS 
Control 
DVFS ARM926EJ 
+ 16K Instruction Cache 
+ 16K Dala Cache 
Static 
Memory Mem()(), 
Controller Controller 
SRAM, FLASH, SDRAM 
CompadFlash x2 Interface 
Interface 
2-4 
AHB/APB 
Bridge 
SOUNO 
!f..W'; 
Library + 
Memory: 
:Z:~ PJI 
ARM 
NATIoNAL 
SYNOPSYS 
Canonical sac Design - 5 generations 
SOC#2 - the DVS926 Dynamic Voltage Scaling Demonstrator 
The research group had completed a software-based dynamic voltage-and - frequency control 
demonstration system and the canonical design was enhanced to become a vehicle for evaluating 
both the design and technology implications for dynamic energy management using DVFS. 
The target technology for the project was TSMC 130nm "G" process (1 .2V nominal). The 
"Intelligent Energy Manager (IEM)" platform needed a Linux operating system environment for real-
world benchmarking so the CPU was upgraded to one with MMU and caches large enough to run 
operating systems such as Linux efficiently: 
• ARM926EJ CPU wi th dual 16Kbyte Instruction and Data caches 
o 240M Hz CPU clock target 
• AMBA AHB on-chip interconnect clocked at 120 (and 60) MHz 
o 32-bit SDRAM interface off chip for 32Mbyte+ bulk memory 
o 16-bit Flash EPROM interface + expansion banks 
o Audio subsystem reused but with hardware DMA 
• AMBA APB on-chip peripheral bus clocked at 60MHz 
o Interrupt controller extended to match Linux environment 
o 1 MHz System counter timers and 1 KHz Real-Time-Clock 
o Dual UARTs to support both console and diagnostic channel 
o 48 GPIO to support wide expansion interfacing 
• Dual PLL design (Re-using the Reference Clock generator from sARS2): 
o 480M Hz System PLL for CPU and bus IP with external divider control to control 
CPU in 12MHz steps from 160 to 300Mhz 
• JTAG serial debug agent connection (with rigorous validation) 
• Dynamic Voltage Scaling CPU support 
o Analogue level shifters on system bus interface 
o Digital clock generation scheme to dynamically modulate CPU clock and manage 
DVFS phase alignment 
o Four performance levels: 240/180/120/60MHz (worst case) 
o Transparent latch based retim ing to guarantee bus hold times 
• Hardware API support for abstracting system-specific interfaces: 
o Fractional dynamic performance setting interface 
o Fractional accumulator dynamic performance counters 
• Prototype National Semiconductors "Adaptive Voltage Scaling" control 
The memory map was enhanced to be closely compatible with Intel StrongARM-1100 ASIC in 
order to simpl ify the Linux operating system porting to this platform . 
The project was used to specify EDA tools for "standard-cell" dynamic voltage scaling 
implementation and static timing analysis and verification requirements, to understand clock 
latency variation with DVS, and to prototype the basis of the ARM "Intelligent Energy Controller 
(IEC)" product development. 
2-5 
Energy efficient sac design technology and methodology 
Figure 2 3A - 130nm UL TRA926 Reference Design 
Dynamic Voltage 
RAM with state 
retention L f--c>- TCMS 
/;; 
", TT ~ 
;; CLAMP CLAMP 
Dynamic Voltage 
r CPU with 
power-down L 
CACHE 
... 
CPUCl K e IIII 111 
I---''''-'''=-:''-.j .. ~ I> 
"Q ARM926EJ CPURESET N RAMS r Performance 
Monitor 
l ·SHIFT ~rl _'~I>~~~~~==~~~~~__ ~ L L-SHIFT L-SHIFT 
1 
Register HCLK TARGETC lK 
VOO RAM 
voce PU 
OBG J 
Mutti-lCE 
Synch 
.JJJ 
AMBA AHB/APB subsystem 
NIT Nl ~ :8= Dynamic Adaptive .......J 
Perlormance Power ...... I 
f--c~o~n~lr~o~"~er~H-"PfETF+j Controller 
Dynamic -
Performance '--,--' 
Monitor 
VODSQC 
OCG I+-
Hel K 
PClK 
Clock! I TARG TCLK soc 
Reset 
~OS 
'--------------------------------------~ 
Figure 2.38 -130nm ULTRA926 sac Design 
VDORAM domain 
VOOCPU domain 
Intel!oent 
Energy 
Manager 
DVS/Clock 
Ccnltol 
OVfS ARM926EJ 
• 16K Instruction Cache 
+ , 6K Oala Cache 
SRAM, 
CompaclFtash )(2 
Interfaoe 
SORAM 
Interface 
2-6 
AHBlAPB 
Bridge .... ..., .......... 
1f.W; 
library + 
Memory: 
ARM 
Power 
IEC 
Real Time 
Clock 
Canonical SOC Design - 5 generations 
SOC#3 - the UL TRA926 DVFS reference system design 
UMC offered silicon to ARM based on their 130nm "Fusion" process that supports mixing of dual 
threshold "Low Leakage (LL)" and "High Speed(HS)" transistors . ARM had already qualified a 
"Foundry" ARM926EJ core with dual 16K caches on the HS process (pre-hardened and optimized 
for 288MHz worst case). 
The opportun ity was taken to approach this as a true reference design with considerable work 
going into transistor level simulation to understand the limits of voltage scaling, the predicted clock 
and I/O latency across the VOl tage boundaries. 
The author was responsible for assembling a five-way consortium to implement the design: 
• UMC financed the project and fabricated/packaged the UL TRA926 silicon 
• ARM enhanced the reference canonical design, the CPU characterized for vol tage scaling, 
and optimized clock generators to provide finer performance control , and had a production-
quality IEC controller design ready to integrate 
• Artisan Components provided the low power libraries and TCM memories, and designed 
the level shifters and isolation cells to ARM's requirements , plus new low-power PLLs and 
standard 3.3V 10 cells . 
• National Semiconductors contributed production quality "PowerWise" Adaptive Voltage 
Scaling IP and serial power control interface. 
• Synopsys Professional Services took on the SOC implementation, timing sign-off and chip 
finishing (with ARM support) 
The design was kept software-compatible with the DVS926 deSign, apart from enhancements to 
the IEC API and the dynamic voltage scaling, The only noteworthy differences in the hardware 
design were : 
• ARM926EJ CPU optimized to 288MHz worst case (1.08V) 
• More optimal dynamic CPU performance scaling frequencies: 
o 100% (288MHz worst case) 
o 83% (240M Hz) 
o 67% (192M Hz) 
o 50% (144M Hz) 
• Dual PLL system clock generator (576MHz + 480MHz) 
o Configurable PLL dividers support up to 360MHz on the CPU 
• Register-based retiming interface (latches removed) 
o Dynamically advances reduced speed CPU clock wrt AHB 
o Improved timing analysis and sign-off 
o New level shifters with integrated isolation clamp functionality 
The project was used to improve both the voltage scaling control and the design methodology 
considerably compared to the original DVS926 development wh ich required many iterations of 
layout to close timing. The project also helped finalize the specification of the first DVS-enabled 
ARM CPU product - the ARM1176EJ-S with IEM support. 
2-7 
Energy efficient SOC design technology and methodology 
Figure 2.4A - 65nm ATLAS926 Reference Design 
CACHE RAMs 
RAM Vo ltage 
Domain 
(with power-down) 
VODRA M 
T . J, 
L-SHIFT +CLAMP 
... 
CLK CPU ~ I I III 111 Dynamic Voltage ~DDCPU Scaling CPU {with power-down} t---'="-=-~i< r------- t> 
NRST cpu "tl ARM926EJ r==--'''-''-~ I----- t> 
'-r 
e l K AHB 
TJ, 
L·SHIFT l -SHIFT 
Register 
I 
ISOLA TlONI L-SHIFT 
RETENTIONI 
PQWER-GAT1NG 
control 
level-2 RAMs 
(save/restore 
scratch pad) 
.111 
AMBA AHB/APB subsystem 
Dynamic I 
Sleep 
control APt 
SLEEP 
depth 
PERF 
ILSM 
Intelligent 
Leakag e 
control 
State ~ Ea 'OJ PSU ACK ~- [" 0J 
10.1 K .PR 
CLK AHB 
Dynamic 
Performance 
Monhor 
System 
Clock 
Gen 'tor 
(PTRC) 
PLL 
(1.2GHz) 
Figure 2.4B - 65nm ATLAS926 SCC Design 
VODRAM domain I 16K InstNCtion +16K Data Cache I 
I 
"'" 
MVARM926EJ 
VDDCPU domain (with retentiooJpower-gating) 
level Machine 
.. PowerfTestl soe 
Reset/Cloc k (always 
control powered) 
requests 
t 
MIAli-ICE 
-t Power t S"" lEe I 
~ Real Time I. 
Clock 1 
AHBJAPB 
.. o. B Interrupt . I Bridge is~at_I.OOI. ~ 1 Bus Relimlog InterfllCe 1 +-+J Controller retention! power galing co [)-<He ~ nmersx2 1 
DynamiC 
Clod< 
Gono<m~ 
I PLl 
Intelligent 
Leakage 
control 
Stale 
Machine 
I 
',,"e 
2 Port ICM 
1 
Static 
Memo<)" 
Controller 
SRAM, FLASH. 
CompadFlash x2 
Interface 
I 
• • I OMA 64K RAM (scratchpadl +-+I GPJO x48 : " t S-AHB IEM state save) <D 
1 
3 Port ICM I 
1 
Dynamic 
Memo<)" 
Controller 
SDRAM 
Interface 
1 
2 Port ICM 
1 
OTG + 
PHY 
2-8 
"-
« ...-..J UART_O : 
~ 
Libra + 'Y 
Memory: 
iSMe 
~ UART_1 : 
-t SSI 1 1-
ARM 
sYNoPSYS 
VODSO e 
VODPA OS 
Canonical SOC Design - 5 generations 
SOC#4 - the ATLAS926-65LP Leakage & DVFS demonstrator 
TSMC offered ARM R&D early lechnology access 10 their 65nm "LP(Low Power)" process. This is 
in fact a process that has greatly reduced leakage compared to the 65nm Generic technology but 
the different gate-oxide material used requires a 1.2V nominal power supply. The low leakage 
characteristics are great for standby battery life but the dynamic power is higher due to the 1.2V 
supply rail compared to 1.0V "G" process (nominally 1.44 x the dynamic power), and the transistors 
are inherently slower so cannot hit the same peak performance of 65G. Although 65LP is a low 
leakage process to start wi th the project was approached as a test case for comparing leakage 
management techniques, but the DVFS archi tecture from the UL TRA926 design was extended to 
support voltage scaling of the CPU standard cell logic with level-shifted interface to all the cache 
RAMs; the High-Vt RAM cells were predicted to have no safe voltage scaling headroom below the 
10% tolerance on the 1.2V supply. The chip was designed to allow limited experimental voltage 
scaling of the RAM supply to confirm this in silicon. 
The leakage techniques evaluated in the project included: 
• Multiple Vt cell library implementation 
• On-chip power gating (MTCMOS switched cells) 
• On-chip state retention registers (with automatic save and restore) 
• Off-chip power gating (with scan-based IP state save and restore to memory) 
The "Intelligent Energy Controller" hardware API was enhanced to provide four levels of leakage 
management to support transparent operation through the software Wait-For-Interrupt mechanism . 
The author was responsible for RTL design and verification of the design and hand-off with 
synthesis constraints to an implementation team in Taiwan that also provided an early "Fine-Grain" 
MT-CMOS power-gated library for the joint project. 
The design was kept software-compatible with the UL TRA926 design, apart from further 
enhancements to the IEC API and re-designing the dynamic voltage scaling frequency syntheses 
around a new PLL. 
• ARM926EJ CPU with voltage scaled standard-cell region 
o A complex task handling level shifters on time critical cache RAM interfaces 
• More optimal dynamic CPU performance scaling frequencies : 
o 100% (240MHz worst case) 
o BO% (192M Hz), 60% (144M Hz), 40% (96MHz), 20% (4BMHz) steps 
• Single high-speed PLL system clock generator (1 GHz) 
o 20-slot shift register architecture to allow 65LP implementation 
o Synchronous pre-compensated CPU-to-AHB re-timing 
The project proved an excellent opportun ity to understand 65LP technology and to understand and 
compare the relative real-time and leakage power cost functions for the basic approaches to 
leakage mitigation. Subsequently this chip became an important technology demonstrator for The 
Design Automation Conference in 2006. 
2-9 
Energy efficient SOC design technology and methodology 
Figure 2.5A - 90nm SAL T926 Reference Design 
) ;-e:- RAM Voltage 3F Domain CACHE RAMs (with power-down) ~ :I: T ! ." F======t CLA MP CLAMP 
:EE 
M 
~ I I 111 Dynamic Threshold Sca ling CPU 
ClK CPU ~ c------+ [> (with power-down) 
u 
lCPU 
.. ARM926EJ NRST CPU f--- [> iI VNWEL 1 T .! l ·S H1FT L-SHIFT ISOLATION! L-SHIFT 
. .L RETENTlONI 
e l K AHB POWER-GATI NG Register control 
LCPU 
Ull 
level-2 RAMs [E.\; 
(save/restore ~ scratchpad) Dynamic ! 
AMBA AHB/APB subsystem Sleep 
control API 
ILSM 
SLEEP Intell igent 
depth leakage 
control 
PERF :8::: :0) PSU [2 : Ea ACK 0) State 
Dynamic level Machine 
Perlormance 
Monilor 
CPU_PERF ' 
ClK APB System ~ PowerfTesU elK AHB Clock sac 
Gen 'tor ReseUClock (always 
(PTRC) control powered) 
requests 
Pll 
(1 .2GHz) 
Figure 2.58 - 90nm SAL T926 Reference Design 
t 
VOORAM domain I lEiK Instrud;oo + 16K Data Cache I MufIi./Cf H Pov.or 10-"'" IEC l~tionCla 
MV ARM926EJ H RealTlme I . 
VODCPU domain (with retention/power-gating) Clock I " 
AHBlAPB 
... .~ Bridge io--o ~ I Interrupt I I",."on'. ~ I eu, Re~ming Interf,ee I r1 Controller retention! power gatlng co ,,-col [).AHa H Tlmersx2 1 
-"'" 
Clod< I-"-Ha I OMA 64K RAM Genenl10r (scratchpadl f.-.I GPIO x48 I+-~ s.AH. IEM state I PLL I "'''') 
'" 
Intelligent 
l eakage 
control 
State 
Machine 
2 Por1lCM 
1 
Static 
Memo<y 
Controller 
SRAM, FlASH, 
CompaClFlaSh x2 
Intefface 
I 
~ 
3 Port ICM .. 1 
1 
Dynamic 
Memory 
Controller 
SDRAM 
Intefface 
~ 
2 Port ICM 
1 
OTG + 
PHY 
2-10 
~ <C H L UART_O r-
1I'JIfi; 
Library + 
Memory: 
ARM 
H UART_1 l+-
H SSI l+-
ARM 
I SYNOP§YS I 
VDDSO C 
VOOPA os 
r+ 
r+ 
~ 
f-
f-
------ -------
Canonical SOC Design - 5 generations 
SOC#5 the SAL T926-09G Leakage and physicallP demonstrator 
Many customers reqUIre "G(Genenc)" technology rather than the "LULP" vanants In order to 
achieve sufficient CPU performance for high-end products Addressing the significant leakage 
power on gOnm and below IS of significant Interest In order to weigh up the relative ments of active 
versus standby power for G versus LP 
ARM had acqUired Artisan Components at the end of 2004 and needed to showcase new low-
power cell IIbranes and leakage management components that were In development for what has 
become the "PMK (Power Management Kit)" product add-on to standard cell IIbranes 
TSMC 90G IS an Industry standard technology reference pOint so was chosen for the project As a 
1 OV nominal voltage process and With power-gatlng responsible for reducing headroom by an 
appreciable degree at the Standard Cell devices, all Dynamic Voltage Scaling was removed from 
the design and attention focussed on aggressive leakage management to showcase the relative 
ments of vanous leakage mitigation techniques 
• Clock-Gallng (restart Within clock cycles) 
• State-Retention Power-Gatlng (restart Within a microsecond) 
o Keep register state powered In hlgh-Vt "balloon", power gate logic 
• Scan-based state save to on-chip or external memory (10's microseconds) 
o Some energy save/restore cost, VDDCPU leakage cut to zero 
• Software-activated cache clean (millisecond restart) 
o Power down cache and logic, energy cost to reload cache 
A number of PhYSical IP customers have requested worked examples of back-bias and dynamic 
threshold scaling, the academiC analYSIS 1 and modelling have always looked attractive but tYPically 
only IC companies that own wafer fabncallon faCIlities are able to reach production With threshold 
scaling Support was added to the system architecture to support Implementation on Tnple-Well 
technology to support 
• Reverse bias - for leakage charactenzatlon 
o plus expenmental reduced performance operating mode named ""Long-Life" 
• Expenmental forward bias operating mode named "Turbo" 
Because currently no timing models support both power rail and threshold voltage scaling 
concurrently the CPU IS stopped whenever changing threshold voltages 
ThiS project again reused the canonical design and extended the ATLAS design where pOSSible to 
maintain software compatibility Synopsys also wanted to understand and evaluate power gatlng of 
penpherallP and contnbuted a USB OTG MAC and PHY to the project, so the Sound/DMA system 
was stnpped out and replaced In the address map by the USB block The base-line Llnux OS port 
should be unaffected, ready to allow for new USB device dnvers to be developed and added 
1 Martin, S M , Flautner, K , Mudge, T , Blaauw, D , "Combined dynamic voltage scaling and 
adaptive body biasing for lower power microprocessors under dynamic workloads", IEEE/ACM 
International Conference Computer Aided Design Nov 2002 
2-11 
Energy efficient sac design technology and methodology 
Figure 2.6 Summary by SCC Design Generation 
sac Name CPU Target MHz Technology Low Power Management 
#1 sARS2 ARM946 200 180nm Clock Gatlng 
#2 DVS926 ARM926 240 130nm DVFS+CG 
#3 ULTRA926 ARM926 288 130nm DVFS+CG 
#4 ATLAS926 ARM926 250 65nm(LPt DVFS+Flne-Graln Leakaqe 
#5 SALT926 ARM926 300 90nm VTCMOS + Coarse-Grain LeakaQe 
2-12 
Canonical SOC Design - 5 generations 
This turned out to be a highly complex project to complete and tape out 
• Experimental power gatlng and retention standard cell lIbrary 
o Derived from eXiting library plus extended characterization work 
• CPU performance tuned around special 1 2GHz PLL clock synthesIs 
o 133% (400MHz) "Turbo" forward bias VTCMOS experimental mode 
o 100% (300MHz) worst case standard sign-off 
o 66% (200MHz) "Long-Life" back-bias VTCMOS experimental mode 
o 33% (100MHz) Bus-speed operating mode (SUitS scan save/restore) 
• SOC Integration with power-gated USB controller and OTG PHY macro 
• SWitch to a 388-pm BGA package 
o In order to re-use a tester and load board m ARM Austin design centre 
Summary 
Figure 2 6 summarizes In tabular form the five generations of reference system-on-chlp deSign 
developed over the research programme lIfetime Although the 90nm deSign appearing after the 
65nm project may appear out-of-order from a process technology road-map perspective the 
SAL T926 chip IS the most advanced from an aggressive leakage power management perspective 
The techniques and deSign methodologies are the primary driver for the order In which the projects 
were developed 
2-13 
Energy efficient sac design technology and methodology 
2-14 
3. Low Dynamic Power Design - DVFS 
Technologv Dependent Design Constraints 
The available voltage scaling range and associated safe operating frequencies are highly 
technology dependent and to support AVS need to understand the tYPical process characteristics 
as well as worst case conditions 
All the EDA tools and models expect to calculate delay (and hence operating frequency) by being 
given an appropriate set of ProcessNoltagefTemperature conditions For a dynamic performance 
scaling system design the requirements start from establishing a range of useful operating 
frequencies and then determining how low the voltage can be reduced at these operating POints 
In order to calculate the predicted power and energy savings 
Frequency IS linearly proportional to dynamic power at a given voltage, reducing frequency 
results In timing slack that can be made use of by voltage scaling - which reduces dynamic power 
by a square-law function However supply voltage cannot be reduced safely below a certain pomt 
where RAMs and registers start to become unreliable 
ThiS chapter documents the extended range characterization and analysIs for the UL TRA926 
DVFS project uSing the "LF027" UMC L 130E HS FSG ARM926EJ foundry core with dual 16Kbyte 
caches Implemented by ARM Lld and supplied as a pre-verlfied hard-macro CPU core 
Standard Operating Conditions 
Two port-level extracted timing models are proVided with the standard Foundry design kit for the 
LF027 core 
'Worst case' - for static timing analysIs of mput setup limes and output settling times (deSign 
critical paths) 
• 1 2V -10% = 1 08V 
• 125 degrees C (commercial grade hot) 
• slow-slow process corner 
'Sest case' - for static timing analysIs of Input and output hold times (to fix any race conditions) 
• 1 2V + 10% = 1 32V 
• -40 degrees C (commercial grade cold) 
• fast-fast process corner 
Due to the Wide spread In both timing characteristics and buffer tree latencles the "typical" 
process, room temperature and nominal 1 2V operating voltage are useful to understand for 
voltage scaling 
3-1 
Energy efficient SOC design technology and methodology 
Figure 3.1 A - Standard Voltage Characterization for clock period and latency 
Process 
MHz 
1.080 slow/slo 3.472 1.741 1.531 0.210 288.0 
1.200 2.075 1.064 0.939 0.125 481 .9 
1.320 fasUfast 1.468 0.733 0.649 0.084 681.2 
Figure 3.16 - Extended Voltage Characterization for clock period and latency 
Voltage Process Period Latency(max) Latency(min) Latency Fmax 
(V) (ns) (ns) (ns) (spreac (MHz 
0.660 typ/typ 6.17 2.973 2.654 0.319 162.0 
0.730 typ/typ 4.80 2.360 2.100 0.260 208.2 
0.800 typ/typ 3.92( 1.962 1.748 0.214 255.1 
0.940 typ/typ 2.89! 1.483 1.316 0.167 345.3 
1.200 typ/typ 2.07 1.064 0.939 0.125 481 .9 
1.320 typ/typ 1.88 0.960 0.844 0.116 531 .3 
Figure 3.1 C - Extended Voltage Delay behaviour for typical silicon 
Delay 
(n5) 
UMC HS130 DVS PFC 
typical/typical , room tem perature 
Period [TT) (lIFmax) 
Latency[TT] 
Latenc / 
Period 
0.501 
0.513 
0.499 
Latency/ 
Period 
0.482 
0.491 
0.501 
0.512 
0.51 3 
0.510 
9.000 
8.000 
7.000 
6.000 
5.000 
4.000 
3.000 
2.000 
1.000 
0.000 
y = ·24.193x3 + 84.707; - 100.41x + 42.447 
0.000 0.500 1.000 1.500 
Voltage (V) 
3-2 
Design for Low Dynamic Power- DVFS 
Fig 3.1 A tabulates the results of full transistor-level characterization for the standard library 
operating conditions (±10% voltage tolerance). 
Clock tree latency is combinational delay of the clock tree from input port to the besUworst of the 
final register cell clocks in the layout. The period measurement is determined from the worst case 
register-ta-register paths in the design. 
For voltage scaling systems design the clock tree latency scaling with voltage is a key concern as 
this affects the macro-cell interface timing for sac integration. The spread in clock buffer tree 
latency was extracted (determining the earliest and latest fiip-flop in the enti re core to be clocked) 
and the "spread" expressed as a proportion of the minimum clock period in the right-most column. 
In terms of analysis the typical silicon at room temperature and nominal voltage is predicted to 
run at about 1.7 times as fast as the worst case corner where the sac design must be signed-off 
to guarantee to meet timing . 
Extended Range Operating Conditions - typical s ilicon 
A series of extended characterization points were generated below 1.0BV. Two linear steps of 
0.94 and O.BOV were used as the starting point and then half steps at 0.73V (predictably a limit for 
the technology) and 0.66V (where RAM model reliability is potentially extrapolated beyond 
operational limit) . 
All the extended range analysis was performed wi th Synopsys NanoSim 1 fast HSPICE simulator 
on a transistor level model of the cached processor core with extracted parasitic load extraction. 
The clock tree latencies appear to scale linearly with voltage (close to the 50%). See Fig 3.1B. 
Graphical Analysis of typical-corner process 
The first stage of analysis focuses on typical silicon for the initial analysis at this refiects 
behaviour of the majority of the production manufactured si licon . Fig 3.1C illustrates graphically 
the timing characteristic for both the worst case reg ister-ta-register path time and the underlying 
clock buffer tree latency. Both are montonic and show unexpectedly good tracking (latency 
consistently of the order of 4B% to 51 % of the cycle time). 
Extended Range Operating Conditions - slow sil icon 
For design sign-off with reduced voltage rail the worst case timing analysis also needs to be 
performed with a timing model that accurately refiects the delays and latency for slow corner 
silicon worst case cond itions. 
A series of extended characterization points were generated below 1.0BV. Two linear steps of 
0.94 and O.BOV were used as the starting point and a further half step at 0.73V were used: See 
Fig 3.2A. 
1 http://www.synopsys.com/products/mixedsignallnanosim/nanosim .html 
3-3 
Energy efficient SOC design technology and methodology 
Figure 3.2A - Extended Slow-corner Voltage PFC for clock period and latency 
Voltage Process Period Latency(max) Latency(minl Latency Fmax Latency/ 
(V) (nsl (ns) (ns) (spread (MHz) Period 
0.730 slow/slow 7.721 3.626 3.237 0.389 129.5 0.470 
0.800 slow/slow 6.198 2.963 2.636 0.327 161 .3 0.478 
0.940 slow/slow 4.442 2.187 1.931 0.256 225.1 0.492 
1.080 slow/slow 3.472 1.741 1.531 0.210 288.0 0.501 
1.320 slow/slow 2.645 1.333 1.162 0.171 378.1 0.504 
Delay UMC HS130 DVS PFC 
(ns) s low /slow, max rated tempe rature 
10.000 
9.000 
8.000 
7.000 
6.000 ~ 
5.000 1- PeriodlSS] (,/Fmax] 
4.000 I~ LatencylSS] 
3.000 ..... Poly. (PeriodlSSI (t/Fmax]) 
2.000 ~ ~ '" .32.061; .. 114.69; • 139.84)1." 61.132 1.000 
0.000 
0.000 0.500 1.000 1.500 
Voltage (V) 
Figure 3.2B - Extended Fast-corner Voltage PFC for clock period and latency 
fasUfast 
0.940 fasUfast 
1.320 fasUfast 
Delay 
(ns) 
10.000 
9.000 
8.000 
7.000 
5.000 
4 .000 
3.000 
2 .000 
1.000 
0 .000 
0 .000 
3.287 1.660 1.485 0.175 
2.100 1.079 0.960 0.119 
1.468 0.733 0.649 0.084 
UMC HS130 DVS PFC 
fast/fast, min rated temperature 
0.500 1.000 
Voltage (V) 
_ Period[FF] (1/Fmax) 
-- l atency[FFJ 
Y'" 6.7614; - 16.944x .. 12.053 
1.500 
3-4 
304.2 0.505 
476.2 0.514 
681 .2 0.499 
Design for Low Dynamic Power- DVFS 
Extended Range Operating Conditions - checking fast corner linearity 
The only sign-off library required is the standard 1.32V/fast-fasU-40C model provided in the 
standard design kit but for completeness a couple of low vol tage characterizations were 
performed to check the behaviour was still monotonic and followed the same basic curve as the 
slow and typical analysis. See Fig 3.2B. 
Caveats 
The low voltage figures will need to be de-rated further to cope with the effects of degraded 
transition times across the buffer tree boundary between the 1.2V sac domain and the scaled 
voltage CPU domain . 
All the analysis of the core is performed on the macro-cell in isolation. With careful power ring 
design and attention to package and PCB design outside the sac the effects of vol tage drop due 
to IR must again be factored into appropriate de-rating of the 'ideal' PFC modelling. 
In both cases the careful choice of master clock frequency may be used to set the maximum 
operating frequency level safely for the less than ideal voltage delivered to the DVS subsystem 
on-chip . 
Performance Scaling Design Reguirements 
The data from the technology-specific voltage and frequency range analysis described in the 
previous section are used to drive the decisions over which performance points are energy 
efficient. The final frequencies selected need to be carefully chosen for a CPU such as the 
ARM926EJ where the bus interface must be synchronous to the core clock. The bus is typically 
clocked using a divided-down version of the CPU clock but all transfers between the bus and core 
domain occur on the CPU clock edge when a clock qualifier (HCLKEN) is asserted to indicate 
that this CPU clock edge captures input signals and updates output signals. 
NOTE: future IEM-enabled CPU cores will manage asynchronous bus interfaces and allow 
arbitrary CPU and interface clock relationships . However a synchronous relationship between bus 
and CPU at maximum clock rate is stil l highly desirable to ensure the maximum communication 
bandwidth without the overheads of synchronization protocols that are required for the generic 
case of asynchronous clocking . 
Frequency Range and 'Granularity' 
The technology demonstrator is required to allow evaluation of both the worst case guaranteed 
frequency (288MHz CPU) but also typical sil icon which will operate up to 480+ MHz. 
The IEM technology treats the CPU maximum frequency as 100% performance and under 
software control may reduce and adjust the required performance level dynamically. Although 
sim ple clock division by 2 and 4 is easy, more complex clock frequency generation is required to 
exploit energy efficient performance points over the voltage scaling range; for this speci fic 
technology operation below 144MHz looks to be at the edge of safe operating margins. 
3-5 
Energy efficient SOC design technology and methodology 
Figure 3.3A - Synchronous performance ratios for a 12X PLL Clock 
Bus = 66MHz 
CPU/Bus ratio Perf MHz 
6 100.0% 400.00 
4 66.7% 266.67 
3 50.0% 200.00 
2 33.3% 133.33 
Figure 3.3B - Synchronous performance ratios for a 12X and 10X PLL Clock 
CPU/Bus ratio Perf 
6 100.0% 
5 83.3% 
4 66.7% 
3 50.0% 
2 33.3% 
Figure 3.3C - Vol tage Scaling Range - slow si licon 
Interpolated [SS] latencies 
Period (ns) Voltage (V) Latency (ns Bus Ratio Freq(MHz) 
3.571 1.050 1.786 6 280.00 
4.286 0.950 2.143 5 233.33 
5.357 0.850 2.679 4 186.67 
7.143 0.750 3.571 3 140.00 
10.714 (unsafe?) (unsafe?) 2 93.33 
Figure 3.30 - Energy saving estimates from slow-corner analysis 
Estimated [SS] energy ca lcu lat ions Energy 
Freq(MHz Voltage V' V Work Rat io Duration Scaling 
280.00 1.05 1 .1 ( 6.00 1.00 6.62 
233.33 0.95 0.9( 5.00 1.20 5.42 
186.67 0.85 0.72 4.00 1.50 4.34 
140.00 0.75 0.56 3.00 2.00 3.38 
Figure 3.3E - Energy calculations for standard voltage operation 
Estimated [SS] energy calcu lat ions Energy 
Freq(MHz Voltage V' V Work Ratio Duration Scaling 
288.000 1.080 1.166 6.00 0.97 6.80 
288.000 1.200 1.440 6.00 0.97 8.40 
288.000 1.320 1.7,,2 6.00 0.97 10.16 
3-6 
Design for Low Dynamic Power- DVFS 
In an IEM system the decision to run the processor at a lower frequency (say 66%) results in 
efficient operation so long as the power consumed by the task running for (1 .5x) longer 
(100%/66%) is reduced .' Given a working voltage range in the order of 0.75 to 1.32 volts there is 
no advantage in supporting frequencies below those which can operate in this range .The 
Granularity definition refers to the number of performance steps available. From evaluation work 
this matters especially within the range 50-100% of max performance. 
The approach adopted in the case of a synchronous interface to the CPU is to treat the operating 
performance levels as multiples of the bus and memory controllers in the system. In order to 
ensure that the wide frequency range does not exceed the memory controller interface this is 
limited to a maximum bus frequency of 75M Hz. 
To address fine granulari ty of control, a master clock division ratio of 12 is adopted which gives 
easy derivation of half/third/quarter/sixth/twelfth divider ratios. The worst case CPU performance 
target is 288MHz but to evaluate si licon that is likely to be typical process rather than worst case 
the design frequencies should be configurable faster on the bench. For 400Mhz operation the 
clock frequency sub-multiples that can be directly synthesized are shown in Fig 3.3A. 
Improving performance control granularit y 
All the above frequency ratios can be produced from a master clock in the range 550-800 MHz. 
Because voltage scaling is most efficient close the 100% level the introduction of second clock 
source phase-locked to 1 Ox the bus frequency facilitates the provision of a performance point at 
5x the bus frequency. The additional complexity in clock generation is deemed worth the design 
effort to produce the extra precision in clock generation - see Fig 3.36. 
Vo ltage/Freguency scaling range 
By interpolating from the maximum-frequency period graphical analysis of the slow process 
corner (and in this case simply multiplying by 50% for the estimate of clock tree latency) the 
following range of voltage/frequency operating points are estimated - see Fig 3.3C . 
Below the 50% performance scaling point there is no more voltage scaling headroom left for safe 
extrapolation. 
Energy saving estimates from slow-corner analysis 
By factoring into account the square of the voltage, the relative frequency and extended durations 
at the proposed operating frequencies the relative energy savings with frequency and voltage 
scaling are tabulated in Fig 3.3D . 
For normal usage a nominal power supply of 1.2V would be specified so the basic reference point 
is calculated - in this case for the characterized PVT for 288MHz (not 280M Hz) by sca ling the 
duration to complete the work is shown in Fig 3.3E . 
, For dynamic power consumption this is proportional to the CV' f term. As long as voltage can 
be lowered such that V' reduction outweighs the 1/f factor in extended duration of the 
workload. In summary there is no energy efficiency to be won when the voltage can no longer 
be reduced proportionally. 
3-7 
Energy efficient SOC design technology and methodology 
Figure 3.4A - Vo ltage Scaling Range - slow silicon 
Latenc Bus Ratio Freq MH 
2.500 1.050 1.250 6 400.00 
3.000 0.920 1.500 5 333.33 
3.750 0.830 1.875 4 266.67 
5.000 0.770 2.500 3 200.00 
7.500 2 133.33 
Figure 3.4B - Energy saving estimates from typical-corner ana lysis 
ca lcu lations Energy 
V' V Work Ratio Duration Scal in 
1.05 1.10 6.00 1.00 6.62 
333.33 0.92 0.85 5.00 1.20 5.08 
266.67 0.83 0.69 4.00 1.50 4.13 
200.00 0.77 0.59 3.00 2.00 3.56 
Figure 3.4G - Energy saving estimates for standard voltage conditions 
Energy 
V' V Work Ratio Duratio Scalin 
1.080 1.166 6.00 0.95 6.67 
420.000 1.200 1.440 6.00 0.95 8.23 
420.000 1.320 1.742 6.00 0.95 9.96 
3-8 
Design for Low Dynamic Power- DVFS 
These must be treated as rough work estimates but there is certainly the potential for halving the 
energy when not running in peak conditions for the CPU portion of the CPU design. 
Energy saving estimates from typical process analys is 
Typical silicon is estimated to run at about 420MHz at a voltage of 1.08V. Taking a 400MHz 
master clock to exercise typical silicon to the same analysis is performed to confirm estimates -
see Fig 3.4A, Fig 3.4B, Fig 3.4C . 
The theoretical energy savings are slightly lower between nominal and 50% (8.23:3.55 compared 
to 8.40:3.38) but again confirm the performance levels are a useful working set. 
Variable Clock latency management 
The CPU presents a wide (Harvard) bus interface which presents a design challenge with both 
the frequency (and SOC control) and the inherent clock buffer tree delays vary with voltage. 
Synchronous approach 
Only (pseudo-) synchronous multiples of the bus clock are used in this design: 
• In DVS-emulation mode, 6 x AHB clock rate (and stopped). 
• In DPS, DVS and AVS mode, 6 x, 5x, 4x and 3x AHB clock rate (and stopped). 
Pre-compensation of the CPU clock in relation to the system bus clock is used such that the 
processor is run at "early" with respect to the system interface timing reference (AMBA HCLK 
rising edge) in preparation for reduced voltage operating points . The cost is a slightly tighter input 
setup constraint on all inputs sampled from the system bus but because the system bus is run at 
one sixth of the processor clock frequency the constraint is of the order of 4 CPU cycles 
compared with 6 in the non pre-compensated case. 
To indicate the active processor clock edge in which transfers between the HCLK and CPUCLK 
domains are initiated the HCLKEN qualifier must be asserted on the processor preceding the 
active clock edge when HCLK rises.3 
Hold-time management at the AHB interfaces 
The SOC interface to the AHB interface on the CPU is entirely referenced to the rising-edge of 
HCLK. 
In this design the CPU is clocked sufficiently early with respect to HCLK to guarantee that input 
hold times to the CPU are never violated at the lowest voltage operating point . 
In order to make sure the interface timing to the SOC is never violated a set of registers is added 
to the interface outputs from the voltage scaling domain: Outputs from the CPU domain have a 
rising-HCLK register added such that output transitions are only enabled when HCLK is high. This 
guarantees that "early" clocking of the CPU domain can never cause early output changes that 
could violate the AHB timing relationship to HCLK for the SOC voltage domain. 
3 And this HCLKEN qualifier must be pre-compensated earlier for reduced performance points, 
as must nRESET. 
3-9 
Energy efficient SOC design technology and methodology 
Figure 3.5A - Latency Pre-compensation for reduced voltage/frequency operation 
Freq Ratio Period Precom Latenc AHB targE 
(MHzI (ns) (ns) (ns) (MHz) 
CPU 100% 288.00 6 3.47 5.21 1.79 
CPU 83% 240.00 5 4.17 5.21 2.14 
CPU 66% 192.00 4 5.21 5.21 2.68 
CPU 50% 144.00 3 6.94 5.21 3.57 
PLL: 576.00 12 1.74 
Bus: 48.00 1 20.83 15.63 64 .00 
3-10 
Design for Low Dynamic Power- DVFS 
Latency pre-compensation specification 
Analysing the worst case latency at reduced voltage provides the basIc pre-compensatlon timing 
by which the CPU clock domain must be advanced At higher voltages the latencles all decrease 
so there IS no danger of Invalidating Interface timing 
The requirement to run the x10 PLL as well as the x12 PLL complicates the design Although on 
the same process voltage and temperature (on the fixed power 1 2V SCC domain) the outputs 
relationships are such that even locked to the same bus clock the hand over between PLL 
sources must be handled by synchronization handshake As such there will be an uncertainty of 
one clock period at the 83% (240MHz) In the CPU waveform so a conservative pre-compensatlon 
by 3 cycles of the x12 PLL clock IS chosen 
The AHB timing constraints must therefore be set to 9/12 of the clock period (I e the AHB target 
frequency must be constrained to meet 12/9 of the actual frequency I e 64 MHz for the 48MHz 
worst case design target See Fig 3.5A 
The highlighted latency figure for the 83% performance pOint IS In fact a minimum figure with up 
to one clock cycle of ·uncertalnty" due to the reqUirement to synchronize the secondary PLL clock 
divider to the HCLK wavefonm produced by the of the primary PLL clock ThiS IS stili guaranteed 
to be less than the 5 21 ns pre-compensatlon speCified 
Debug clock domain synchronization with variable latency 
Debug Signals are synchronized to HCLK (not CPUCLK) and then qualified with HCLKEN at the 
core to ensure valid state advance clocking Limiting the debug synchronization frequency to AHB 
HCLK rate causes minimal synchronization timing penalty (HCLK » TCK frequency for the 
debug agent) and avoid the expensive requirement for clock-tree balancing of the debug clock 
synchronlzer to the variable voltage VDD _CPU domain 4 
4 The Debug domain TCK must be synchronized uSing HCLK and generates RTCK, the return 
TCK handshake to support DPS or DVS/AVS, especially when PLL clock 
resynchronlzatlon delays may be Introduced 
3-11 
Energy efficient SCC design technology and methodology 
Figure 3.6A- Power-Test-Reset-Clock control structuring 
testmode 
clk_xtat12 
[12MHz) 
clk'pllx12 
1576MHz) 
pU_bypass 
dvsJevel 
cpuJ)erf 
-
-
--+ 
L) clk]taI1~_ J-t-- [10MHz) clk...,pllx10 
1 [480MHz) plLbypass 
elk ape a 
CLKGEN_IEMx10 
Dynamic 
Clock 
Generator 
elk cpu a CPU frequencies 
000/2401000100010 MHz 
CLKGEN_IEMx12 Phase-locked to 
Dynamic AHB 48 MHz 
Clock 
Generator 
elk hclkena 
CPU frequencies 
288/000/192/144/0 MHz 
AHB 48 MHz 
APB 48 MHz 
3-12 
elk ape b r- f---o :1-OR elk ape a 
'--
elk cpu b r- ~1 
-elk cpu a OR 
'--
elk hclkenb r-- f---o r OR elk hclkena 
'--
'-- 1 
rst_por n r 
rst_por n r 
hclk 
hreseCn 
--------------------------------------------------- -
Design for Low Dynamic Power- DVFS 
PowerlTestlResetlControl Requirements in detail- see Fig 3.6A 
• Top-level module of the SCC design 
• Contains the high-frequency state machine used to derive the reference clocks to the system 
• Designed to be synthesized and placed as a 'pseudo-hardened' block to constrain the timing of 
the clock generation and ensure low-skew and cleanly defined clock and reset sources - to 
ensure clock tree buffering and balanCing start from known relationship. 
• Subsequently this can be black-boxed as a LIB component for chip-level STA If reqUired - but 
the netllst IS usable by the STA and ATPG tools 
• Instantiate Independently for each power domain and control the clocks and associated resets 
to this domain as well as controlling test configuration 
• Support development of each sub-system with Independent power domain In Isolation -
providing resets and clocks to each domain sUitable for standard RTL development and 
verification The expertise necessary to balance clocks, and manage buffer tree latencles on 
clocks and resets of hardened subsystems, IS encapsulated In the PTRC deSign and 
Implementation 
• Assumed to be powered at all times that the SCC but support handling of sWitching voltage 
rall(s) to subSidiary voltage domains 
• When domainS are to be powered off active low resets shall be asserted and clocks then 
clamped to zero 
• When domainS are to be powered up, once the voltage rail IS stable and safe then the clock 
shall be started and the (active-Iow) resets de-asserted 
• All clocks Identified With "clk_" prefix 
• Resets to be asynchronously asserted and synchronously de-asserted from the domain clock 
for the appropriate domain 
• All resets Identified With "rst_" prefix and where all resets are conventionally active-Iow 
assertion ("_n" postfix) 
• All clocks and resets fully controllable from SCC pinS to ensure clean test and timing analYSIS 
flows 
• Testability control - support fully controllable clocks and resets 
• testmode forces all clocks to be multlplexed to externally controllable (crystal) clock Input 
• testmode forces all resets to multlplexed to externally controllable active-Iow reset Input 
• pll_bypass forces Internal clock to be multlplexed to externally controllable (crystal) clock 
Input not the PLL synthesized clock 
• pll_bypass forces 'PLL locked' Signal to be overridden (as PLL SWitched out) 
3-13 
Energy efficient SOC design technology and methodology 
Figure 3.7 A - PLL configuration for 12X PLL 
Master IEMPLLx12 frequenc' (MHz) 
Xtal (MHz) 12 13.33333 
Ratio = x36 480.00 
Ratio = x42 504.00 560.00 
Ratio = x48 576.00 640.00 
Ratio - x54 648.00 720.00 
Ratio = x60 720.00 800.00 
CPU 100% performance (M Hz) 
Xtal (MHz) 12 13.33333 
Ratio = x36 240.00 
Ratio = x42 252.00 280.00 
Ratio = x48 288.00 320.00 
Ratio - x54 324.00 360.00 
Ratio = x60 360.00 400.00 
AHB/SDRAM frequency (MHz) 
Xtal (MHz) 12 13.33333 
Ratio = x36 40.00 
Ratio = x42 42.00 46.67 
Ratio = x48 48.00 53.33 
Ratio = x54 54.00 60.00 
Ratio = x60 60.00 66.67 
3-14 
Design for Low Dynamic Power- DVFS 
• Static Timing AnalysIs support 
• pll_bypass allows controllable clock path vIsibility for all derived SOC clocks without the 
STA tools needing to understand the 'black-box' functionality of the PLL components 
• Simple static case analysIs uSing externally controlled clocks and testmode 
• Identified clocked state machine register outputs for synchronized resets and all derived 
clocks to allow named ports to be controlled with STA clock forcing 
CPU Subsystem Clock Specifications (PLL_IEM domain) 
The following clocks and clock generation protocols are required 
• Free running AMBA clocks for AHB HCLK (also serves as APB PCLK In this design) This 
runs continuously and the processor and debug clocks are phase-aligned to this primary 
clock All the syntheslzable memory controllers and peripherals use this primary clock In the 
SOC design 
• Independent target processor clock frequency reqUired to support Adaptive Power controller 
module with the frequency under the control of an external (asynchronous) dynamic power 
controller This target frequency IS set by the Intelligent Energy Controller to the desired 
processor frequency and used as part of the voltage control feedback loop to determine If the 
voltage IS suffiCient to safely support operation at this target performance level 
• Dynamically sWitching, glltch-free CPU clock carefully controlled to align with AHB transfer 
HCLK edges uSing HCLKEN enable The CPU frequency IS Independently sWitched In 
frequency under control of the Intelligent Energy Controller In a pase-allgned manner to the 
free-running 
• Power management protocol support for Isolation (Interface signal clamping) and CPU 
sequenclng and synchronous de-assertion of reset Addition support IS added to ensure that 
the CPU IS always clocked for a minimum of three cycles after power-down prior to reset de-
assertion - to meet the clock gatlng specifications on the CPU subsystem 
• Instantiate Independently for each power domain and control the clocks and associated 
resets to thiS domain as well as controlling test configuration 
3-15 
Energy efficient SOC design technology and methodology 
Figure 3.8A - PLLx12 Clock generation : 
clk_pllx12 
Q1 (: =-Q6) 
Q2 (:=Q1 ) 
Q3 (=Q2) 
Q4 (=Q3) 
Q5 ( =Q4) 
Q6 (:=Q5) 
CPUx6 [100%] 
HCLKENx6 
CPUx4 [66%] 
HCLKENx4 
CPUx3 [50%] 
HCLKENx3 
Figure 3.88 - PLLx1 0 Clock generation : 
clk_pllx10 
01 (:=-05) 
02 (:=01) 
03 (:=02) 
04 (:=03) 
05(:=04) 
CPUx5 [83%] 
HCLKENx5 
3-16 
Design for Low Dynamic Power- DVFS 
Detailed IEM System Clock Generation: CPU Clock Generator (a) : CLKGEN_IEMx12 
When Dynamic Vol tage Scaling is enabled the CPU clock must be dynamically 
reprogrammable to the supported frequencies between 0 and 100%. The pri mary clock 
generation for Maximum, Minimum and the bus clock rates is provided by CPUCLK1 . This is 
clocked by the CPUPLL 1 (12x bus speed) clock. 
To ensure the clock generator is fast and efficient the underlying divider is based on a 6-stage 
"twisted-ring counter" - a simple shift-register structure with inverted feedback loop (Referred 
to as a "Johnson Counter"5). 01 to 06 reflect the underlying HCLK bus frequency and can be 
tapped off for clock balancing for both the AHB IP and the SDRAM memory controller external 
device clock in particular. 
This clock generator is able to produce the following CPU clock ratios: 
6 x bus speed (6/6 = 100%, max performance) 
4 x bus speed (4/6 = 66% of max performance) 
3 x bus speed (3/6 = 50% of max performance) 
At each frequency point the appropriate HCLKEN signal timing is generated which is the 
qualifier to indicate that the next rising edge of CPUCLK is aligned to the external HCLK bus 
IP transfer edge. See Fig 3.8A. 
Detailed IEM System Clock Generation: CPU Clock Generator (b): CLKGEN_IEMx10 
For energy efficiency at reduced performance and voltage an extra dynamic performance 
setting of 5/6 of maximum operating frequency is highly desirable. A second PLL clock at 10x 
the bus clock frequency is more power efficient than a single, much higher frequency, master 
PLL clock . 
Again, for efficiency the clock generator is fast and efficient the underlying divider is based on 
a 5-stage Johnson counter (twisted ring counter). 01 to 05 reflect the underlying HCLK bus 
frequency and are synchronized to the master clock generator clock described in the previous 
section. 
This clock generator is able to produce the following CPU clock ratios: 
5 x bus speed (5/6 = 83% of max performance) 
HCLKEN signal timing is generated which is the qualifier to indicate that the next rising edge 
of CPUCLK is aligned to the external HCLK bus IP transfer edge. See Fig 3.8B. 
5 http://en.wikipedia.org/wiki/Ring_counter 
3-17 
Energy efficient SOC design technology and methodology 
Figure 3.9A - Latency Precompensation 
AMBA HCLK 
latched outputs 
to AM BA 
inputs from 
AMBA 
DVS dependent clock latency 
CPUCLK rising 
HCLKEN active 
re-clock outputs 
-j_..L........:: ... y'-__ 
~l-: ----
. 
. 
i" AHB cycle synthes is constraints • i ;.,. ... __ .. __ . __ ._._ .. _--_._.-i fixed length AMBA AHB cycle period ~ 
3-18 
Design for Low Dynamic Power- DVFS 
(Pre-1Compensating for DVS Clock Tree Latency variation 
When scaling the voltage of a sub-system with an SOC not only do the set-up and hold times 
degrade with lower voltage but also the clock tree (clock enable, reset, etc) latencles all scale In 
a non-insignificant manner 
Because the ARM926EJ CPU core used In thiS design has a synchronous bus Interface thiS 
creates a challenge to ensure the rest of the SOC design can be completed With standard static 
timing analYSIS methodologies The hardened processor core has a deep Internal clock buffer tree 
and It IS necessary to align the clocks for the maximum frequency (at worst case process and 
temperature and worst case max voltage) In order to meet performance at 100%, but avoid 
having to compensate for significant negative set-up times on Inputs at lower voltages 
The approach adopted In thiS design IS to advance the clock to the CPU suffiCiently to guarantee 
that the CPU worst case latencles at the lowest reduced voltage and frequency operallng pOint 
are met In advance of the AMBA HCLK rising edge 
All outputs from the DVS domain are guaranteed to be set up In advance of the rising edge of 
HCLK by design A set of registers clocked by rlsmg-HCLK are added on the 1 2V VDD SOC 
domain Side of the level shlfters that bndge the Interface between the dynamic voltage scaled 
domain and the standard SOC design Implementation for the rest of the chip 
ThiS IS shown In Fig 3 9A 
• In summary, the DVS-domaln CPU IS clocked In advance of HCLK rising by an amount 
guaranteed to exceed the worst-case low voltage latency and hold times 
• The AHB SOC subsystem IS constrained to a tighter frequency to guarantee that all 
Inputs to the DVS-domaln CPU are valid for the worst-case low voltage latency and set-
up times ThiS IS not a difficult constraint to meet because the HCLK domain IS run at the 
low-frequency of 1/61h of the primary CPU Sign-off frequency [Normally the IP IS 
syntheSized for 133MHz HCLK and SDRAM controller clock rates] 
• The addition of the retlmlng registers clocked by the rising edge of HCLK then guarantee 
there are no output hold time Violations to the AMBA subsystem HCLK domain These 
add minimal delays to the Significantly slower low-to-hlgh voltage level shlfters 
The pre-compensatlon times for the CPU were determined from the worst-case latency analYSIS 
tabulated In Figure 3 2A 
3-19 
Energy efficient soe design technology and methodology 
Figure 3.10A - Fmax Timing 'strobes' for AHB subsystem 
CPUCLK 
HCLK 
HCLKEN 
HRESETN 
-'~--~I ~~~ 
____ ~rh~ __ ~nnL 
i : 
- ----+..., ".' r ! I . 
:- -: ! ! 
Constrain AHB to CPU Fmax/4 
Figure 3.10B Fmax DVS-mode Timing 'strobes' 
CPUCLK 
HCLK 
HCLKEN 
HRESETN 
L-----!r 
____ ~r-1~ __ ~ __ ----~Il~-+,--
! ,r 
: 
~4~----------------~ 
Constrain AHB to CPU Fmax/6 
Figure 3.10C Fmin DVS-mode Timing 'strobes' 
CPUCLK 
HCLK 
----i-____ ~~----~----~r--
HCLKEN 
HRESETN ~ __________ +-'r 
Constrain AHB to CPU Fmlnl2 
3-20 
Design for Low Dynamic Power- DVFS 
Static Timing Analysis strategy 
With CPU, RAM and SOC domains all set to standard voltage/de-rating corners the timing 
analysIs must be verified at CPU at Fmax 
Then sign off the timing with new extended low-voltage slow-slow corner with CPU at Fmln 
Current and Target clock generation 
• Figure 3.10A shows the primary Interface clock relationship with the CPU clock to the 
AHB bus clock In order to constr3ln bus read and write data access times correctly 
o The AHB clock period must be over-constrained to 4/6 of the cycle time In 
order to ensure Input data (and read data In particular) will be valid and set-
up to the CPU edge that samples the data (when the HCLKEN qualifier IS 
active) 
• Figure 3.1 OB shows the primary Interface clock relationship with the CPU clock 
enable In advance of each active (nslng-) edge of the AHB bus clock when the CPU 
IS run at full voltage and at maximum frequency (6 x bus clock) 
• Figure 3.10C shows the primary Interface clock relationship with the CPU clock 
enable In advance of each active (nslng-) edge of the AHB bus clock when the CPU 
IS run at lowest voltage and at reduced frequency (2 x bus clock) 
With Dynamic Voltage Scaling the voltage/frequency operating POints must not be violated -
the CPU must never be clocked faster than the operating voltage supports TYPically a look-
up table of voltages reqUired to support each level of performance IS provided to control the 
power supply and for guaranteed operation over temperature and process spread a 
conservative charactenzatlon process IS reqUired 
With Adaptive Voltage Scaling the operating voltage IS made continuously adlustable such 
that IS can be vaned to compensate for process and temperature within the power supply 
control loop In order to support dynamic performance scaling the concept of a secondary 
'target' clock IS Introduced When requesting to Increase performance the CPU must be 
maintained at the current performance value while the target clock IS used to probe a 
'Hardware Performance MOnitor' (HPM) of some form that IS used In the power supply control 
feedback loop to ascertain when safe operating voltage at the new performance setting has 
stabilized At this pOint It IS safe to sWitch the processor to the higher frequency clock When 
redUCing performance the CPU and target clocks are Immediately SWitched to the lower clock 
frequency and the power supply then reduces the operating voltage to support the new target 
frequency that IS mOnitored by the HPM 
The clock generators must support concurrent clock syntheSIS of current and target 
frequencies 
3-21 
Energy efficient sac design technology and methodology 
Figure 3.11A - Performance Level Request Coding 
Perf Level 
1xxxx 100% 
01xxx 83% 
001xx 66% 
0001x 50% 
00001 0% I Idle 
Figure 3.11 B - Voltage Ready Request Coding 
Voltage 
Read 
1xxxx 
01xxx 
001xx 
0001x 
00001 
100% 
83% 
66% 
50% 
0% I Idle 
Figure 3.11 C - Shift-register Design Implementation 
AHBCLKSR 
CPUIAHB clock 
cf!LlaCmatch latency matchmg • -2n5 -1n5 Ons +1n5 +2n 
-co----{? Buffered, nrst_por synchronized 
actIve-la reset 
clk_cpuhclken ... >-----1 
I 
CPUCLKSR 
Synchronous 
reload 
100000000000000001111 
CPUClKSR 
Clock buffenng -;;;;; H0011 00110011 00110011 
----{>- Clock tree SO% r---LOO011000110001100011 (+ve edge) 
- ""' 
HOOO01100000110000011 
cpuactlve 
I 
Non-zero 
"'" 
ro--t00000000000000000011I Input 
18c5x 
IBc4x 
lec3x 
19x1x 
=: 
=: 
synchromzers "" HoooooooooooooOOOOOOO ~CIOckJ r- Mux r- Ct~ 
3-22 
Design for Low Dynamic Power- DVFS 
Clock sWitching to the CPU must be clean and never violate minimum pulse widths at the 
selected current or target frequencies The reqUirement to Introduce the secondary PLL adds 
complexity and the approach adopted IS to logically OR the clocks and clock enables between 
the two clock dividers and provide a synchromzer and handshake Interface between the two 
(semi-synchronous) clock generators uSing an "actlve"/"hold-off" request acknowledge 
sequence 
Dynamic Clock Control Interface Specification 
Performance Level request 
Gray-coded 5-blt buses are used to convey the target performance clock setting from the 
Intelligent Energy Controller These should be synchrOnized to the local domain clock before 
gatlng With the clock enables See Figure 3.11 A 
Voltage Ready Level 
Gray-coded 5-blt buses are used to convey the safe operating voltage level at present from 
the DynamiC Voltage Controller These should be synchrOnized to the local domain clock 
before gatlng With the clock enables See Figure 3 11 B 
Dynamic Performance Monitoring 
The Intelligent Energy Controller should use the "AND" of the Target Performance level and 
the Voltage Ready buses to determine the currently active performance level Although these 
control levels have a few cycles of synchronization penalty there are no extended PLL out-of-
lock delay times so the performance SWitching IS effectively Instantaneous 
Implementation Details 
A shift-register deSign for the clock generators IS used A pair of shift registers are preloaded 
With the appropnate clock and clock enable waveform every HCLK cycle for the selected 
target/CPU frequencies and this IS then Simply shifted out at the PLL clock rate This keeps 
the Implementation fast and efficient for the high-speed clock Figure 3 11C shows the shlft-
register style deSign used for the ATLAS926 DVFS clock controller 
Summary 
The challenges In deSigning for dynamiC frequency and voltage scaling can be overcome but 
integration timing Interfaces either need detailed deSign and careful timing venficallon - or need 
to be treated as asynchronous Interfaces With associated cross-clock-domain handshakes or 
synchronlzers Although an efficient synchronous tlm Ing relationship can be deSigned In as 
descnbed In this chapter, this does reqUire expert deSign resource for SOC Integration The 
author chose to speCify fully asynchronous Interfaces for the ARM1176 IEM-ready product IP that 
IS licensed commercially to customers 
3-23 
Energy efficient SOC design technology and methodology 
3-24 
4. Design for State-Retention Power-Gating 
Principles of Power·gating design methodology and flow 
Leakage power dissipation grows every generation of CMOS process technology This leakage power IS not 
only a senous challenge to battery powered or portable products but Increasingly an Issue that has to be 
addressed In mains-powered or tethered equipment too where added leakage power generates Increased 
heat and often requIres specialIzed packaging or cooling 1 
It IS highly desirable to add mechamsms to fully or partially switch the leaky power rails to reduce the 
leakage power that IS diSSipated 
In this chapter, a set of best practice approaches to architecllng a SOC for one or more power-gated regions 
In a SOC are Introduced and descnbed In detail The next chapter deals With the detailed transistor deSign 
and analYSIS Issues as these are very technology dependent, here the pnmary Interest IS In understanding 
how RTL deSigners can deSign for power-gatlng Implementations In as technologY-Independent and portable 
manner as possIble 
RTL deSign conventIonally assumes Ideal power raIls and logical commUnication between modules 
SWitching or gatlng the power rails needs speCial care and attention In the RTL partlllomng coding and 
venficatlon 
Dynamic and Leakage power profiles 
Clock gatlng can be handled fairly transparently from an Implementation and tools perspective, power-gatlng 
IS more invaSIve than clock-gatlng In that It affects Inter-block Interface communicatIon and adds In non 
InSignificant time delays to safely enter and eXit power gated modes 
For the purposes of descnblng power-gatlng pnnclples In this chapter the concepts of entry and eXit from 
such power modes are Introduced 
• SLEEP events Initiate entry to the low power mode 
• WAKE events Im"ate retum to active mode 
Such events may be scheduled expliCitly by control software as part of deVice dnvers or operallng system 
Idle tasks, or altematlvely Initiated In hardware by timers or system level power management controllers 
System deSigners frequently choose more expliCit and appropnate names, but the concepts should be clear 
from what follows 
Figure 4 lA shows an example acllvlty profile for a sub-system that needs to be power managed In the 
case of baSIC clock gatlng the leakage power component IS shown at the base of the graph 
'Klm N S et ai, "Leakage current Moore's law meets static power" IEEE Computer, Dec 2003, 
(Vol 36, No 12) pp 68-75 httplldOlleeecomputersocletyorg/l0 1109/MC 20031250885 
4-1 
Energy efficient sac design technology and methodology 
Figure 4.1 A - Activity profile with Clock Gating 
~ • m "". <"<""l '1-f-<;; : ~<"<""l 
~ 
POwot 
, , 
Dynamic , Dynamic 
-. P'- / 
~~1 Acbvlty2 
/, / 
~ , 
/ 
~ 
/ , / / ~ ~ , " / / /// 
LaakagG Power Leakage Powet" Leakage Power 
ActIVity 1 (e 9 Clock Gated) Activity 2 
• 
Figure 4 1 B - Activity profile with ideal Power-gating 
/ '/ ' 
Leakage Power leakage Power 
ActIVity 1 Leakag. Power ,_ g Power Gatod) ActIVity 2 
Figure 4.1C - Activity profile with non-Ideal Power-gatmg 
~ • m "". <"<""l 'If; • ~<"<,,"l. ~ 
/ /~:///// / '/ / //'///////////// 
/ / /" /~ // ,,/ // ,/,/ / // .-;// /' / / ' ,/ / //" / 
-;';'////' , .', ,/ ///// / 
//»///' " <'/<' // ////// /,/ /,.///, »~ //"//k'/;/< 
//,/ ~ 0)0///// /,// ;/?" / ~// / 
, '"amlc" ' </ //, Oynafn!..c, ,./ ~<, :// Power / ,/;,/ //,///'/Powet~,// ,/ 
./ --I'" / ,/ ,,/ // ./ / / /// ,-
,/ / / / Activity 1 / 
'/ ; /, ~'Adfvtty 2 / //, ;;/;/;/;/X/; //// -;(///;// //;>/ / ~ 
'/' ///j// // ./0///'/// /" ///'/// / :j; //-; /// /;///0,';;/////; / //:~::( ?/:/' '// /////~////>// /, '/ / / % / / / 
/'////// / /, /// /////// ///////// 
Laakage Power V Leakage Power 
Actrvrty 1 Leakage Power Ie.g. Power Gatedl AdIVlty2 
• 
• 
4-2 
"" 
• 
• 
• 
• 
Time • 
\.Oa"~ 
T,me 
• 
"". ~ 
• 
m ~<"~ 
/ 
~ Dynamic 
Power 
, 
A~ivIly 3 
~ , , ' 
~ 
~ , 
/ / ", / 
~ 
<"<""I 
/ ' / 
;c~ 
/ / 
, 
~ 
/ /,//// 
/ Dynamfc 
-:- ,Power' 
Actj~ty3 
Leakage 
Acllvily3 
m ~<"<""l 
"I' //// 
<, '/.////' / 
,,' /////, 
//~ /~:// / 
,/ /~ '//// 
' Dynamic' 
"" Power// 
/ ~ / 
,/ "./ /,/ ,/ 
/ ActiVIty 3 
"// >:< 
~'////<./ //;0/-:// 
/, ///,// 
// //~ // 
~ Leakage 
Power: Acllvily3 
• 
• 
Design for State-Retention Power-Gatmg - SRPG 
Archltecturally one IS faced With trade-offs between 
• the degree of leakage power savings that are possible 
• the entry and eXit time penalties Incurred 
• the energy dissipated entenng and leaving such leakage saving modes 
• the activity profile (proportion and frequency of times asleep or acbve) 
Figure 4 1 B shows an example activity profile for the same sub-system with basIc power-gallng Implemented 
The real-time response time between the WAKE event and having clocks running may be significant and 
cannot be Ignored at the system design level 
Figure 4 1 C shows more realistically the leakage power savings are not perfect and Instantaneous, the full 
leakage power savings take some bme to reach target levels due partly to the (hotter) thermal profile of the 
preceding activity and the non-Ideal nature of the power-gatlng technology Therefore the achievable 
savings are compromised to some extent 
Impact of Power-gating on classes of sub-system 
A cached CPU subsystem for example can typically be dormant or inactive for long penods (Fig 4 2A) 
• power-gatlng the entire CPU provides very good leakage power reduction 
• but wake-up-tlme response to an Interrupt has significant system level deSign Implications (may 
even require deeper FIFO's or scheduled time-slots) 
• If the cache contents are lost every time the CPU IS woken up then there IS likely to be a significant 
energy cost In haVing to repeat all the bus activity to refill the cache 
• net energy savings depend on the 'sleep'/'wake' acbvlty profile as to how much energy was saved 
when power gated versus that burnt In reloading state 
Alternatively a penpheral subsystem may have a much better defined profile under control of a deVice dnver 
and operating system power management scheme (Fig 428) 
• power-gallng most of the block but maintaining key state may give best leakage power reduction 
• the deVice dnver may be reqUired to expliCitly load/restore key state or Initiate hardware sequencer 
control 
• the real-time and recovery costs can be eaSily profiled and optlmlzed 
• net energy savings are relatively easy to quanllfy from (exlsllng) deVice dnver statistics 
Finally by way of Illustration a more complex multi-processor CPU cluster would be an example of a 
hierarchical power-gatlng subsystem where one or more processors may be power gated off completely 
(Fig 42C) 
• power-gatlng individual CPUs proVides very good leakage system level power reduction 
• In such an MP cluster, coherency has to be re-established when a CPU IS re-powered Therefore 
the fact that the local cache contents have been lost every time It was power gated In not a problem 
The CPU IS awoken clean and reset ready to execute and cache the next task It IS given 
• optlmlzed net energy savings may well reqUire adaptive shutdown algonthms that vary the number 
of CPU cores power gated and active With varying workload 
4-3 
Energy efficient sac design technology and methodology 
Figure 4.2A - Cached CPU sub-system Figure 4.28 - Peripheral sub-system 
BUFFER & 
CPU CACHE MEMORIES CONFJGURA TION 
MEMORY 
PERIPHERAL 
ENGINE CORE CPUCCRE 
BUS INTERFACE BUS INTERFACE 
Ir 
Figure 4 2C - Multi-processor CPU cluster sub-system 
CPU CACHE MEMORIES CPU CACHE MEMORIES CPU CACI-E MEMORIES CPU CACHE MEMORIES 
CPU CORE t1 CPU CORE 12 CPU CORE t3 CPUCORE'f.4 
ICONT1<OLj 
BUS COHERENCY MANAGEMENT AND INTERFACE 
Figure 4 20 - Functional Partitioning for Power-gating 
P_..swttehng bbrk 
, 
-i ~ , 
Control ~ Powef-Giltmg - ~ Power-Gllted Contldlel' Functional Block 
r " .. ' ...... 
, 
Figure 4 2E - Functional power gated subsystem packaged for integration 
I ~~~"""""""""T"""""""""'''''~ Power-Swltchlng fabric 
t , 
-
-i ~ 1501'0 + Outputs 
Control 
-
~ Power-Gatmg 
-
~ Power-Gated -Controller Functional Block 
E- I 
" 
Inputs 
+ , 
I SS) i 
...................... -....................... ~ 
4-4 
Design for State-Retention Power-Gatmg - SRPG 
Power-gatlng Design Partitioning 
There are two parts to architect power-gatlng systems 
• the IP block(s) or subsystem(s) that will be power gated 
• the controller that IS responsible for sequenclng the power control to the power-gated subsystem 
The controller tYPically must supplied with power from an "always-on" supply to ensure It has the ability to 
manage the Wake and Sleep Interfaces on behalf of the power-gated block The control Signals for the 
power-gatlng hook-up need to be explicitly coded In the RTL and providing a defined control protocol IS used 
may be venfied with a protocol test-bench See Fig 4.20 
More often It may be desirable to package the lP-specific power controller With power managed block Itself-
as the power sequenclng may be tailored to an Implementation specific configuration rather than being a 
genenc reusable component However the hierarchical partitioning must be handled carefully to ensure the 
controller and power-gated portions can be cleanly mapped onto separate power supplies and even 
separate physical regions potentially The ImpliCit SaC-level power supplies are shown powenng the 
controller and the power sWitching CirCUitry, and the concept of packaging the output Isolation functionality 
With the module IS Introduced - ensunng that the composite component can be safely deployed In a sac 
With clean safe logic outputs See Fig 4 2E 
With more complex state retention mechanisms that may reqUire functional state save and state restore 
operations then the power gated umt or module may have expliCit control pinS Otherwise the Interface for 
the power gated regions can be largely transparent to the RTL coding 
RTL Design for Power-gatlng 
The primary considerations are the hierarchical partitioning of the design to ease the Implementation and 
verification and the degree of power-gallng granulanty Within the subsystem hierarchy The concept of a 
power-Island, or a bounded region of design hierarchy With controlled power rails, IS Introduced 
Any region that IS locally power gated or externally sWitched as a power rail Introduces a set of constraints 
Into the system deSign that affect not only dynamiC and leakage power modes of operation, but also 
• the functional control of clocks and resets 
• Interface Isolation 
• Implementation and analYSIS constraints 
• state-dependent venficalion for each supported power state 
• power state transition coverage to ensure all legal state entry and eXit arcs are tested 
• manufactunng and production test Implications 
The most baSIC fomn of power-gallng control, and that With the lowest long-temn leakage power, IS that of 
prOViding an extemally sWitched power rail It IS Important to ensure that the sac architecture provides clear 
hierarchical VISibility of all sWitched power rails as thiS effects not only power but clocks, resets and any 
signals that pass through a power Island 
Where a number of on-chip subsystems share a power rail the next level of refinement IS to support on-chip 
power-galing at the complete subsystem level Power-gatlng a complete subsystem IS tYPically easier from 
an Implementation and venficatlon perspective - the larger the block that IS sWitched the greater the leakage 
power savings for long penods of mactlvlty In general ThiS results In a number of power Islands on a shared 
power supply rail 
4-5 
Energy efficient sac design technology and methodology 
Figure 4.3A - Conceptual Hierarchical Power-gating 
Ral~sw"ched CPU Sub-system 
I 
Powe<-Gated 
CPu (;Ore I ( ( 
Integer Cache Memory Subsyst&m 
Core 
MAC VFP 
Figure 4 38 - Conceptual Hierarchical Power-gatmg Function Table 
Cache CPU MAC VFP Power State 
(OFF) (OFF) 
-
. Shutdown (Cache cleaned, VODCPU off) 
ON OFF - . Deep Sleep (Cache preserved) 
ON ON OFF OFF Normal Operation 
ON ON ON OFF DSP workload 
ON ON OFF ON Graphics workload 
ON ON ON ON Intensive multimedia mode 
Figure 4.3C - Conceptual Hierarchical Power-gating 
Ral~swltched CPU Sub-system 
I I I 
Power-Gated 
CPU corol 
Integer Cache Memory Suboystem 
Core 
MAC VFP 
Figure 4 3D - Conceptual Hierarchical Power·gatmg Function Table 
Cache CPU MAC VFP Power State 
(OFF) (OFF) (OFF) (OFF) Shutdown (Cache cleaned, VDDCPU off) 
ON OFF OFF OFF Deep Sleep (Cache preserved) 
ON ON OFF OFF Normal Operation 
ON ON ON OFF DSP workload 
ON ON OFF ON Graphics workload 
ON ON ON ON IntenSive multimedia mode 
4-6 
Design for State-Retention Power-Gating - SRPG 
However power-gatlng al finer granulanty within the subsyslem tends to provide optimal savings when 
certain functional Units may be effectively turned off for even short penods For example a complex co-
processor In a CPU might be a good candidate for local power-gallng If It IS only used by certain threads on 
demand Each sub-component of a sub-system needs to be treated as a true power Island due to the 
Interface boundary conditions and testability 
Chip architecture for power-gating 
A scalable approach to chip architecture IS valuable as a system-on-chlp deSign today tYPically becomes a 
component In a larger or more complex Integration In a subsequent product generation 
It IS recommended that module boundanes must be enforced at the power Island level, even If more 
advanced EDA flows support process by process tagging of voltage and operating conditions Ensunng one 
has clean VISibility of the boundanes of a power-gated block for venficatlon and testability purposes IS 
Important to ensure a clean top down Implementation flow 
Hierarchy and Power-gating 
Although one can In theory arbitranly nest power gated modules within power gated subsystems which are 
In turn nested on a shared SWitched power rail, there are conSiderable benefits In not Infernng multiple levels 
of power SWitching fabnc As Will be descnbed In Section 4 3 power-gatlng IS Intrusive and adds In some 
voltage drop and degradallon of performance 
Even If a functional power management view IS presented at the architectural level the Implementation IS 
Improved If thiS IS mapped onto a Single level of power-gallng at Implementation For example a CPU 
conceptually has all the core logiC power gated, and Within It a number of functional Units that can be 
indiVidually powered down Independently - a Multiply-Accumulate and a Vector Floallng POint units In thiS 
case See Fig 4.3A and Funcllon table at Fig 4.38 
From an Implementation standpOint the SWitching fabnc IS flattened as shown below There IS never any 
case when the MAC or VFP functional Units would be SWitched on Without the CPU core Itself so the SWitch 
control semantics are adjusted to AND the control terms rather than cascade the SWitch elements 
See Fig 4.3C The tabular power mode descnptlon now requires expliCit control of the nested power gated 
functional Units at Fig 4 3D 
Recommendations: 
• Map power gated regions to expliCit module boundanes 
• When partitioning a hierarchical power-gatlng deSign ensure that the power-gatlng control terms 
can be mapped back to a flat SWitching control fabnc rather than deSign to cascade power-gatlng 
more than one or two levels 
Pit-fails: 
• AVOid control Signals passing though power-gated or power-down regions to other power regions 
that are not hierarchically SWitched With the first region 
• AVOid exceSSively fine power-gatlng granulanty unless absolutely required for aggressive leakage 
power management Every Interface adds Implementation and venficallon challenges and 
complicates the system level production test challenges 
4-7 
Energy efficient sac design technology and methodology 
Figure 4 4A - Power Networks and Control 
Co_ 
,----------------------------------------------------------------1 
_ • .1. ••••• _ •• _ ••• _ •• _.~. __ ••• __ ••• __ •• _ ••• _. __ •••••••• _._ .......... _. _____ ., ______ ., •• ___ ...... _ ••• __ .-" 
~DCPUI i ! I ! 
••• .L ••••• _ ••• _ ........ _, • i i : 
(VOOSOCI : ..:
I: I I : 
 I I i 
I! i!:
t: 
~-Gatlng 
Controller 
.-
.... 'n 
Power-GWId 
CPUS~.II 
Non~ed 
Cocho .......... 
! 
Ns i i I 
... T.!!._ ............ _! ......................................................... t ........................ l ........ _ .............. ! 
l _______________________________________________________________ _ 
Figure 4.48 - External Power Rail switching 
r,------------------- .------------------_._-----.. Power SUpply 0 SOC 
0 
0 
0 
PSU • A1wo)'ll-on" VDDSOC power rail 
Regulator 
i PSU Control 
''''''rface I ~ 
..../ i Externally Switched VDD po_ rail 
PSU I Regulator 
(with Enable) I 
RaIl ... w1tched 
SUb-8ystaml 
VSS ground rail 
t. ______ • __ ••• ______________ _ 
4-8 
Design for State-Retention Power-Gatm9 - SRPG 
Power networks and their control 
In the desIgn of a processor-based SOC the CPU system may well Introduce a number of power networks 
• An Independent power raIl to the entire cached CPU subsystem - thIs allows the CPU to be 
completely tumed off for long-term "sleep" modes of operatIon 
• Power gated supply to the logIc to support short-term leakage savings modes where the cache 
memory can be left retaIned but all the leaky standard cell logIc tumed off locally 
• OptIonally support state-retenbon on regIsters In the standard cell portIon of the desIgn by some 
form of retentIon power supply from the non power gated raIl 
• Non power-gated supply to the control CIrcUItry and buffenng that supports the power-gallng fabriC 
Itself and any state retentIon control sIgnalling 
FInally one WIll need a SOC-Ievel supply that IS always on to control the external raIl sWItchIng handshake 
wIth the power supply - and oplJonally Include the on-chIp power-gallng support on the same raIl 
The figure below Illustrates the power networks WIth Independent "VDDCPU" and "VDDSOC" that IS always 
powered, shanng a common VSS ground connectIon In thiS example the power~gated standard cell area 
has a non-gated state retentIon supply shown to IndIcate an actIve supply raIl wIthIn a power gated regIOn 
See Fig 44A 
External power rail sWItching 
External power raIl sWItchIng offers the best long-term leakage power savings - but Introduces a sIgnificant 
turn-on delay to allow voltage regulalJon to stablhze and settle w,th,n specIficatIon 
Only a few voltage raIls can tYPIcally be externally sWItched, every power supply Incurs (external) regulator 
cost and area on the CIrcUIt board - usually Inductors and capacrtors required to Implernent hIgh-efficIency 
sWItched mode power supphes Every power raIl also requires on-chIp power gnd or nng supplIes that cost 
area and complicate the power planning and phYSIcal floor-planning Most SOC's already have at least 
three power raIls 
• 10 power (1 8/25/4 3V for external memory Interfaces typIcally) 
• "Always-on" SOC core raIl (technology dependent logIc and memory power raIl) 
• Clean analog power supply raIl to PLL's etc 
• OplJonally one may have "keep-alIve" voltage supply to real-tIme clock 
AddIng more than couple of extemal SWItch power raIls adds SIgnificant compleXIty and end-product cost 
TYPIcally a shared ground NSS conneclJon approach to the chIp and board works best for external power 
raIl sWItchIng Although there are typIcally Independent VSS pIns for both the 10 pad-nng and the chIp core 
In order to decouple the worst of any output SImultaneous sWItching acbvlly from the logIc and memory these 
are typIcally grounded on the CIrCUIt board Into a shared "O-volt" ground plane Treallng any other power 
supphes as sWItched posItIve supphes relatIve to the common ground mInimIzes compleXItIes when adding 
power-gatlng See FIg 4 4B 
4-9 
Energy efficient sac design technology and methodology 
Figure 4.SA - On-chip Virtual Rail Power-gating 
, 
• 
soc---·--·-··-····-·-····-·····-··-·~ 
• 
• 
• AiwlIyo-on·· VOOSOC pow« rail 
, 
~ PowarG3l. 
¥ Control 
; 
Po ..... Gatod ''VIrtual VDO" power roll 
I 
Power·Gatod 
Sub4ystem 
VSS ground fBll 
~-.---.----.. ---.-.----.---....... --. 
4-10 
Design for State-Retention Power-Gating - SRPG 
External power rail sWitching Incurs slgmficant delays on wake-up events - from the order of tens of 
microseconds to milliseconds or even longer potenllally Seeking to specify much faster supply SWitching 
times IS not necessanly desirable - the Inrush currents to re-charge all the capaclbve nodes In the power 
down subsystem result In nOise injection Into other (powered) regions of the chip and the resulting "ground-
bounce" In a shared ground system can Introduce problems that are hard to quantify until very late In the 
Implementabon and analysIs parts of the design flow 
Translating such latencles Into clock cycles at RTL level IS not Simple Normally the clocks should be 
suppressed until a SWitched power rail IS stable and within specrfied tolerance For a design operating In the 
hundreds of MHz region thiS may be the eqUivalent of tens of thousands of clock cycles and With the actual 
delays being highly dependent on the power supply technology (that may have to be multl-sourced In a 
production) 
Separate power rails become a necessity when one Introduces the concepts of dynamiC voltage scaling 
(Chapter 5) It may also be highly deSirable In processor-based deSigns where large banks of memory can 
be given their own supply which may then be SWitched to Intermediate RAM retention operating conditions 
for example 
Recommendations: 
• MInimiZe the number of external SWitched Independent power rails - each one must be JustIfied 
from an end-product reqUIrement given the associated additional power supply real-estate costs 
and on-chip power d,stnbut,on 
• SWitched (poslbve) supply rails With a common ground - common VSS mesh 
• In systems Implementing voltage scaling an Independent rail must be proVided for each, and the 
power rail SWitching IS Simply mapped as an extreme case of the voltage scaling supply - to zero 
Pit-falls: 
• DeSign for Significant external power rail SWitching times tens or hundreds of thousands of clock 
cycle latencles must be factored Into wake-up and Will be dependent on the extemal PSU 
speCifications 
• Although multiple rails appear elegant from a system deSign perspective they Introduce venficatlon 
and deployment challenges In production Independent supply rails have Independent voltage 
control regulators, and Independent rails can exhibit vastly different load regulation charactenstlcs 
when active, w8l!-stated or halted compared to logiC powered at Interfaces Worst case conditions 
across Interfaces need careful venficatlon 
On-chip power-gating 
Given a restncted number of external (SWitched) power rails the on-Chip power-gatlng further segments the 
supplies to functional areas of the hierarchy 
The terminology often used to descnbe power gated supply rails IS that of "vIrtual rails" A power gated 
ground rail IS that of a Virtual Ground or Vlrtual-VSS rail, while a power-gated supply rail would tYPically be 
referred to as a Vlrtual-VDD rail See Fig 4 5A 
4-11 
Energy efficient SOC design technology and methodology 
Figure 4 6A - "Fine-Gram" Intra-cell Power-gating 
"TURN ON" ) 
VSSGROUND 
4-12 
Design for State-Retention Power-Gating - SRPG 
On chip power-gatlng of a much smaller region of a design IS potenbally much faster but not Instantaneous 
The current required to re-power a small power gated region IS much less significant than that tYPical for a 
full sWitch power rail but time must be budgeted to manage the minimization of power-gatlng transients and 
nOise Injection as seen by other logic and memory 
Therefore It IS reahstlc to see power-gatlng In terms of numbers of clock cycles for very small regions and 
tens or even hundreds of clock cycles for more significant gate counts Trying to turn on a number of small 
power-gated regions at the same time IS no better than a large block and this may lead to the requirement to 
have the power-gatlng of multiple Independent sub-blocks centrally managed In order to ensure the delay 
times are minimized with respect to total sub-system demands 
In the finest level of granulanty the power-gatlng can even be factored with the standard cell hbrary such that 
each (high-leakage) gate has a local power sWitch bUilt In senes with the power rail inside the cell In theory 
the RTL designer needs to know nothing more than the fact that the cells have a very fast Internal power-
gallng function bUilt In - but It turns out that the designer Will In fact have to handle the timing or delay of 
such functlonahty just as carefully to ensure that the Implementallon does not have over-constrained 
sWitching demands placed on the power-gatlng functlonahty See Fig 4 6A 
Power-gatlng has an Impact on both performance and area, as Will discussed In Chapter 4, due to the nature 
of the sWitching transistor fabnc, so at the outset It must be understood that there may well be a headhne-
performance limitation compared to Implementations that Sit on an externally regulated voltage rail 
MapPing on-chip power-gatlng to the cached processor example 
• Power rail to the cached CPU subsystem (externally sWitched potentially) 
• Power gated supply to the CPU logic - a local Vlrtual-VDDCPU rail 
• Non-power-gated supply to sWitching control sequencer and sWitch fabnc buffering 
• Non-power-gated supply to Isolation regions and buffer trees 
• Power gated supply to the CPU fioallng-polnt unit - a local Vlrtual-VDDVFP rail 
• Power gated supply to the CPU Multiply-Accumulator - a local Vlrtual-VDDMAC rail 
See retention support In later section of thiS chapter for details for additional control buffenng and state 
preservation power requirements for non-power-gated supplies 
Recommendations: 
• DeSign for technology-dependent power-gatlng times tens or hundreds of clock cycle latencles 
may require to be factored Into wake-up times dependent on the area sWitched and the sWitching 
fabnc control charactenstlcs 
• Therefore deSign for "wait-states· across boundanes where there are dynamically power gated 
funcbonal units such that the Implementation-dependent delay times can be safely managed and 
latency constraints set 
Pit-falls: 
• Every power-gated rail Introduces venficatlon and test challenges so the number of power gated 
regions needs to be carefully Justified and factored Into project tlmescales 
• Although multiple power-gated supphes appear elegant from a system deSign perspective 
Introduce venficatlon and deployment challenges In production Each power gated rail requires 
Independent control sequencm9, and Introduces "walt-stateN Implications across boundaries that 
require Independent venficatlon 
4-13 
Energy efficient sac design technology and methodology 
Figure 4.7A - BasIc (PMOS) Header Switch 
VDO PPLY 
1£0 VDO 
Figure 4.7B - Basic (NMOS) Footer SWitch 
RN N" 
VSS NO 
4-14 
Design for State-Retention Power-Gating - SRPG 
"Header" and "Footer" switches 
The transistor structures for power-gatlng are descnbed In Chapter 4 The sWitches are highly technology 
specific A lot of the academic papers on "MTCMOS" - Multi-Threshold CMOS where hlgh-Vt sWitches are 
used as senes power sWitches for faster leaky low-Vt gates - advocate both P-channel"Header" sWitches 
gatlng the VDD supply and N-channel"Footer" sWitches gatlng the VSS ground However two such hlgh-Vt 
power sWitches In senes With the gate do cause a more significant IR voltage drop In the supply as seen by 
the gate In many practical designs where perfonnance cannot be sacnficed unduly sWitching of only the 
supply rail or the ground can be tolerated The leakage current reduc\lon comes from the first of any 
sWitches added 
The baSIC Header SWitch IS shown In Fig 4 7A, footer sWitch In Fig 4.7B 
With Header sWitch fabnc the Internal nodes and outputs of a power gated block collapse down towards the 
ground rail, With footer sWitches the Internal nodes and outputs all charge towards the supply rail There IS 
no guarantee the power gated nodes will ever fully discharge to ground or charge to the supply as 
equlllbnum IS reached when the leakage currents through the SWitches are balanced by the sub-threshold 
leakage of the sWitched cells 
Recommendations. 
• Only bother sWitching the supply rail or ground, rather than both, dependent on the appropnate 
SWitch fabnc characteristics for the technology In order to minimize the IR drop due to power-
gatlng 
• DeCide early on In the deSign phase whether header or footer sWitches most naturally fits With the 
system deSign as this Will affect the Isolation and Interface protocols across the power gated 
Interfaces 
• Header sWitches are the most appropnate choice for sWitches If they are power-gatlng an 
externally sWitched power-rail This ensures that outputs behave the same whether power gated or 
extemally sWitched and the Input parking condition IS consistent regardless of the "depth of sleep' 
• In systems Implementing voltage scaling header-switches are favourable Level shlfters tYPically 
share a common ground across Interfaces, therefore uSing a vlrtual-VDD rail and a common 
ground maintains consistency when It comes to clampmg Signals at boundanes whether scaling or 
gatlng power supplies 
Pit-falls: 
• Seek to avoid "sneak paths" between dnven Inputs and header or fcoter power gated regions that 
Inject current Into the Inputs of sWitched If the "natural" discharge potential IS the oppoSIte sense 
(e g active-high Signals asserted Into header-switched region when these are power down) 
• Beware of mixing "Virtual-ground" power-gatlng With externally SWitched power rails or voltage 
scaling This results In sleep mode-dependent Input and output states 
4-15 
Energy efficient soe design technology and methodology 
4-16 
Design for State-Retention Power-Gating - SRPG 
Power-gatlng control networks 
The control networks for power-gatlng at the RTL level are not straightforward The control sequenclng 
needs to be explicit and match the technology-dependent control of the power sWitching fabnc The power 
gated portion of the design typically IS transparent to thiS, much In the way that scan testability IS not VISible 
In the RTL and Implementation-specific ports for scan data and enable(s) only appear later In the 
Implementation flow 
In the Implementallon such power control networks are typically hlgh-fanout nets that have Implied buffer 
trees Such buffer trees need to be powered from the pnmary rail not the power-gated virtual rail to ensure 
all power sWitches see valid and safe dnve signals If there are multiple control nets for phased tum-on of 
networks, as discussed In Chapter 5, then multiple control ports must be made explicit In the RTL design 
and careful hook-up of such control ports to the power-gatlng fabnc netlist supported In the Implementation 
and venficatlon methodologies 
Best pracbce RTL design Will require the designer to ensure controllability of resets for testability Ideally all 
denved or resynchronlzed resets (or presets) are multlplexed from an extemal controllable pnmary reset 
control pin, or With reduced test coverage Inhibit asynchronous reset assertion when configured In a test 
access mode 
For Similar reasons the designer needs to proVide controllability of power-gatlng control networks It IS cruCial 
that scan test patterns cannot aCCidentally toggle state machine outputs that activate power-gatlng of sub-
systems Power control Signals therefore need to be gated or multlplexed when In test rather than functional 
mode, and optionally controlled extemally If power-gatlng leakage analYSIS IS to be performed on a tester 
From a venficallon standpOint assertions and coverage should be added In order to validate the correct 
sequenclng and polanty of the control networks 
Recommendations and Pitfalls for power-gating control 
Recommendations: 
• ExpliCit control In the RTL controller needs to be tailored to the technology speCific power-gatlng 
fabnc chosen 
• Assertions should be prOVided for the power-gatlng control ports, to match the chosen SWitch 
technology to ensure funcbon venficatlon and coverage In the RTL deSign environment before the 
power-gatlng fabnc IS Implemented 
• Power-gatlng control Signals must be made controllable dunng test 
Pit-fails: 
• Uncontrollable power-gatlng control Signals may well break testability later In the Implementallon 
flow 
• Hierarchical deSigns With power-gatlng networks that pass through power-rail SWitched regions 
have the potential to construct power-gatlng control networks that function In power-gatlng but fall 
With rail SWitching 
4-17 
Energy efficient soe design technology and methodology 
Figure 4 9A - Basic (AND) Clamp-Low Isolation cell 
X--I 
"VALID· 
ClAMPED 
(LOW) 
ISOlATED 
SIGNAL 
Figure 4.9B - Basic (OR) Clamp-High Isolation cell 
X----\ ClAMPED (HIGH) 
ISOlATED 
SIGNAL 
Figure 4 9C - Basic (Pull-Down) Clamp·Low 
LAMPED I AL 
Figure 4.90 - Basic (Pull-Up) Clamp-High 
VDD 
"VALID" 
MPED I NAL 
Figure 4.9E -Intra-cell clamp high In Fine-Grain footer"switched cell 
VDDSUPPLY 
VIRTUAL VSS 
GROUND 
4-18 
Design for State-Retention Power-Gating - SRPG 
Power domain interface signal isolations methods 
Every Interface to a power gated region needs management of the signals that traverse this boundary and 
outputs that are unsafe or un-dnven are the pnmary concem 
Signal Isolation 
Power-gated or extemally sWitched power rails result In outputs that exhibit non-logiC values Worse than 'X' 
values In Simulation, these un-dnven outputs are unsafe Inputs of non power-gated logic and can cause high 
currents to flow In the gates that constitute the receiving Interface, and potentially may even "crow-bar" that 
supply rail 
When uSing header SWitches for power-gatlng an AND-gate function which forces the power-gated signal 
low IS what IS required to park the output safely at logiC '0', with footer sWitches the deSired function IS that of 
an OR-gate function which parks the output at logiC '1' Clamp library cells must be deSigned to ensure they 
cause no sneak leakage paths when the Input that IS power gated floats, and tYPically they have extra 
attributes that EDA tools need to ensure these never get optlmlzed away, buffered Incorrectly or Inverted as 
part of logiC optimization 
Conceptual AND-style Isolation clamp-Iow, when "VALID" IS true, Signal passes to output, when false output 
IS clamped low shown In Fig 4 9A 
Conceptual OR-style Isolation clamp-high, when "INVALID" IS true, output IS clamped high, when false Signal 
passes to output shown In Fig 4 9B 
SpeCial purpose clamp gates as descnbed do add delay on potentially add to cntlcal paths - for example on 
cache memory Interfaces that Interface to power gated control logic A low-cost altematlve that does not add 
full gate delays appears to be that of a much Simpler pull-down tranSistor clamp when USing header SWitches, 
or pull-up transistor clamp when uSing footer SWitches that forces the power gated Signal to the parked value 
Without actually intercepting It However thiS effectively Introduces multiple dnvers on the power gated net 
that require careful sequenclng to aVOid contenllon when SWitching, and exclUSive logical control at power up 
Even If the pull-up or pull-down transistors are relatively week deVices the total number may become 
Significant 
Conceptual pull-down style clamp-Iow, when "INVALID" IS true, output IS clamped low, when false Signal 
passes to output shown In Fig 4 9C, 
Conceptual pull-up style clamp-high, when "VALID" IS true, Signal passes to output, when false output IS 
clamped high shown In Fig 4.90. 
In the Simple fine-grain SWitched cell the same control Signal can potenllally be used to turn-off the power 
footer SWitch say and enable the pull-up as the layout and timing can minimiZe contention, It IS not pOSSible 
to guarantee thiS for layout-dependent networks An example fine-grain cell With footer power SWitch and 
Integrated output pull-up IS shown In Fig 4.9E 
Finally, Ideally one wants to Isolate outputs until power-gatlng has stabilized to aVOid output glltches and thiS 
IS obViously logically pOSSible With the clamp cells but not With pull transistors that would fight the output 
values whenever these powered back up In an active state 
Therefore pull-up and pull-down clamps are not recommended despite the lower area and timing cost for 
portable RTL deSign, and the "gate-style" cell styles are favored, and descnbed In the rest of thiS section 
However With careful use these have value In speCialized sltuallons where the Signalling protocols are 
understood and the pull-up or pull-down clamps do not cause contention by deSign when enabled 
4-19 
Energy efficient sac design technology and methodology 
Figure 4.1 OA - Power·gating and clamping 
INPUTS 
FORCED 
"lOW" 
WHEN 
ISOLATED 
VOOSUPPlY 
GA EO VDD 
Power·Gatod 
Region 
4-20 
OUTPUTS 
DRIVEN 
'"lOW" 
(INACTIVE) 
WHEN 
ISOLAtED 
Design for State-Retention Power-Gating - SRPG 
Output or Input Isolation 
As explained, It IS the electncal problem of floating outputs that must be addressed Clamping at the output 
or the Input appears to make little difference funcbonally but there are pros and cons either way 
Library Isolation cells or clamps typically require to be placed In the power domain that remains powered 
relative to the power gated region And the EDA tools and Implementation methodology for any control 
Signal buffenng or output dnve strength optlmlzabon phases need to respect thiS 
Isolating or clamping a power-gated output to a safe level allows the cases of fan-out to multiple Inputs to be 
handled cleanly In one place Isolating the Inputs In the powered domain may well result In duplication of the 
clamp functionality mulbple times 
Therefore It IS desirable to limit fan-out across power gated Interfaces to a minimum and ensures clean 
Implementation vIsibility of where Signals may safely be buffered (after the Isolation clamp not before) 
From the reusable IP perspective the deSirability of Integrated clamps "within" the subsystem such that the 
complications of Isolation or non-logic value Signals are hidden from the SOC-Ievel Integration Interfaces has 
been Introduced earlier From an RTL deSign perspective thiS requires that a "VDDSOC" power rail Interface 
module IS Introduced Into the deSign that can be mapped to the appropnate power domain and all the 
Isolation clamps and any local high fan-out net buffenng can be Implemented In thiS region 
Interface Protocols 
Inputs to power-gated region do not reqUire explicit Isolation for fear of electncal failure or breakdown but 
applYing clocks or dnven Inputs to a power domain effectively waste power by charging up some nodes that 
then leak back to Into the power gated ground or current return path 
Therefore It IS deSirable to define the module boundanes between power-gated regions In the RTL deSign 
With Interface Signalling that maintains the best state-dependent leakage states across Interfaces when 
Isolated for power-gatlng 
Taking the example of power-gatlng regions uSing header SWitches, the preference IS for active-high 
Interface Signal protocols The de-assertion case then matches the preferred Isolation clamp values for 
parking Signals Parking the clock at zero IS Ideal because then the clock network IS dnven low logically 
before power-gatlng which then minimizes leakage currents Into thiS high buffer strength network 
The only RTL deSign level exception to the acbve-hlgh Signalling preference for header-switched regions IS 
that of resets When powenng back up after power-gatlng, registers Will typically need to be reset (see State 
Retention section later for qualification of thiS) to initialize state correctly By haVing an active-Iow reset 
asserted through power-gatlng and the stablllzabon penod when power IS reapplied correct behaViour IS 
guaranteed Most standard cell registers are available With active low asynchronous set or reset control 
Inputs so thiS IS a natural fit In fact as long the module boundanes are speCified In the RTL to have active 
high Signalling and active low reset protocols then the appropnate buffer tree Inversions can then be handled 
In the Implementation regardless of the actual library cell preferences See Fig 4 10A 
For footer-swltched power-gated regions the Inverse IS In fact true With a Virtual-ground the protocols that fit 
naturally at the Interfaces are active-Iow Signalling and active high asynchronous resets Again specifying 
thiS In the RTL Interface descnptlons then ensures that the Implementation flow handles any inverSions or 
preferences 
4-21 
Energy efficient sac design technology and methodology 
4-22 
Design for State-Retention Power-Gating - SRPG 
RecommendatIons and Pitfalls for power-gatmg mterface IsolatIon 
Recommendations' 
• M,n,m,ze the output fan-out across power-gated Interfaces POInt to pOInt power-gated output to 
Isolation receiver minimiZeS timing closure and buffenng 
• DesIgn for Isolabon cells on Interfaces rather than pull-up or pull-down style clamps unless uSIng 
very specIalized Interface protocols (where the "multlple-dnver" challenges are worth the 
ImplementatIon complicatIons) 
AssumIng a power gated or externally sWItched power raIl and common ground 
• Specify actIve-hIgh sIgnalling protocols on outputs, such that the ·parked" or Isolated cond,Mn of 
·zeros' on outputs I e aVOId actIve low sIgnalling protocols 
• Specify actIve-hIgh clockIng and sIgnalling protocols on Inputs, to aVOId wasbng power dIssIpatIon 
In the power-gated module I e aVOId actIve low Input sIgnalling protocols 
• The exceptIon to actIve hIgh sIgnal levels IS that of asynchronous resets ACbve-low reset(s) that 
cleanly asserted before the power IS sWitched off and de-asserted after power IS restored proVide 
guaranteed lnotlallzatlon behavIour (See the State RetentIon secbon for more complex reset and 
state restore factors) 
• M,nim,ze the output fan-out across power-gated Interfaces POint to pOInt power-gated output to 
Isolation receiver minimiZeS timing closure and buffenng 
AlternatIvely, for a gated or sWItched ground and common supply raIl 
• SpecIfy actIve-Iow sIgnalling protocols on outputs, such that the ·parked" or Isolated condItIon of 
·ones' on outputs I e aVOId actIve-hIgh sIgnalling protocols 
• Specify actIve-Iow clockong and sIgnalling protocols on Inputs, to aVOId wastIng power dIssIpatIon 
In the power-gated module I e aVOId actIve-hIgh Input sIgnalling protocols 
• ActIve hIgh asynchronous resets fit most naturally wIth sWItched ground approaches and should be 
asserted before the power IS sWItched off and de-asserted after power IS restored and stable (See 
the State RetentIon sectIon for more complex reset and state restore factors) 
Pit-fal/s. 
• IsolatIon clamps on outputs WIll behave as expected when locally power-gated But If the power 
raIl can be extemally sWItched for altematlve modes of power savIng then these clamps could 
become un-powered and result In floatIng outputs despite the clamp In the design 
• IsolatIon clamps on clocks between voltage regions tend to force a Single pOint of entry Into a 
subsystem Clock tree balanCing on e,ther s,de 'f the Isolation clamp becomes an ,ssue 'f there 's a 
w,de fan-out on the clock In the ,sland that rema,ns powered 
4-23 
Energy efficient SOC design technology and methodology 
4-24 
Design for State-Retention Power-Gating - SRPG 
State retention and restoration methods 
State retenbon with power-gatlng IS more challenging to support In a seamless way but can be highly 
deSirable Although In some cases one can afford to power off a subsystem, throwaway all state and Simply 
apply reset to re-Initialize the block when power IS reapplied, there IS typically an energy or real-time cost In 
resuming operation 
Retaining the state of registers over a sleepiwake power-gated episode IS both power effiCient and may 
reduce wake-up latency, compared to lOSing all register state and haVing to reset the power-gated 
subsystem, and run again Without any contextual state bUilt up over time 
How essential thiS IS depends on the subsystem charactenstlcs A Digital Signal Processing Unit that IS 
pnmanly data-flow driven may usefully be able to start afresh supplied With new Input data However a 
penpheral or cached processor typically has a lot of reSidual state, and It may well be waste time and energy 
to require a Significant amount of bus traffic to reload the state that was lost 
Although one could use software approaches to reading speCific register state and saving It away to memory, 
and subsequently reading It back and writing It back Into the registers, which may be appropnate In some 
cases, thiS secbon focuses on hardware solutions that may be transparently overlaid on an RTL deSign 
State Retention With Power-gatlng, or SRPG as It IS often referred to, bUilds on the baSIC power-gatlng 
descnbed so far and IS able to offer the potential of the IllUSion to that of clock gatlng, an RTL deSign 
resumes With the state It had at the last active event (the state Just before the sleep event that Initiated the 
power-gatlng) 
RTL deSign relies on the pnnclple of latches or registers that exhibit state preservabon or retention between 
activation events In some form of "Shadow" register" The expliCit activation events In RTL are baSically the 
clocks and resets Into a process that Infers registered state, and In the case of transparent latches sensitivity 
lists that Includes enables and data terms for Inferred level senslbve storage, for example 
Clock gallng can be overlaid on such RTL deSigns Simply by intercepting the clock and suppressing clock 
pulses whenever the particular enable term for the register IS not active ProViding the reset and enable 
terms have been coded cleanly for syntheSIS In the RTL descnpbon the register state values exhibit the 
same behaViour when Implemented With clock gatlng tools and Implementation methodologies There IS an 
area cost associated With the clock gatlng latches that need to be Implemented when gatlng clocks, but thiS 
IS offset typically by the power saving that IS gained by suppressing clock edges to flip-flops that have their 
clocks gated 
Retention Registers 
An elegant approach to proViding state retention while power-gatlng IS to replace a standard register With a 
special version that supports a locally "always-powered" Island With some form of "shadow register" that can 
preserve and restore the register state between power sleep and wake events 
A D-type flip flop IS typically bUilt as a Master-Slave pair of latches, often optlmlzed for speed and minimum 
setup and hold times to the sampling clock edge Library deSigners and vendors have come up With a 
number of vanants of retention registers that trade off the performance and area costs to support state 
retention functionality ThiS may Involve a third latch Implemented In Hlgh-Vt technology With some extra 
control functionality to capture and restore state to the high-speed register core Alternatively the slave latch 
may be Isolated and become the retention node at some cost to the clock-ta-output timing 
4-25 
Energy efficient sac design technology and methodology 
Figure 4.13A - Retention Registers 
'RE E' 
'SAVE" 
E 
D 
The "RE. box indicates the addllional retention latch with associated control signals to support saving and 
restonng the register state 
4-26 
Design for State-Retention Power-Gating - SRPG 
Whatever the library element approach some form of control Interface must be added to the register to allow 
the suppression of clocks or state saving and restonng functionality, and this may well Impose extra 
constraints on m which state the clock must be for save and restore functionality 
In the simplest approach one could Imagine constralnrng the Implementation to only uSing Hlgh-Vt registers, 
connecting these to the un-switched power rail, and simply power-gatlng all the leaky (Low- or Mlxed-Vt) 
comblnatonalloglc between register stages from the power gated power rail However In any reasonable 
sized block the reset and clock networks typically have to be Implemented With high-leakage low-Vt buffer 
trees which contnbute a Slgnrficant portion of the leakage for the block, as soon as these high fanout nets 
are power gated then the clocks and resets float and would corrupt standard registers 
Real-world retentron registers do all have some area overhead, typically 20-30% larger If they Incorporate 
guard bands to Isolate the retention state as robustly as possible from power-gatrng transients then the area 
Increase may be much greater, potentially as much as 50% In a deSign With a large proportron of registers 
the area Impact can be significant 
In addition to the area overhead there IS a control requirement as well From a controllP perspective there 
are one or more extra signals required to dnve to sequence state capture and control, and rarely can these 
be shared With the power-gatlng control line, for safe operation the save state operatron and more 
problematically the safe restoration of state back to the main storage element need to be safely away from 
the power-gatlng tranSients and unknown propagation 
Examples of control signal approaches Include 
• a pair of level sensitive signals of the form of Save and Restore 
• a Single edge-senSItive Retam control that capture on one edge and restores on the other 
By way of example a conceptual register With conventional reset and scan multlpleXlng IS show opPosite A 
retention latch structure IS Incorporated - With a non-power gated power supply connection not shown - and 
control signals to support cop)-'ng the active register state mto thiS shadow latch before power-gatlng, and 
restonng the state value after power-gatrng before restarting the clock See Fig 4.13A 
DeSigners are used to the fact that registers With Integrated scan multiplexers can automatrcally be 
substituted and hooked up later In the Implementation flow From the perspective of the power gated region 
thiS IS the Ideal abstractron for full state retention The control signals need to be Implemented as always-on 
networks to aVOid state corruptron dunng penods of power-gatlng but otherwise can be treated transparently 
to the RTL deSign 
However the control IP must manage the expliCit sequenclng of the save and restore signalling as part of the 
power management control state machine 
One other detail needs understanding from the RTL deSign perspective RTL coding for synthesIs reqUires 
that asynchronous resets have precedence over the clock edge and clock gatrng terms, retention 
transparent to the RTL deSign requires that neither the clock nor reset IS activated dunng retention Given 
that both clock and reset trees are likely to be power gated and uncontrollable dunng power down for best 
leakage power saving, there IS an underlYing assumption that retention has pnonty over clock and reset 
Evaluation of the available cells In the power-gatlng library IS Important to ensure the high-level retention 
scheme coded Into the RTL deSign IS not corrupted In Implementation due to floating clocks and resets 
4-27 
Energy efficient SOC design technology and methodology 
4-28 
Design for State-Retention Power-Gating - SRPG 
Partial and Full State Retention 
Full state retention has been descnbed and from a methodology perspective so far As long as clean control 
sequenclng IS provided It rs possible to make RTL design behave transparently with respect to state 
retention All state rs guaranteed to be the same before and after power gatlng 
Partial state retention appears much more attractive If only the "archltecturally vIsible" state rs saved and 
restored then the associated retention register area cost should be much more acceptable However this 
adds venficatlon complexity to ensure that the rnteractlon of retained and non-retained state IS always legal 
and safe - and cannot result rn state-machme deadlock or lost data of value 
Best practice RTL design' advocates storage that can be Initialized - typically with asynchronous set or 
reset top level Signals With partial retention this becomes mandatory because the state space to IS too large 
to prove that all X's can be safely flushed In a design from many retained state conditions as opposed to Just 
from (power-on) reset 
In order to design RTL that rs portable across different styles of retention register, where, for example the 
precedence of reset may be higher than that of state retention controlled save and restore, It becomes 
Important to separate out the reset Signals expliCitly for retention and non retention storage It then becomes 
pOSSible to archItect fully the "power-an-reset" full Initialization and the restart function after power-gatln9 to 
restore retained state for retention areas and Initialized all the non retention registers that would otherwise 
come back With "X" values 
The power control sequencer for partial retention must therefore dnve Independent (named) resets to the 
appropnate portions of the subsystem Some ngorous functional testmg will be required to ensure that there 
are no Illegal combinations of states that mrght cause deadlock 
A more subtle complication anses from a potentrallnteractlon With clock gatlng that WlII be Implemented 
further down the deSign flow All the state bits that make up clock gatrng enable tenms need to be retained 
themselves or be re-Inlbalized to a safe and restart-able condition such that the transparent latch contents of 
Integrated clock gatlng library cells can cleanly be regenerated - Wlthout the requirement to require 
retention-enabled vanants of clock gates as well 
State Retentron using standard Scan-Flops 
Scan chains that are Implemented for manufactunng test are potentially reusable as a mechanism to 
emulate state retention Without any register area overhead except for some extra control sequenclng It 
becomes pOSSible to Implement the deSign uSing standard multlplexed scan-flops and once state IS scanned 
out the entire subsystem can be power gated off 
From an RTL deSign perspective there are of course challenges The number of registers IS typically only 
known after Initial Implementation (Including optimization of constant value registers for example) 
, Keatrng, M "Reuse Methodology Manual", (see Brbllography) 
4-29 
Energy efficient SOC design technology and methodology 
Figure 4.1SA- Retention using manufacturing scan chains 
SYSTEM 
INPUTS 
WLnPlEXED 
SCAN TEST 
• SCAN SAV£J 
RESTORE 
1---------------------------, 
+or- r- r- -, 
,r---+ I f- f- f- I M~ -I: 
,r---+ U : l- f- :..: 
,r---+xl+o 
I 
If- f- :..1 : 
r+- I 
,r---+:.. : L- L- _ I 
!..""-----------------------JJ 
Y 
seA ",. 
1-----' 
I I r-
I I 
I I C 
L 
I I • I I M I p 
:-0+ 5 I I 
rL-: I I 
I ~-" 
BALANCING 
Fl.OP(S) AS 
• SCA .oor & RESTORE STATE DATA SCAN CHAINS CONFlGURED FOR SCAN RETENT10H REQUIRED & SAVE STATE DATA 
Figure 4.1SB - RTL Emulation of scan retention state 
'define CPU_SCAN_LEN 257 /* set to implementation length once known */ 
-ifdef RTL_SLEEP_EMULATE 
parameter scan_reg_length = 'CPU_SCAN_LEN; 
reg [15:0] scanword [O:scan_reg_length-l]; 
integer i; 
initial '* initialize the scan chain to count pattern, or more draconian X */ 
begin 
for (1=0; i < scan_reg_length; izi+l) 
begin 
scanword[i)<=l; If or 16'hXXXX; 
end 
end 
always@(posedge eLK) /* emulate Bcan shift CPUSI -> CPUSO */ 
begin 
if (CPUSB == l'bl) /* when controlled SCAN ENABLE is active */ 
begin 
end 
for (i=1; i < scan_reg_length; i=i+1) 
begin 
scanword[~]<~scanword[i-1]; 
end 
scanword[O] <= CPUSI[15.0]; 
end 
assign CPUSO [15:0] a scanword[scan_reg_length-l]; 
"'endif 
4-30 
SYSTEM 
OUTPUTS 
DRlVEN 
"lOW" 
(INACTIVE) 
WHEN 
ISOLATED 
Design for State-Retention Power-Gatmg - SRPG 
Therefore the control sequencer needs to be parametenzed to manage Implementation-dependent counters, 
and exphclt control of scan enable and scan chains needs to provided that will later be hooked up with the 
net-hst 
• The state needs to be saved somewhere so an on-chip or off-chip area of memory IS typically 
required sufficient to hold the number of scanned bits required to hold the state dunng retention 
• There IS a real-time delay cost In both saving and restonng state, and this grows depending on the 
size of the block to be scanned out and back In, and IS a function of how many scan chains are to 
be Implemented 
• If more that one chain IS used (typically byte- or bus-width IS more efficient) then It IS necessary to 
add sufficient extra registers to balance up the scan chains such that they can share the same shift 
enable signal 
• There IS also an energy cost In shifting out and back In the register state TYPically the pattems 
shifted are highly state dependent on the condition the subsystem was at the request for sleep In 
the pathological case the worst case pattems must be handled to ensure that the dynamic power 
and IR voltage drop constraints are not violated 
• For long term sleep the lower leakage power of power-gallng of power-rail sWitching an entire 
subsystem yet supporting state restored contlnuallon can be highly desirable from a product 
perspective compared to the energy costs of restarting with all state reset after power-gallng 
An example Simplified to 4-blt save and restore data IS shown below If the number of scan chains does not 
diVide exactly by the width of the retention scan data path then one or more flops must be added When 
balanced the state can be saved to memory ("wnte data") and later restored from memory ("read data") such 
that every register has the onglnal state bit re-scanned See Fig 415A 
Funcllonal tesllng and Simulation at the RTL level before nethst Implementation IS a challenge, but not 
Insurmountable 
One approach IS to add some conditional code Into the RTL deSign which IS only compiled In for Simulation 
when emulating scan-based retention Behaviour al shift registers are modeled and can be Implemented 
With Simple test sequences or even checksums to venfy that exactly the same state IS reloaded that was 
saved for example 
An example of proViding an RTL model of a CPU to be Implemented With 16 scan chains for retenllon 
support that can allow early venficallon of the control functionality IS shown In Fig 4 15B 
At a later stage a nethst slmulallon should be performed to ensure that the Implementation-specific scan 
chains and control signals really are Wired up correctly and that the correct length scan chain has been 
Implemented and balanced 
4-31 
Energy efficient SOC design technology and methodology 
4-32 
~~~~~~~~~~~~~~~~~---- -
Design for State-Retention Power-Gatmg - SRPG 
RecommendatIons and Pitfalls for state retention 
Recommendations 
• State retentJon has the potenllal to offer very fast wake-up tImes after power-gallng and allow 
transparent contlnuabon ThIS has to be traded off agaInst an Impact on the ImplementatIon 
sequentJal cell area 
• Ideally Implement full-retentIon at the module level rather than specific sequentIal processed In 
order to end up wIth a block-level venficatlon approach 
• Ideally regIster all Inputs to a state retentJon module or regIon as thIS the ensure that all Internal 
clock 9a1ln9 enable terms can be regenerated after power-gatlng - which IS essential to ensure 
clock gatlng behaves correctly In the ImplementatJon flow 
• Ensure that all regIsters In a full or partIal retentIon deSIgn can be InJllahzed RegIsters wIthout set 
or reset funcbonahty can be tolerated In deSIgns that can be exhaustIvely tested from power-on-
reset, but wIth retentIon the state space grows beyond what can be venfied 
• RetentIon controls must be made controllable dunng scan test 
• If partIal retentJon IS Implemented then It IS strongly adVIsed that separate resets are coded for the 
retained and the non-retained storage portions of the deSign ThiS allows clean venficatlon vIsibility 
of power on reset and restore/re-inItialize operation 
• When Implementing partial retention ensure that state machines and sequencers have no 
dependencIes on non-retaIned state, In order to aVOId state-dependent deadlock or Invahd state 
condItIons (The state space to verify can be enormous If many retentJon state values must be 
tested WIth non-retained state) 
• Where the area Impact of speclahzed retentIon regIsters IS too hIgh then reusIng the manufactunng 
scan chaIns IS a reahstlc optIon Although th,s reqUIres some care to map cleanly onto nethst 
ImplementatIon after test structures have been generated th,s can be managed relatIvely cleanly In 
RTL-coded control state mach,nes 
Pit-falls 
• Poor power-gatln9 mrush current management or retention power supplies nOise have the 
potentIal to corrupt retentIon regIsters resulllng In unsafeilnvahd state on restart Great care must 
be taken In the RTL power control to ensure power IS reapphed to power down blocks safely and 
gently 
• Partial retentIon requires much more ngorous reset and restore validation to ensure there are 
never deadlock condItIons between maintained (archItectural) state and re-lnJtJal,zed non-retained 
state 
• Clock 9a1ln9 enable terms that affect retention state themselves need to have retention registers 
on their entire fan-In state In order to ensure that "next state" sequenclng behaves correctly 
• USing both edges of the clock In a deSIgn WIth retentIon IS strongly dIscouraged Clock gatlng 
elements have Internal transparent latch structures and these potentIally fall If power gated As 
soon as both edges of the clock are used, some Integrated clock gates WIll be reqUIred to retaIn 
their enable state rather than to be able to resample thIS from logIC terms based on retentIon or 
relnltlahzed registers 
• In the case of scan-based save and restore care needs to be taken to ensure any bus-based 
senahzatlon of data can support waIt-states Scan-enable IS a very baSIC form of senahzatlon 
control and any genenc or reusable deSIgn reqUIres the clock to be started and stopped on 
demand In order to ensure there IS never any data loss on scanning state In and out 
4-33 
Energy efficient SOC design technology and methodology 
Figure 4.17A- Power-gating Control sequencing 
CLOCK ~ N_ISOLATE 
N_RESET f 
N_PWRON f 
Figure 4.178 - Power-gating Control sequencing with Retention 
CLOCK ~ N_ISOLATE 
SAVE n f 
N]WRON ~ RESTORE n 
4-34 
Design for State-Retention Power-Gatmg - SRPG 
Power-gating control 
Having descnbed the aspects of designing subsystems for power-gatlng Without and With power-gatlng we 
turn to the explicit RTL that IS required to control the additional functionality 
Power control sequenclng 
Putting together the power-gabng and retention control requirements the RTL sequenclng details become 
clear 
To power gate a region without retention 
• Flush through any bus or external operations In progress 
• Stop the clocks, In the appropnate phase to minimize leakage Into the power-gated region 
• Assert the Isolation control signal to park all outputs In safe condition 
• Assert the power-gatlng control signal to power down the block 
To restore power 
• De-assert the power-gatlng control signal to power back up the block 
o Optionally sequence multiple control signals for phased power-up depending on the 
current Inrush management approach and technology 
• De-assert reset to ensure clean Initialization follOWing the gated power-up 
• Assert the state retention restore condition (pulse or edge-tnggered IS technology dependent) 
• De-assert the Isolation control signal to restore all outputs 
• Restart the clocks, without glltches or violating mlmmum pulse Width deSign constraints 
The sequence IS shown In waveform representation at Fig 4 17 A 
To power gate a region with retention 
• Flush through any bus or external operations In progress 
• Stop the clocks, In the appropnate phase to minimize leakage Into the power-gated region 
• Assert the state retention save condition (pulse or edge-tnggered IS technology dependent) 
• Assert the Isolation control signal to park all outputs In safe condition 
• Assert the power-gallng control signal to power down the block 
To restore power and retained state 
• De-assert the power-gatlng control signal to power back up the block 
o Optionally sequence multiple control Signals for phased power-up depending on the 
current Inrush management approach and technology 
• De-assert reset to ensure clean Initialization follOWIng the gated power-up 
• Assert the state retention restore condition (pulse or edge-tnggered IS technology dependent) 
• De-assert the Isolation control signal to restore all outputs 
• Restart the clocks, Without glltches or Violating mlmmum pulse Width deSign constraints 
The sequence IS shown In waveform representation at Fig 4178 
4-35 
Energy efficient soe design technology and methodology 
Figure 418A - Power-gating Control with Request and Acknowledge 
CLOCK L..----f--~ ------------~ r+-----------
4-36 
Design for State-Retention Power-Gating - SRPG 
Handshake Protocols 
As has been made clear eart,er In this chapter, power-gallng takes time and safe sWitching and settling 
times must be designed Into the controller IP 
The simplest way IS to embed synchronous delay or counters Into the controllP to add In enough clock 
cycles to meet the power up or power down times However embedding such time constants In the RTL 
makes the IP reusability or portablllty much harder - even on mlgrallng a working product deSign onto a next 
generation technology node where the power-gallng transistor charactenstlcs affect timing 
For best practice It IS advisable to deSign In both request and acknowledge or "power valid" Interface Signals 
to support double edge synchrOnization handshaklng between power being requested and subsequently 
acknowledged as safe 
In Simple small power gated blocks one might Just lie the power request back to the request 
In more aggressive deSigns where analog IP IS proVided In the power control fabnc to sense when gated 
power rails are safe and within speCification then the acknowledge IS dnven directly by these The advantage 
IS that one does not have to deSign for the worst case delays because the senSing can be adaptive to the 
environmental conditions and how partially discharged a power gated region IS 
The sequence IS shown In waveform representation at Fig 4.18A 
DeSigning In such acknowledge handshakes Into the control IP leaves open the Implementation options as 
Wide as pOSSible and allows the selection of Implementation technology to be handled later 
Recommendations and Pitfalls for power-gatmg controllers 
Recommendations: 
• DeSign the control sequencers or state with request and acknowledge handshakes for reuse 
across different generations of power-gatlng technology 
• BUIld In Interlocks to ensure safe wake-up the time when a region IS In the process of power-gatlng 
(I e synchrOnize both edges of power request and acknowledge to ensure the control does not 
overrun the SWitch fabnc and control networks 
• Any power-gatlng acknowledge timing IS technology and network area dependent so should be 
synchronized to the sequencer clock (unless such a low clock speed IS used that thiS 
asynchronous timing would never be a problem) 
Pit-falls 
• Because powenng down IS tYPically dependent on semiconductor process and temperature the 
"hazardous" case IS typically In the time Window If an IP block IS Just being put to sleep when a 
wake up condition occurs before the block IS fully power gated 
• Partial retenllon of state requires great care, and for venficatlon purposes will Ideally require 
Independent resets to retention and non-retention register state 
4-37 
Energy efficient sac design technology and methodology 
Figure 4.19A - RTL coding suitable for power-gating post-processing 
always @ (posedge elk or negedge nrat) begin 
if (!nrat) 
state <= 4'b010l; 
else 
state <= next_ state; 
end 
Figure 4.198 - RTL coding after power-gating post-processing 
always @ (posedge elk or negedge nrat 
'ifdef RTL PG EMULATE 
or negedge PWR 
~endif 
) begin 
~ifdef RTL_ PG_ EMULATE 
if (IPWR) 
state <= 4'bXXXX; 
else 
~ endif 
if (!nrat) 
state <= 4 1 b010li 
else 
state < = next_ state; 
end 
4-38 
Design for State-Retention Power-Gating - SRPG 
Power-gating design verification - RTL simulation 
Providing one uses a rigorous RTL coding style and consistent naming scheme for clocks and resets then it 
is possible to automatically annotate synthesizable RTL code with simple scripts to provide: 
• functiona l modelling of power-gating (including forcing outputs to X when power gated) 
• functional modelling of save and restore 
• functional modelling of the precedence of power-gating/retention/resel 
This is highly desirable for functional testing and allows test code and vectors to be generated for both the 
RTL and the implementation net-list. 
Inferring Power-gating Behaviour in RTL 
Providing a consistent clean (RMM3 compliant) design style has been used to code up a consistent 
asynchronous reset (or set) and synchronous clocking style to all sequential statements in the RTL 
subsystem then it is possible to sCript a conditional set of power-gating and behaviours that allow rigorous 
simulation modelling: 
Force 'X' on all state outputs when power-gated (potentially when not PWR_REQ or PWR_ACK) 
Ensure state is set to 'X' by power-gating to verify that explicit reset resets state after power-gating 
Model correctly the priorities of power-galing/reseUclocking to ensure correct sequencing 
Providing the asynchronous "initialization" section of the RTL has been coded cleanly then it has proved 
straightforward to augment the RTL to infer simulation behaviour to match the neWst: 
e.g . RTL for synthesis shown in Fig 4.19A. 
This can be automatically converted to code to support simulation testing with power-gating controller RTL 
as shown in Fig 4.19B. 
When simulating with RTL_PG_EMULATE defined the RTL control of the PWR power-gating (enabled) RTL 
input is added to the sensitivity list and as the highest priority term in the sequential process descriptions 
forces state unknown whenever power is removed (PWR deasserted). 
3 Keating, M. "Reuse Methodology Manual", (see Bibl iography) 
4-39 
Energy efficient SOC design technology and methodology 
Figure 4.20A - RTL coding suitable for retention post-processing 
always @ (p osedge elk o r n e g edge nrat) begin 
if (! nrat ) 
s t a te < = 4 ' bOl Ol ; 
else 
state <= next_ state ; 
end 
Figure 4.208 - RTL coding after retention post-processing 
~ ifdef RTL_ PG_ EMULATE 
reg [3:0] state SAVE 
wire PWR; 
assign PWR 
~ endif 
4'bXXXX; // declare new state 
always @ (posedge elk or negedge nrat 
~ ifdef RTL PG EMULATE 
or negedge PWR or posedge SAVE or negedge NRESTORE 
' endif 
) begin 
-ifdef RTL_ PG_ EMULATE 
if (IPWR) 
state <= 4'bXXXX; 
else if (SAVE) 
state_ SAVE < = state; 
else if (!NRESTORE) 
state < = state_ SAVE; 
else 
- endif 
if ( ! n r at ) 
state < = 4'bOlOl; 
else 
stat e <= next state ; 
e n d 
4-40 
Design for State-Retention Power-Gating - SRPG 
Inferring Power-galing and RetentIon BehavIour In RTL 
ProvIdIng a consIstent clean (RMM compliant) desIgn style has been used to code up a consIstent 
asynchronous reset (or set) and synchronous clockIng style to all sequentIal statements In the RTL 
subsystem then It IS pOSSIble to scnpt a condItIonal set of power-gatlng and retentIon behaVIours that allow 
ngorous SImulatIon modelling 
Force 'X' on all state outputs when power-gated (potentIally when not PWR_REQ or PWR_ACK) 
Sample state to extra Inferred retentIon state vanables for "SAVE" operabon 
Re-InItIalize state from retentIon state vanables on "RESTORE" operatIon 
InItIalize retentIon state vanables to 'X' to capture InvalId RESTORE before SAVE operatIon 
Model correctly the pnontles of power-gatlng/retentlon/reseVclockong to ensure correct sequenclng 
Prov,d,ng the asynchronous "InItIalizatIon" sectIon of the RTL has been coded cleanly then It has proved 
straIghtforward to augment the RTL to Infer SImulatIon behaVIour to match the neU,st 
e g RTL for syntheSIS shown In Fig 4 20A 
ThIS can be automatIcally converted to code of the form shown on FIg 4.20B 
When SImulatIng WIth RTL_PG_EMULATE defined the RTL control of the PWR power-gatlng (enabled) RTL 
Input IS added to the sensItIvIty lIst together WIth an actIve-hIgh SAVE and an actIve-Iow NRESTORE paor of 
SIgnals In th,s example, chosen to match the retentIon library IP components Added shadow state vanables 
are InstantIated and the save and restore and power-gabng pnontles are added to the sequentIal process 
descnptlons to model In RTL the underlYIng behaVIour 
An example perl scnpt IS Included In Appendlx-A to show how the SImply automatIon of the addIng of th,s 
template code to cleanly coded RTL has been achIeved 
Design For Test considerations 
Standard best·practlce conSIderatIons are pnmanly about proVIdIng test modes that make clocks and resets 
extemally controllable such that standard AutomatIc Test Pattem GeneratIon tools can generate hIgh 
coverage test vectors 
The pnmary addItIonal conSIderatIon for the system and RTL deSIgner IS to extend these DFT rules for 
State-RetentIon Power-Gatlng The addItIonal questIons that need to be posed are 
• Power-gatlng SWItch functIonalIty - fully SWItch on and off? 
• IsolatIon funcbonallty - are IsolatIon Interfaces safely low leakage, buffered and dnven correctly? 
• Retention-register functionality - clean state save, retain and restore operation? 
• Control SIgnal connectIvIty for all the above - do controls and handshakes correctly control the 
analogue SWItchIng? 
4-41 
Energy efficient sac design technology and methodology 
4-42 
Design for State-Retention Power-Gating - SRPG 
Power Gates and OFT 
Functional problems that need checking at Implementallon time are basIc things such as control signal 
polanty and ensunng that control buffers are all on the correct power domainS etc 
Manufactunng problems are tougher Control buffer or sWitch Iranslstor faults may lead either to some power 
gates not being sWitched on properly - resulting In IR-drop "hot-spots' In the gated power rails to some cells, 
or one or more power gates that are permanently on that then cause leakage power modes to malfuncbon 
and potenbally result In product standby ballery time failures 
As has been recommended earlier In thiS chapter making (all) power-gallng extemally controllable ensures 
that the power-gates controls do not get Inadvertently SWitched when shlfllng In scan pallems 
At-speed testing IS the only automated way of ensunng that power-gated regIOns behave correctly Any poor 
Impedance or broken sWitches will tend to Induce timing faults In cntleal paths From a OFT perspecbve thiS 
usually requires slightly specialized test control of clock SWitching between senal scan clocks that set up the 
register state scenanos for path testing and the functional at-speed clocks that capture state at full 
performance 
ProViding a mechanism IS provided to tum off power-gates extemally then a fairly coarse qUiescent current 
measurement test (IOOQ) may be possible to support In production tesllng where standby battery life IS a 
major concem But such tests are expensive In terms of timing and given that the leakage spread across 
process on shnnklng geometnes thiS may have to be approached later In the assembled product test flow 
Functional test vectors to cover the control state machine sequenclng are the best way of proving that the 
control state machines have no breaks In connectivity or stuck-at-zero/-one faults These are also very 
valuable In the early Implementallon flow testing to make sure that the correct signal polantles between the 
expliCit RTL state machine controllers and the technology-specific power sWitching fabnc have been 
correctly understood and connected 
When more complex power gate control structures are employed, as In the case of sequenced slow weak 
turn-on followed by strong IOW-Impedance power-gatlng then functional test IS required In order to ensure 
that any sWitching tranSients are managed properly and VOid potential corrupllon of retained state that IS 
descnbed In the next secllon 
Isolation and OFT 
Isolation or clamping across power-gated Interface boundanes should be straightforward from a functional 
test perspective The Implementation and venficatlon flow must proVide the power rad Integnty checking that 
buffer trees are Implemented on the correct power supply ete The functional test coverage required from the 
RTL perspective IS to ensure that the correct sequenclng IS applied to the power gated Interfaces and that 
the correct control signal polanty IS used With the technology specific Isolation cells In the library 
There IS only extra reqUIrement from the RTL deSign perspective, Interface clamping signals are typically 
controlled by state machine register outputs so these need to be made controllable so that that clamp 
signals are not Inadvertently toggled dunng scan test 
4-43 
Energy efficient sac design technology and methodology 
4-44 
Design for State-Retention Power-Gatmg - SRPG 
State Retention 
The state save, hold and restore functionality are tYPically controlled by state machine register outputs If 
these are not made explicitly controllable then shifting pattems through the sequenclng logiC on one power 
domain may cause the scan flops to stop behaving as scan flops In the power gated region 
Functional and manufacturing tests require proof that both zeroes and ones can be safely captured and 
restored, and Ideally the state can be senally scanned to prove retention latch Integnty 
Therefore making the retention control mechamsms vIsible to the test tools IS Important Mapping these to 
extemal pins may not always be possible so some form of coded test mode control Inputs may be the best 
altematlve to allow ATPG tools to gain vIsibility and controllability 
Funcbonal test vectors to cover the control state machine sequenclng are the best way of proving that the 
control state machines have no breaks In connectivity or stuck-at faults In the save and restore networks 
These are also very valuable In the early Implementation flow tesbng to make sure that the correct Signal 
polantles between the explicit RTL state machine controllers and the technology-specific state capture and 
restore functions have been correctly understood and connected 
As descnbed In the preceding section power-gatlng must be functionally sequenced correctly to aVOId 
sudden IR-drop transients that could corrupt retention registers Inadvertently 
In the case of scan-based state retention thiS has to be tested With functional vectors to guarantee that the 
scan-flops have been correctly balanced, the scan chain lengths do Indeed match the RTL-Implemented 
counters and that there are no arbitrary Inversions or scan test control functions that have crept In from the 
Implementation flow Finally such functional test vectors are Important to ensure that any unknown 
propagation does not "leak" Into the net-list due to Incorrect Isolation or un-Initialized storage 
Recommendations and Pitfalls for SRPG OFT 
Recommendations: 
• Clock and reset Signals must be made extemally controllable dunng test 
• Power-gatlng control Signals must also be made externally controllable dunng test 
• Isolation control Signals need to be made controllable dunng scan test 
• Retenllon controls must be made controllable dunng scan test 
• Functional test programs/vectors are validated on an RTL emulation of the final deSign 
• Support for externally IDDQ on VISible power rails In the case where "stuck-on" power gates could 
potentially cause product malfunction In the end-customer system 
Pit-fails: 
• Power gated qUiescent current measurements can only be "relative" to full-on current 
measurements due to the Wide spread In leakage currents across fabncatlon process 
4-45 
Energy efficient SOC design technology and methodology 
Figure 4.23A - Power State Machine design for SALT90G project 
"WAKE" Events 
( 
I~ '"TURBO" ··HALT' RUN 1\ / 
Idle 
WrthWel1s Wrth Clocks 
Forward-bl .... d Stopped 
i~ "'NORMAL" ''SNOOZE" RUN 1\ / Idle RESET With standard Stale Relenbon Wel~ia. Power Gatlng 
"--. 
·POWERSAIlER" "HIBERNATE" Idle RUN 
/ Scan Retention WllhWel1s 1'\ Back-blased External Power RaIl Swllehlng 
"SHUTDOWN" 
Idle 
External Power 
/ ·SLEEP" RaIl Swllehlng Event of CPU ond Cache 
RAMI 
\.. 
\... ..) 
4-46 
~ 
J 
J 
L/ 
J 
Design for State-Retention Power-Gatlng - SRPG 
Power-Gating/State-Retention in the SALT project 
The SALT technology demonstrator proJecl was used to evaluate and detenmne the best-practice 
approaches to power-gatlng and state retention that have been descnbed In this chapter In this section 
more details on the system design and RTL for this 90nm technology node are descnbed as well as some of 
the lessons learned that have been described more genencally In the pit-fails to beware of 
Leakage modes supported 
Most battery-powered ARM-processor based designs have to deal carefully with the balance between 
suffiCient performance to support product features which tends to require more leaky process technology 
versus the need to prOVide a number of leakage power reduction "stand-by" modes that trade-off depth of 
sleep with real-time penalties and cost of energy to enter and eXit such power states For the SALT project 
four low-power modes were designed In order of Increasing leakage savings - and real-time/recovery 
energy charactenstlcs 
• HALT SLEEP causes architectural clock gatlng, fast WAKE restart 
• SNOOZE State Retention Power-gatlng entry on SLEEP, WAKE as fast as power-gatlng safely 
allows The Cache memory state IS preserved, logic IS power gated 
• HIBERNATE scan-based State Save to memory and allow VDDCPU power rail SWitch-off on 
SLEEP, scan-based senal restore on WAKE when power rail SWitched back on A 32-blt AMBA-
based bus protocol was used to block wnte to SOC memory-map and block read back A 32-blt 
CRC was added to provide Integnty checking that IS saved away with the scanned data and used to 
protect against restarting with corrupted state via an error mechamsm The Cache memory state IS 
preserved, logic IS powered down 
• SHUTDOWN The only mode not transparent to the operating system Explicit cache clean code 
must be called to wnte back any dirty data In the cache memones before both the VDDCPU and 
VDDRAM supplies can power be power-rail SWitched 
In addllion support for active leakage reduction was Included to support externally managed threshold 
scaling uSing back-bias control Both P- and N-wells for the CPU standard cell area were exposed at pln-
level to support expenmental analysIs of delay and leakage power charactenslics 
No "multi-dimensional" delay models for both supply and dynamic well-bias threshold scaling were available 
or feaSible to generate, so a set of delay de-rating curves were denved from charactenzlng the library cells 
while sweeping the P and N well bias voltages, and from thiS a realistic constrained upper frequency limit 
was determined for the CPU In thiS "power saver" mode of operation In addition a "Turbo" mode of 
operation with some forward bias was also evaluated to support a higher-leakage mode of operation for 
(short-term) peak performance operation The latter mode of operalion would need on-chip thermal 
management to avoid potential thermal runaway, but was valuable from an expenmental Implementation test 
case 
The system design was then enhanced to enforce state changes between "normal" and back- or forward 
biased modes of active power management to go through HALT/SNOOZE/HIBERNATE state changes In 
order to ensure that well-bias voltages were only changed while the design was static and un-clocked, as 
shown In Fig 4_23A 
4-47 
---------------- ----
Energy efficient sac design technology and methodology 
Figure 4.24A - Power domain partitioning for SAL T90G project 
-------------1 
-----------., I r - I 11 I I ~ VODRAM! 11 21 x CACHElWIIU RAMs I 
11 I 11 I 
I I 
11- ------------
r 
Ir- :.------VDDCPO~ 
II I 11 SRPG sm..cElL 
VBBCPU..f'WELL I 11 CPU 
11 WITH DYNAMIC THRESHOLD - I 
11 SCAUNG SUPPORT VBBCPU·NWELL : 
11 
-:.---- ______ 1 I I 
... 
--- -----------, VDOSOC 
... 
r. 
BUS INTERFACE - ClAMPS and CPU LEAKAGE 
CPU SCAN RETENTION CONTROL MANAGEMENT 
SYsreM CLOCK eLK CONTROL 
GENERATOR STATE MACHINE BUS 
eLK 
OTGLEAKAGE 
'----' 
SRPG sm..cElL MANAGEMENT 
USB-OTG CORE CONTROL 
STATE MACHINE 
Figure 4.248 - JP re'partitionlng for SAL T90G project CPU subsystem 
r- ----------, 
I I 
I I 
I VODRAM: 
21 x CACHEJMW RAMs 
I I 
I I 
I I 
1-
-----------
I 
r- -------VDDCPO-: 
I ~ I I SRPG STD-CElL 
VBBCPIU'WELL ! I CPU 
-
--r-' WITH DYNAMIC THRESHOLD I 
I 
SCAUNG SUPPORT 
VBBcpu·NWELL : 
I 
------______ 1 1-
r -----------
I BUS INTERFACE - CLAMPS and VDOSOC I 
I SCAN RETENTION CONTROL I 
I 
.I CPU lEAKAGE MANAGEMENT I: I 
CPU I CONTROL STATE MACHINE I 
(;1.11. 
'- ----- - ~C-~; - - - - - f ~Y~ - - - J 
BUS 1 (;1.11. INTERFACE RESET 
4·48 
Design for State-Retention Power-Gating - SRPG 
Design partitioning 
The RTL design was partitIOned to allow the three pnmary power supplies to be mapped to the RTL design 
• VDDSOC IS the "always-on supply" that powers the digital side of the PLLs, the clock generators 
and the power management control blocks, plus all the real-time penpherals including real-time 
clock and timers that gcan generate wake-up events as part of their Interrupt service requests 
o Within this power domain the USB OTG subsystem IS power gated to evaluate RTL 
approaches to Isolallon and power-gatlng control (State retention In local RAM) 
• VDDRAM IS an external SWitched power rail that supplies the Cache and MMU RAMs In this 
project this also allowed detailed leakage and active power consumption profiles to be measured 
and allowed (limited) reduced voltage headroom retenllon analysIs 
• VDDCPU IS an external sWitched power rail that supplies the CPU standard cell area Support was 
Included for both full state-retention WIth power-gatlng, proViding fast local leakage reduction, and 
scan-based full state save and restore enabling external power rail SWItching to cut leakage 
completely In thiS project thiS separated supply also allowed detailed leakage and active power 
consumption profiles to be measured as well as the energy cost funcbons to get In and out of each 
power saving state 
See Fig 4 24A 
Subsequently the CPU sub-system deSign was re-Implemented to ease deSign re-use The Bus Interface 
and State Retentlon/Power-Gatlng controller are Integrated Into a "VDDSOC" region that IS grouped with the 
CPU such that only VDDSOC Interface signals are VISible to the SOC-Ievel deSign Although thiS makes the 
4-supply-rall CPU subsystem Slightly more complex to Implement the timing and Internal power-gatlng and 
Isolation Interfaces are then all abstracted away from the top-level SOC deSign Any changes or 
enhancements to the low-power states support by the IP block are properly Independent of the top level 
system deSign proViding the wait-state and handshake protocols with the top level clock generator are 
cleanly defined See Fig 4 24B 
Power-gatlng control and handshakes 
The CPU power-gatlng control system was Initially deSigned as a top-level module that managed the 
Interfaces to the external power supplies, local header-switch power-gatlng control, Isolallon, state retention 
and the handshake WIth the system clock generator to sWitch CPU clock frequencies and manage bus-clock 
synchronous scan clock pulses for hibernation save and restore funcllonallty 
In the state diagram Included below the "SNOOZE" states are alllabeled "LSLEEP" as the light sleep SRPG 
control, and the "HIBERNATE" states labeled as "DSLEEP" for the deep sleep control flow 
In order to support a Wide range of power management library components and current In-rush management 
experiments every control signal, whether for power-gabng, Isolallon, of save, restore and even reset, was 
dnven as a request signal and had an expliCit acknowledge signal All the acknowledge signals were treated 
as asynchronous and had local synchronlzers to the state machine clock domain 
ThiS ensured the deSign was free of locally coded delays or counts and allowed the acknowledge signals to 
be tied direct to the requests for some Implementations or bUilt as true handshakes when senally buffered 
nets were Implemented for some control schemes - which also provided some form of Integnty check on 
control signal connectivity 
4-49 
.I>-, 
01 
0 
-INIT 
!!l!!!l 
"'''''''' FWROA.TE 
NRESET 
NRESTORl 
ISAve 
'HnERSCAN 
'HlBERSAVE 
NISOLATE 
oa. .... 
r-------------~N~a~~' ________________________ __J 
"M>Q 
P'I>'RGATE 
NRE<;ET 
NRESTORE 
ISAY!! 
IHIOCRSCAN 
IHICERs",ve 
"'>OLATE 
''''''' 
""""" FWRO",TE: 
INRUer 
_UTOR!! 
~"VI! 
IH\BERSCMI 
'ItIBERSAVE 
INISOIATE 
",XlVOD 
....... n 
'NFIESH 
MleST~E 
'SA"" 
IttI9ERSCItN 
I!UI 
'"'''''' ,""",OATE 
..... .,. 
t.flESTOOE 
... '" 
Oft 
.,"'"" 
IPWROAT£ 
INRuer 
'NReSTOR'f 
ISAVE 
'HIBERSCAN 
'It'BERSAVE 
'NSOLA.TE 
'ClJ(~N 
'om 
L _______ ln'_&" ... IIIER ..... f'E"·----------------~ 
n.'I.~D 
......... 
NlE<;ET 
MESTORE 
'SA" 
IHlfIER<;CAN 
'111flERSAVE 
INI<;(UTE 
"""" 
~JUVOO 
P'I>'RGA.TE 
NR£<;ET 
'"RESTOR!! 
'SA"" 
1It1~<;CAN 
'HIBER,",VE 
"""'>OLATE 
ICU\.EH 
"'" ",rooo 
PA'RGATE 
NRES£T 
NltESTOftE 
'S",VE 
'HIt)£RSCAN 
,",GERMVE 
'NlSOlATE 
'C\.I{[N 
IIttollM & "HIBERNATE" 
0= 
E)lTVOO I.X!VOO IEXTVOO 'E)lllroO EUVOD EXTVOO 80lo'DO 
"""""" .... """ 
IYWI'IO ... n: 'F'I'mGATI! ~" .......... Po'i'RO"'tt ~ESET t.II:ESET _ESlET ONfIESET ..sIESET INR[SH 'NRES£T 
IoIIESTORE ,- NRESTOR[ _UTOA.! MlESTORE MESTOFtE 'NRESTOf'lE: INR£STORE 
'SA'" .- 'S",\I1O 'SAVE '''''YE .""" ""YE '''' MIIIVtseAN IMHRICA14 IHfBERSCMI IHlQERSCIIN 'IfIbERSC»I 1H1~<;r..AN IHI9.r]~<;CAN 
........ YE HIBERS",\IE I1lIlERSA.VE 'HlBD<SAVE 'HIBERSAVE IHIBERSAVE 'HI6ERSAVE 
..... SOLATE NSOlATE 1N1'SOlATE 'NISOlATE 'NSOLAfE ..... SQLATE 
Ia.J<EN ICU<EN ""<EN 
"""" 
",",EN OCI..KEN 
N~ 
..... ...... 
,moo exrvuo EJI,lVCO E)lI"OO EXTot() 
PWRGAT£ P'MIGATE 
"""""" 
P'M<GA" PA'fI('oATE 
NRESET ~" ... , ... ""'= "",,.T 
".,....,......"" NRESTORE NRESTORE NRESTORE HRtSTORl INRESTORE 
ISAVE '''''YE 'SA'" 'SA'" 'SAY! IHI8I!IIISCAN Hl8al:8CA./III ....aEFlSCIUII IHIBf.RM.AN IHlBf.R'>{.A.N 
IHrlBERSAVE IHIBERSA"'" IHlOE.RSA.VE itll9.{RSAIo'E lI-ioBER'>AVE 
'l'tIbMMall • IIIMm!or I!NISOlATE 'NlSCL\TE 'NISOlATE 'NISOLATE 'NISOLAT~ 
'CLKEN 'ClI<EN tCLKEN 'CtKEN 'Cl.kEN 
-n 
cC 
c: 
~ 
CO 
.,. 
N 
UI 
» 
I m 
en ::::l 
-
(1) Cl 
cO 
-CO 
;0 '< 
co (1) 
-
::l: co 
:J (') 
-
CD' e; 
::::l :J 
-'Ii Ul 0 0 :e 
co 0 
';' C-C> (1) Cl 
!:!' CJ) 
:J cO' 
IQ ::::l 
en 
-
-
(1) 
Cl (') 
-co ::::l' 
;:: ::::l 0 Cl 0 n 
::r <C 
:; '< 
co DJ 
0' ::::l 
~ c-
en 3 
» (1) r 
--I ::::l' CD 0 0 c-C> 0 
'D 0' ~ 
<C 
.2, 
co '< 
n 
-
Design for State-Retention Power-Gatmg - SRPG 
A couple of design notes that may prove Interesting 
• The Initialization sequence pulses the save/restore signalling to flush out any X-s from the shadow 
retention flops This may be useful when running funcbonal test programs or vectors on net-list 
• All timing-dependent state machine transItIOns Include a holding term that waits for the output 
asserted In that state to be acknowledged In order to maintain the timing-Independent 
request/acknowledge sequenclng 
• The power-gatlng assertion and de-assertion IS 'demand-dnven' In the SALT project there were 
extra dlagnosbc control Inputs to control the SWitch fabnc which allow the power-gatlng to be soft-
sequenced or forced fully on and off, and only the power-gatlng acknowledge Input to the state 
machine IS used to determine when power IS safely restored 
See Fig 4.2SA 
Isolation 
Several different Isolation techniques were employed In the project 
The VDDRAM region had inPut Isolabon cells Instanbated as "Genenc Library Cells", which are In fact 
wrappers for either behaViour al simulation models or technology-specific clamp cells from the underlYing 
"Power Management KIt" for the standard cell library This provided explicit Instantiation of the cells on the 
(many) entlcal path Signals Into the memones from the CPU core logiC, and supported clean vIsibility of the 
clocks and select Signals that need to be managed carefully In the Implementation flow Clock balanCing 
across Isolation cells IS not straightforward as such cells typically limit the flexibility the clock buffenng tools 
have to restructure the buffer trees 
The VDDCPU Included speCialized output Isolation cells that pulled down all output Signals at the Interface 
when locally power-gatlng, to guarantee clean sac Interface Signals However when the CPU rail IS 
SWitched off (Hibernate) these Isolation cells lose their VDDCPU power and the outputs could again float, 
Simple bus repeater or 'hold-" cells were added In the VDDSaC Interface to aVOid any further gate dlays on 
the Interface, and the bus Interface module had expliCit resets asserted by the Isolate control Signal to force 
loglc-O clamping of all bus Interface protocol Signals 
Separate RTL Isolation Signals were used for the RAM and CPU regions In order to aVOid the potential 
problems of an routing an "always-on" clamp control Signal through the potentially power-rall-swltched CPU 
to get to the RAM, even though both Signals were dnven by the same state machine output port 
The aTG block used the alternative of Instantiated "AND" gate cells In the RTL With SUitable "don't-touch" 
attnbutes added to prevent logical optimization across these Isolation boundanes 
Subsequently these Isolabon Interfaces have been managed completely transparently to the RTL by adding 
SUitable Isolation attnbutes to the module-level supported by later EDA tools, but for baSIC functional 
slmulabon and venficatlon there IS ment In wrapPing the clamp functionality In a technologY-independent 
wrapper module which allows EDA-Independent deSign portablllty to FPGA and sac tool flows for example 
4-51 
Energy efficient SOC design technology and methodology 
Figure 4.26A - RTL post-processing for single-pin NRET AIN ret'n flops 
Ir 
I" 
~ ifdef RTL PG EMULATE 
reg [3:0) state_ SAVE 4'bXXXXi If declare new state 
wire PWR; 
assign PWR 
~ endif 
always @ (posedge elk or negedge nrat 
~ ifdef RTL PG EMULATE 
or negedge PWR or negedge NRETAIN 
"endif 
) begin 
~ ifdef RTL PG EMULATE 
if (I PWR ) 
state <= 4'bXXXX; 
else if (!NRETAIN) 
state < = state_ SAVE; 
else 
"endif 
if (! n rat) 
state <= 4'b010l; 
else 
state <= next state; 
end 
always @ (negedge NRETAIN) If capture state on falling NRBTAIN 
state SAVE <= state; 
"endif 
4-52 
Design for State-Retention Power-Gating - SRPG 
Retention 
The SALT project incorporated a number of techniques to evaluate the area/time/energy cost functions for 
different approaches to allow comparison and analysis on the same silicon, as many designers 
Full state retention was implemented for the CPU. Given a fully va lidated CPU core the only safe approach 
to verifying that the processor could be restarted with arbitrary control and data state was to maintain every 
register bit state. The option of adding a separate reset or reinitialize signal for ~non-a rchi tecturar state (Le. 
the programmer's model) was not feasible without a serious verification project phase. 
In order to understand the soft-error effects of uncontrolled turn-on power-gating a non-real-time diagnostic 
mechanism was introduced into the power control sequencing to allow not only SRPG save and restore to 
be controlled normally but also re-use the "Hibemate" scan functionality to check-sum and save away and 
the entire register state after SAVE operation and then checksum and save away the entire register contents 
after RESTORE operation . This allowed error ana lysis for both random and location sensitive problems and 
the efficacy of the soft-start power-gating sequencing when not over-ridden. This turned out to be a valuable 
way of quantifying the safety margins of the retention flops and allowed them to be subjected to thermal and 
voltage shocks while in retention mode. 
On the other hand the USB OTG core is partitioned with persistent USB end-point data held in SRAM while 
all state in the control state machines and FIFO interfaces to the PHY are transitory. Therefore it is not 
required to provide retention registers for the power-gated logic providing the control state machine 
guarantees to flush any transmit or receive buffers to the PHY interface before power-gating , and 
re initializes all register state when restoring gated power. 
The RTL annotation scheme was adapted to forces X-s at outputs on power-gating and on all register state 
to ensure correct reset seQuencing when waking up after power-gating . The OTG block can indicate able to 
sleep so that a device driver knows when it is power-gating the core, but the block can be awoken by USB 
activity detected in the PHY. Verifying the associated tight real-time constraints are met to ensure retention 
was not required resulted in a significant amount of simulation work to prove that from power-gating to re-
initial ization to restart was within the USB specifications. 
Inferring Power-gating and retention behaviour to RTL for single-pin control retention flops 
In the SALT project the retention register library elements had a single-pin control to manage save and 
restore in an edge-triggered manner. Retention state is captured on the falling edge of an active-Iow 
NRETAIN signal and restored on rising-edge of the same NRETAIN : 
Building on the same worked example described in section 4.7 the retention reg ister intent could be inferred 
on the synthesizable RTL by post-processing the source files: 
See Figure 4.26A. 
4-53 
Energy efficient SOC design technology and methodology 
Figure 4.27 A - RTL post-processing for single-pin NRETAIN ret'n flops 
If main SCC interface 
.pwr_ override(diag-pwr_ override ) , 
. pwr_ req(cpu-pwr), 
.pwr_ acx(cpu-pwr_ ack ) , 
. save (cpu_ save) , 
.nrestore(cpu_nrestore) , 
' ifdef RTL_ PG_ EMULATE 
. cpuclk (clk_ cpu & -( ldiag-pwrstate[12:11] » , // inhibit clock edges 
. cpuresetn (nrst_ cpu I ( I diagywrstate [12: 10]) ) , / / inhibit asynch reset 
. dhgntrst(nrst_ c pudbg I <l diag-pwrstate[1 2 :1 0 ] », // inhibit asynch reset 
~else 
.cpuclk (clx_cpu ) , 
.cpureset n (nrst_ cpu), 
. dbgntrst(nrst_ cpudbg ) , 
"endif 
4-54 
Design for State-Retention Power-Gating - SRPG 
Emulating state-retentIon power-gating of RTL sub-systems 
If one does not have access to the source RTL whIch IS often the case for many desIgners work,ng wIth 
hIgh-value IP components then th,s IS a real Issue For many ARM-based desIgns the end-customer works 
wIth a pre-complled/obfuscated technologY-Independent behavIour al model of the CPU and only later 
sWItches In the detaIled technology specific versIon that may be provIded from an Internal IP ImplementatIon 
group or a Foundry for example 
So an altematlve approach, when workIng WIth subsystems that have encrypted or protected IP models, IS to 
Intercept the clocks and resets to the module and emulate retentIon and state retentIon In the early desIgn 
and venficatlon flow 
In the SALT project the leakage management power controller had some hIgh-order flags In the state field 
that reflect whether the processor IS power gated or In hIbernatIon-scan control In th,s case a condItIonal set 
of connectIons to the clock and reset(s) IS coded whIch InhIbIts the clock (AND out the clock tenm) and the 
the actIve-Iow resets (OR In a reset InhIbIt) such that the RTL or event dnven model has all clock and reset 
transItIons InhIbIted dunng the power-gatlng and scan sequenclng that would otherwIse spunously advance 
or re-Inltlahze state 
See FIg 427A. 
4-55 
Energy efficient sac design technology and methodology 
4-56 
5. PhysicallP for Low Power Design 
The DVS926 and follow on UL TRA926 projects drove the specification both of library components 
for dynamic power management and the extended voltage scaling charactenzatlon requirements for 
the full standard cellllbranes and memory compiler technology 
For the 90nm and 65nm node technology demonstrators, leakage mitigation was the pnmary 
challenge and the A TLAS926 and SAL T926 projects provided vehicles to evaluate both fine-grain 
(Intra-cell) power gatlng and coarse-grain shared power sWitch networks respectively, as well as a 
number of different state retention approaches 
All the chips With the exception of the TSMC ATLAS926 project were Implemented with physlcallP 
from Artisan Components, Artisan was a collaboration partner for the first three projects, but had 
been acqUired by ARM before the SAL T926 project, so the latter project allowed cell-level 
englneenng and expenmentatlon rather than being constrained to customer dellverables and views 
Library IP Support for Dynamic Voltage and Frequency Scaling 
The pnmary components specified and negotiated as add-on cells to the Standard Cell IIbranes 
Include 
• Low-to-Hlgh Level Shlfters 
• Active components With dual VDD supply rails to provide correct voltage-swing 
dnve from a lower supply voltage to a higher supply rail 
• These are often instantiated on cntlcal timing Interfaces (between cache RAMs and 
core logiC In a CPU for example) so need to be as fast and efficient as possible 
• From a methodology perspective these components were specified to be placed In 
the (output buffer) dnver voltage domain - the rail With the higher current supply 
• Optionally specified either With or Without guard-bands between the voltage 
domains, the area overhead IS slgmficant With guard bands but for a number of 
customers this IS preferred to smaller shlfters Without the guard-band safety 
• Hlgh-to-Low Level Shlfters 
• In fact Simply re-charactenzed cells that overdnve a weak voltage domain In a 
lower supply domain from a higher voltage domain 
• No 'bidirectional" capability of supporting arbitrary High/Low to High/Low Shifting 
• Rigorous methodology enforced to ensure direction always clear and defined 
• Isolation or 'Clamp" Gates 
• Support power-down regions for Isolating Interface signals from floating nets 
• AND-type functionality chosen to clamp active-high signals to Inactive 
• This matched most naturally the shared-ground multiple VDD shlfter architecture 
• Level shlfters With Integrated Isolation clamping 
• A valuable 'boundary" EDA cell vanant for DVS Interfaces With power-down 
5-1 
Energy efficient SOC design technology and methodology 
Figure 5.1A - Novel Buffered Header Power Switch 
• 111 .531/ / 1.13u 
D 
"" l-Iulkl9~ S 
~ ' .!l2U 
• r IN !l.E[P 
NOlffP . - - - ---1 .. >O~~ .... - = c-----l 
. " 
" '''''" 
...,. 
we, 
Figure 5.1 B - Novel Buffered Header Power Switch - double height 
re-buffered nSLEEPOUT 
-- N-well connection by abutment 
P-well connection by abutment 
WOO arid connection by abutment 
nSLEEP grid control by abutment 
(nSLEEPIN input to bottom) 
5-2 
Physical IP for Low Power Design 
From a memory perspective, where typically the vollage scaling headroom was negligible or 
certainly less than the standard cell library voltage scaling range: 
• RAMs with integrated input Level shifters 
• Emulated by "pseudo-hardening" discrete Low-to-High level shifters with integrated 
isolation clamps around standard complied RAM until compilers could be 
augmented to tile such shifters automatically 
• Proved invaluable to get good multi-voltage timing closure across independent 
logic and memory domains - any latency variation between voltage regions 
minimized by design . 
Standard cell libraries required extended voltage characterization : 
• Most simply an extra set of "Non-linear Delay Model" (NLDM) timing files were required to 
be characterized at a couple of fixed reduced operating voltage points; the voltages are 
highly technology dependent and expensive to build on-demand for customers however. 
• Synopsys was proposing "Scalable Polynomial Delay Models" (SPDM) around the time of 
the DVS926 project - these allowed accurate delay interpolation for intermediate operating 
voltages but turned out to be very expensive in term s of machine time required to build the 
underlying polynomial delay characteristics . 
• Current Source models are now the favoured industry approach - Synopsys have not only 
"Composite Current Source" (CCS) models for timing but also for power and noise; these 
are much more efficient to build from a library provider perspective but still involve huge 
quantities of data, even in the timing libraries that package this up for customer design 
views. 
All the technology demonstrator projects covered in this report were in fact signed-off using NLDM 
timing and power library views but are proving valuable test cases for CCS timing and power 
closure subsequently. 
Library IP Support for (MTCMOS) Power Gating 
The favoured approach to power gating has been distributed shared P-MOS header switching. 
Although N-MOS footer switches have higher electron mobility for the 90 and 65nm technologies 
targeted for the leakage demonstrators the system-level benefits of shared ground throughout, and 
therefore standardizing on active high signalling with clamping to zero, was the driving factor. 
Fig 5.1A shows the underlying header switch design approach, with the integrating buffering 
supporting ease of chaining or cascading switches without added "always-on" control networks, and 
the ability to add some internal switching control path delay to avoid peak inrush switch currents by 
design. 
Fig 5.1 B shows the switch layout optimized for area and switch efficiency; the double height cel l 
layout allowed sharing of the inverter chains to minimize area impact, and the switches were 
implemented as many "fingers" of transistors to get the best on-current to leakage rat io - here 
optimized for 90nm TSMC Generic technology. 
5-3 
Energy efficient SOC design technology and methodology 
Figure 5.2A - Distributing Virtual and Un-switched Power 
INVXl UBUFX6 
Figure 5.2B - Novel Buffered Header Power Switch with Weak-Start 
Chained start-up cont rol 
n"tlAlnrlc<: (buffered) 
Double- height switches 
Chained start-up control 
networks (buffered) 
5-4 
Physical IP for Low Power Design 
The approach taken with the entire prototype power gating library was to add an extra track in 
addition to the "normal" power rail. An existing 9-track library was used as the starting point and a 
tenth power track added - not ideal from a cell area perspective - but avoiding all the challenges of 
supplying "always-on" supply to retention no ps and the control buffer networks to control power 
gating, isolation and retention that must not themselves be power gated or "noat". 
The dual-height Header switch is designed to sit under a low-impedance VDD power mesh and 
locally switch (a pair of) Virtual-VDD rails , VVDD. Although a number of experiments were 
conducted wi th strategies to use columns of switches to turn on each W DD row at a time, the only 
safe strategy adopted was to provide a controlled turn-on of the VVDD as a gridded supply allowing 
best current sharing between switches to minimize "hot-spots" for voltage drop, The switch cell is 
shown supporting virtual rail grid connectivi ty by switch cell abutment. 
A final complication shown in Fig 5.16 is that of the switch providing the Well-taps for the adjacent 
library cells in a final layout. A tap-less baseline cell library was used for the project; the area 
efficiency of cells in improved by not having well ties explicitly provided inside each cell , but this 
adds a requirement in the EDA flow that well ties or tap cells must be added within technology-
dependent radii of such cells to ensure they behave correctly as originally characterized. 
Incorporating the well ties into the switch cells guaranteed that the well connections could be made 
by design providing they were placed less than 75 micron distant from each other (radially) for the 
90nm technology for this project. Because the project was to exploit variable well bias evaluation 
the P-well and N-well connections are again chained together by abutment. 
Fig 5.2A shows examples of how the underlying cell library was converted for 9-track to the 
experimental10-track variant with distributed always-on VDD distribution. Conventional cells such 
as the INVX1 inverter shown simply had a dummy 10-th track (VDD) added and their pre-existing 
track-9 supply is renamed W DD; when this virtual rail is switched off the cell loses power with 
associated reduction in leakage current. 
An extra set of "Un-gated" power cells which were given a "U-" prefix were introduced into the 
library: a higher-strength non-inverting buffer, UBUFX6, is shown alongside the inverter. This is 
simply rewired to use the always-on VDD supply track 10 rather than track 9 that the parent 
(BUFX6) library cell used to clone this cell variant. 
All cells in the new 10-track library variant have to be re-characterized to take account of the extra 
parasitic capacitance and to update the cell area parametrics. 
Fig 5.26 shows the double-height switch cell enhanced to include a specialized start-up circuit that 
supports connection by abutment of not only the standard switch structures but also a buffered 
control chain that feeds a turn-on request "up" the chain and returns an acknowledge back down 
the chain to support placement-controlled turn-on management. The starter network only turns on a 
set of three weak switch "fingers" while the remaining 27 fingers are turned on by the main chained 
switch control to provide the low-IR switch functionality once the virtual grid is safely powered up. 
5-5 
Energy efficient SOC design technology and methodology 
Figure 5.3A - Schmitt Trigger Virtual Rail Sensing 
Figure 5.38 - Soft-start weak transistor header switch network 
-_. 
Figure 5.3C - Low IR-drop strong transistor header switch network 
--.~ .. -
-
-
- .... '"1 
5-6 
PhysicallP for Low Power Design 
The SALT virtual power grid switches were developed and simulated using trial loads of standard 
cells instantiated in between column switches arranged in columns every 50um to match the 
standard power supply vertical metal pitch for standard single voltage designs. 
For the trial implementations the virtual power grid switch network for the SAL T926 CPU was: 
Standard cell area: 280 rows x 2000um wide area 
• 50 um column switch grid (to start with) 
• Switches - 6.5um wide 
• 11200 switches (5600 double height cells in fact) 
72800 um of switch cell 
36.5 rows of switches in total 
Total current capacity is high (typ conditions) for 50mV drop 
11.42uNswitch x 5600 = 63.gmA 
Gate current 520.3 pNswitch x 5600 = 2.9uA 
The approach of over-designing the power gating in order to guarantee no local high current IR-
drop hot-spots was chosen to reduce risk , with the knowledge that in an optimized solution smaller 
current switches could be selectively be swapped in once the high current cell placements were 
analyzed - which would then reduce the off-current leakage compared to the initial grid. 
Because the design was to implement retention registers which must not be corrupted by turn-on 
currents in the power gating grid (considerable given the complete capacitance of al l power gated 
cells once these have been gated off and discharged) an active approach to inrush transient 
management was taken for the project: 
A Virtual VDD grid-sensing Schmitt trigger cell was specified - see Fig S.3A. 
• Analog voltage-sense cell for the Virtual-VDD supply rail 
o Generate a "VVDD-ready" signal when start-up voltage reaches - 90% 
• Integrated AND-gate to allow gating "ready" with "nSLEEP" 
o One cell sufficient to control main power gating network 
o Necessary in order to ensure turn-off time not dependent on network discharge. 
• A diagnostic override (OR-gating) structure to allow instantaneous "ready" signalling 
o To allow experimental analysis of retention state soft error rates with and without 
soft-tu rn on sensing. 
Fig 5.38 shows the chained structure implemented by script-based placement of the weak starter 
switch structures - in this case distributed every 10 switch columns, and with their control path 
chained serially to provide a gentle turn-on characteristic. 
Fig 5.3C shows conceptually how the high-current switch columns are arranged and controlled , and 
only turned on once the level-sensing Schmitt trigger indicates the starter network is on and safe to 
turn on the strong header switches. Only on starter column is shown for this part of the VVDD grid 
switch network, which is simply a replication of further switch columns to the right of th is . 
5-7 
Energy efficient sac design technology and methodology 
Figure 5.4A - Column-based Switch Fabric Deployment 
Figure 5.48 - Close-up of optimized connection by abutment 
5-8 
WOO 
'Ready' 
PhysicallP for Low Power Design 
The SALT virtual power switch grid is only one example of how such a shared or distributed power 
gating network might be built - e.g. Row-switches rather than the Column approach of Fig 5.4A. 
The project determined four classes of power gating implementation each of which requires more 
EDA tools support but more appropriate for synthesizable IP design flows : 
"Concealed" Power Gating 
• Custom 10 Track Library (easy distribution of true VDD) 
• SALT project header switch exam pie - "physical only" cells 
• Not in the netlist - hard to analyze 
• Pre-placed by perl script 
• Connection by abutment, as shown in Fig 5.48. 
• Power domains in RTL Hierarchy, need manual power gating control hook-up 
"Explicit" Power Gating 
• Standard 9-Track library (VDD must be explicitly routed) 
• Fully characterized headers (pre-production Power Management Kit) 
• Switches exist in the netlist - can be analyzed 
• Pre-placed by perl script 
• Connection by pre routing in back-end router 
• Power domains in RTL hierarchy, again need manual power gating control hook-up 
"Inferred" Power Gating 
• Standard 9-Track library (tap-less) 
• Fully characterized headers (pre-production Power Management Kit) 
• In the netlist - can be analyzed 
• Automatically placed and routed by synthesis tools 
• Minimal switch topology scripting - intelligence now in the tools 
• Power domains inferred from RTL 
"Automated" Power Gating 
• Standard 9-Track library (90GT) 
• Fully characterized headers (production Power Management Kit) 
• In the netlist - can be analyzed 
• Automatically placed and routed by front-end synthesis tools 
• Require minimallC-Compiler (Synopsys synthesis/place & route) scripting 
• Power domains specified during implementation 
Library IP Characterization for Power Gating 
Adding switches in series with power rails results in context-sensitive voltage drop for power-gated 
library cells . One approach would simply be to over-constrain the timing in order to compensate for 
some level of IR drop that would only be known once the layout and extraction is complete . 
5-9 
Energy efficient SOC design technology and methodology 
Figure 5.5A - Single-pin control Retention Register - single height 
Figure 5.58 - StateS aver Retention Register - Double Height 
5-10 
PhysicallP for Low Power Design 
For the project the base library was re-characterized at 5% and 10% voltage de-rating from 
the standard voltage sign-off conditions: for the 90nm library that has worst-case voltage 
COrner specification of 0.9V (1 .0V nominal minus 10% voltage), two more worst case timing 
libraries were built: 
• 0.85V, slow corner, highest temperature 
• 0.80V, slow corner, highest temperature 
Re-synthesizing and routing the ARM926 CPU configuration wi th these libraries resulted in 
de-rating of the performance by: 
• -10% of FMAX using the 0.85V slow corner library 
• -29% of FMAX using the 0.80V slow corner library 
For the project the 50mV IR drop was taken as the absolute maximum target derating so the 
synthesis and analysis was all completed using the 0.85V worst case library for setup times , 
and the standard 1.1 V, fast, low temperature library used for hold-timing closure. 
By the end of the SAL T926 project the CCS-Timing models were fully supported for the static 
timing analysis and scheduled to be supported for the front-end placement-aware synthesis 
from mid 2007 - which will allow customer designs to work wi th standard library deliverables 
and not require extra reduced vol tage headroom characterization "specials". 
Library JP support for State Retention 
State retention register designs typically add a High-Vt low leakage latch structure that has a 
retention power supply to a conventionallow-Vt high performance Master-Slave register that 
can be power-gated off. A well publicised version is known as the "Balloon-flop·' which has a 
pair of extra control signals: 
• A SA VE signal to transfer state to the retention latch 
• A RESTORE signal to copy state back into the register when re-powered up 
The signals need careful sequencing and have to be driven by non-power-gated control buffer 
networks to ensure no state corruption . 
For the SAL T926 project two new styles of retention register were investigated and 
implemented: a single-pin-control retention register, invented by a colleague in the ARM 
Austin design centre , and a more radical alternative co-invented in Cambridge which simple 
reused the existing scan-enable and reset pins. 
Fig 5.5A shows the layout of the retention register used for the final tape-out: 
• An edge-sensitive NRETAIN control pin captures state on falling edge before power 
gating and restores state on the rising-edge. 
Fig 5.5B shows the layout of the "StateSaver" retention register design: 
• An asynchronous pulse on SCAN-ENABLE copies register state to the retention latch 
• The NRESET control doubles as the retention restore when SCAN-ENABLE is low 
and a true reset when SCAN-ENABLE is high 
, Mutoh S.et aI. , "A 1-V Multithreshold-Voltage CMOS Digital Signal Processor for Mobile 
Phone Applications", IEEE Journal of Solid-State Circuits, vol.31 , nO.11, pp. 1795-1802, 1996. 
5-11 
Energy efficient SOC design technology and methodology 
Figure 5.S- TSMC90G Dynamic Well Bias Delay Simulations 
Relative delay v VBB,VDD,Vt 
' .00 
3." 
3.00 
2." _HVt (O.8OV) 
__ HVt (0 .85" ) 
.. HVt (O.9OV) 
> 
~ , RV! (O .9OV) 
t 2.00 _ RV! (0.85V) 
~ _ RV! (0.""") 
~ -M- LVI (O.8OV) 
I- __ LVI (0.85\1) I. .. , l V\ (O.9OV) I-
-
• 
-
-
' .00 
-- ~ ~-r !- f-
0 .. 
000 
... S " .8 .. , .... " .5 .... ".3 ".2 .. . 00 0.' 0.2 0.3 0 .. 0.5 O' 0.7 0.8 0.' 
VBB (VI 
5-12 
PhysicallP for Low Power Design 
The single-pin control design was most easily supported by the EDA tools flow. 
The StateSaver version has the potential to be retrofitted to legacy IP and mixed with 
standard reset-able/scan-able registers with the only constraint that the Scan-Enable buffer 
network should be non power-gated and therefore ideally implemented in low-leakage High-Vt 
cells. EDA support that does no break the Design-for-Test tools is still currently in the R&D 
groups and support is not yet productized . 
In both cases there is an area over head penalty per regis ter - 30% to 50% overhead per 
retention register over conventional register depending on the register variant and drive 
strength. 
The 10-track prototype cells provided transparent retention supply routing to the registers 
which kept the design power planning simple for the project. Production registers revert to 
standard track height and add extra retention supply ports that must be explicitly power routed 
in the design flow. The current requirements for the retention supplies are low compared to 
the at-speed functional register operation so do not require a major supply grid. 
Library IP support for Dynamic Well Bias 
A number of customers have expressed interest in the leakage savings possible with the addition of 
back-bias to the cells - but only IDMs with their own fabrication plants have been able to productize 
this for mass manufacture. 
The SPICE models and manufacturing guarantees offered on Foundry processes are complete 
when wells are tied off to standard supply rails but less well specified for forward or reversed biased 
transistor operation. And the EDA library views are well characterized for supply voltage variation 
but not the extra dimension of well bias for reduced leakage operation with back-bias or the more 
risky faster/leakier operation with forward-bias (and the potential of thermal runaway). 
The SALT project took on board the challenge of providing experimental control of well and bulk 
voltages for the CPU logic domain - for both standby and active leakage current analysiS. 
A representative netlist design based on a ranking of gates used in the SAL T926 CPU 
implementation were simulated at O.1V steps for the range -0.9V to approaching +0.8V to 
understand the derating effect on the standard 0.9V WC library and the 0.85V and O.8V special 
characterized libraries for the project: 
Fig 5.6 shows the graphical plot of aggregate timing delay across the range. This analysis then 
al lowed the specification of two active modes of chip operation under software control : 
• SAVER mode, running at 200MHz compared to the worst-case 300MHz operation to 
support running with controlled back-bias 
• TURBO mode, running at 400MHz to support 
Because no EDA timing models support dynamic changing of supply voltage and P-well/N-well bias 
architectural design support was put in to ensure the Wait-For-Interrupt mechanism was used to 
ensure clocks stopped and held idle until power supply acknowledges safe and stable well supply 
voltages. 
5-13 
Energy efficient SOC design technology and methodology 
Figure 5.7A - GIDL for 65LP LVT and HVT header switch transistors 
1.0E-03 
1.0E-04 
1.0E-05 
1.0 E-06 
:< 
- 1.0E-07 
o 
. ~ 1.0E-OB 
:E 
1.0E-09 
1.0E-1 0 
1.0E-11 
1.0 E-12 
o 
HVT vs LVT 
-- ....... 
"'''' 
0.5 
\ \ 
\ \ 
\ \ 
\ \ 
\ \ 
"\. 
1.5 
Vgs (V) 
~ 
.... 
2 2.5 
FLVil ~-HVT 
Figure 5.7B - HVT and SCC-LVT header switch area efficiency 
1.0E-03 
1.0E-04 
1.0E-05 
1.0E-06 
~ 1.0E-07 
~ 1.0E-08 
" 1.0E-09 
1.0E-10 
1.0E·11 
1.0E-12 
~'--...... 
o 0.5 
" , \ 
HVT vs LVT 
\ 
\ 
\ 
\ 
~ 
1.5 
Vgs IV) 
~ 
2 2.5 
- LVT (Area= A) 
- HVT(Area = A) 
L VT (Area = 10xA) 
HVT (Area = 10xA) 
5-14 
PhystcallP for Low Power Design 
Adding In well bias Introduces a number of design flow and evaluation platform challenges 
Analyzlng the leakage and performance benefits IS only part of the work There are also senous 
Issues with respect to manufactunng vanatlon and the danger of latch-up without very careful 
external power supply sequenclng and decoupllng 
Library IP support for Super Cut-off CMOS (SCCMOS) 
All the leakage mitigation IP developed and descnbed In this chapter have been based on 
Multi-Threshold CMOS, MTCMOS, technology, where Low-Vt but leaky transistors are used 
on cntlcal paths to meet timing constraints and Hlgh-Vt transistors are used on non timing 
cntlcal paths and as the senes power gates In order to ensure power gatlng leakage power IS 
reduced as low as possible 
The ATLAS926-65LP used three different cell library transistors with High, Standard and Low 
Vt charactenstlcs (With the fine gram power gatlng IIbranes adding Hlgh-Vt SWitch transistors 
to the leakier Standard and Low Vt Implementation cells) while the SAL T926 project used a 
pair of low and high Vt cellllbranes and hlgh-Vt transistors for the power gates 
Every Vt van ant reqUires another expensive Implant mask layer and processing steps In the 
manufactunng flow, and the vanablllty Introduced with each Implant does not track the other 
Vt's precisely, so the overall circuit design timing and leakage vanatlon vanes In a complex 
way 
A number of high-volume customers are very sensitive to mask costs and are Interested In 
the trade-offs associated With active and leakage power and extra mask steps and Yield 
Impacts 
To address this some detailed work on transistor slmulatlons of 90nm and 65nm 
technologies was performed In order to understand non MT-CMOS approaches to leakage 
power management 
Fig 5.7A shows the Gate-Induced Drain Leakage charactenstlcs Simulated from for the TSMC 
65nm LP process, the process used for the MTCMOS ATLAS65LP project 
• The gate voltage IS overdnven from the 1 OV nominal logic signalling level to show 
the effect on drain current On the log scale several orders of magnitude of leakage 
current can be obtained by dnvlng the gate voltage Into ·super-cut-off" operation 
• With a gate voltage of approx 1 5V the ·off" leakage for the Low Vt transistor IS as 
good as (In fact betterl) than the Hlgh-Vt transistor can ever achieve With a gate bias 
of about 1 2V 
This IS for a P-MOS header sWitch In this example 
Fig 5.7B shows the area trade-off for the HVt and LVt SWitches The Low-Vt sWitches can 
carry a higher current per umt area Even scaling the HVt sWitches to 1 Ox the area stili only 
matches the on-current performance of the Low-Vt umt area SWitch while the off-current 
leakage at optimal gate bias voltage leakage IS worse proportionally 
5-15 
Energy efficient sac design technology and methodology 
5-16 
---------------------------------------- -
Physical IP for Low Power Design 
Provldmg voltage control signals that are beyond the supply rails of course add complexity 
and risk 
The SCCMOS approach IS bemg Investigated further by focussmg on a "ring-sWitch" topology 
style where any out-of-Ioglc-range control signalling voltages can be managed and 
encapsulated within striPS of sWitches applied to the periphery of blocks to be power gated 
such that the routmg and buffering of analogue control signals can be managed locally and 
not add Implementation/testability and verification compleXity to the IP block Implementation 
flows 
5-17 
Energy efficient sac design technology and methodology 
5-18 
6. Evaluation Platforms 
A series of development boards were created over the lifetime of the research programme to 
support 
• BaSIC testing of the SIlicon functionality 
• Software development environment for In-depth test program development 
• Evaluation and measurement platform for dynamiC and leakage power analysIs (and 
comparison with predicted results from simulation) 
• DynamiC Voltage Scaling power supply evaluation system to support the collaboration with 
National Semiconductors and their mltlal "PowerW,se" AVS power control SIlicon samples 
Later on m the program the scope grew to encompass more functions and become the baSIS of the 
"Intelligent Energy Manager" demonstration platform 
• Demonstration systems for lead customer access 
• Exhibition systems for stand-alone operation - booting from compact flash and dlsplaymg 
on VG A-resolution mOnitor screen 
• Customer 'loan' system for detailed energy measurement usmg proprietary work-loads 
• Systems bUilt and given to lead partners to allow them to demonstrate their Silicon to 
customers mdependently 
• Battery power experimental test systems to allow battery-life measurements for low-power 
active and standby energy characterisation 
ThiS chapter of the report Introduces the baSIC architecture for the first board and the evolution and 
enhancement of the deSign to match each stage of the development of the canonical deSign for the 
SOC 
Design Goals 
To use simple off-the-shelf components and low-power modules wherever possible to Simplify 
board deSign and component procurement 
• SDRAM DIMM modules and sockets rather than memory components 
• Conventional (0 1" pitch) sockets and DIN-64/-96 way connectors 
• Independent power supply and system board components supporting detailed current 
measurements of power rails Independently and supportmg third-party power supplies 
• MInimal on-board l6-blt Wide Flash EPROM to support baSIC diagnostic monitor 
• BaSIC RS-232 serial commUnications to support commUnication with the diagnostic mOnitor 
through a standard terminal emulator 
• BaSIC clock OSCillator support 
• Multi-ICE debug agent support through a standard header connector used from the ARM 
development tools to debug the chip and memory system and flash the on-board memory 
• Support for audiO output analogue mterface to allow reai-tlme software development and 
testing 
6-1 
Energy efficient sac design technology and methodology 
Figure 6.1 - 180nm sARS2 evaluation board 
6-2 
Evaluation Platforms 
180nm sARS2 Evaluation Board 
The first board in the programme was specified by the author but designed and built by Synopsys 
engineers in Hillsboro, W A. 
The sac technology was 180nm so 1.8V core logic and 3.3V 10 and memory systems were 
required for the external interfaces: 
• A pair of on-board linear voltage regulators, provided wi th removable links on the power 
rails to allow power measurement. 
Four memory subsystems were supported : 
• A pair of 8-bit wide 32Kbyte fiash memories were fitted as the 16-bit bootstrap memory and 
diagnostic monitor' 
• On-board SRAM devices - a pair of 1 Mbit 8-bit wide SRAMs arranged to provide 
256Kbytes of 16-bit-wide static memory 
• On board Synchronous SRAM (SSRAM) bank to support 32-bit wide fast static memory 
access in order to enable pseudo-dual-ported DMA access for audio subsystem 
• SDRAM DIMM socket for 32Mbytes of 32-bit wide SDRAM bulk memory for program 
development 
A ZIF socket was procured for the project to meet the specification for the 408-pin BGA package 
used for this chip, with the aim of re-using this for subsequent sac iterations. This was essential as 
no wafer or package level test was possible (cost reasons primarily) and silicon had to be tested in 
functional usage. 
There is a large footprint area reserved for this on the circuit board but the sprung-loaded contacts 
can be released by unscrewing the socket and a known-good chip can then be surface-mount 
soldered onto the same footprint. 
The only board-level peripherals were: 
• RS232 level converter and 9-pin D-connector 
• 8 LEDs connected to GPIO lines 
• 8 Switches connecter to G PlO lines 
• Simple stereo D-A filtering for audio connection on 3.5mm jack socket 
All other signals were made available at header connectors shown in the lower half of the board, 
laid out to support logic analyser connection and external interfacing to memory mapped expansion 
regions . 
Acknowledgement to Larry Rogers at Synopsys for schematic capture and board layout for the 
sARS2 board . 
Only 3 boards were built for this first collaborative project. 
, In the figure opposite these are in fact the two surface mount sockets . Due to a timing 
problem in the JTAG interface on this design the external Multi-ICE debug agent connection 
to the debugger was unusable so rather then using the ARM CPU core to program the fiash 
memories under control of the debugger the memory devices were In fact programmed in an 
external EPROM programmer and the fitted in sockets on the board rather than being 
soldered down directly. 
6-3 
Energy efficient sac design technology and methodology 
Figure 6.2 - Intelligent Energy Controller FPGA prototype platform 
6-4 
Evaluation Platforms 
Software development board for "lEe" 
Due to the long fabrication and packaging schedule for the first Dynamic Voltage Scaling silicon 
(the 130nm DVS926 chips) an alternative hardware platform was devised for the software 
engineering team which emulated the DVFS environment in order to support the development of 
the device drivers and control stack ahead of the actual chips . 
The requirements included : 
• Support for Linux OS with "reasonab le" performance 
• Register-compatible interface to the performance control and monitoring functions of the 
prototype "Intelligent Energy Controller" in the first silicon. 
• Emulate the behaviour of DVFS in terms of clock and voltage settling times and to emulate 
reduced performance levels reasonably accurately 
The only FPGA platform that could be sourced to support 200MHz operation (from cache and 
external DDR memory) was a development board that ARM had built around the Altera Excalibur 
device which has a pre-hardened ARM922T CPU with 8Kbyte caches and memory controllers built 
within a "stripe" alongside the programmable logic. 
It was not possible to reprogram the PLL clock to the CPU dynamically so a pulse-width-modulated 
scheme was developed that controlled the Run and Stop duty cycle of the CPU to emulate 
performance levels below 100%. 
The synthesizable performance control ler and monitors from the SOC design were re-synthesized 
using FPGA tools and apart from minor adjustments for the overall memory map to work with the 
Altera part a full emulation of the programming environment was built. 
The configuration for the PLO was put into flash memory such that the device auto-configured on 
power -on-reset 
The board itself is only part of the software development platform . This Excalibur "Logic Module" in 
fact has stacking connectors to allow it to be plugged in as a daughter card into an application 
development back plane that supports video card, Ethernet interface and the other linux peripherals 
expected in the OS development environment. 
Only two systems were built and commissioned. 
They were successfully used to develop the driver interfaces, to port the Intelligent Energy Manager 
control stack and policies, and even allow some real -time profiling. (The FPGA proved ideal for 
such purposes allowing selected real-time signals to be routed out to a logic-analyzer connector for 
measurement and tuning). 
6-5 
Energy efficient SOC design technology and methodology 
Figure 6.3 -130nm DVS926 Voltage Scaling Test platform 
6-6 
Evaluation Platforms 
130nm DVS926 Voltage Scaling Test Platform 
The next board in the programme was designed entirely in-house by the author to re-use 
components used for commercial ARM development systems, and the circui t board was laid out by 
a local contractor to meet complex power plane requirements. 
The SOC technology was 130nm so multiple (well at least separable in the first instance) 1.2V core 
logic supplies and 3.3V 10 pad ring supplies were required. Before any fancy voltage scaling 
experiments could be started the primary requirement was to logically test the SOC devices; no 
wafer test was possible for cost reasons to raw untested die had been packaged. 
The decision was made to separate the supplies to the pad-ring and the memories in order to 
measure the proportion of power consumed in the interfaces and the memories themselves. 
• The power regulation was moved off the main board through a 64-way connector (lower 
connector of Fig 6.3) to allow both bench and standalone power supplies to be used and 
developed independently and by carefully splitting all the power planes and supplies on the 
main board support independent measurement of current and power explicitly. A fixed 1.2V 
functional test supply is show to the bottom of the figure . 
Two on-board memory subsystems were supported: 
• A single 16-bit wide 64Mbit SMT Flash EPROM to support 8 Mbytes of diagnostic monitor 
and application programs or operating system boot-strap. Programming is via the JT AG 
controlled debug agent (simulated in more detail on this chip after the problems with the 
sARS2 chip debug port) 
• Standard PC-style SDRAM DIMM socket for 64/128Mbytes of 32-bit wide SDRAM bulk 
memory for program development (64Mbytes per side so support double-sided DIMM 
modules) 
The footprint for the same ZIF socket was reused from the previous project - the same 408-pin 
BGA package was used for this chip but with different pin-out due to the multiple power rails and 
interfaces. 
There is a large footprint area reserved for this on the circuit board but the sprung-loaded contacts 
leave the board undamaged when the socket is removed, allowing known-good tested parts to be 
soldered to the same board subsequently. 
The only board-level peripherals were dual channel RS232 level converter and 9-pin D-connectors. 
One 96-way connector provided access to 48 General Purpose 10 signals, audio and Synchronous 
Serial 10 ports, and allowed external switches and LEDs to be connected for diagnostic purposes 
(connector on the right hand edge of Fig 6.3). 
A second 96-way connector provided access to the full static memory address, control and data 
buses to support logic analysis and memory expansion (top edge of Fig 6.3). 
Only 3 boards were built and used to functionally test DVS926 silicon and develop the diagnostic 
monitor code in order to characterize the DVFS behaviour. 
6-7 
Energy efficient SOC design technology and methodology 
Figure 6.4 - 130nm DVS926 Voltage Scaling Demo platform 
6-8 
Evaluation Platforms 
130nm DVS926 Voltage Scaling Demonstration Platform 
After the successfu l testing of both OVS926 silicon and the evaluation board, a follow on evaluation 
board was developed to support internal and external software developers and to provide a number 
of demonstration systems to be shown to customers across the regional offices and even loaned 
out to customers for internal power and energy bench marking. 
The schematics were re-used from the previous board design and then a number of enhancements 
added: 
• A "daughter" card power supply system was developed. This provided individual regulated 
rails to all the SOC core and 10 voltage rails, together wi th link-selectable current 
monitoring facili ties for each rail to support detail power measurement and analysis . 
• Linear regulators were used (shown at the bottom of of Fig 6.4) to ease design and 
construction rather than switch-mode supplies that would have exhibited better power 
efficiency; noise-free measurements after the regulators were the benefit with wide current 
load regulation for testing running and halted 
• A rudimentary dynamic voltage scaling system built into this PSU board to support open-
loop table driven voltage scaling of CPU and Memory which was used for basic OVS 
control at different frequencies 
• Two sets of 5 LEOs were provided to indicate dynamic performance request and the values 
of the National AVS slack detector built into the CPU scaled voltage domain. 
• Switches were added to control the "ready" level acknowledge to the Intelligent Enenrgy 
Controller on board to simulate different power supply ready behaviour 
• National Semiconductor in parallel developed a switch-mode Adaptive Voltage Scaling 
power supply daughter board with their first PowerWise interface IC (LP5550). 
The Main board was also enhanced to add a pair of memory-mapped CompactFlash slots (top of 
Fig 6.4). This requ ired a pair of PLO's to be developed that provided both hot-powered insertion 
protection and the fairly complex memory card timing sequencing and control for 8- and 16-bit 
cards. This proved very valuable and allowed Ethernet, VGA and Flash memory card drivers to be 
developed that provided a set of standalone demonstration modes to showcase the OVS926 silicon 
running MPEG-4 decoding in software with OVFS. 
A batch of 20 boards was built using known tested devices verified on the previous test board with 
ZIF socket and the BGA devices were soldered down directly. 5 had manufacturing short circuits or 
problems, the remaining 15 were commissioned and used globally by sales staff with customers 
and at trade shows. 
6-9 
Energy efficient SOC design technology and methodology 
Figure 6.5 - DVFS Voltage Scaling Exhibition Board 
6-10 
Evaluation Platforms 
"IEM" Voltage Scaling Exhibition Platform 
For the fi rst ARM Developers' Conference in Santa Cia ra, CA, a smaller footprint demonstration 
vehicle for the IEM software and hardware was developed. 
The basic circui try and schematics were reused and re-partitioned across a set of small circuit 
boards designed to be stacked and allow modular use with and without the CompactFlash 
subsystem. And the opportunity to allow larger Flash EPROM images was added to support more 
comprehensive Linux boots trap systems and file-system images. 
A footprint the size of a Business card was chosen as the basic building block and the fol lowing set 
of cards were developed: 
• System SOC board wi th DVS926, BMbyte Flash EPROM, crystal clock and Multi-ICE 
debug connector on the top side, and a laptop SODIMM memory socket on the underside 
of the board. DVS926 test-chips were pre-tested using the ZIF-socket-ed functional test 
board so could be soldered onto a small circui t-board foot-print without the need for the 
large socket keep-out area. With minimal trace lengths for the resultant system many of the 
series termination impedance matching resistors were cost-reduced from the original larger 
board layouts. 
• Miniature stacking connectors at either end of this and all modules to route power and 
ground supplies, dynamic voltage scaling control, and expansion static memory interface, 
UART and GPIO signals. 
• A dual-slot CompactFlash stacking card with PLDs and card sockets on top and bottom 
sides of this board 
• An add-on 32Mbyte Flash Memory board that by means of a pull-down signal on the card 
disables and overlays the B-Mbyte Flash EPROM on the CPU board. 
• A development "mother-board" that provides plug-compatible 64-way power supply 
connector to National Semiconductor or ARM dynamic voltage scaling boards, plus the dual 
RS-232 connectors , power-on-reset circuitry and push-bulton plus a 96-way interface 
connector compatible wi th the GPIO/Sound/SSI connector on the previous DVS926 boards. 
• Finally, an optional battery powered base-board to be used in place of the motherboard 
described above that allows 3 NiM H 1.2V AA cells to power the entire stack of boards and 
provide dual RS-232 and 3 20-way headers for simple expansion interfacing and logic 
analysis . This provided fixed 1.2V core voltage supply rather than full dynamic voltage 
scaling. 
A CPU + CompactFlash + Motherboard are shown stacked together at the lower centre of Fig 6.5 
(with a VGA interface card plugged in for MP EG playback demonstrations) surrounded by, in 
clockwise order, the alternative battery-powered baseboard , the system/CPU card , the dual 
CompactFlash card, the development motherboard and the optional Flash EPROM expansion 
module. 
10 exhibition systems were built and commissioned for world-wide exhibi tion and trade-show use. 
6-11 
Energy efficient SOC design technology and methodology 
Figure 6.6 - 130nm ULTRA926 Voltage Scaling Demo platform 
6-12 
Evaluation Platforms 
130nm ULTRA926 Voltage Scaling Demonstration Platform 
The UL TRA926 chips were designed with the same BGA package and pin-out as the DVS926. 
Fig 6.6 shows a DVS926 Demonstration board reused with ZIF socket used to screen and test the 
UL TRA926 devices. 
The testing turned out not to be straightforward due to a problem with the new PLLs that were used 
for the project: 
• The silicon fai led to phase-lock to the 12MHz crystal source. 
• External PLL visibi lity could only be observed indirectly via SDRAM clocks and power 
supply interface signalling 
• By plugging in a known-good DVS926 chip it was eventually possible to program up the 
Flash EPROM with a derivative of the diagnostic monitor built for the UL TRA926 SOC. 
• Eventually it was possible to prove that UL TRA926 devices were in fact functional despite 
the SDRAM clocks being unstable by being able to capture RS-232 characters at one 
eighth of the programmed 9600-baud (1200-baud) - despite the PLL instability (the baud 
rate dividers filtered out much of the PLL instability) 
Armed with this information a post-mortem with the PLL designers in San Diego pinpointed an error 
in the Verilog model for the PLL that resulted in a divide-by-8 function on the reference clock input 
rather than the divide-by-1 in the model used for simulation and signed off of the chip . 
• After experimenting with how to overdrive the crystal input pad circuitry using a bench-top 
waveform generator set at exactly 8 times the 12M Hz reference clock frequency and 
carefully adjusting the amplitude and DC offset, the UL TRA926 silicon then behaved 
functionally correctly from SDRAM 
• A simple modification to the clock circuitry was devised for the UL TRA926 build of the 
board that used a custom-made 3.3V CMOS 96M Hz oscillator and a passive signal 
conditioning network to overdrive the CM OS crystal pad input 
After overcoming the clock generation problems the UL TRA926 DVFS demonstration board 
became a useful evaluation platform . The four dynamic CPU performance levels between 50% and 
100% of the 288MHz design target operating frequency exhibited better energy savings compared 
to the original DVS926 design and the system operated up to 360MHz on the bench at room 
temperature with the close-to-typical silicon fabricated devices. 
A batch of 8 limited production cards were built with surface-mounted pre-tested UL TRA926 
devices, hand modified with a yield sufficient to successfully show at the Design Automation 
Conference and ARM Developer Conferences in 2006, and with some boards then supplied to 
UMC for their own use. 
6-13 
~---------------------------------------------------------------------
Energy efficient SOC design technology and methodology 
Figure 6.7 - 65nm ATLAS926 Voltage Scaling Demo platform 
6-14 
Evaluation Platforms 
65nm ATLAS926 DVS/Leakage Demonstration Platform 
The ATLAS926 chips were again designed to use the same BGA package and pin-out as the 
DVS926. Fig 6.7 shows a DVS926 Demonstration board reused with surface-mounted ATLAS926 
si licon . 
The system again required some basic modifications to support the specific voltage requirements 
introduced as the result of 65LP technology: 
• The Generic 10 pads were specified at 2.5V rather than the 3.3V used for previous boards 
and required for the standard SDRAM and Flash memory components interfaced direct to 
the chip 
• The analogue power supply to the on-board 1 GHz PLLs in particular needed a 2.5V +1-
10% clean supply rail. 
• After negotiation with TSMC it was agreed to run the pad-ring at around 2.8V, not high 
enough to stress the 10 devices for evaluation usage, and high-enough to give clean 
clocking and signalling at the interfaces of the Flash and SDRAM . (Flash EPROM requires 
a minimum of 3V3 - 10% for safe programming, the SDRAM similarly is unsafe below 3.0 
volts and the clock rise-time specifications require careful attention. 
• A set of modifications to the Power Supply daughter boards were devised to run the 
SDRAM and Flash supply rails at around 3.10V (minor adjustment to the regulator 
reference resistor dividers), the SOC 10 ring at 2.SV and the filtered PLL supply vol tage 
exhibited clean start-up behaviour. 
• The 5 "Slack value" LEDs from the DVS926 and UL TRA926 adaptive voltage scaling 
interface were re-used on the ARM PSU board to provide visual indication of: 
o 2-bit Leakage mode (HaIUSRPG/Scan-Hiberate/Shutdown) 
o 2-bit Perform ance level (Turbo/NormaI/Long-Life) 
o Wake up event (enabled interrupt following Wait-For-Interrupt) 
An ATLAS926-specific version of the diagnostic monitor with the appropriate IEC performance 
control values and support the new leakage management states. 
MPEG-4 video decode demonstration software was successfully ported by the author to the ATLAS 
board but due to subtle change to the Synopsys DesignWare UART functionality in the release of 
the IP used for this project the Linux serial drivers fail wi th some form of interrupt that cannot be 
cleared so only the basic Linux kernel is ported to this system 
A batch of 8 "production" cards were built with surface-mounted pre-tested ATLAS926 devices and 
the system was successfully shown at the Design Automation Conference and ARM Developer 
Conferences in 2006 wi th some boards then supplied to TSMC for their own use. 
6-15 
Energy efficient sac design technology and methodology 
Figure 6.8 - Diagnostic and Analysis boards 
Measurement " break-out" board 
Figure 6.88 - Diagnostic LED and Switch module ______ ~~~-___. 
6-16 
Evaluation Platforms 
Diagnostic and Analysis Boards 
The "canonical" board design with a common package footprint has proved a valuable and cost-
effective approach for the EngD research programme. 
Using somewhat old-fashioned 0.1" pitch industrial quality connectors for all the boards apart from 
the miniaturised exhibition system it has been possible to construct simple interface plug-in boards 
for power analysis and diagnostics purposes to aid basic software development - using standard 
prototyping "grid-board" rather than having to resort to further printed circuit board development and 
manufacture. 
Fig 6.8A shows the Simple power rail intercept board that was used to monitor current and voltage 
with the range of National , ARM and bench power supplies used for voltage scaling 
characterization . 
The lower edge of Fig 6.8B shows a basic diagnostic board that interfaces 8 LEDs and 8 Switches 
to generic GPIO lines that is used for basic bootstrap code development and supports power-on-
self-test diagnostics that set appropriate patterns on the LEDs to indicate good/bad status when 
working though the memory and peripheral checks from power up. This board proved invaluable for 
the basic manufacturing test of new boards. 
The left hand side of Fig 6.8B also shows a specially modified power supply module connected to 
the system board. In this case the CPU/Cache-RAM supply regulator has been bypassed and a 
1.2V NiCd or NiMH rechargeable cell is connected in place through a switch controlled by a low 
voltage relay. The relay mounted up below this provides zero-ohm switching between the battery 
source and the core supply 1.2V fixed voltage linear regulator. 
This circuitry was designed to enable concrete energy consumption measurements. After fully 
charging the rechargeable cell the true integration of current and voltage over time can be 
measured for repeating MPEG4 movie playback using different DVFS algorithms for dynamic 
energy savings and different leakage modes for Wait-for-Interrupt periods at the end of the data 
dependent frame decoding and rendering . 
This method provides much more accurate energy consumption measurements compared to 
accumulating the product of measuring average current and measured average voltage given the 
high-speed burst-nature of current consumption for a cached CPU core . 
6-1 7 
Energy efficient sac design technology and methodology 
Figure 6.9A - 90nm SAL T926 packaged silicon and new ZIF socket 
Figure 6.9B - 90nm SAL T926 Test Board prototype 
6-18 
Evaluation Platforms 
90nm SAL T926 Test Board 
At the time of writing up this thesIs the SAL T926 silicon has Just arrived back from packagln9 and 
the first of the new test boards bUilt up 
This silicon IS packaged In a 388-pln BGA package unlike all the earlier test ChiPS, and Incorporate 
a 480MHz USB OTG PHY Interface that requires special PCB layout so the board for this project IS 
being designed by Synopsys uSing the Author's SDRAM, Flash and Serial RS-232 CIrcUitry 
As can be seen from the socket In Fig 6.9A thiS IS a larger footprint package and has a centre 
cluster of 16 (grounded) balls that can be removed form the package for experimental purposes. 
Because leakage power IS exponentlally related to die temperature the balls that are under the 
centre of the die act as a thermal heat-sink for the plastiC package and substrate, by removing 
these balls the effect of higher die temperature will be valuable to analyse once the board IS ready 
(The socket acts as a form of heat-sink so the experiment may only be conclusive with directly 
surface mounted packages with and without the 16-ball thermal sink) 
Fig 6 98 shows the prototype SAL T90G board 
• SDRAM Interface and DIMM socket to the top side of the CIrCUIt board 
• RS-232 sockets, reset sWitches and PLL configuration to the right hand side 
• In the lower right area are GPIO and diagnostic connector links The chip has a 16-blt GPIO 
Interface and a multi-purpose 16-blt bidirectional diagnostic channel that can be configured 
for 8 different modes of operation from manufacturing scan test to leakage state machine 
VISibility for real-time measurements of on-chip power gatlng entry and eXit times and 
tracking soft errors In State-Retention entry and eXit 
• USB OTG mini 5-pln connector to the lower edge of the board 
• A small on-board sWitched-mode power supply IS Implemented lower left to generate the 1 V 
core supply rails and 1 8V analogue PLL supply In addition to the 3 3V 10 and memory 
supplies 
• The left hand side circUitry shares the 16-blt static memory Interface between an 8Mbyte 
on-board Flash memory and an experimental 128x128 OLED panel display which has a 
parallel data and command Interface to an Integrated frame-store ThiS has been chosen as 
an alternative to the CompactFlash approach to external VGA display cards (after all 
manufacturing for these appears to have been discontinued) 
• To the Immediate left of the SOC socket IS the Multi-ICE debug agent connector provided 
as the primary diagnostic and programming Interface 
Five boards are being manufactured Initially and the first 20 die have been packaged Only limited 
screening of the packaged devices has been pOSSible to Isolate some devices Without Wire-bond 
short circUits 
6-19 
Energy efficient sac design technology and methodology 
6-20 
7. Technology Demonstrator Evaluation and Analysis 
SARS2 180nm sac Evaluation 
The first SOC design was used to "pipe-clean" the standard design flow, and should have been a 
straightforward integration exerCise However to simulate full customer design flows with pre-
venfied foundry CPU core (ARM946 with dual SK-byte caches In this case) the processor was 
treated as "black-box" design with the following design views 
• Timing Model - a pin-level (Liberty) timing library view (worst and best case timing for 
sign-off of setup and hold times) 
• Design Simulation Model (an encrypted cycle accurate model with Venlog timing "shell" to 
support back annotation of delays from the final SOC layout) 
• Physical Model representing footpnnt, signal and power ports and "keep-out" regions 
over the memones 
All the above views were InsuffiCient In certain ways and a detailed report was wntten to feed 
back to the ARM CPU hardening team to address deficiencies and Issues 
The silicon was packaged untested and had to be debugged on the evaluation board designed for 
It None of the packaged devices would communicate with the Multi-ICE Debug Agent and yet 
there were signs of life on the PLL and externally VISible SDRAM clock 
This was a major problem given that the Flash EPROM devices soldered onto the board were 
blank and were only programmable through the Multi-ICE debugger JTAG Interface 
Two approaches were taken to understand the problems 
• The software diagnostic mOnitor that had been prepared and Simulated was blown Into 
Flash EPROM devices uSing a bench programmer, the surface mount EPROMs were de-
soldered from the board and surface mount sockets procured and eventually soldered to 
the evaluation board The UART based diagnostic mOnitor was successfully ported 
• The JTAG Interface timing was analyzed with a storage OSCilloscope and eventually a 
race hazard on the clock synchrOnization was uncovered No workaround was possible 
The latter was traced directly to clock latency modelling problem with the abstracted timing view 
and SDF back-annotation where a negative setup time violation was "rounded" to zero 
SARS2 Lessons Learned 
Very painful learning expenence on what should have been a straightforward SOC design The 
JT AG validation test-bench coverage was Increased to check signatures In detail before the next 
project was venfied And design review gUidelines developed to address timing sign-off cntena 
were developed 
7-1 
Energy efficient sac design technology and methodology 
Figure 7.1A- RSD926 DVFS Evaluation (ChiP#1, 22°C) - V limit 
Vc:pu vs CORECLK [Room Temp] 
,.00 
./-~ 
/ ~. 
"00 
"00 
.-'" 
0600 
0 ... 
0>00 
0000 
o 60 100 150 '00 
'''' 
300 350 
CORECLK (MHz) 
Fi ure 7.1B - RSD926 DVFS Evaluation 
Icpu .... CORECIJ'; [Room T_p) 
O~r---'-'----'-r----~-----'------r-'---~----, ~ j I _ 
02~----4-----~-----+-----4------~--~+-----4 
-,- ,- .. :-- / .-
." l==:::::::j.::==+:::.:::::.::;::j:==:.:-t==:;:;jC==:J=::::.::=~ ~ I: l-/=-~ 
j --: /,! -
o,t-----~----~-----+--~~------t-----+-----~ 
.J.Z: 
0~t-----~--~~~~-+-----4------t-----+-----~ 
:j , ~~ : -1- =J:~~l 
ot---L-~--~~----L+~---4------t-----+-~--~ 
50 "0 '"0 200 
CORECLK (MHq "" 
Figure 7.1C - RSD926 DVFS Evaluation (ChiP#1, 22°C) - Power (mW) 
Core power vs CORECLK [Room Temp) 
0300 
0250 
[ 0200 
I 0150 
.. 
• ~ 
o 0100 
0050 
0000 
0 50 100 150 200 250 300 350 
CORECLK (MHz) 
7-2 
Technology Demonstrator Evaluation and Analysis 
DVS926 Silicon Evaluation 
Silicon Verification 
Compared to the first SOC design the DVS926 sIlicon carned a lot more risk but after the clock 
skew hazards unearthed In the first project there had been much greater review scrutiny In the 
timing sign-off for this DVFS design 
Again no wafer test was available for cost reasons so the untested packaged devices had to be 
screened In an untested board, with untested diagnostic firm ware and unknown bonding quality 
Good signs off life were detected from five devices tested, and the Multi-ICE debug agent 
established JT AG connection successfully with the ARM926 CPU on four of these devices The 
SOC, CPU and RAM supplies were all set to a fixed, shared 1 2V supply for functional testing 
• Memory-mapped control registers could be successfully read and written 
• SDRAM clocks were eventually balanced and timing configured to support down load and 
up load of data - after a number of scares with byte mask hold times causing some byte 
write corruption 
• The Flash Programming application was then developed and down-loaded to SDRAM to 
support programming of basIc diagnostic bootstrap loader 
• Finally the terminal-based mOnitor application was developed and tested and 
programmed Into Flash memory to support stand-alone operation 
DVFS analysis 
In order to test and measure the power and energy savings with Voltage Scaling, and to verify the 
level shlfter IP and complex clock pre-compensatlon scheme functioned correctly across some 
degree of CPU/RAM reduced supply range 
The Dlagnosllc mOnitor was augmented to add 
• Performance control programming of the IEC prototype Interface to support 100%, 75%, 
50% and 25% levels, and a 0% halt mode uSing wait-far-interrupt for leakage power 
measurements 
• A stand-alone Dhrystone-based cache-intensive workload to exercise the CPU and 
cache memories intensively and repetitively In order to allow average current and power 
measurements to be made accurately, and thermal stability to be exercised 
• A UART-based protocol to support commUnication with a LabVlew configuration to allow 
test-bench automation of run/test with success or fail/time-out reporting to allow an 
IEE4BB (HPIS) power supply to be used to sweep voltage and measure current 
The measured limits of voltage headroom (Fig7.1A), current (Flg7.1 S), and power (Fig7.1C) are 
shown graphically The slight "hook" In the voltage headroom curve was unexpected but clean 
monotonlc operation In the range 1 32 to 0 BV was encouraging for tYPical SIlicon 
7-3 
Energy efficient sac design technology and methodology 
Figure 7.2A - RSD926 DVFS Evaluation (ChiP#1 , 22°C) - Energy Consumption 
DVS926 - Normalized Energy Measurements 
(Power x Duration to complete - for cached workload) 
120.0% 
I 
N 100.0% 
:I: 
::E ~ <> " !:! 80.0% > 'd~ ~ ....... Energy: fixed 1 .2V DFS ~ B 60.0% :;%: -<>- Energy: Open Loop DVS GO ~ ~ Energ y: AVS limil +S% (2SC) > 
'" -a- Energy: AVS limi t(25C) .. ,  ~ 40.0% ~ ~ 
'" 
I Q-
C> 
~ 
GO 
c 20.0% w 
0.0% 
0 SO 100 1S0 200 2S0 300 
Frequency (Mhz) 
Figure 7.2B - RSD926 DVFS Evaluation (Chip#1 , 22°C) - Energy Savings 
Measured Voltage/Power/Energy Analysis 
Q. 
E 120% .,-----------------..., 
~ 
E 100% 
o 
o @ 80% 
~ 60% 
::E 
~ 40% 
N 
~ 20% 
~ 
';!. 0% """'.,..".,...., ... ~+..y._Y'_\,L_J'r_~ 
MHz 
7-4 
<> <> <> <> 
" '" 
.... 
'" 
';! ~ ~ 
!" 
m 
C 
W 
• Voltage Headroom at 1.2V 
• Operational Voltage (OVFS) 
o Wasted Powe r at 1.2V 
o Operational Power (OVFS) 
Ell Wasted Energy at 1.2V 
• Operational Energy (OVFS) 
Technology Demonstrator Evaluation and Analysis 
DVS926 DVFS Energy Savings 
Energy Measurement 
In order to relate the power measurements to energy consumption - the real metric for the IEM product 
development - the same Dhrystone workload was run a fixed number of iterations at each of the 
performance levels. The cache intensive nature of the workload in fact resulted in close to inverse linear 
scaling; running at half the performance duly resulted in having to run the workload twice as long . 
Fig 7.2A shows the measured power x duration values for energy consumption: 
• In the case of no voltage scaling the energy consumption at lower performance levels was 
roughly the same (as expected when running at half the power for twice as long) 
• The lowest curve is shown for the limit of operation - at room temperature with the near-typical 
silicon. 
• The next curve above this - the "AVS" curve was plotted with the extra (5% or so) voltage 
headroom with the on-Chip process and temperature slack detector control loop functionality 
• The CUNe above this is e reference with +10% voltage headroom for table-driven open-loop or 
non adaptive dynamic voltage scaling 
Energy savings of 50% for the cached CPU were measurable even with open loop voltage margins -
and the energy saving for the 25% performance level was negligible compared to the 50% performance 
paint - only the 100%/75%/50% performance levels were useful to IEM control software. 
Fig 7.2B shows the results displayed graphically as bar charts in the form that became the basis of 
an EE-Tim es article written by the author for publication in January 2004' . 
DVS926 Lessons Learned 
The lack of good low voltage characterization has been a concern going into the project. Form the 
analysis it is apparent that there was no value in providing the 25% performance level - there was 
insufficient voltage headroom to provide any power saving below the 50% performance level. Knowing 
this beforehand would have resulted in a different dynamic clock generator design. 
The analogue level shifters and isolation clamps had not been simulated at transistor level; there was 
indeed a problem when the CPU was stopped and held in Wait-Far-Interrupt mode which resulted in 
leakage power measurements much higher than expected from the standard-cell models and this was 
traced to floating-nodes in the level shifters that caused currents to flow. 
The silicon was fully functional for DVFS operation and once the linux port was completed the 
evaluation platform was made available to the Intelligent Energy Manager software team and eventually 
to end customers to allow commercially sensitive workload profiles to be run and measured with realistic 
multi-tasking application software. 
Although only typical silicon resulted from the project the ability to increase the PLL master frequency in 
5% steps up to +25% was found to be useful to emulate the effect of running closer to the limit - as if 
slower si licon. This also allowed the increased resolution in voltage/frequency points for Fig7.1A-C. 
, http ://www.eetimes.comlin _ focus/m ixed _ signals/O EG 20040 122S0028 
7-5 
Energy efficient SOC design technology and methodology 
Figure 7.3A - UL TRA926 DVFS Evaluation (Chip#2, 22°C) - Current (mA) 
Current 
100 
90 
----80 
----70 ,K 
----
;;-
60 
.,-
.... - -i 
___ 1288MHz ) 
.§. 
----- -
C 50 . 
___ ~240MHz ) 
~ 
-----
1192MHz) 
, 40 
----
0 __ 1144 MHz ) 30 ,.....--
20 
10 
0 
0.6 0.72 0.84 0.96 1.08 1.2 1.32 
Voltage 
Figure 7.38 - UL TRA926 DVFS Evaluation (Chip#2, 22°C) - Power (mW) 
Power 
140 
120 ~-
100 
§" 
80 
/ ,-' ___ P(288MHz) 
.§. / , _ P(240MHz) ~ 60 -" P(192MHz) 
° • 
--
.. • 
""" 
~P(l44MHz) 
40 
~ .;:;: ~ -
20 
0 
0.6 0.72 0.84 0.96 1.08 1.2 1.32 
Voltage 
Figure 7.3C - UL TRA926 DVFS Evaluation (Chip#2) - Dhrystone/second 
Freq KDhry/ 
(MHz) second 
144 217.00 
192 288.32 
240 360.58 
288 432.91 
Figure 7.3D - ULTRA926 DVFS Evaluation (Chip#2) - Work Duration 
Freq 1MDhry 
(MHz) (milliseconds) 
144 4608.30 
192 3468.31 
240 2773.30 
288 2309.97 
7-6 
Technology Demonstrator Evaluation and Analysis 
UL TRA926 Silicon Evaluation 
Silicon Verification 
Compared to the DVS926 si licon the UL TRA926 project had the benefit of a full set of detailed 
transistor-level simulations of the cached CPU core that had been used to design the dynamic 
clock generator. Therefore measuring actual power and energy usage compared to predictions 
was a large part of this project. 
However the evaluation work on this chip ran into two unforeseen problems: 
• All the first 8 devices screened suffered from what appeared to be power supply short 
circuits across the VDDRAM supply network to VSS . On more careful inspection th is was 
found to be a few ohms and by using a high current limit on the supply some life could be 
detected on the SOC. Checking more devices in Taiwan led to the discovery that a few 
parts showed higher impedance on the RAM supply, and on re-screening 50 packaged 
devices 4 were found to be usable. 
• The PLLs were found to suffer from a problem where they would not lock to frequency 
and a swept-frequency signal was found on what should have been the steady SDRAM 
clock drivers. The root cause of this and the workaround in the form or an external 
96MHz oscillator are described in Chapter 6 for the UL TRA926 evaluation system , suffice 
to say that a couple of functional parts were eventually running cache diagnostics and 
ready for analysis . 
After updating the memory configuration programming the diagnostic monitor was largely reused 
from the DVS926 project but with a couple of enhancements: 
• Performance control programming of the IEC prototype interface updated to support 
100%,83%,67% and 50% levels - for the new faster 288M Hz worst-case design. 
• 0% halt mode using wait-for-interrupt for leakage power measurements (without the 
floating nodes leakage power problems of previous chip) 
The measured current IN and power PN curves are shown at Fig 7.3A and Fig 7.38 
respectively, and show good monotonicity. The measurements are at the power supply 
connections to the board power connector so the actual silicon has IR-drop for circuit board, 
socket, 8GA socket bonding and on-chip power rails are all additional to this - while the transistor 
simulations do not take any of these into account . So maintaining operation down to 0.72V with 
typical silicon at room temperature was encouraging. 
Fig 7.3C and Fig 7.30 show the detailed measurements of cache-intensive Dhrystone application 
work load and how this translates into work-load duration for the energy consumption analysis. 
7-7 
Energy efficient sac design technology and methodology 
Figure 7.4A - ULTRA926 DVFS Evaluation (Chip#2, 22°C) - Energy 
E(288MHz) E(240MHz) E(192MHz) E(144MHz) 
V (mJ) (mJ) (mJ) (mJ) 
0.72 (unsafe) (unsafe) (unsafe) 37.44 
0.78 (unsafe) (unsafe) 42 .12 43.68 
0.84 (unsafe) 48.38 49.14 50.40 
0.90 (unsafe) 56.16 56.70 57.60 
0.96 63.36 63.36 64.80 65.28 
1.02 71.40 72.22 73.44 73.44 
1.08 81 .00 81.65 82.62 84 .24 
1.14 91 .20 91.66 92.34 93.48 
1.20 100.80 102.24 104.40 105.60 
1.26 112.14 113.40 115.29 115.92 
1.32 124.08 126.72 126.72 129.36 
Figure 7.48 - UL TRA926 DVFS Evaluation (Chip#2, 22°C) - Energy 
Energy 
140 
120 
~ 100 
~ o E(144MHz) 
0 
~ 80 o E(192MHz) 
"C 
~ 
~ 80 • E(240MHz) 
.. 
E o E(288MHz) 
0 40 - - I-z 
20 - -
0 
0.72 0.78 0.84 0.9 0.96 1.02 1.08 1.14 1.2 1.26 1.32 
Voltage 
Figure 7.4C - ULTRA926 DVFS Evaluation Overciocked - Energy 
CPU Vmin I KDhry/sec Energy Energy 
MHz (limit) (mA) (mJ) consumed 
180 0.777 33 271 .003 95 58% 
240 0.842 47 363.636 109 67% 
300 0.932 65 454.546 133 82% 
360 1.030 86 542 .005 163 100% 
7-8 
Technology Demonstrator Evaluation and Analysis 
UL TRA926 DVFS Energy Savings 
Energy Consumption Analysis 
Fig 7.4A shows the tabulated power x duration measurement data displayed at the four 
performance levels designed into the UL TRA926 SOC. These were gathered across 5% steps of 
VDDCPU supply from 60% to 110% of the nominal (1 .2V) supply ra il. A blank in the column entry 
indicates the processor is outside safe operating range. 
The frequencies chosen for the UL TRA926 project are in fact all multiples of a master 48MHz bus 
clock at the default PLL multiplier configuration setting : 
• 6x (100%) for worst case 288MHz "FMax" sign-off 
• 5x (83.3%) for 240 MHz 
• 4x (66.7%) for 192MHz 
• 3x (50%) for 144MHz 
The energy measurements all correlate very cleanly - taking a row for a certain operating voltage 
the energy consumption is shown similar at each of the frequencies with some positive increase 
approaching 5% in the 144MHz case. 
Fig 7.4B shows the energy consumption plotted in histogram form to visuall y display the close-to-
linear energy efficiency relationship measured for the device. 
Even with 10% voltage margins added back for safety there is still of the order of 50% energy 
savings possible on workloads that can be run for twice as long at half the frequency (50.4 milli-
Joules for 144MHz at 0.84V compared to 100.8 milli-Joules for 288MHz at 1.2V). 
The UL TRA926 CPU subsystem was implemented on UMC 130HS process technology which is 
faster but leakier than the 130LL Low Leakage technology used to implement the rest of the 
SOC. The leakage power becomes apparent in the energy "losses" for the 83%, 66% and 50% 
performance levels compared to 100%. For leakier process technology nodes th is is a reminder 
of the balance that must be evaluated between runner slower in order to reduce voltage and 
power as much as possible but at the expense of burning leakage power for longer. 
The "shuttle" silicon for this project was confirmed as close to typical process by UMC. To explore 
the case of slower silicon in order to observe and understand the energy savings possible the 
master PLL frequency configuration was changed to raise FMax to 360MHz. 
The measured minimum voltages for the four supported fractional performance levels are 
tabulated in Fig 7.4C together with the energy consumption savings possible while emulating 
"moving the silicon" closer to the edge as if slower. 
7-9 
Energy efficient sac design technology and methodology 
Figure 7.SA - UL TRA926 160x120 MPEG4 Decode Workload 
Figure 7.58 - UL TRA926 DVFS analysis test bench 
7-10 
Technology Demonstrator Evaluation and Analysis 
MPEG-4 QQVGA Movie Playback Workload Analysis 
In order to build a repeatable dynamic workload environment a 25-frame-per second 160x120 (quarter-
quarter-VGA) movie with software-only MPEG-4 decoder application were developed and programmed 
into the on-board Flash EPROM. The small display is representative of mobile-phone and small PDA 
devices sufficient to run realistic application workloads. 
The display hardware was in the form of a memory-mapped Compact Flash VGA card that was 
supported by the Demonstration and Exhibition boards developed from the original DVS926 evaluation 
board. The CV-VGA card has an integrated frame-store buffer and only changes from a previous frame 
need to be updated which avoids an expensive software complete video buffer copy every frame. 
Fig 7.5A shows example display output as shown on an external VGA monitor: 
• The centre window shows a frame of the 20 second (500 Frame) compressed movie 
• The lower window displays a rolling display of the dynamic performance control for every frame 
of the movie. The Yellow (lower) trace shows the fractional performance level requested by the 
movie player, that varies according to how complex the motion estimation is and decode 
complexity on the video stream, and how far ahead of the next reat·time frame display interrupt 
the decoder is running. 
• The Red (upper) trace shows the actual performance level quantized in hardware to the next 
highest available frequency. Although hard to see it is just possible to discern the 50% (half-
height), 67%, 83% and occasional 100% levels dynamically chosen for frames in the example 
movie. It will be observed the video player can keep up with the simple frames of the movie 
much of the time at 50% or 67% of the 288MHz CPU core but on complex video sequences 
this grows occasionally to 83% and even 100% (around half-way a long the rolling frame history 
window) 
DVFS (and PLL) Testbench 
Fig 7.58 shows the equipment test-rig used to evaluate the UL TRA926 devices. 
• An Hewlett-Packard (Agilen!) programmable power supply shown as the lower test rack was 
used to provide the voltage and current testing of the dynamic and fixed power rails . 
• An HP Frequency Generator (balanced on top of the PSU) is also shown that was used to 
develop the workaround for the PLL reference frequency divider - which in the end required 
overdriving a sinusoidal 96MHz signal with managed DC-offset voltage into a 12MHz crystal 
oscil lator circuit with internal harmonic fi ltering, after much experimentation. 
Lessons learned 
The detailed low voltage characterization at the design phase of the UL TRA926 project resulted in the 
highly usable DVFS performance, and the choice of four performance levels between 50% and 100% 
offered much better algorithmic control compared to the previous DVS926 project. 
The chip proved a useful testing ground for the Artisan Analogue IP - including picking up the 
divergence between the simulation model and the silicon for the PLL design ahead of other customers. 
This chip design became the case-study for a tutorial the author was asked to present at DAC2005. 
7 -11 
Energy efficient SOC design technology and methodology 
Figure 7.6A - A TLAS926-65LP evaluation test bench 
80.0 
75.0 
70.0 
65.0 
60.0 
55.0 
SO.O 
45.0 
40.0 
35.0 
30.0 
25.0 
20.0 
15.0 
10.0 
5.0 
0.0 
0." 
Evaluation (Chip#21 . 22°C) - Power (mW) 
QVFS Powe r Consumption 
CPU+RAM 
v-
V-
I 
I 
I 
0.96 L20 
Opel.tlng Vol ~gll l20mV SUlpS) 
24()M 1U 
___ 192"'1-12. 
1441.4Hz 
_ 96Mou 
____ 46t.1Hz 
Figure 7.6C - ATLAS926-65LP Evaluation (Chip#21. 22°C) - Energy (%) 
~-,,-
~-
15.0'M0 
,,-
•. -
r .-
1 • .. -
. 100'M0 
. 150'M0 
.~-
·250% 
~-
~.-
DVFS Energy ConaumpUon (rel.Uv. 10 1 ,08V) 
CPU +RAM 
O~n"nll von_ljI. (20mV •• pa) 
7-12 
---------------------------------------------------------------- ---- -
Technology Demonstrator Evaluation and Analysis 
ATLAS926·65LP Silicon DVFS Evaluation 
The ATLAS926 silicon suffered from a problem that the JTAG Multi- ICE connection failed -
regardless of speed of connection and JTAG clock duty cycle . After some hours tracing this 
problem , a Verilog coding error in the edge synchronization of the on-chip debug clock was 
discovered - which was nol picked up in the self-checking simulation test-bench. A workaround 
was devised that inverted the JT AG clock (TCK) to the board and inverled the return clock 
(RTCK) from the board such that the CPU sampled the debug dala on Ihe alternate edge of the 
clock. A small daughter board was built that intercepted the 20-pin Multi-ICE connector and 
proved to transparently fi x this problem (shown mounted vertically in Fig 6.7). 
Subsequently the monitor code was able to be tested , updated and the Flash programming all 
worked rel iably. 
• The monitor was enhanced to support the 100%/80%/60%/40%/20% DVFS performance 
levels 
• A number of leakage control functions were added, described on the next page. 
The DVFS design for the 65nm LP process was complicated by the fact that the standard cell 
logic portion of the CPU had headroom on the 1.2V nominal voltage process to support voltage 
scaling but the (cache) RAMs were not guaranteed to have any voltage scaling headroom. In 
order to support this level shifters had to be added onto every signal to be "up-shifted" from the 
CPU domain to the RAM domain - and RAM outputs down-shifted back to CPU . Understanding 
the timing impact of adding the level shifters onto the critical path cache access circui try and 
validating that the timing relationships at the RAMs were not violated by voltage scaling -
something that had been hard to sign-off before tape-out. 
The VDDCPU and VDDRAM supplies were separately bonded out of the ATLAS design and 
supported interception for independent current monitoring and safe working voltage testing . 
Fig 7.6A shows the DVS test-rig adapted for the ATLAS silicon. 
The limits of voltage scaling were mapped and the current measurements for reliable operation 
were captured and used to derive the measured steady-state power graphs shown in Fig 7.66 at 
20mV intervals. The graphs are basically monotonic but the current resolution of 1 mA resulted in 
slight quantized current readings resulting in the staircase appearance of the graphs. Below 
1.08Va shallower gradient is just apparent and this shows the effect of the RAM voltage not 
being scaled below this voltage while the standard cell logic is scaled a further 200mV. 
Fig 7.6C shows the energy consumption after the workload duration scaling has been facto red in 
and is shown normalized to the 1.08V (1 .2V - 10%) . Again the energy efficiency gradient is 
slightly shallower below this voltage due to the RAM energy cost remaining consistent below this 
point. 
7-1 3 
Energy efficient sac design technology and methodology 
Figure 7.7 A - ATLAS926-65LP Leakage Evaluation (Chip#21 , 22°C) 
Halt-Mode Leakage 
§" 700 
:J 600 
":" 500 
~ 
o CG-CFlJ( uW) ~ 400 f-0 
"- 300 - f- • RA M(uW) ~ 200 
I ,...., n n I I LJ -Q--{}{]= - - I-~ 100 ,.... - I-"" ~~ 0 
...J 
"' 
N 
'" 
v 0 
"' 
N 
'" 
v 0 
"' 
N 
"' 
"- "-
'" 
en en 0 0 "! ! "1 0 0 0 0 0 .,; ~ ~ ~ ~ 
Voltage (V ) 
Figure 7.7B - ATLAS926-65LP Leakage Evaluation (Chip#21 , 22°C) 
Power Gating 
§" 80.0 -r-----------------. 
:J 
":" 60.0 +----------------=--1 
~ 
~ t1. 40.0 +--------------:::::--1 
~ g> 20.0 +--------""P9rl:J-
"" ~ ~ 0.0 +--L.,.--,-IIL.,.---,-II~ 
~ 
o 
Voltage (V) 
'" o 
o f'G.CF\J(uW) 
• RAM(uW) 
Figure 7.7C - ATLAS926-65LP Leakage Evaluation (Chip#21, 22°C) 
Leakage Mltlgol.lon An.lysls (O.vleelO I) 
l--~" ___ RETENTION __ HIIIERNAoT'E 
'" 
• S ~ ~ ~ , 2 ~ ~ S 8 ~ ~ ; ! 
000000006 ...:_ 
Volbl~ 
, ~~~~~~~4L~ 
8 ~ ~ ~ , i ~ ~ I 8 R ~ ~ ~ oooeiooooo ___ ':_ 
Vohau-
7-1 4 
Technology Demonstrator Evaluation and Analysis 
ATLAS926-65LP Silicon Leakage Mitigation Evaluation 
The leakage mitigation techniques for the ATLAS design were analysed in detail. The leakage 
control states are all entered by the processor executing a Wait-For-Interrupt instruction, 
completing all outstanding bus transactions and signalling that clocks may be stopped . The 
metrics of interest are not only how much leakage power may be saved over and above simply. 
• HALT mode allows stopping the clock , has the fastest real-time wake-up time following 
an interrupt and allows the baseline static leakage to be measured. This is equivalent to 
high-level (architectural) clock gating. Fig 7.7 A shows the leakage power measured at 
5% voltage steps from 55% to 110% intervals for both the mixed-Vt standard cell logic 
and the High-Vt cache RAM partitions of the CPU. In fact because voltage scaling has 
real-time costs the only "instantaneous" wake up condition is at the appropriate voltage 
scaling that the CPU was halted and then re-awoken. However the analysis is proving 
useful to understand how to use header switches to power gate between reduced voltage 
"retention" supply and standard full-current rail. 
• LIG HT SLEEP (SRPG) mode offers a transparent "emulation" of clock gating. After the 
clock is stopped the state is internally retained in a low-leakage always powered "balloon 
latch" and then all the leaky logic switched off using local power gates (fine-grain intra-
cell footer switches in this project). The real-time wake-up cost is impacted by the times 
required to safely power back up the logic and restore state from the balloon to the active 
register latches. Fig 7.78 shows the leakage power measured at 5% voltage steps from 
55% to 110% intervals for both the mixed-Vt standard cell logic and the High-Vt cache 
RAM partitions of the CPU. The leakage power is shown to be reduced from the HALT-
mode clock-gating by power-gating at 1.2V by more than 8-fold, and the RAM now 
becomes the dominant leakage power component. Because only the Low-Vt and 
Standard-Vt standard cells have fine-grain power gating support the residual leakage 
power comes from the always-on networks to control the power-gating, the save and 
restore control networks and the balloon latches and the remaining High-Vt logic on non-
critical timing paths. 
• DEEP SLEEP mode is the scan-based state save and restore mode that allows the state 
to be stored in (slower, lower power) memory on-chip or even off-chip. This has higher 
real-time latency penalties than SRPG but allows the CPU logic rail to be powered off 
completely. The RAM power measurements from Fig 7.78 are the same, the power 
gated logic component reduces to zero. For this particular technology it can be see that 
as the RAM leakage power dominates savings are only of the order of 30% over SRPG, 
but on higher leakage technologies this will be more valuable. 
Fig 7.7C shows the leakage power plotted on linear and log scales - and the effect of the "floor" 
on VDDRAM of 1.08V required to ensure retention - for Clock Gating, State-Retention Power-
Gating and scan-based rail switching. 
7-1 5 
Energy efficient SOC design technology and methodology 
Figure 7.8A - ATLAS926-65LP Leakage/Temperature analysis (Chip #21) 
HALT(CG) vs Temp 
1BOO 
_ 1400 
~ 1200 
~ 1000 
&. BOO 
~ BOO 
'" '" 
'" Q) 
...J 
400 
200 
70 
- BO :: 
2. 50 
~ 
Q) 
~ 40 o 
0.. 
Q) 30 
Cl 
~ 20 
'" ~ 10 
o 
o 
o 
/ 
/' / 
/ / 
/ / / 
~~~~ 
~~ : : I 1 
o 20 
20 
40 
Temp (C) 
SRPG vs Temp 
BO BO 
/ 
/ 
/ ./ 
--- -- / / / 
-+ 
,-
40 
Temp (C) 
~
... =1 
I 
BO BO 
----- O.BV 
--- 0.7V 
O.BV 
0.9V 
-l<- 1.0V 
~1 . 1 V 
--+- 1.2V 
- 1.3V 
----- O.BV 
--- 0.7V 
O.BV 
0.9V 
-l<- 1.0V 
~1 .1V 
--+- 1.2V 
- 1.3V 
Figure 7.88 - ATLAS926-65LP CG/PG Ratio (Chip #21) 
35.00 
30.00 
Vi 25.00 ...J ;;: 
'iii 20.00 
:: 15.00 
.2 
10 10.00 
0:: 
5.00 
0.00 
0 20 
HALT/PG vs Temp 
-
40 BO 
Temp 
7-16 
BO 
_____ VDD = 0.6 V 
--- VDD = 0.7 V 
VDD =O.B V 
VDD= 0.9V 
-l<- VDD = 1.0 V 
---- VDD = 1.1 V 
--+- VDD = 1.2 V 
- VDD=1 .3V 
Technology Demonstrator Evaluation and Analysis 
ATLAS926-65LP Thermal Leakage Characteristics 
All the initiat teakage measurements and analysis were at room temperature . The die-size is 
small (4x4mm) and the CPU dynamic power dissipation less than 75mW at 1.2V for the DVFS 
analysis work so the thermal characteristics are optimistic compared to a larger real-world SOC 
design with higher-power integrated subsystems. Because the leakage term is exponentially 
proportional to temperature a follow on set of measurements was conducted in an environmental 
control chamber that supported controlled temperatures between O· and 80· Centigrade. 
Understanding the temperature characteristics is also important from the perspective of the 
complications of effects such as "temperature-inversion" where delay characteristics become non 
monotonic in relation to temperature, highly dependent on the supply voltage. 
Fig 7.8A shows the leakage power graphs measured at 10-degree Celsius intervals for the 
standard-cell portion of the CPU subsystem (i.e. without the RAM portion); the upper graph 
shows the measurements for clock gating using the HALT mode. The lower graph shows the 
measurements for state-retention power-gating using the LIGHT SLEEP leakage mitigation 
mode. As expected the effects of increased temperature are dramatic: 
• Clock-gated (CG) leakage grows more than six-fold between room temperature and 
80' C to more than 1 mWatt for this "typical" silicon. 
• SRPG leakage grows less strongly - roughly quadrupled over the same temperature 
range to a little over 40uWatt. 
Fig 7.88 shows the graphing of the relative savings of SRPG over Clock Gating. This is a much 
more interesting metric from the perspective of understanding the relative cost functions of how 
much leakage power can be saved at particular operating conditions versus the energy costs to 
enter and exit deeper levels of leakage mitigation and their associated wake-up latencies. Again 
this is shown for the standard-cell portion of the processor where the power gating is applied: 
• The savings grow from roughly 20-fold at room temperature to 30-fold at temperature of 
60·C. The experiment was conducted over two days to allow full thermal stability to be 
reached by the board and the plastic package to ensure that the measurements closely 
tracked die temperature. 
• The decrease in the effectiveness in SRPG over static leakage at elevated temperatures 
is related to the proportion of non-power gated High-Vt logic to power gated Low and 
Standard-Vt cells; the High-Vt leakage characteristics grow faster at elevated 
temperatures compared to the lower VI's so start to dominate the leakage characteristic. 
• The thermal profile matches well the die-temperatures of SOC products in low-cost 
plastic packages and the results are valuable to understand for leakage management 
control algori thms. 
7-17 
Energy efficient SOC design technology and methodology 
Figure 7.9A - ATLAS926-65LP Leakage/Temperature analysis (Chip #21) 
ARM926 CPU+CACHE leakage power 
TSMC 65LP (ATLAS Silicon #21) 
10000 
~ 1000 c- l- I-- I--
2-
~ 
• c- l-- I-- t- I-- I-- 1- I-- I-- I-- I- : 1 ~ CPU_HAl T(uW) 0 100 0- El RAM HAl T{uW) 
• ~
• ~
• 10 c- l-- - - I--• - I-- - I-- I-- I--~
1 
·10 0 10 20 30 40 50 60 70 80 90 100 
Temperature (C) 
Typica l silicon, 1.2V nominal 
ARM926 CPU+CACH E · SRPG leakage 
TSMC 65LP (ATLAS Silicon #21) 
10000 
~ 1000 
2-
i 
- -
I ~ CPU_SRPG(uW) 0 100 -
0- GII RAM_HAL T{uW) 
• ~
• ~
• 10 I--• - I-- - - - - -~
1 
·10 0 10 20 30 40 50 80 70 80 90 100 
Temperature (C) 
Typi cal silicon, 1.2V nominal 
Figure 7.98 - ATLAS926-65LP CG/PG Ratio (Chip #21) 
RELATIVE LEAKAGE POWER SAVINGS 
(OVER PORTABLE BATTERY PRODUCT TEMP RANGE) 
TSMC 65LP (ATLAS Silicon #21) 
10.0 
9.0 
0 8.0 
~ 7.0 
'" 
...., 
0- 6.0 - -w 
--- "" 
I~~LT: SRPG RATIO w 5.0 - - - - c- -~ 
- Poly . (HAL T:SRPG RATIO) 
'" 
4.0 
" - - -~ 3.0 < 
:J: 2.0 
1.0 
0.0 
·10 0 10 20 30 40 50 80 70 80 90 100 
Temperature (C) 
Typical si licon, 1.2V nominal 
7-18 
Technology Demonstrator Evaluation and Analysis 
Fig 7.9A shows the leakage power graphs measured at 1 O-degree CelsIus Intervals for the 
complete CPU subsystem, both standard cells and the Hlgh-VT RAM portion The upper 
temperature was extended up to 100"C, the lower end reduced to -10"C and the graphs are 
plotted on log scales to show more clearly the RAM proportion of the leakage power and how this 
vanes With temperature 
The upper graph shows the measurements for clock gatlng uSing the HALT mode The lower 
graph shows the measurements for state-retention power-gatlng uSing the LIGHT SLEEP 
leakage mitigation mode As expected the effects of Increased temperature are dramatic 
• Clock-gated (CG) leakage grows more than ten-fold over a +70"C temperature change 
for thiS "typical" silicon 
• SRPG leakage IS dominated by the RAM leakage power - the cached CPU leakage 
power reaches 1 OOuWatt at only +30"C above room temperature 
Fig 7.98 shows the graphmg of the relative savings of SRPG over Clock Gatlng, taking mto 
account both RAM and logic 
• The savings grow to roughly 8-fold at temperature of 40"C Again the experiment was 
conducted over a couple of days to ensure thermal equilibrium for the board and the 
plastiC package so that the measurements closely tracked die temperature 
• The decrease In the effectiveness In SRPG over static leakage at elevated temperatures 
IS related to the proportion of non-power gated Hlgh-Vt logic to power gated Low and 
Standard-Vt cells, the Hlgh-Vt leakage characteristics grow faster at elevated 
temperatures compared to the lower VI's so start to dominate the leakage characteristic 
Because the RAM leakage IS significant thiS depresses the "peak" ratio temperature from 
the +60"C figure established In Fig 7.8B for the logic alone 
• The thermal profile stili matches well the die-temperatures of portable battery powered 
sac products In low-cost plastiC packages, and the results are valuable to understand 
for leakage management control algOrithms 
Lessons learned 
The ATLAS926 project has proved a very useful vehicle to understand 65·nanmoeter low leakage 
processes - and explore the ments of both dynamic voltage scaling on thiS higher voltage 1 2V 
technology, and the relative ments of a number of leakage mltlgalion schemes under development 
The chip proved a useful testing ground for RTL deSign techniques for state retenbon and on·chlp power 
gatlng, and strengthened the relation With TSMC and opened up 45nm collaboration potentIal 
ThiS chip deSign was demonstrated running MPEG-4 movie decode applications at DAC2006 and the 
results presented at JOint ARMITSMC customer booth sessions dunng the conference 
Finally the project provided a prototype development platform for the leakage extensions to the IEC and 
the baSIS for a follow·on coarse-grain leakage projected on leakier 90nm "G" technology 
7-19 
Energy efficient sac design technology and methodology 
7-20 
Patents Filed/Granted 
8. Patents Filed/Granted 
A number of patentable Ideas have been generated dUring the research programme 
System Control for power management 
• US 6,883,102 developed with two colleagues In the Austin Design Centre 
• 5 US patent applications for dynam IC performance control, including 
o US 7,181,633 
o US 7,194,647 
• 1 US patent application for AMBA bus-based state save and restore 
o US 2004/0153762 close to notice of allowance 
RTL coding for power management 
• US 6,950,951 a protocol for RTL design that reuses the asynchronous reset coding for 
synthesIs as a power ready control acknowledge for multi-voltage design 
PhysicallP for leakage management 
• US 7,154,317 developed with Davld Howard for zero-pin overhead StateSaver 
retention register 
• 3 other patent applications In progress covering 
o power control networks 
o reduced leakage Isolation clamps 
o leakage optlmlzed registers 
Only the published patents or those with notice of allowance are discussed In this chapter 
8-1 
Energy efficient sac design technology and methodology 
Fi ure 8.1- US 6,883,102 - Power Mana ement Control API Patent 
(12) United States Patent 
WlIhams, 11] el 81 
(54) Al"PARAl U~ AND Mt..THOD tOR 
PrRFORMING PO\\TR MANAGFMF!l.T 
tVNCfIONS 
(75) InvI'ntof!>. Lerard RIchard WlIIlDms, DJ, SUIL-.el 
Vali<.y. TX (US), Klm Ra."II1U!>.wn, 
AUl>lm. TX (U&). UI'Il"1d Waiter tlynn 
Cambndgc (G8) 
(73) A!.'>tglKlC: ARM Umltl'd C_mOOdge (GB) 
(. ) NotICe C;uhJect \0 any d_lalll1<:r. the ":TIn ufthlli 
p.lcnl IS o;xlclldcd or aUJulIIcd UOOC! 35 
U 'i C )<;4(11) hy ~46 daY' 
(21) App! Nu lO/02QJi11 
(22) '''-d Dec. 18, 2001 
(65) Prior Puhll",lIlon nlllll 
IJ., 2(Xll,{lJ\ ~4Yt Al Sun 19,21WLl 
(51) InLCI 7 G06F 1/30, G06F 9 '44 
G06t 91455 
(52) Us. Cl 713/300,717124.717/134 
(58) t k.>1d of SNrch 713'300; 717'l34 
7\7124 
(56) Reft'I\'JlCe5 ('ltoo 
US PATENT DOCUM[NTS 
5.630.Il52 A • '\'1997 <;hah 
5.6'iO 9'19 A • 7'1997 y""b,da 
5,832.286 A • 11'1998 Y""h,da 
5,9(.1,71',1 A • ~ 19'>':1 ("'~tu, 
~<,JII2.HI<; A ·11119119 Rom't<? 
fi_l~l~<;.t,.~ Ill· V2U)2 Ito 
• cJtcd by exam mer 
7t41J,<1 
700'21l6 
713'324 
n4"if>.~ 
l7~1224 
YOV2()] 
(10) Palent No· 
(45) Date or Patent: 
US 6,883,102 B2 
Apr. 19, 2005 
PnllUlry [xamlllI'T--Lynne H Browoe 
A'I:mranr f.xQmuwr-Matthcw Henry 
(74) AnOl'IIt'), Agw, or Fmn-Nlxon & Vandcrhyc PC 
(57) AH.-',lRA(.' 
The pr<:.~nllllv\.ntnlfl pnwlllt. ... I data prUIX""lng apparatu.~ 
aoo method for tesllng power management llIStructlOM. The 
dau proct.SMDg apparatus compnses a processor forcxecut 
LOg data pl'OlX .... "I,Jg H1 .. 'trucuon~ mdoomg ptlWCr maniSc 
meD! IW>lructlOru;., at l ... as( ODe o[ tlx: poWCl' maoagcm~nt 
HI~truCh(lfll; hemg I command power management 1D~truc 
uon A puwu managerru..nt controller I/o ~t ... , proVldw for 
receIVing command data from tbe pmoe ... o;of when a corn 
mand powcr management ImIrUl.lion IS cxecuted by tbe 
proce.. ... o;or, and to control power rnam'gcmcnt logiC to per-
form an 15.<;OClalcd !!Ct of powcr management functIOns 
~D<knt on the command dau Ilx data pmcc.<..<,lDg appI_ 
n.UlS lllCll,ldcs fiT"ol power management loglC controllable by 
the p<lWer mln~sement conlroller, w,th lhe p<lWer man:oge-
ment controller .1<00 haVIng In 1Dterface to enable commu-
IIlcatton wltb addItIonal power management lo&,c In accor-
dance WIth the present IDveDIIOn, tbe processor IS arranged 
wlltn Utf,.ullnll IIx: command power m~n.gcnw:nl JJl>,lru~ 
uon to specify wllblll tbe command datl pl'OYlded to tbe 
power management controller whether .n emulation mode 
of Oper~ILOD IS 5Ct Ibe power management coDlroJler IS 
arranged when the emuimon mode L~ not ~t to lDIlLate the 
aMCIClatcd set of power managcmCllI f\lIlCtlonsdcpendenl on 
the command data, wbllsf If the emutahon mode IS !!Cl n 15 
arrangc:d to nnly IDLlLate I fiuh...:t of the ...... -.; .... lated "'l't of 
JlOWl!f management functions not rcqumng commumcatJOD 
over the lDterface Rv tbL~ approach IIIl' ptO/O'olble to pcrfonn 
<oOme testmg of rower management o;offWlrc before all 
.. -.pcct~ nr the puwcr management hardware have hc:cn 
dcslgned 
27 Chllm"" 8 Drawl"ll "h~'4.'ls 
8-2 
US 6,883,102 Power Management Control API Patent 
Developed to support the work of the ARM 1 0 project team In the ARM Austin Design Center 
A control Interface was developed that mapped the control API (Application Programmer 
Interface) through a co-processor or memory-mapped Interface to abstract the design speCific 
power supply control Interface A secondary node of operation was added to allow the software 
emulation of entry and eXit of various levels of dynamic performance and various depths of sleep, 
addreSSing a Verification challenge VOiced by Operating System companies on Simpler hardware 
focussed designs 
An embodiment of the Interface was bUilt Into the ARM1020E processor macro-cell design 
The API maps between an extensible set of performance level or power states and Simple 
generic Interface to power supply controllers With hardware handshake request and acknowledge 
support 
Original Submission' 
Proposed title 
Communications channel and method for commUnication of power-down state-change events 
Novel approach 
Archltecturally defined co-processor accessible commUnications channel to prOVide memory-map 
Independent programming Interface to system-specific power controller 
• hardware handshake protocols ensuring safe power domain Isolation and reset control 
• software handshake protocols for OS system portablllty across SoC designs 
• software transparent scheme to support portable OS handler code to support technology 
dependent features (Iow-leakage CMOS uSing clock-gatlng or high-performance CM OS 
processes supporting logic/memory power-down) 
Applicability 
ARM software architectural mode for (applications) processors that run the system power 
management 
VISibility 
Software VISible CP15 access mechanism ensuring OS 'lock-In' and easy poliCing of patent 
infringement 
8-3 
Energy efficient sce design technology and methodology 
Fi ure 8.2 - US 6,950,951 RTL Power Control Interface Patent 
(12) United States Patent 
tlynn 
(54) POWI<..K l.ON I ROL MLNAu.INL 
(7') Inventor Da,id ,",wtlt'!' Flynll, (.mhndgc «.11) 
(73) A<;SLgnec ARM I.Imltoo Camhl"ldge (GB) 
( .) NOIk.'e 'iubJecI. to any dL .... laHTlcr, the term ofth" 
patent IS encoded or adjusted under 3S 
use lS4{b) by W2 days 
(21) AppJ No 10/134,467 
(22) Fikd Apr 30. ZIM)2 
(h~) Prior Puhllcatlon nntll 
us 2003ft)2!H757 Al 0..1 10,2003 
(51) Int.CJ7 
(52) U~U 
G06F 1/26, G06F 1132, 
G05FS'OO 
713/300,7131320,323299 
713/300,310; 
3407.36737 
(58) tk>1d of ~~b 
('6) Rcfcn'nc...,. Utoo 
US PAITNT DOCUML .... 'TS 
6140714 A • H)''2MO Fu)" 
• elled b\' eumLllCr 
~ " , ~ 
~!.~ 
(10) Patent No: 
(45) Date of Patent: 
US 6,950,951 B2 
Scp 27,200S 
Pnmary E.:rWJII1lt'r--R~baQ.II P,;lVe~n 
A'I.\~tllnl F...ramllln'--Slcfan 'ilOvnov 
(74) Anomt'\ Agent, or hrm--NIlIOD & Vand~rhye Pt:. 
(57) AnSTRACT 
0111 pl'OCe!lo:>wg SV~1m (2) h,vmg pow~r manlgern~nt 
meWalW>lll!> (8, 10 14.16) for one or mon. power domalDS 
(6, U) uuhse an 1~11VC high power ernoble n:que!>! g~Deraled 
by a power conlmlk-r (H, 14) 10 lngger a p .. wer ,upply uml 
(10,16)1(1 generate I KqulfCd power 5Upply '<Ignal Pendmg 
nlld generatIOn nf Ih~~ power QlPply '<Ignal, or more gen-
erally when a power domalD ll>!;wltclJCd off,ln active t.lgnal 
wlucb is generated by the power controller (8, 14) aod 
app!J.cd to the re5e1 mput ot lbe powcrdoruam (6,12) is used 
to bold lhe power oomun (6, 12) UI an macllvc rescl stale 
Wbcn Ihe power supply Signal bcco1IlCl'> vabd, the KIIVC 
SIgnal releases the power oomam (6, 12) 10 commence 
(lperau(lQ. 
45 Claim.", 6 Hnn.lll(l ..,he«5 
, 
i 
I I 
OlPs >x 0 
8 i 11 
lIPs to x 
ACTIVE_x 
nRESET PWR_VAUD_x 
pSU_x PWR_CTL_x DOMAIN_x 
PWR VDO le PWR VOO X 
I ·lrx;>YOIPsIO 
~ 16 ~ \' \' \ 
PWR_ENJ 
l't'1I10 V 
ACTIVE V 
" RESET PWR_VALlOj' 
rsu .y P't,fR_CTl_y DOMAIN_y 
PWR VDD""y PWR VDD_y 
I I yzOIP~O 
I ACTIVE_z ] , 
8-4 
US 6,950,951 RTL Power Control Interface Patent 
Original Submission: 
The novel approach IS to re-use the reset Signalling - explicit and vIsible to the RTL designer 
- as the "power valid" Signalling protocol to each and every power domain In a system 
Industry standard cell libranes tYPically offer register functions with active-Iow asynchronous 
resets and thiS IS explOited In tYPical embodlments 
In essence the protocol speCifies 
• active-high power domain 'request' Signalling 
• active-Iow reset handshake to power domain 
• active high Signalling for all outputs from the power domain to other domains 
• active-high Signalling on all Input Signals to the domain other than reset(s) 
• optional (AND) gatlng of outputs from other domains uSing active-Iow reset handshake 
• optional (AND) gatlng of power-enable and domain reset to minimize power explicitly to 
support Implementations where there are less phYSical power domainS than explicit 
system deSign domains 
Highly applicable to 
• Soft-IP creation within ARM to support power domain SWitching and at platform level for 
controllable SUb-systems 
• Potentially a fundamental foundation for AMBA 3 0 Interconnect speCification (which must 
be power aware) 
• Very appropnate to proVide patent cover In negotiations with off-chlp power supply 
regulator manufacturers for external analogue power deVices 
• Potentially a hook Into EDA vendors for tools support for ARM low-power IP 
DlscoverabllltyNlslblllty 
• Compared to the Implicit usage of the concept In ARM102xE, the aim IS to make thiS 
explicitly VISible for soft-IP 
• EspeCially easy to 'police' as the protocol should be VISible between SOC deSigns and 
external power controllers - and compatibility with external power controllers emploYing 
thiS protocol 
Novelty/Non-obviousness 
• The Inventive step claimed bUilds Significantly on Simple pnor art of 'voltage valid' 
power supply Signalling by allOWing the HDL (Hardware DeSign Language) deSigner 
to work With deSign rules that can support multiple power/voltage domainS uSing 
familiar reset coding styles 
8-5 
Energy efficient sac design technology and methodology 
Figure 8.3 - US 2004/0153762 A1 Bus-based state save and restore 
11II1I ~llm IIII~I ~m III~ IIIIIIII~ III1 ~I~ II~II ~IIII~I om 01 ~~ II1 
U~ 200401 'i3762/\ 1 
(19) United States 
(12) Patent Application Publication 
F1ynn et al. 
(10) ('ub No. US 2004/0153762 Al 
(H) Pub. Dd!e: Aug. 5, 2004 
(S4) HARDWARE DRJVFN STATF 
SAvtJRlSTORE IN A DATA PROCLSSING 
'iY'IT'FM 
(75) invcotorl>. Davkl "'aim- .1voo, C,mbnuge (GB) 
Domink Hu~ Symes. Cherry Hmlon 
(GB) 
('..orre~pondc:na: Arldn:q; 
NlXON &. VANDFRIIYF., PC 
1100 N (.U 8)< ROAI) 
8TH FJ..OOR 
ARIJN(. TON, \i\ 222014714 (U\) 
(73) AM.tguee ARM liMITED, Cambndge «(,8) 
(21) Appl No 
(22) Filed 
10/691,501 
Oct.. 24. 2003 
(30) torelgll AppUcatlon PrIority Data 
Nov 13, 21102 {(,B) _ _ _ __ _ ___ 0226~(J2_1 
PubhCIIllon ClasslficuUun 
(51) Int Cl' __ 
(52) U .... U ___ _ 
(57) AB~lRA<'" 
_ HD4L 1/22 
_ _ __ _ 714/15 
'itllc 11811 rrom I Cl1'\,UII 2 I~ ""V~ 10 I memory 14 VII I 
"yo;lem bus 4. 6, 8, 10 under control of I SllllC o;avlIlg 
CU1trolkr 16 The 51.te dala may be: upLurw WJlblD ".;1lI 
cham" 12 proVIded forproducllon Icst wllhm the CIrculI wllh 
these KaD l.halDS supplymg rt:speL1lve bill> 10 !he mulh-bll 
Slale savmg data word<; IhalaTe slored 10 the memory Via the 
!:>}'!>lI:mbub-
CONVENTIONAL CPU IN SOC DESIGN 2, 
HRDATA 1'. ---
V 
CPU 
AHB INTERFACE 
8-6 
6 
HADD~CTRL 
> 
US 2004/0153762 A1 Bus-based state save and restore 
US Patent Office - final updates in progress to gain allowance 
Novel method for high-speed CPU state save/restore for leakage sensitive applications where 
CPU must be powered down/up for standby effiCiency with extensibility for on-chip diagnosIs 
(Domlnlc Symes' extension proposal) 
BUilds on the author's Patent US 5,525971 granted for the AMBA Test Interface Controller 
onglnally developed for full-custom core test purposes 
• Configure scan-ready CPU core for 32 scan-chains re-use 32 HRDAT A Inputs pins 
as scan-In, 32 HWDAT A outputs as scan-out 
• Use standard ATPG or core-based test tools for test coverage 
• Wrap' thiS CPU core with a thin AMBA (AHB) bus-master sequencer that can also 
control the scan enable signal 
When power management requires CPU to be frozen/powered down 
• Sequencer wntes all the D-type state to memory In Simple burst mode uSing standard 
AMBA protocols uSing a programmed base address (looks to system like a bus 
master dOing lots of buffered word writes) 
When power manager needs to wake up the core then, when voltage stable 
• Sequencer reads 32-bIt data from memory and shifts Into scan chains uSing standard 
burst reads (looks like lots of cache fills to system) 
• Once fully reloaded Simply disable scan-enable and re-enable the CPU to start with 
the entire state restored [ThiS IS much easier than trying to use ARM code to reload 
CPU, Cache control, MMU state, VFP, and Internal control state] 
Diagnostic usage 
DSymes Introduced the Idea of being able to not only save state but then to re-load the scan-
chainS In the same manner from diagnostic code "sessions' - Including complete test context 
sWitching 
• Appropnate for safety-cntlcal (automotive) appllcallons where the CPU core may 
need to have one of a number of diagnostic tests run every millisecond and 
potentially run whenever the CPU IS Idle 
e g An ARM7TDMI-S (With 3700+ D-types Including full register-bank etc) could be 
state saved With 116 word wntes (for balanced scan chainS) 
• Which at 120MHz (0 18u target) IS only about a microsecond 
Patent attnbutes 
• Highly VISible as AMBA core IS compliance checked 
8-7 
Energy efficient soe design technology and methodology 
Figure 8.4 - US 7,181,633 IEC Performance Available Response 
IIII~ ~II~ ~ IIII~ ~I~ 1111 1111 111I 111111110 II~ ~~I ~I~ 11111 lie ~II I1 ~~ 
U~ 2OO4013936LAl 
(19) United States 
(") Patent Application Publication 
Flynn 
(10) Pub. No : US 2004/0139361 Al 
(43) Pub. Date Jul. 15, 2004 
(54) DATA PROCESSING PERFOR\tANCE 
CONTROL 
(7') Inventor Davld Waiter FJynn, C'ambndge (GB) 
Jan 13, 2003 ((,B) M_M MM 
lan 27 20(H (GB) MM H _ H H H 
PubllcadOD ClasslflCJldoD 
0300710 1 
0301R5Z.o 
Correspoodeoce Addn:S$ 
MXON & VANl)tRHYE, re 
tlOO N GLEBE ROAD 
8THflnOR 
(51) 1nl CI~ 
(52) u.s a 
__ G06F IIU 
713/320 
ABSTRt\CT 
ARLINGTON. VA22201-4714 (US) A data pro\.~~mg ~yo,(em 1JK.lulles a prnl.eWlT 46 operahle 10 generate control signals to control ooc or mOTe funher 
eU'CUII'! 4, 6 10 adopt opcnhOn.a1 stales 10 suppon dJlr~rt"nl 
performance I('\cls oftbe pI"O(:I:$.'i(Ir 1bc one or more fulther 
ClrcuLt« generate cum:nl operallon MgnaJt. IIld,calrve of the,r 
currenl operluon [umples of funher ClI'CIIllS are a clock 
generator 4 which generales a cum:ol operalloo SIgnal 
LluhealJVc of a eurreolly generated cll>l1r. SIgnal or pos.ibly 
l.Llm:oily avall.hle Ll(l(.k 'lgoa1~ A vult age cuolrldler 6 may 
Ix: a fu.l.L"" ...... ud wlll .. b ""rvQ lU p...., ... nk a power "'goal 
to the system and WhiCh generaleS a current operatIon slJ!Oal 
U1c:hcatIYc of the current maXllDum voltage level whICh the 
voltage cootmller L .. wle to supply 
(71) A~,gllU ARM 1J\tlnn, Camhrldge (GB) 
(21) Arpl Nu 
(22) riled 
IOnU,410 
Nul' 19,2003 
(30) fon"lJ.;n Appllrudon Priority l>llta 
Jan 13 2003 (toll) 
lan 11,2003 (GB) 
_ 0300713.5 
0lOO7127 
OSp~tform-h;;d;are--l I Dy;;a;;;~e 
(33MHz?) ,.,~2 I scaler 
(Xtel-MHZ?) ! 
DPM ! 6~J~ Ii~==='::~il ' 
Dynamic Performance I I VI") ! MOnLIOf (option) I A ! pwr: status I rka 
~ OPC I ~ ~a:ntfol 1 I--VI-'J--r1 ) 
AP:-vB DynamIC Pelformance I-~'::::':===:'::'. I 
! Controller ! IDltregs 1\ ••• ~ 
p,"<...l.-..I ",--:"",,,:--..,j=~==~'" I V IDLE I 'PSU ~ I ~ I elK 
l_oL~ 0- I I 
Current w !Z Target w AVC ,I a:: __ 0 _ _ a:: _ ~ adaplLve voltage 
~ '" 7' <.7 I 1\ I controller I l~c;'P~u;-'II--;D~p~c~l~JL~.~'.~I=~» 7;;u 
cpuclk4 CLKGEN ClKGEN , V •••••••••••••••• SERiAl 1 ____ ~L-___ -f4!-m~~o==.~~: 100TA 
._.--,1 t- : Technology I 
'''',n_: AMBACLKGEN I : dependenl I L. 
pclk.J I~. slack detector r. ·14 
PTRC functlonahty I 
, 
SOC·speclfic clock generator r, .. 4 
(250150OMHz?) 
10~ -I .I ~'2 
8-8 
: Isolated I 
'::: voltage domaIn " 
+ level-shlflers 
(o.250M",') I 
" _ __J 
US 7,181,633IEC Performance Available Response 
US Patent Office - Granted: 20-February-2007 
Dunng the development of the Intelligent Energy Controller hardware abstraction layer a novel 
approach to the mterface protocol between 
• The programmed performance level from the Intelligent Energy Controller (IEC) 
• The available performance level(s) from SCC-speclfic dynamic clock generator 
ThiS supports variable latency PLL lock times on dynamic clock generation and support 
for minimum bus-clock frequencies always available to ensure "forward progress· 
• The available level(s) of voltage from the SCC-speclfic dynamic voltage controller 
ThiS supports variable latency power supply settling times on dynamic voltage scaling 
generation and support for reduced level power available to ensure "forward progress· 
• the actual performance level fed back to the IEC to support automated hardware 
fractional performance level accumulation to cope With variable latency clocks and power 
supply ready timing 
A thermometer coding scheme was adopted for each mterface ThiS supports mdependent bit 
synchrOniZing across the multiple clock domains for each of the above mterfaces and allows very 
Simple logical gatmg to determme available performance levels 
8-9 
Energy efficient sac design technology and methodology 
Figure 8.5 - US 7,194,647 lEe PWM Dynamic Performance Scaling 
Imllnlll~mlmDIIIlmIm1nnllmummlmn 
U~ 2OO40138813Al 
(\9) United States 
(2) Patent Application Publication 
nynn 
(10) l'ub No. US 2004/0138833 Al 
(41) Pub Date JuL 15, 2004 
(54) DATA PROCl.SSl~G PERtORMANCE 
(."ONTROI 
(1') Inventor DII~1d WlIl'Ier n,.nn. rarnbndgc (GB) 
Correspondence Addn:".; 
NI'I{ON ... VAN))" RHYF .. PC 
1100 N (,Lt..IJt. ROAD 
8TH FLOOR 
ARLrIo<' TON, VA 22201-4714 (u!)l 
(11) Ao;s.gotc ARM LlMIH I) (ambndge (&B) 
(21) AppJ No 
(22) Fikd Noy 19,2OOJ 
('\0) Fomgn AppUcatlon Priority Data 
Jan 13 2001 (GB) ~~~~~_ 
}'O 13 2003 (GB) __ 
DPM 
031.10713 ~ 
03007127 
, DynamIc PerfOflTlance 
, 
I , 
JiilD 13,2003 
1,,1') 27,2001 
(GB) 
(GB) MM 
Publlallloll 1..1lI""lftcatlon 
(51) Int U 7 
(52) U~'i CL 
«7) ARSTRACf 
030111520 
L06. 19/00 
_ 702}58 
Pcrf(lnnanoc control of I pm.:.,MOI core SI "" .chleved by 
moduiallDg bcrwuo I proces&ulg mode power supply coo 
figuuhoD WblLb Ihc: prO\............c' co!'; ~ I' Lio<.ked and • 
boldlng mode: powcr supply oonDgunuoo wbJ.cb the pro-
cc;"sor core ~ I!. 001 cJockr,d By moduLt1.lllg bclWCcn these 
\WO !XIw"r ,upply coofigurahoo modes, I large] pelfo! 
mince level may be BlhlCVCd and energy UJIL'>Umphon 
WblL~ m the holdmg mode can I>c: reduced 
,-,---~ 
I DynamIC voltage 
I scaler I 
-
6 .. '~ (Xtal-MHZ?) I , , 
, VIN) , I Monitor (ophon) A I pwr status J..-. 
Valid 
, , , I 
A ,~ DPC ~ control 2 Dynamic Performance Controller Inrtre-gs r-...: p, ~ 
h 
p 
, 
; I'"" ~ 0 , - - , , 
Current @ l~ I 21J~et & 
:--:.;; ~; 10 7' ; 
'" 
~I I 
"t 
CPU OPC level ) 
ClKGEN ClKGEN I . V I, I targetfclk • 
, 
,.- AMBAClKGEN 
"' 
, 
, , 
, 
PTRC functional 
SOC-speClflc cloCk generator r, . ..4 
(2501500MHz?) 1 
VIij 
.-
I V IDLE I 
AVC 
adaptIVe voltage 
controller 
.............. 
Technology 
dependent 
stack detector 
Isolated 
voltage domain 
+ IeveJ.-shlfters 
«()..250MHz?) 
, 
~ 
CLK 
, 
, 
, , 
.L...-
I PSU 
- SERI AL 
A I OAT 
..,.'4 
, 
I , 
, 1O~ ~rl- -I-L -;'2 -----~ 
8-10 
8,5 US 7,194,6471EC PWM Dynamic Performance Scaling 
US Patent Office - Granted: 20-March-2007 
By adding a "BUSY" status signal to a CPU core or processing subsystem to Indicate on de-
assertion that the system IS ready to go Idle and all buffered bus transactions are complete, 
and mandating the use of aclive high protocol signalling such that all signalling IS In an 
Inactive (Iow) state, the fractional performance control Interface developed for the IEC may 
be used to emUlate a multiple step Dynamic Voltage Scaling system simply by controlling a 
power supply to SWitch between normal (suffiCient to support 100% CPU performance), 
V_RUN and a retention-level voltage, V_HOLD The control Signal would tYPically be 
modulated uSing a Pulse-Wldth-Modulabon (PWM) approach uSing the logical OR of any IEC 
hardware "PANIC" wake up request, the "BUSY" status from the processing subsystem and 
the target PWM mark -to-pace ratio of the IEC target performance level 
The "slot" time resolution of the PWM would be In 10's of microseconds and benefit from a 
fast closed-loop power supply 
No changes would be reqUired to the IEC programmer's Interface and would allow operating 
systems to work With traditional multiple-voltage designs 
The key advantages are summarised as 
• Simple on-off SWitching of the maximum CPU clock signal IS all that IS reqUIred to 
emUlate reduced rate clocking (allOWing smooth interpolation of clock performance 
levels) 
• Support for power supply reduction when the clock IS stopped to give both dynamic 
and leakage energy savings 
• No level shlfters reqUired between bus and processing subsystem - Simply AND-gate 
style clamp gates 
• Standard SOC static timing analYSIS and verification - at the maximum clock rate of 
the design 
• Re-uses the Isolated voltage domain to support state save/restore for Isolation when 
the processing subsystem IS to be SWitched off completely to prevent any leakage 
• OS/API transparent, proViding a wait-for Interrupt and bus transaction flush call IS 
prOVided In the "Idle" task to support both conventional voltage scaling and MAXlMIN 
SWitched power supply as desCribed above 
• Real time wake-up support by synchrOniZing the PWM scheme With any panic wake 
up 
• Multi-processor ready - can be shared by one or more processors on shared SWitched 
voltage domain In thiS case all CPU "BUSY" status indicators must be OR-ed 
together 
8-11 
Energy efficient sce design technology and methodology 
Figure 8.6 - US 7,154,317 Zero-pin control overhead "StateSaver" Register 
In~lmlll~n~n~lnmnfln~D~~~lmnlmm 
us 20060152268Ai 
(19) United States 
(12) Patent Application Publication 
Flynn et al. 
(10) Pub No: US 2006/0152268 Al 
(") Pub. Date: Jul. 13,2006 
(54) L\TCII URCUlT .'CLUDI1\G A DAT\ 
RFrr"TIO' I ATrn 
(75) inveruon. DlIVld W~hl:r FJ)Dn CambndJ!i! (GRl, 
DllVid WlllI.m lIo1\ud, <,;unbndg~ 
(GR) 
Com.osJ"Qoo..'!lCe 4"dd!l:ss 
'l'{O'l & VA."'IDFRIlYE. PC 
901 'ORII! (.'tllt ROA)), 11111 .. I,OOR 
ARLI'GTO"l, VA 22203 (US) 
(73) A ... ,J!,D<." AR"t U-.nTlD Cllmbmll!L (GB) 
(21) AI',,1 No 
(22) hied 
SI 
SE 
o 
Jan 11,20OS 
B ( 
Restore 
nReset = 0 
SE'O 
PubU .... tloo U."II1 .... ,108 
(51) Jn' C1 
HOJK 3/OJ7 
(52) U..l) (L 
(57) 
(200601) 
M ~ 3271218 
ARSTRAf'T 
A bleh CIrculi 2 ,g ~rlbed lIlcJuchng a functIon path lmch 
4, 6, whIch may be In Ihe fonn of 11 standard fhp 1101". 
mgel:her Wllh a dam mentIOn latch 12, 14 The reset $1~1 
nresct and the scan enable Slpal SI-; are used to control these 
latches 10 perform reM KIIIl., save and reslore flu.::l,on. 
lbe $live and restore functIOns serve to save a data value dv 
from the functIonal path latch 4, 6 Into the data l'l."lroflon 
latch 12, 14 and re.tore tlU8 value &I1ch that the fu~hOJ\:d 
path latch CIlIl be pov.crcd down v.nhout a loss Ofd.1ta 
a 
bclk 
x 12 ; 
cIv balloon latch 
SE' 0 
8-12 
US 7,154,317 Zero-pin control overhead "StateSaver" Retention Register 
US Patent Office - Granted' 26-0ecember-2006 
Patent Idea - StateSaver OFF 
ThiS IS a type of balloon flop, but has the unusual feature that It reqUIres no more Inputs than 
a standard DFF ThiS IS achieved by uSing the eXisting "nReset" and "SE" pins to control the 
restore and save functions respectively 
The nReset pm controls both reset and restore operations, while the SE pin controls both 
scan-enable and save functions 
ThiS behaViour IS asynchronous and IS Independent of the Clk Pin The term "StateSaver 
register" refers to thiS new DFF design and "StateS aver latch" refers to the "balloon" latch 
Within It 
There are several benefits of thiS design 
• The cell Will be handled by eXisting EDA tools With minimal changes to the 
methodology because all the PinS are common to DFFs In many libraries 
• Since there are no extra "save" or "restore" or "sleep" PinS on the DFFs, there IS no 
need 10 bUild large networks of buffers to drive them Other retention flops tYPically 
have one or more of these pinS, and some of these buffers trees must be powered 
from the non-switched supply, adding to the stallc leakage problem 
• The cell IS Interchangeable With a normal (non-retention) DFFR In a netllst, so It Will 
be pOSSible to swap cell types easily prior to layout 
• The StateSaver latch node can be buffered With a smgle Inverter to create a Scan Out 
pin ThiS pin eXists In some DFF designs to avoid adding unnecessary load onto Q 
output when connecting up the scan chains The Scan Out or "SO" pin IS tYPically 
Implemented as Q AND'ed With SE, thiS prevents It toggllng and wasting power when 
not In scan mode Since the StateSaver latch only toggles when SE IS high, that 
power-saving functionality IS already bUilt Into thiS design, and therefore Jusl a single 
Inverter IS needed to drive thiS output 
System benefits of "StateS aver" registers In IP/SOC design 
For many (sub-)system designs only the registers holding archltecturally defined state In 
fact have to be replaced by StateS aver registers redUCing the area penalty that all 
registers would Incur For example for a CPU core all state vIsible to the programmer 
must be preserved across power gatlng sleep periods while other state (mlcro-TLBs and 
other transparently cached state, for example) could simple be discarded as long as the 
state IS re-Initialized when power IS re-applied It IS obViously Important that non 
StateSaver state must be Initialized to inactive condition rather than "unknown" or "X-
value" 
8-13 
Energy efficient sac design technology and methodology 
8-14 
9. Publications and Conferences 
Most of the specifications and reports generated from the research project portfolio have commercial 
confidentiality access restrictions However the papers that have been published and public presentallOns 
that directly result form the research work are summarized In thiS chapter and the papers attached as 
AppendiX B 
In summary, thiS chapter documents 
• The Initial sARS2 technology demonstrator work resulted In 
o The inVitation to contribute a chapter on Syntheslzable ARM IP to a book edited by Kurt 
Keutzer and Davld Chlnnery (Berkeley University, CA) 
o A keynote address at the Canadian Microelectronics Corporation 2002 workshop 
• The DVS926 technology demonstrator resulted In a number of conference presentations and led 
to ARM product announcements In relation to the Intelligent Energy Manager product and the 
productlzed ARM1176JZ(F)-S product With Integrated Dynamic Voltage Scaling support 
o Microprocessor Reports Adaptive Voltage Scaling article bUilt up for Max Baron 
o DeslgnCon 2003 DVS Hardware and Software presentation and paper 
o HotChlps 2003 presentation, Stanford University CA, presentation 
o EE-Tlmes article Jan 2004 publishing the energy savings results 
o DATE 2004 paper and presentation on detailed DVS926 power/energy savings 
o IEE/ACM ColloqUium low-power keynote, Sept 2004 at Loughborough University 
o EDA Interoperablllty Conference presentation, Qct 2004 
• The UL TRA926 technology demonstrator generated material for 
o DAC 2005 All-day Low-power Tutorial uSing the UL TRA926 as worked DVFS example 
o sac Conference 2005, Newport Beach, low power design flow panellist 
o DAC 2006 UMC Voltage Scaling and Energy Savings presentation and demonstration 
• Leakage Mitigation work 
o ARM Developers Conference, Santa Clara 2005, leakage management approaches 
o DAC 2006 TSMC Voltage Scaling and Energy Savings presentation and demonstration 
o ARM Developers Conference, Santa Clara 2006, low power panellist 
• Finally contract In place as primary author for 2007 publication In preparation 
o Low Power sac Methodology Manual to be published by Springer 
o Scheduled for DAC 2007 launch, volume publication from July 2007 
o ISBN 978-0-387-71818-7 
9-1 
Energy efficient sac design technology and methodology 
Figure 9.1 A - "Creating Synthesizable ARM Processors with Near-Custom Performance" 
II ... _u .. ' ... , .. taott ....... -. .... . oc_ .... /olIV'. 
,...to, ........ A.SH;: ....... u.. ... ,,,. ... ,.~_ .. t ... ,....., _ _ .. 
ASK: _ ... U ... ,"'_., !wO .. ,_ . u ..... . 
....... , .. , ,..,.u ,_ ... , 
• -.... ... ,"""' . .... _ _ h "'0<1 ................. 
. '_ ."_'-"'ft_ 
• eo.."OI""_ . ... ~ . ","",,,,,clOC' .... ... 
• "' If! .. , ... _ ...... , .............. , ' ''' .. ... "'SIC _'_ifC:I' 
· ~t'"" '" ... Iyoaf: . .... , . ..... " ..... _. ~ .. , .. 
• _, •• c .. , ....... .. _r .... " ' ''''''''''0 u , ".wU " _ 
· c--.... .. oc . .. ....... .. 
Tools and Techn iques for 
High -Performance 
ASIC Design 
David Chinnery 
Kurt Keu tze r 
Figure 9.18 - Canadian Microelectronics Corporation keynote 
CANADIAN SYSTEM-ON-CHIP 
WORKSHOP 2002 
July 5, 2002 
Banff Conference Centre , Banff, Canada 
Proceedings 
The opportunity to configure your research platforml 
Morning Program (Chair: Hugh Pollltt-Smith, CMC) 
,_ Author 
SOCRN """', '1I 101lPOF Peter St*es. CMC 
COre-basoed soc Desq'\ OfHI5PDF [)avia A)m. ARM llO 
OYervie'N d CMC's B!ue!00(h Plotdonn Hn< \IIkram Baiot, CMC 
CMC's soc Veri~C:llion Str3Ieqy Ol1WBPOF ~ PoIIIu.smth, CMC 
Digital 0e$9'I ErMronmenIIor soc UlUliPOF VIola Poon, caoenc.:e 
9-2 
Energy efficient SOC design technology and methodology 
"Creating Synthesizable ARM Processors with Near-Custom Performance" 
Guest author of Chapter 17 (with Michael Keating at Synopsys adding the EDA material) of "Closing the 
Gap between ASIC and Custom" compiled by Kurt Keutzer and David Chinnery University of California, 
Berkeley, published by Kluwer 2002, ISBN 1-4020-711 3-2. (Figure 9.1 A). 
The request to write this had come via Michael Keating at Synopsys as a resu lt of the sARS2 technology 
project collaboration and earlier work as a contributor to his "Reuse Methodology Manual" which has wide 
industry recognition as best practice for design for synthesis . 
Based on material generated and learnt from the experience of re-archi tecting the ARM7TDMI processor 
from a full-custom design to a widely licensable core for standard synthesis design flows, with interface re-
design for ease of integration. The final drafts were written up around the time of the 2002 DATE 
conference in Paris and Mike Keating helped expand out the Synopsys related EDA fiow sections before 
the a technical writer in Mountain View finalised the style and presentation. 
The book went to the publishers (Kluwer Publishing, Inc.) on 1 SI April 2002, and was published 1 SI May 
2002' ,' . 
Canadian Microelectronics Corporation keynote presentation 
The Canadian Microelectronics Corporation is a non-profit corporation that provides Canadian Universities 
with design tools and resources, manufacturing technology and centralized IP repository by taking on the 
procurement and licensing framework on behalf of individual university research groups. 
CMC licensed the ARM7TDMI CPU macro-cell for university SOC project integration in 2002 and invited a 
keynote presentation on SOC design with microprocessor IP cores. 
The presentation proved a useful opportunity to bring together the learning experiences from the sARS2 
reference design particularly with respect to the 
The presentation was given at a SOC Workshop hosted by CMC in July in Banff - see Figure 9.1 B. The 
presentation was archived with the Canadian System-On-Chip Workshop 2002 proceedings3. 
1 http://www.eecs.berkeley.edu/-chinnerv/asicvscustomspeed/bookchapters.html 
2 http://www.springer.com/uklhome/engineering/circuits+%26+systems?SGWID-3-40604-22-33340599-detailsPage=ppmmedialloc 
3 www.cmc.ca/news/events/cwsoc2002proceedings.html 
9-3 
Energy efficient SOC design technology and methodology 
Figure 9.2A - Dynamic & Adaptive Voltage Scaling paper built for MPR 
-'"-R 0 C E 5 
www.M PRrin Unc.(om 
A ALOG AND C PU W IZARDS 
R EDUCE DI GITAL P OWER 
Nallonal Sermconductol arid ARM Increase Battery LIfe 
By '\/II X Bur.,II (I11110J-OI/ 
o R 
"'knlor~ SI"<."I,'· .. I is no longer gu ilt ), of IUllIt ing p«x.,,'.s.sor Ix·rfonmnu.', Thl' i n("'lllnll~ litl. 
has bt-C'n [I\\'Mde'd 10 b,lIIt'ry cap.ldty. CI:II \1 I,.r tdtphon(,$, POAs, notebooks, and pol1ablr 
IlHlllinll"di:1 u"',,kcs (oull.! bring higlll'r mkropllk.cssor ren'nues Rnd more rcw,m.ling 
I. 1."11101 • nd CPU Wizard ... Rrducr Dicital PQI\C'I 
Figure 9.28 - Designeon paper jointly with National Semiconductor 
~ p~.,,,. by ~ International Engineering ~ Conwnium 41 Connecting the World of Electronic Design """ ," "'K 
2003 Archive 
Highlights I Schedule I Exhibitor List 
2003 Designeon Schedule 
January 27-30, 2003 
Tuesday, January 28 
11 :00 am -11 :50 am 
SA2·3 
A Combined Hardware-Software Approach for Low-Power SaCs: Applying Adaptive Voltage Scaling and the 
Vertigo Performance-Setting Algorithms 
9-4 
Energy efficient SOC design technology and methodology 
Dynamic & Adaptive Voltage Scaling paper built for Micro-Processor Reports 
Max Barron at MPR approached ARM to understand and write up the IEM + Adaptive Voltage Scaling 
technology announced in November 2002. The author worked on the system diagrams, based on the 
DVS926 project work and the IEM control framework , and also helped furnish the power and energy 
savings predicted and achievable, justified by measured data from early National Semiconductor AVS 
silicon. (See Fig 9.2A for example of the system diagrams produced). 
The article was turned around at high speed in order to meet publication in the second (bi-weekly) January 
edition of Micro-Processor Reports , a respected industry reference for ARM customers : 
Analog and CPU Wizards Reduce Digital Power 
Max Baron - Principal Analyst (O1l2112003) 
In November 2002, National Semiconductor Corp. and ARM announced a strategic business 
relationship to jointly develop and market power-efficient systems that, they claim , will increase the 
battery life of handheld portable devices in several stages- from 25% to as much as 400%. The two 
companies' joint effort will leverage ARM's penetration in the mobile-phone market and National 
Semiconductor's expertise in analog design and power management. 
NSC and ARM's joint project aims to create circuits, software, and tools that address three energy-
consumption tasks : frequency reduction , minimal vol tage levels to support it, and reduction of leakage 
current. 
Microprocessor Report readers can access the full story here: 
https:llwww.mdronline.com/publications/epw/issues/epw158.html 
Designeon 2003 Paper and Presentation 
A joint paper between ARM and National Semiconductor was presented at the end of January' : 
• Krisztian Flautner presented the IEM software framework (Vertigo codename from original 
University of Michigan PhD project that was the control basis for ARM DVFS) 
• Mark Rives from National Semiconductor presented the Adaptive Voltage Scaling IP and early 
results from 0.18um test chip project 
• The author presented the system design, implementation and analysis work and detailed 
information on the synchronous design challenges with DVFS and proposed solutions. 
ANALOGandDSP - News and technical information about. .. 
Published on: 1/ 2/ 2005 
By Krisztian Flautner, Principal Research Engineer, David Flynn, a Fellow in the Research and 
Deve lopment group, ARM , Inc., and Mark Rives, Principal Applications Engineer, NatSemi Corp. 
4 www.analoqanddsp.com/results.asp?entryid=6067 
9-5 
Energy efficient SOC design technology and methodology 
Figure 9.3A - HotChips 2003 presentation 
MmI Final Program HOT Chip. 15 
I i i i i 
C H 
11 i I i I 
I P S A SymposlU'n on HIch-l'«1orfNrKl ChlJls - Aucus1 17· 19, 200) Memornol A.lidIconum, Sonlord unfVtnlty. Pa loAlto. CA 
Monday, August 18,200] 
1:45. 9:00 Welcome, Openl", Remuk. 
Gen f!ral C ha ir. Si.am:ok Aryl. 
~m Co-CN!rs: Pr<lldeep Dvbey'. 
H ike Flynn 
' :00-10:10 Senlon I: Sup~omputln. 
Seuioo Ckair: John Sell 
• Red 510"": A 10,000 ~ SYKem with reliable, high 
bandwidth, low latency Inlerconoe<:1 BobAlwnon Ovr 
• Qu.ldrla QINet It :14. N .twor1c for Sup..-«Iftlpudne App.c:adoni 
nblUlo Pw'InL Dmd AddlKII'I.pn BHcroft. DI"Id H __ • 
Mony Mturen lo1 AIorrm 
• Sub.tllhoanPhic Sftnkonduc:IOf" Computing System$ 
Andl't C>.Hon Cdb«JI 
10:30-10:55 a,...ak 
10:55.11 :45 Chal., HlI(e Flyrn 
Taduhl Wflcwub. Vlc:. ""'1df,'IL,"''1h 
f'erlofTNl'u Compuan, NEe 
' TI1oe Who le £ .. rlh SlmUb~WOO'lcrl F"~lest Supe.-C:OfnpilCe r 
11 :45. 11:45 Senkon 1 Embedded 
~t.Ion Chair. Howanl Sx h . 
• A Hullilhrellded RISCIDSP Proc. wl HIi:h Spoeed Inlen:onnecl 
&1k NonMn /r("nftltI 
• In t ellleent Entrl)' H &IllItft1lenr. all SOC Oesl,lI Ba~ed on 
AR M 916EJ.S Omd FI)'nn AANo 
11:4S. 1:00 Lunch 
Figure 9.38 - HotChips 2003 DVS926 project publicised 
9-6 
Energy effi cient SOC design technology and methodology 
HotChips 2003 "Intelligent Energy Management" presentation 
ARM was invited to present a paper at HotChips conference in 2003 and the IEM prototype work based on 
the DVS926 project was chosen as the technology to highlight - although at some risk given that the 
silicon only arrived out of fabrica tion the month before the conference and due to chip packaging issues 
was not available to show physically at the conference. The device sprang to life the week after the 
conference, in time for the ARM Partner Meeting hosted in Cambridge following on from HotChips. 
Figure 9.3A describes this particular industry conference and context. The conference historically focus 
on high speed processors typical ly fabricated as stand-alone components so the ARM IP macrocells 
appear in the Embedded session/track. 
The presentation was split into three primary sections: 
• The whole case for DVFS, focussing on energy metric rather than dynamic power dissipation that 
is well understood by the CPU communi ty. Because results were not yet available from the 
DVS926 project, measured data from a 0.18um ARM926 testchip were used to set the context as 
well as predicted figures from the IEM prototype project. 
• The performance control software. This is difficult to convey and yet is the fundamental concept to 
convey in order to be able to exploit the dynamic performance and voltage scaling hardware. To 
an audience that is primarily hardware-focussed a number of slides were developed to build up 
the case for how the operating system builds up a view of the varying application load and task 
deadlines without having to touch the application code. 
o A projected movie was presented with both the MPEG frame playback and an annotated 
dynamic performance prediction graph overlaid to show visually how the OS-based task 
performance monitoring and scheduling behaves with IEM policies. 
• The hardware control approach, and the underlying design challenges to be addressed to support 
voltage scaling wi thin standard EDA tools environment: 
o The Intelligent Energy Controller design was presented and the interfaces to clock 
generator and DVS/AVE power controller discussed along wi th how the interfaces are 
abstracted to give a unified fractional view of performance requested and monitored taking 
into account the granularity of control and PSU-specific ramp-time constants . 
o How the National Semiconductor Adaptive Voltage Scaling is interfaced and controlled . 
o An overview of the implementation complexities from a tools and library characterization 
perspective. 
Figure 9.36 shows a slide from the presentation looking forward to the actual si licon results. 
Acknowledgements to Dr Kris Flautner for work on the instrumented movie, and to Clive Watts for 
integrating the marketing material for the IEM software and National Semiconductor PowerWise hardware. 
Archives I HOT CHIPS 15 
www.hotchips,org/archives/hc15 
Intelligent Energy Management: an SOC Design Based on ARM926EJ-S, David Flynn (ARM) 
9-7 
Energy efficient sac design technology and methodology 
Figure 9.4A - DVS/AVS results for EE-times, Jan 2004 
CMP 
---EE Times: 
Design and evaluation of power-efficient SoCs 
Oavid Flynn 
(01122120044 :42 PM EST) 
URL: hnp'/lwww eetrmes.com/showArtlcle _ihtml?articleID=18310592 
The latest portable devices--from mobile phones to media players-offer a host of new Intemet. multimedia 
and gaming features that place a significant strain on batteries. So the quest to oplimize system-wide power 
use and maximize battery life has led four companies-ARM. Artisan . National and Synopsys--to collaborate 
on the design of power-saving intellectual property (IP) and systems-on-ch ip (SaC) that reduce dynamic 
power consumption based on application software workload. available silicon performance and environmental 
conditions. 
Here's how they met the chaUenge. 
In 2003, ARM and Synopsys collaborated with National and Arti san on an SaC test chip that can dramatically 
increase the battery life of portable devices. The SaC was based on IP that intelligently and dynamically 
adjusts performance and power consumption to maximize energy conservation. 
The chip addressed dynamic frequency and voltage scaling, implemented multiple power and clock domains 
and targeted a 0 .13-mlcron process from Taiwan Semiconductor Manufacturing Co. (TSMC). 
The destgn was partitioned into three primary on-chip power domains: 
• A voltage-sca led CPU power domain. which featured level shifters, a retiming inter1ace to the main SoC 
and a separate VOO CPU power ring and pads for supply rails. This sca led-voltage domain includes 
NatIonal's PowerWise hardware performance monitor, which was realized using the same cell library as the 
CPU. 
• A voltage-scaled memory power domain. which featured a dual 16-kbyte instruction, a data scratch pad 
RAM su itable for a state save-and-restore when the CPU is turned off (for example. when implementing 
software-controlled leakage management), isolation clamp cells to the CPU a VOD RAM power ring and 
individual supply pads. 
• A standard SaC fixed-voltage, "always-on" power domain for the rest of the chip , which featured SDRAM 
and flash memory controllers , real-time peripherals and power control. The test chip was implemented and 
verified in tour steps, using Synopsys' Galaxy Design Platform (Fig. 1). 
See related chart 
The test chip implemented a system architecture for a power-
!I-aElii-l efficient SoC design. 
. . 
See related chart 
The test chip power/energy evaluation summarizes the results 
I!!:==~ normalized to the nominal 1.2V operating voltage. 
9-8 
Energy efficient SOC design technology and methodology 
DVS/AVS Evaluation results for EE-times, Jan 2004 
The collaborative partnership of ARM, Artisan , Synopsys and National Semiconductor plus TSMC who 
had fabricated the DVS926 system-on-chip implementation chose to publicise the evaluation results from 
the ARM silicon analysis. 
An article was written for EE-times which discussed how the multi-voltage SOC design was partitioned and 
implemented and the power and energy savings achievable on "near-typical" silicon - and in particular 
how National's Adaptive Voltage Scaling technology had been integrated in order to support both process 
and temperature compensation to minimize voltage headroom . 
The diagrams and graphed results were the primary technical input were produced at the end of 2003 and 
the final article incorporating data to satisfy each collaborating party was final published in January 2004. 
9-9 
Energy efficient sac design technology and methodology 
Figure 9.5A - DATE, Feb 2004 
Advance Programme- 10D Low-Power Design (Designers' Forum) 
Design. 
Automation 
and Test 
Room AIIlI" oi'Ml 3, l evel A 
Mode. . ton : W l uk. l "lP~",1 Col ~ Ue. UK 
V Gflfo \l ,i .. Infineon Technolooi ... DE 
Energy efficiency. a vital u pect of effective &yM.em design from I high level of 
abetraction, through hardware/.oftware partitioning down 10 the phYliealleve1. 
D.aign examp le. inclu~ I low-power methodology applied to network ,rKI 
multi-proeNeor en:hitectUrM. Thi. i, follo~d by 6peeific de3ign. for diaplay 
control, ~art card, and code opcj~tion. in Europe 
1:00 (S) 
1515 (S) 
1530 (S) 
A POWER OPllr.t'ZED DISPlAY MEMORY 
ORGAN IZATION FOR HAND HELD USER TERMINALS 
L l1ol/(NCJe r. A De..";lch and K Denolf, IMEC, BE 
F Catthoot; KU leuven. BE 
ENERGY ESn,.,ATlON BASED ON HIERARCHICAL BUS 
MODElS FOR POWER-AWARE SMART CARDS 
U Naif., K Rothbtrt. C Stflgttr end R Wt!iM. TU Grll,. AT 
f RiBf}iJf and A Muahlbarg«; 
Phil ipI Semicondueton Gl1Ilkom GmbH, AT 
ANAlYS IS AND MOORING OF ENERGY REDUCING 
SOURCE CODE TRANSFORMATIONS 
Conference 
& Exhibition 1545 (S) 
C ar."doI .... W Fomaciari F Sa/it» IKKi F Curto, 
DEI · PoIitecntoo di Milano, IT 
A SIMULATION.BA SED POWER·AWARE ARC HITEC1\J RE 
EXPLORATION OF A MULTIPROCESSOR SYSTEM-ON· 
CHIP DEStGN 
1eoo (S) 
L S.,.,in;, BoIognl U, IT 
L 8i.oounilfJ\ Imrscom. GR 
M DonrKJ. BuIlD •• , IT 
F M.,.,icheli IKKi M Oivieri, LA Slpienza U, Rome, IT 
SYSTEM UVEL POWER MOOELING AND SIMULATION 
OF HIGH-E ND INOUSlRLAL NETWORK·ON ..cHIP 
A Bona, V bcuri. IKKi R bfalon. STMictoelecuonicl, IT 
February 16-20. 2004 1515 (S) IEM920: AN ENERGY EFFICIENT SoC WITH DYNAMIC 
VOLTAGE SCALING 
K RautnM, 0 Rym and D I P«III~ ARM Ltd~ UK 
www date'(onferenu:(om 
'1130 CLOSE 
Figure 9.58 - IEE/ACM Colloquium keynote, Se pt 2004 
-m 
m 
SaC 0=. Test a1d TecII_1OIogy 
P - adualB Se:: ..... 
1k sellWlM "'IU ~ orgmised in Ihrrt sw.ions with SJ)(U:en prtsOUUlg topiCS for gOleJ'aI 
disct.lMiOll, A ~ papt1 :I"':lrd will ~ presented to ont: of me prHt::ntm 10 rec~ll15lnOll for 
~ quabty oflht: presmt:lllon and lbI! PJJX1 
8:30 • 9:30 Registr:l tioo 
9:]0·9,35 Welcome 
9:35 - 10:15 Keynote r:llk, " Pfn7l slvr Computill l - A SOC Desien Ch311i'D:~" . M;uk 
Zwohnskt. SouIh.:Ullotott UnivtrSitv 
14:15 - 14 45 Keynote r :tlk, "H,ud"7Iff and Sorm7l1" appl'o.ub.s to low-po,nr SOC and 
~"$It:m dfSilu". Da\,d Flynn, AR...~ 
10:15 - 10:30 "Tb. Applkadon of Dynamjr \ '0111&" &l.line 10 [ lIlbNIdf'(l SySlflD.'l 
[mplo~;nl a TICS $afm7l .'. AI'rhitfr IUl', : A Cas. Sludy". T. Phartrapomn.lIlt and M, 1. 
Pout.. U !l1\,ttSlty of UJCe5tet 
1030· 10:45 "urablishin& t b~ COl1'K tnrsS or [mbfildrd SOftwar . ... K. Tasie· ,"-m3dt and 
p, N, Green, UMIST 
10;45 - 11:00 "UslD& Pan nus 10 Suppon tb~ D~,'~)opm.n t and Malntananu of Sofm7lrf 
for R.liabl~ [ mbfdd. d Syst~ms : A Cas~ Stud),", C MWeJWOI , M J POIlt and 0 Ward, 
Uninnit)' ofLcic~ler and MIRA Ltd, 
11:00 - 11 :30 cofT.~ bruk, POS(~ I' Sf'ssiou 
9-10 
Energy efficient SOC design technology and methodology 
DATE, Feb 2004 
The Design And Test, Europe conference is an important forum for customers and licensees of ARM 
technology. A more detailed paper on the measurable energy savings for a cached CPU core such as the 
widely licensed ARM926 was chosen as an important paper to target for this conference. 
DATE as a conference favours papers that cover real-world experimental results and analysis and the 
measured results from the DVS926 si licon were just available in time to meet the submission deadlines 
that DATE sets in order to ensure peer review and acceptance. 
The paper was co-authored with thee colleagues, Kris Flautner who was the architect behind the DVFS 
prediction control software , and Dipesh Patel and Dave Roberts who had helped work on the 
implementation and evaluation aspects respectively. See Figure 9-5A. 
The papers generated a lot of interest and became a valuable reference to use when engaging with 
prospective licensees and customers of the IEM software and IEM-enable microprocessor cores. 
Abstract 
One of today's most successful embedded devices, the mobile phone, embodies a set of challenging 
design requirements : long battery life, small size, high performance and low cost. The single 
parameter that complicates the simultaneous fulfilment of all of these design goals is energy efficiency 
of the system , since batteries only hold a finite amount of charge. To operate within the allotted energy 
budget, systems must be optimized for energy consumption during design and also at run-time. 
Increasingly it is not sufficient to statically optimize for worst-case conditions but designers must 
enable systems to adapt to conditions at runt ime. The Intelligent Energy ManagerTM (IEM) technology 
provides an integrated solution for addressing energy management of SoC devices. In this paper we 
present data about the energy consumption characteristics of a multiple power-domain based SoC 
which includes PDA functionality built around an ARM926EJ-S core. 
IEE/ACM Colloquium keynote, Sept 2004 
Loughborough University, the lEE Professional Network on System-On-Chip and the Association for 
Computing Machinery (ACM) Special Interest Group on Design Automation hosted a postgraduate 
colloquium focussed on Embedded Software, Hardware implementations and hardware-software co-
design from the SoC perspective. 
The invitation to present the afternoon keynote session gave the opportunity to not only present the 
systems-based approach to manage both hardware and software components but to also present live the 
demonstration system running MPEG video decode displayed on the conference projector which helped 
make the material come to life. (Ideal for the after-lunch slot!) See Figure 9-58. 
l http://csdl .computer.org/dl/proceedingsldate/2004/2085/03/208530324.pdf 
9-11 
Energy effic ie nt sac de sign tec hnology a nd m e thodo logy 
Figure 9.6A - Synopsys EDA Interoperability Conference Qct 2004 
14th Synopsys EDA Interoperability Developers' Forum 
October 21 , 2004 
Sun Conference Center, Santa Clara , CA at Agnews Historic Park 
Keyn o te: The Next Big le Design Discontinuity 
Professor A. Richard Newton 
Dean of the College of Engineering and the Roy W. Carlson Professior of Engineering, 
University of California 8erkeley 
Wha t I s t he De v e loper s ' Fo rum? 
This Forum provides EDA vendors and Iheir customers an opportunity to exchange information and 
ideas on EDA tool interoperability including information on new interface technologies, future 
enhancements, upcoming news, and successes from developers and customers. TopiCS and 
formats discussed will include Liberty, Milkyway, SDC and SystemVerilog. For more information on 
these formats, please visit the TAP-in and MAP-in program web pages. Please join us in promoting 
increased interoperability in the EDA industry. 
The slides presented at the October 2004 Developers' Forum will on ly be available at the event 
Please plan to attend the Forum Register for this event now at no charge. 
Agenda Highlights 
7:45AM - 2:15PM 
2:30PM - 5:00PM 
General Session (Lunch Included) 
Keynote: Professor A. Richard Newton - U.C. Berkeley 
Breakout Sessions 
Track 1 - Low Power Forum 
Figure 9.68 - Synopsys EDA IDF Qct 2004 - Low power track based on DVS926 
3:00PM · Bnakoln Se$sions 
Track 1: Low Power Forum 
The Elements ofthe Most Energy 
Efficient SoC Design 
3.00PM ARM IEM and System-Level 
Power Management - Clive 
Watts and Da,;d Flynn, ARM 
Lld 
IEM Technology and Interoperability 
IEM provides Dynamic Performance Scaling, but.. 
Design for Lower Power. 
SynopsyslARM IEM 
Reference Methodology and 
Other DeSign Techniques-
Barry Pangrle , Synopsys 
PowerWise Technology for 
Power Suppty and Voltage 
Control - Gordon 
Mortensen, National 
Semiconductor 
Silicon IP for Low Power 
Design - Rob Aitken, 
Artisan Components 
IEM-enabled (CPU) cores require 
Level-shifter interfaces, isolation clamps 
Library lP, fast-settling PLLs 
Extended charactenzatlon for cell libraries, memory 
Power-aware RTL coding 
Global power/ground in current HDLs inadequate 
Careful hierarchical design for now 
Latency-aware clock design 
Clock tree balancing becomes dynamic challenge 
Either asynchronous or complex clock generation 
Comprehensive implementation and analysis EDA 
Plus verification, test and production yield extensions 
--ARM "" """""""", ,,,., 'tit 1. •• nA, W,"HIJ" 10 
9-1 2 
Energy efficient SOC design tech nology and methodology 
Svnopsys EDA Interoperability Developers Conference Oct 2004 
A presentation was invited by Synopsys to cover. 
Clive Watts, the product manager at the time for the Intelligent Energy Manager software product covered 
the basic overview of how the software control framework and policy manager fit into the system and OS 
design. 
The author presented the primary technical requirements on the EDA tools and library IP components to 
allow clean design flows and interoperabili ty with different front-end and back-end tools . 
Figure 9.6A describes this particular industry conference and context. 
The systems-level challenge of DVFS and IEM proved a useful framework for the Low Power forum : 
• Systems level challenges from the perspective of IP provider and system-on-chip integrator. The 
DVS926 project was used as the basis for the technical justification: 
o As shown in the summary slide shown in Figure 9.6B the challenges to EDA, library and 
power controller suppliers as well as the need to enhance RTL coding to support multi-
voltage design all introduced, and then addressed from the collaborative technology 
demonstrator project perspective . 
• EDA implementation and analysis - using the worked examples from the DVS926 project that 
Synopsys had worked on and learned from . 
• Interfacing to on-chip and off-chip power supply controllers - again using the DVS926 project that 
National Semiconductors had used as the reference system design for PowerWise prototype. 
• The physical library components (primarily level shifters and isolation clamps) and the expensive 
library re-characterization to support DVFS timing constraints and analysis specified as a result of 
the DVS926 project. This section presented by Rob Aitken who was later to become the ARM 
PhysicallP R&D director after Artisan Components acquired by ARM . (The UL TRA926 project 
was running confidentially at this stage and provided the requirements specification for the "Multi-
Voltage Kit" components discussed as the basic interoperability components requiring EDA 
support) 
Increasingly the success of the IEM hardware and software was being seen commercially as dependent 
on the design tools and flows to support both design and verification . Visibility at such EDA conferences 
and the opportuni ty to present the challenges and requirements in a customer forum has proved and will 
continue to prove very valuable in getting tools and methodologies in place to enable successful multi-
voltage product designs in the wider market than just the major IDMs. 
9-13 
Energy efficient SOC design technology and methodology 
Figure 9.7 A - DAC 2005 - All-day low power tutorial 
g eneral Info 
conference program 
at ·l -gt.nc. 
Dally Matrn:es 
• OAC PavlIJon Panels 
· Management DayC OAC 
· Wireless Wednesday 
• Search the Pr"oQr-..m 
technlQl aes&lon. 
• Keynotes 
· Papers 
· Panels 
· Special SessIOnS 
· Monday Tutorial 
· Fnday Tutorials 
workahop. 
· Intro to EOA 
· Integrated Design 
Systems Workshop 
· UMl for SoC DesIgn 
· Women's Workshop 
hIIods.-on tutofttlla 
· RTL Hando 
· Core-based SoC Design 
exhibits 
registration 
travel & hotels 
related events 
sponsors 
contact us 
lllI.81 nAC planner I nAC Roar I nAC IBI€) 
CONFERENCE PROGRAM. TECHNICAL SESSIONS· FRIDAY TUTORIAL 
FRIDAY, June 17, 2005, 09:00 AM - 05:00 PM I Room: 208AB 
TRACK:POW ER 
rRIDAY TU TO RIA L 
#4 - Advancements in Energy-Efficient Design 
OrglUlizer(s} : Barry Pangrle 
~ Key trends In power management for energy-effioent design will be descnbed 
along with the latest research techniques that address them. 
Key Trends : 
-Lower Voltages 
-lower Threshold Volt<lges 
-Multiple Th reshold Voltages 
-Dynamic voltage Scaling 
-Dynamic Frequency Scaling 
-AdaptIve Voltage Scaling 
Power Management Areas 
-Analysis and Minimization Techniques for Total Leakage 
-Power Optimization Using Multiple SUpply and Threshokt VoItages 
-Asynchronous Level Converters and Level Convemng logiC DrcUlts for Multl-VDO 
Design 
-Analysis and Design of Level -Converting Rip-Flops for Dual- VDO! vth Inlegrated 
CIrcuits 
-Voltage-scalable SRAM With Timmg SpeoJlauon and Error Correction 
-Dual vth, Dual VOD, and siZing optimization 
-A New Algonthm for Improved VDD Assignment In Dual VDD systems 
Speollker(s}: Oavtcl Aynn - ARM Ltd" ClImbnd9€!. UK 
DaVld Tamura - NlItionlIf Semiconductor, s"nttt CkJrtt, CA 
David Blaauw - Umv. of Michiglln. Ann Arbor, MI 
Bctny pangrie - Synopsys. Inc., Mount/un VIew, CA 
Figure 9.7B - DAC 2005 - DVFS (AM) and Leakage Mitigation (PM) 
Overview 
• Systems level challenge 
• SW/SOC/PSU design, EDAlLibrary implications 
• Dynamic Power/Energy Management 
• Dynamic Voltage and Frequency Scaling 
• Real-world design issues 
• Static/Leakage Power Management 
• Multiple power management states 
• Work in progress ... 
9-14 
,'I 
Morning session 
(Based on U L TRA926) 
Afternoon session 
(Leakage Mitigation) 
Energy efficient sac design technology and methodology 
DAC 2005 - All-day low power tutorial 
The UL TRA926 DVFS project and leakage mitigation work in progress presented at Friday tutorial : 
TUTORIAL 4) Advancements in Energy-Efficient Design (See Figure 9.7 A ) 
Organizers : Barry Pangrle 
Speakers: David Flynn - ARM Ltd. , Cambridge, UK (See Figure 9.7B) 
David Tamura - Nationat Semiconductor, Santa Ctara, CA 
David Blaauw - Univ. of Michigan, Ann Arbor, MI 
Barry Pangrle - Synopsys, Inc., Mountain View, CA 
DAC 2005 - low Power Panellist 
From EE-times conference panel report by Ron Wilson6: 
M A more ambitious design was described in a panel sponsored by Synopsys, and including speakers from ARM and 
Toshiba. This design was also an ARM processor: in this case an ARM1176 demonstration. ARM fellow David Flynn 
and Synopsys fellow Mike Keating described a design that employed more invasive techniques and more manual 
assistance than the Cadence-described flow. 
In addition to the usual logic optimization and multi-threshold techniques, this design used dynamic voltage-frequency 
scaling-a technique that ARM for one has been discussing for a year or so, but that adds considerable complexity to 
the tool flow. It also employed power gating, which though simple in principle, also adds remarkable complexity. 
Both techniques require very careful thought in partitioning the design, Flynn warned. Decisions based on timing 
requirement perceptions may have implications not only for the end performance of the chip, but also for the layout 
complexity and the amount of energy actually saved. 
Voltage scaling of course requires level shifters between voltage regions. But if voltage is scaled dynamically, nets that 
cross the boundaries must be checked for timing, signal integrity, electro-migration and other issues across all the 
possible combinations of voltages. Clamps may be needed to hold the outputs of a block at a known level while the 
block is inactive. And if frequencies are scaled as well , signals may have to be resynchronized between blocks. 
Even just turning the power off to a block can be complex, the researchers observed. Power gating can be done either 
on a very fine granularity, by placing a switch at the foot of each ground path, or on a block level, creating a virtual 
ground for each block. In the former case, the switch is now in the signal path, and must be sized to meet timing 
constraints. ''That one transistor that turns off the power can be larger than a two-input NAND gate," Keating said. 
In addition, state must be saved somehow. In this deSign, Artisan registers with low-leakage shadow flip-flops were 
used, so that on a signal the state of the register could be transferred to the shadow or restored from it. And output 
clamps are needed to keep outputs from a powered-down block from crow barring the next block down the line. Further, 
the only tool available today to predict the transient behaviour of the power grid is SPICE. 
Altogether there is considerable energy expenditure in powering down a block. This makes it necessary to have enough 
application knowledge to know when the block can be inactive long enough to make the effort worthwhile. 
The impression from the panel was that such aggressive techniques were feasible, and the rewards worth it, but today 
the undertaking is not for the faint-hearted. Major steps are being taken by both tool and library vendors to automate the 
hard spots, but it appears the architectural-level decisions, specifically about partitioning and knowing when a block can 
be slowed or shut down, will remain challenging. R 
6 http://www.eetimes.comlnewsldesignl technologylshowArticle.jhtml?articlel 0= 164900311 
9-15 
Energy effi cient SOC design technology and methodology 
Figure 9.8A - ARM Developers Conference, Oct 2005 
• DESIGNING FOR THE DIGITAL WORLD HOME ABOlIT ARM • NEWS. CONTACT us • lOG IN 
• ~ EXKlBJTORS£ SPONSORS REGISTRAnON HOW TO PARTICIPATE CAU FOR PAPERS TRAVEL 
ARM DEVEL'(9PERS' 
October 4-6, 200S CONFERENCE 
Slntl Oar. Conventron Center. Sant4 Clam, CA U5,t. 
1:00 PM - 1:45 PM 
A Combined Hardware & Software Approach to leakage Control 
Presented by Synopsys, Mlkp Keatlng ; ARM, Keym Mcl ntvre and Davld F1ynn 
Room 206 
Power and energy management is an important aspect for consideration In any SoC or overall 
product deSign. As leakage current will play a more significant role In the future, measures must 
be taken to control leakage and ensure power and energy usage remain as low as poSSible. On the 
current generation of prooess technologies, leakage accounts for about 20% of the overall 
power/energy budget; on subsequent process geometrles, thiS IS neanng 40-50%. Future SOCS on 
process technologies of 60nm and beyond will require additional techniques In both hardware and 
software to manage leakage. This presentatJon explores some of the areas ARM IS actively workmg 
on to mlnrmlze leakage current. 
9.88 - International sac Conference Panel, Nov 2005 
Conf ..... nc:. ~ncb Abstncts • BioI' Reoistntiou JnforPWItion 
Tabl~Top Exhibit U,t R~t All Exhibit TabllP: R@qi5f.e.- for Ft __ bhibit PASS 5pons«s & Prese.nbng 
- --
-
-
Comoon,., 
-
Conference Reaistration ootions 
3 RD I NTERNATIONAL SYSTEM-ON- CHIP (SaC) CONFERENCE & EXHIBIT 
November 1 & 2, 2005 - Radisson Hotel Newport Bead" talifon lia 
_I: low p~ Design ChallengH in Complu SoC " MIC Designs 
~tor:: Ron Wilson. Editor. EE Times. 
_lists: 
4:"'5 pm - 5 :30 pm • Su!\l s.r~, VP Milrlwtil"lg, K--Micro 
D,vliI RYI'lI'l., Engll'IHri."!g F.now, AA.I<J, 
P...t • Savt t..ibson, TKh~eIist. T_ illa • &.run RuncMa-Smi::h. . -SI:"ICOIl DtiIgn ~tn lnitiriw. 
• TBO . T osI'vhoi 
Wing-Yu lRung. cm. MoSys.. 
9-16 
Ene rgy efficient sac design tech nology and m ethodology 
ARM Developers Conference Leakage Mitigation presentation, act 2005 
This inaugural conference hosted in Santa Clara was targeted at both software and ASIC developers and 
engineers in the Silicon Valley area. Working with the product manager of the IEM embedded software 
and an EDA tools expert a presentation covering the extensions to the IEM product roadmap to support a 
range of leakage mitigation techniques was produced. See Figure 9-BA. 
Without publicly naming the leakage projects at this stage the technology portion of the presentation 
covered the leakage mitigation techniques under evaluation : 
ARM926-based 
Key to meeting shuttle die area targets (-> 4x4mm -> 3.5x3.5 mm) 
Reuse Linux port and IEM 'policy stack' development environment 
Reuse existing IEM evaluation and development systems 
Multiple voltage domains 
DVFS support (more for 1.2V LP process than 1V G/OD) 
Allow independent V/I measurement across domains 
Support power-gating (MTCMOS) strategies 
RTL handshakes to drive distributedlfine-grain power switches 
First prototype of IECv2 for intelligent leakage control 
Support for novel Retention Registers for "Light sleep" 
Support for novel AM BA-based block hibernate/wake "Deep sleep" 
Optional support of (VTCMOS) forward/reverse bias control 
• Despite lack of EDA mUlti-dimensional timing models 
3'd International sac Conference panellist, Nov 2005 
The invi tation to represent ARM at this SOC conference hosted in Newport Beach was accepted and the 
opportunity used to build credibility for the synergy potential for both physical and synthesizable IP from 
ARM in the year that Artisan Components had been acquired. 
The panel was entitled "Low Power Design Challenges in Complex SOC and ASIC Designs". See Figure 
9.B8 . The moderator was Ron Wilson who subsequently reported the highlights in EE-Times 7. 
NEWPORT BEACH, Calil. - A panel discussion at the 3rd International System-on-Chip Conference here this 
week attempted to skip past the academic science projects and the rosy vendor marketing and explore what was 
really feasible today in the challenging area of power management. ... 
7 hltp:l!www.savantcompany.com/PRlSavant-news19.htm 
9-17 
Energy efficient SOC design technology and methodology 
Fig 9.9A - Design Automation Conference Jul 2006 - ATLAS65LP presentations 
, .~ I 2006 43rd DESIGN AUTOMATION CONFERENCE 
July 24 - 28 , 2006 . Moscon!! Center, San Francisco, California 1 
, . 
Leakage Power Analysis A W. 
• Correlating measurements 
• IV analysis for CPU+RAM 
• NB beyond functional range 
• HALT 
• Std cell + RAM leakage 
• LIGHT SLEEP 
• Lvt Std cell power gated 
• Retention registers active 
• + RAM leakage 
• DEEP SLEEP 
• Just RAM leakage 
,~,- Compared to -0.3mW leakage power, 25C typical 
JU111WI im 
Fig 9.9B - Design Automation Conference Jul 2006 - UL TRA926 presentations 
ULTRA926 "IEM" Evaluation platform 
• Fully function silicon 
• 128MByte SDRAM 
• 8 MByte FLASH 
• VGA display support 
• Measurable power rails 
• Intelligent Energy Manager 
• Software MPEG-4 playback 
• Recalculate performance every 
frame of 25 frame-per second 
• Graph predicted versus actual 
288/240/192/144MHz levels 
Wait-for-Interrupt if any slack 
9-18 
-------------------------------------------------------------~ 
Energy efficient SOC design technology and methodology 
The Design Automation Conference in San Francisco in July 2006 proved to be an important showcase for 
the technology demonstrators developed wi th both TSMC on advanced 65nm-LP process and UMC on 
130nm "Fusion" technology. 
Screening the untested test chips and getting these assembled on surface-mount technology evaluation 
platforms proved very tight and the exhibition systems were hand-couriered out the weekend the exhibition 
was being set-up. Two sets each of TSMC and UMC evaluation systems were commissioned to allow for 
both the foundry partner and ARM to have systems and to provide some form of backup. 
The ARM booth in the exhibition centre ran demonstrations of both UL TRA926 and ATLAS926 running 
IEM application workloads with voltage scaling and both TSMC and UMC exhibited systems running real-
time dynamic performance scaling of MPEG video workloads. 
Design Automation Conference 2006 TSMC ATLAS926-65LP 
The author presented the ARM evaluation resu lts for the 65nm leakage and DVFS project at a number of 
TSMC-hosted customer presentations. 
An example of the leakage results disclosed showing the leakage power reduction for the mitigation 
schemes implemented on this technology demonstrator is shown in Figure 9-9A. 
Considerable customer interest and follow up resulted from these presentations. 
Design Automation Conference 2006 TSMC ATLAS926-65LP 
The author presented the ARM evaluation results for the 130nm DVFS project at a number of UMC 
exhibition floor presentations. 
Alongside the presentations a running evaluation system was displayed and the setup is described in 
Figure 9-98 . 
Again considerable customer interest and fol low up resulted from these presentations and a follow on 
technology demonstrator project requested by UMC. 
9-1 9 
Energy efficient SOC design technology and methodology 
Figure 9.10A - ARM Developers Conference, Qc! 2006 - Panel 
October 3-5 200ft 
Wo ..... Cl.,. COftve ....... Cerr.<" 
s..t>I. Cl ... CA.~. 
ARM DEVEL'(0PERS' 
CONFER ENCE ' 06 
WEDNESDAY, OCTOBER 4 , 2006 
12:45 PH - 1:30 PH 
Fast Track to Lower POl,ller Luncheon Panel 
Penel of experts in IP I toots o!ind methodology will present the challenges o!Ind solut ions 
for implementinQ hir;lh-performance, low-power SeCs that "re fueling the growth of the 
electronic business, based on results from extensive collaborative technoloQJ 
development. In addition to explaining the science behind the architectural and design 
challent;Jes, the p<!!Inel will present examples of desu~n solutions usinQ special IP and 
design automation, which enable leadino-edoe product performllnce and battery life . 
The panelists will also share their insights into the future of ultra low-power and high-
performance desiQn and automation. Examples of topics to be covered include: 
dynamic voltage scaling, power oatino, power Orid creetion and analysis, Intellioent 
Eneroy Manaoer (JEW") technoloOYJ verification, and physicallP . 
Hosts : 
John Chilton, General Manaoer, IP &. Systems Group, Synopsys 
Mike Muller, CTO, ARM 
Panel Speakers: 
Frederic Nyer, ST Microelectronics 
Alan Gibbons, Principal EnQineer, Synopsys 
Oavid Flynn, ARM Fellow 
Stephen Meier, VP, Engineering, Synopsys 
Rob Aitken, ARM Fellow 
Figure 9.1 OB - ARM Developers Conference 2006 - Panel slides 
Technology Demonstrator Results So Far ... 
Four p rimary levels of IEM leakage management: 
HALT/SRPGISCAN-HtBERNATE plus SHUT~OWN fOf CPU 
W orking 65nm LP silicon evaluated (see on ARM Booth) 
supports DVFS fOf IEM dynamic energy saving 
Importam with lZ1/ LP 
SRPG Savings of 85%+ measurable 
tompare(! 10 the static -o.3mW leakage 
90nm G project j ust o ut of fabrication 
Distributed header power gating with inrush control 
Dynamic threshold scating leakage management 
Both CPU plus USB subsyslem leakage management 
Stale retenlion noise immunity and 
9-20 
Energy efficient sac design technology and methodology 
The second ARM Developers Conference in Santa Clara provided the forum to show case the A TLAS926 
and UL TRA926 working silicon to the wider ARM developer community and to begin to reveal publicly 
details of the SAL T926 project alluded to the year before. The original plan had been to have 
demonstrable silicon in time for this conference but at this stage the silicon had only just emerged from 
fabrication and was awaiting packaging. 
With the aim of encouraging vendors and partners to present at the ARM conference rather than ARM 
staff a presentation was co-written with Synopsys that Alan Gibbons delivered for the system-level power 
management track : 
Techniques for Aggressive Leakage Management in ARM Systems 
Presented by Alan Gibbons, Synopsys 
Room 210 
3:00 PM - 3:45 PM 
At 65nm and below, minimizing the static power dissipation through aggressive techniques such as 
coarse grain MTCMOS power gating and threshold voltage scaling can yield , the significant reductions 
in power consumption that are necessary to derive high-performance, complex applications on mobile 
platforms. ARM and Synopsys have jointly developed a comprehensive low power technology 
demonstrator that employs these advanced low power techniques. Various alternative approaches to 
MTCMOS power gating and threshold voltage scaling are discussed together with a detailed 
description of the implementation flow and the results . 
ARM Developers Conference, Qct 2006 - Panellist 
ARM and Synopsys jointly sponsored a low power panel to address the progress and remaining 
challenges in delivering synthesizable lP, methodologies, tools and physicallP libraries which provoked 
lively discussion and a lot of follow up questions and dialogue. See Figure 9-10A. 
The System-level leakage management slides presented included a summary of the system control 
challenges, the analysis and verification tools EDA requirements and visibility of the R&D projects ongoing 
to come up with worked SOC examples - see Figure 9-10B for an example side that showed the 
technology being evaluated in detail. 
9-21 
Energy efficient sac design technology and methodology 
9-22 
10. Conclusions and Future Work 
Evaluating the "Energy efficient sac design technology and methodology" goals 
The results for the work covered in this thesis are summarised in this chapter. The focus of the work 
has very much been building on the earlier academic research framework in order to demonstrate 
real-world applicability in standard EDA design flows . The applied R&D has resulted number of 
novel technology components and techniques have been developed that can now be reused with 
commercial customers and EDA partners. 
In order to evaluate the value of the applied R&D portfolio of projects evidence has been gathered 
in the following areas: 
• Licensable products resulting from the research projects 
o The "Intelligent Energy Controller" (IEC) hardware API to manage dynamic 
performance setting and monitoring in a SOC-reusable product[1] 
o The "Intelligent Energy Manager" (IEM) software product development platform , 
deployment model and as/sac portable interfaces[2] 
o The ARM 1176EZ(F)-S CPU multi-voltage design partitioning and interface 
definition for DVFS SOC deployment, setting the foundation for the multi-voltage 
CPU specifications for the ARM "Cortex™" family.[3] 
o Adaptive Voltage Scaling "PowerWise" system-level design and deployment model 
devised in collaboration with National Semiconductor Inc[4] 
o Physical IP specifications for the ARM /Artisan "Power Management Kit" in the form 
of level-shifters with integrated isolation functionality, and power-gating and state-
retention standard cell components licensed to a number of foundries and 
customers .[S] 
• Results and expertise developed from the representative technology demonstra tors 
o Functional, measurable silicon on technology nodes from 180nm to 6Snm over S 
generations.[6][7][8] 
o Dynamic Voltage and Frequency Scaling designs that allowed precise power and 
energy measurement, analysis and quantifying energy savings. The designs drove 
the specifications of physicallP components and provided real-world world-
examples for EDA companies to work with in order to develop methodologies 
suitable for non-expert customers.[7] 
o Leakage-management techniques that can be applied as a "standard-cell" design 
flow rather than requiring expert-level transistor level knowledge and visibility to 
support a number of system-level leakage mitigation schemes.[8] 
10-1 
Energy efficient SOC design technology and methodology 
Figure 10.1A- Low Power Methodology Manual, Springer 2007 
Figure 10.18- Low Power Methodology Manual primary author contributions 
Introductory chapters and USB IP Design example, authored by Mike Keating 
• Chapter 1: Overview of the challenges and basic approach to low power design. 
• Chapter 2: Basic power reduction techn iques 
• Chapter 3: Multi-voltage design, focusing on architecture and design issues. 
• Chapter 8: IP design for power gating, a USB subsystem example. 
Authored by David Flynn : Power Gating 
• Chapter 4: Power gating Overview 
• Chapter 5: RTL Design for power gating 
• Chapter 6: Worked example of a power gated chip design at the RTL level 
• Chapter 7: Architectural design issues in power gating . 
Authored by David Flynn : Dynamic Vol tage and Frequency Scaling 
• Chapter 9 : RTL and system design for dynamic vol tage and frequency scaling 
• Chapter 10: Worked examples of voltage and frequency scaling 
Authored by David Flynn: Physical Library and Memory IP 
• Chapter 12: Physical cell library and memory requ irements for multi-voltage design 
• Chapter 13 discusses retention register design and data retention in memories 
Authored by Alan Gibbons: 
• Chapter 11 : Implementation issues in low power design: synthesis , place and route, 
timing analysis and power analysis 
Authored by Kaijian Shi: 
• Chapter 14 The design of the power switching network 
10-2 
Conclusions and Future Work 
• Induslry text book contributions 
o Invited chapter on low-power syntheslzable CPU design for Keutzer & Chlnnery 
"Closing the Gap between Custom and ASIC", Kluwer, 2002 [9) 
o Authored 7 of the 14 chapters of the "low Power Methodology Manual" launched 
by Springer at the Design Automation Conference, June 2007, to be published m 
July 2007 Remaining chapters co-authored by Keatlng M , Gibbons A , Altken R 
and Shl K. ISBN 978-0-387-71818-7 Figures 10.1A, 10.1B 
• Credibility with end-customers and EDA tools providers 
o Increasingly the challenge of low-power has to be addressed as a system level 
problem and needs to span transistors/physical lP, RTl design, on- and off-chlp 
power supply technology But one also requires low-power Implementation and 
analYSIS tools, as well as Operating System level software, device drrvers and 
extenSible control schemes The SOC-Ievel technology demonstrators have mdeed 
provided several generations of hardware platforms that have enabled a number of 
OS providers to port and demonstrate their power-management schemes and 
measured energy savings [10)[11) 
o As an IP company the technology demonstrator programme has been credibility to 
engage With expert OEMs and advanced technology licensees With regard to low 
power The technology demonstrator programme has enabled ARM to share 
results and work closer With Industry-leading mobile phone manufacturers to JOintly 
understand the slgnrficant usage profiles that affect active and stand by battery life 
and the value of different energy management approaches [12)[13) 
o The board-level evaluallon and development platforms were not only Important and 
Influential from an exhibition and trade-show perspective but also became a 
valuable "loan" resource to allow potential customers to benchmark their own 
proprretary and commerCially sensitive workloads and benchmarks under the 
project's IInux OS enVlfonment ThiS was the key for some early adopters to 
evaluate the IEM software product 
o FollOWing the acqUisition of Artisan Components the programme has been a driver 
for "synergy" between the syntheslzable IP (processor, data engines and AMBA 
bus components) and the celllrbrary and memory enhancements to derrve lower 
power, more energy effiCient SOC Implementations The early "Multi-Voltage Kits" 
and subsequent "Power Management Kits" embody a number of the components 
and Inventions developed under the auspices of research for thiS theSIS [5) 
10-3 
Energy efficient SOC design technology and methodology 
10-4 
Conclusions and Future Work 
Canonical SOC Design - 5 Generations 
A number of ARM licensees and customers have show Interest In JOint development projects but the 
problem has always been that the data from such collaborations IS tYPically always com merclally 
restricted and non shareable with other companies or EDA partners 
The major benefit to ARM of basing the research and development upon an In-house canonical 
design was that there was no dependence upon external commerCially restricted customer IP which 
then meant that the results from analYSIS were under ARM's control Rather than having to 
Incorporate the full compleXity of an end-product sac deSign the research has been able to focus 
on representative compleXity In terns of multiple clock domainS, power domains and baseline 
operating system platforms suffiCient to focus on best practice Implementation methodologies All 
the dynamiC and leakage energy savings for an ARM926 processor core are of higher value for 
larger and more complex CPU cores that tYPically consume a greater proportion of the end-product 
power budget Chapter 2 summarizes the successful evolution of the Initial deSign across four more 
generations With increasing power management sophistication 
• SOC#1 -sARS2 "AudiO Reference System DeSign" 
a 180nm TSMC standard low-power synthesls/place-and-route baseline deSign 
• SOC#2 -DVS926 "DynamiC Voltage Scaling Demonstrator" 
a 130nm TSMC technology deSign for DVS and experimental Adaptive Voltage 
Scaling prototype, With Llnux OS support 
• SOC#3 - UL TRA926 "DVFS Reference System DeSign" 
a 130nmUMC technology higher performance deSign for DVS uSing a re-
characterized Foundry CPU core plus AVS 
• SOC#4 - ATLAS926-65LP "Leakage and DVFS Demonstrator" 
a 65nm TSMC Low-Leakage, 1 2V technology deSign for leakage mitigation and DVS 
With experimental leakage management control IP 
• SOC#5 - SAL T926-90G "Leakage/Physlcal-IP Demonstrator" 
a 90nm TSMC Low-Leakage, 1 OV technology deSign for aggressive leakage 
mitigation and With experimental State-Retentlon/Power-Gatlng control and 
analYSIS management IP 
Best Practice Design for Low Dynamic Power - DVFS 
The major achievements to grow out of the voltage scaling technology demonstrators deSigns 
deSCribed In Chapter 3 are summarized as 
• Multi-voltage partitioning of DVS-enabled processor products, Initially the ARM1176 With 
RAM, CPU and SOC power domains With asynchronous bus Interface layering to support 
variable clock latency management 
• PowerlTestlResetlClocklng ReqUirements In detail as developed With EDA partners and 
worked examples based on ARM926-based SOC 
• The first realization of complete Llnux-based Intelligent Energy Manager System-on-chlp 
With IEC and IEM deployment 
10-5 
Energy efficient sac design technology and methodology 
10-6 
Conclusions and Future Work 
• Adaptive Voltage Scaling reference design for JOint ARM/National Semiconductor 
engagement WIth customers 
• Novel dynamic performance control techniques and prototypes 
Best Practice Design for State-Retention Power-Gating - SRPG 
The major achievements to grow out of the SRPG demonstrators designs descnbed In Chapter 4 
are summanzed as 
• Multi-level leakage management techniques transparently overlaid on ARM "walt-for-
Interrupt" instruction set support 
• Leakage mitigation state machine controllers for safe SRPG sequenclng 
• Zero-area overhead state save and restore functionality re-using manufactunng scan 
• State-retention register Integnty analysIs hardware checking 
• Novel power gates, retention registers and Isolation buffer cell designs 
• Advanced 65nm TSMC 'LP' technology demonstrator to allow both leakage plus dynamic 
voltage scaling support for early customer evaluation of dynamic and static energy 
management techniques and relative benefits 
PhyslcallP for Low Power Design 
Chapter 5 Includes the pnmary novellP developed dUring the project 
• Library IP support for DVFS -level shlfters with Integrated Isolation clamps 
• library IP support for (MTCMOS) Power Gatlng - In the form of sWitches with Integrated 
"always-on" control buffenng 
• Library IP characterization for Power Gatlng - de-rated for power gatlng IR-drop In order to 
faCilitate timing closure when there are "standard cell" power gates In series with power rails 
• Library IP support for State Retention - In particular for a novel deign of scan-flop that 
reuses eXisting ports to control retention, In thiS case supports a standard deSign flow with 
the only reqUirement that the scan-enable network must be Implemented With always-on 
low-leakage buffer cells 
• Library IP support for dynamiC well-bias - example worked deSign With well taps 
automatically placed Within power gatlng cells 
10-7 
Energy efficient sac design technology and methodology 
10-8 
Conclusions and Future Work 
Evaluation Platforms 
The evaluation boards developed as desCribed In Chapter 6 were a major factor In being able to 
port operating systems (Llnux In-house) and bring up measurable/demonstrable systems for 
detailed energy management and benchmarklng These are summarized as 
• 180nm sARS2 Evaluation Board 
• Software Development Board for IEC 
• 130nm DVS926 Voltage Scaling Test Platform 
• 130nm DVS926 Voltage Scaling Demonstration Platform 
• "IEM" Voltage Scaling Exhibition Platform 
• 130nm UL TRA926 Voltage Scaling Demonstration Platform 
• 65nm ATLAS926 DVS/Leakage Demonstration Platform 
• (90nm SALT926 Leakage Test Platform dUring the writing up phase of thiS theSIS) 
Technology Demonstrator Evaluation and Analysis 
All the technology demonstrators With the exception of the first (slmplel) deSign were fully functional 
and together With the evaluation boards allowed dynamiC and static power and energy analYSIS 
across a number of process technologies and geometrles 
• sARS2 180nm "standard" low-power SOC flow [Suffered from a JT AG debug clock 
balanCing fault that meant evaluation was based on Flash-ed EPROM mOnitor] 
• DVS926 130nm SIlicon DVFS Evaluation of 240MHz cached CPU 
• DVS926 130nm DVFS Energy Savings and co-development of Adaptive Voltage Scaling 
• UL TRA926 130nm SIlicon Evaluation of 288MHz cached CPU 
• ATLAS926-65LP SIlicon DVFS and Leakage power/energy Evaluation 
o AnalYSIS of SRPG leakage With Virtual rail voltage scaling 
o AnalYSIS of SRPG leakage across commercial temperature ranges 
Patents Filed/Granted 
A number of inventions have been patented from the dynamiC and stalic energy management 
proJects, of which 6 are public In that they are granted or have confirmation of grant notices, as 
deSCribed In Chapter 8 
• US 6,883,102 Power Management Control API 
• US 6,950,951 RTL Power Control Interface 
• US 2004/0153762 Bus Based State Save and Restore 
• US 7,181,633IEC Performance Available Response 
• US 7,194,647 IEC PWM DynamiC Performance Scaling 
• US 7,154,317 StateSaver Retention Register 
10-9 
Energy efficient SOC design technology and methodology 
10-10 
Conclusions and Future Work 
Publications and Conferences 
For commercial reasons publications and papers have been limited A number of the early 
engagements were under confidential arrangements With ARM partners, and some of the detailed 
technical work was held back while proceed With patent filing Chapter 9 provides details of the 
conferences and articles that were public, many of which were invited as the work began to show 
credible results and working technology demonstrators were exhibited 
• Chapter for "Closing the Gap between Custom and ASIC", 2002 
• Canadian Microelectronics Corp Keynote, Banff 2002 
• Microprocessor Reports DVS/AVS Article, Jan 2003 
• DeslgnCon 2003, SA2-3 Hardware/Software paper/presentation 
• HotChlps 2003, Intelligent Energy Management With ARM926 
• EE-Tlmes Jan 2004, DeSign and Evaluation of power-effiCient SQCs 
• DATE 2004, Energy EffiCient SQC With DynamiC Voltage Scaling 
• IEE/ACM Colloquium, SQC DeSign Test & Technology, Sep 2004 
• Synopsys EDA Interoperablllty Conference, Qct 2004 
• DAC 2005 All-day Low Power Tutorial - DVFS and Leakage 
• ARM Developers Conference, Leakage Control, Qct 2005 
• DAC 2006 Leakage Technology Demonstrators (TSMC & UMC) 
• ARM Developers Conference, Low Power Panel, Qct 2006 
• Primary author for "Low Power Methodology Manual", Springer 2007 
Research contributions to knowledge and Industrial application 
The Identifiable contributions of the research can best be summarized as 
• An approach to RTL deSign for Multi-Voltage Implementation where Isolation, Reset, 
Clocking, Retention and Test semantics are overlaid on a Power Domain and can be 
described In conventional syntheslzable HDL coded deSign ThiS has proven to work well 
for a series of technology demonstrators and IS shaping the future of how ARM can verify 
and deliver multi-voltage CPU deSigns and next generation deSigns With partial retention 
• Novel system level Interfaces and hardware abstractions for dynamiC performance scaling 
appropriate to third-party voltage scaling technology 
• Novel retention schemes and retention register deSigns With minimal Implementation 
requirements In the form of control sequenclng 
• Software-transparent retention mechanisms demonstrable With both retention register 
deSigns (fast wake, higher area cost) and scan-based state save and restore (higher 
energy cost retentIOn and wake-up latency at zero-area cost) 
The approach by Springer to author the primary content for the "Low Power Methodology Manual" 
will hopefully enable a number of these techniques to become main-stream In the Industry, and 
underpin much of the power format standardization work that the EDA companies are acllvely 
engaged In as the theSIS IS completed 
10-11 
Energy efficient SOC design technology and methodology 
10-12 
Conclusions and Future Work 
Future Work 
Testability of power gatlng In production IS an Issue for high-volume customers At-speed test 
techniques can be used to test for delay "hot-spots' where broken power gates result In excessive 
IR voltage drop However for battery-powered products mal-formed sWitches that do not turn off 
would definitely cause standby life problems, traditional leakage current tests are no longer able to 
dlscnmlnate accurately such failures on a tester due to the vanatlon In dle-to-<lle leakage 
• Research and develop on-chip sensing mechanisms to provide 8U1lt-ln-Self-Test faCIlities 
for power gatlng designs 
Address the reliability of power gatlng transistors A number of advanced customers have begun to 
express concern about the potential stress mechanisms on power gates 
• Research the reliability Issues for the 'header' and 'footer' SWitch transistors to understand 
and develop techniques and mechanisms to mitigate against sWitch wear-out 
Fault tolerance to cope with errors at low voltage 
• Apply and extend the techniques developed In thiS theSIS to "Razor" [14] collaboration 
research between ARM Lld and the University of Michigan 
SllIcon-on-lnsulator (SOl) SOilS a specialized and currently expensive technology that has been 
optlmlzed for higher performance than traditional bulk CMOS As a result, SOl has a Significant 
static leakage power problem that Will affect standby battery life 
• FollOWing on from ARM Lld's acquIsition of an SOl library business to develop optlmlzed 
power gatlng and retention phYSical IP along the lines of the Power Management Kit 
components developed for bulk CMOS 
References 
[1] Intelligent Energy Controller http IIwww arm com/pdfsfOT00005C lee rOci to pdf 
[2] Intelhgent Energy Manager http Ifwww arm com/products/esdfiem home html 
[3) ARMl176JZ-S IEM-enabled CPU http/fwwwarmcom/producls/CPUs/ARMl176hlml 
[4] PowerWlse AVS http Ifwww national com/appmfo/power/powerwlse htm! 
[5] ARM PMK http Ilwwwarm comfproducts/phvslcallpfstandardcell html 
[6] DVS926·130 http llieeexplore Ieee org/xpls/abs alllsp?tp=&arnumber=1269261 
[7] ULTRA926-130 http IIwww svnopsvs corn mUspsfpdf/scott paper Ddf 
[8) ATLAS926·65LP http!lwwwdspdeSlgnllne com/news/190500623 
[9] http flwww eees berkeley edu/-chmnerv/aslc vs custom speed/book contents html 
[10) ARM IEM OS support http/fwwwarmcom/odfs/0171.2%20IEM%20Flyerpd! 
[11J ARM rEM sw support http IIwww arm comfproducts/esdhem softwareovenvew html 
[12) DVS926 al PATMOS03 http/fwww2 polllo Itlrlcerca/eda/palmos03/Shdes/Anlun pdf 
[13J http/lwwwarmcom/pdfsIDAI0172A .em hw control system in the arm11761zfs dev chip app note pdf 
[14] Ernst D et ai, "Razor A Low-Power Pipeline Based on CircUit-Level Timing Speculation," Proc 36th Int'l 
Symp M,croarch,tecture, IEEE CS Press, 2003, pp 7·18 
http IIcsdl computer orgfdlfproceedlngs/mlcro/2003/2043fOOf20430007 pdf 
10-13 
Energy efficient SOC design technology and methodology 
10-14 
Appendix A. perl script for RTL emulation of SRPG 
Chapter 4 Introduced the concept of post-processing syntheslzble RTL to emulate the behaviour 
of both power gatlng and state retention control for simulation 
A technique was JOintly developed with Mike Keatlng at Synopsys to parse and annotate RTL as 
descnbed In chapter 4. 
This chapter documents the author's scnpt developed for RTL designed for the SAL T926 project 
for use with the single-pin NRETAIN controlled retention register IP 
Appendix. SCript for RTL emulation of SRPG 
#I/usr/bin/perl 
use strict; 
##################################################################### 
# File: conv_retl_flop.pl 
# Purpose: To make each register in the input Verilog file have an 
# additional retention register 
# 
# Note: This script can't handle memmory array 
# 
# Useage: % conv_retl_flop.pl filel.v file2.v file3.v 
# --> ret_filel.v, ret_file2.v, ret_file3.v 
# Writter: dflynn based on an original by trihn 
##################################################################### 
#--These variables are for customizing the generated output file-----
my $sim_flag = "RTL_PG_EMOLATE"i 
my $nretain • "NRET"; 
my $power = "PWR"; 
my $post£ix D "_RET"; #postfix for retention registers 
my $ret_file = "ret_"i #prefix for the new ouput file 
#--------------------------------------------------------------------
# global vars 
my @ff = 0; 
my %signal size = (); #store all signals (type reg or integer) and 
their sizes 
"" . , my $local_signals = 
my $if_block • nu; #if-block that resets register 
my $count = 0; #count line number 
my $tmp = "" . , 
my @arrayl = 0; 
my $s1ze = "". ,
my $always = " .. ,
my $new_always c "" . , 
my $ll.ne = "". ,
my $file = nn; 
#regexpr for terms in verilog 
my $s = "(\\sl\\rl\\nl\\t)."; #space, tab, endline 
my $id = "(\\dl\\wl\\-I\\\')+"; 
#my $range = " «\\[$s$id$s\\) I (\\[$s$id$s \\: $s$id$s\\))"; 
my $range ="( (\\[(.+)\\) 1(\\[(.+):(.+)\\))"; 
my $var = "$id$s($range)."; 
my $concat = " (\\{$var $s ( $s $var Ss). \\})"; 
my $term = "«$var) I ($concat»"; 
my $local_declaration = "(integerlreg) ($range)? $s $term ($s , 
$term) * ;" ; 
########################## MAIN ############################## 
foreach $file (@ARGV) { 
&add retention reg1ster $file); } - -
########################## END MAIN ############################## 
A-2 
Energy efficient SOC design technology and methodology 
sub add_retention_register { 
open INFILE, @_[O); 
open OUTFILE, ">" • "$ret_file" .n@_[o]n; 
#reset all of the variables 
@ff = (); 
%signal size = (); 
$local signals = n.; 
$if_block ... n.; 
$count = 0; 
$tmp ... n"; 
@arrayl = (); 
Seize ... ftn; 
$always = n"; 
$line = "ft; 
while ($line = <INFILE» { 
$count++i 
#Insert 3 new input ports 
if( $line =- /(module) $s ($term) $s \(/x) { 
$tmp = $line; 
$line =- s/(module) $s ($term) $s \(//x; 
print "S3 \n"; 
print OUTFILE "\n-ifdef $sim_flag\n" ; 
print OUTFILE "module $3 ( " ; 
print OUTFILE IISpower, $nretain, \on; 
print OUTFILE ""else\n"i 
print OUTFILE "Strop" ; 
print OUTFILE ""'endif\n"; 
while ( not ($line =-
$line = <INFILE>; 
$count++i 
/\) 
print OUTFILE "$line"; 
$s ; Ix) ) { 
#-------try to handle v2k port declaration here---------
if ($line =- /(output) $s 
(reg) $s 
($range)? $s 
($term) $s 
(. $s $term Ss), 
Ix) { 
$line =- s/'Ss (output) Ss (reg) (\sl\tl\rl\n)/,/; #remove reg or 
integer from the string 
} 
} 
} 
$line =- si; (.*)/,/; #remove everything after the I;' 
if( $line =- s/\[(.+)\)//) { $size • $&; } #remove the range 
else { Seize ... nn;} 
@arrayl = split(/,/, $line); 
for($tmp = 0; $tmp < scalar(@arrayl); $tmp++) { 
$arrayl[$tmp) =- s/Ss//g; #remove space, tab, 
$signal_size{Sarrayl[$tmp)} = $size; 
} 
#--------------------------------------------------------
print OUTFILE "\n-ifdef $sim_flag\n"; 
print OUTFILE "input $power, $nretain;\n"j 
print OUTFILE "-endif\n" 
A-3 
Appendix SCript for RTL emulation of SRPG 
#find and store registers' name and registers' sizes 
elsif( $line =- /A$s(reg) $s 
($range)? $s 
($term) $s 
(, $s $term $s)* 
( ; ) 
'x) { 
print OUTFILE "$line"; 
$line ~- s/A$s(reg) (\sl\tl\rl\n)/,/; #remove reg or integer from 
the string 
} 
$line =- si; (.*)/,/; #remove everything after the ';' 
if( $line =- s/\[(.+)\l//) { $size • $&; } #remove the range 
else { Seize III nn;} 
@arrayl = split(/,/, $line); 
for($tmp = 0; $tmp < scalar(@arrayl); $tmp++) { 
$arrayl[$tmpl =- s/$s//g; #remove space, tab, 
$signal_size{$arrayl[$tmpl} • $size; 
} 
#find reset block 
elsif ($line =- /always $s 
(@) $s 
(\() $s 
(posedgelnegedge) $s 
($term) $s 
« or $s 
) 
I 
(posedgelnegedge) $s 
$term $s 
( $term 
) 
) 
(\» 
(begin ($s:$s$term)?)? 
'x) { 
$s 
$s 
print OUTFILE "//--------------------------- $count ---------
--------------------\n"; 
$always = "$&\n"; 
Snew_always = "$&\n"; 
$new_always =- s/always $s 
(@) $s 
(\()/always @ \( negedge $power or negedge $nretain or Ix; 
@ff = (); #empty the array 
Sif_block = ""; 
$local_s1gnals = nn; 
$line III <INFILE>; $count++; 
if($line =_ /A $s begin/x) { 
$always .= "$line"i 
Snew_always .= "$line"; 
$line = <INFILE>; 
} 
#Local signal like integer for 'for' loop 
#they need to be place outside of reset block 
if($line =- /$local_declaration/x) { 
$local signals .= $line; 
A-4 
-- -- -------------------
Energy efficient sac design technology and methodology 
} 
else { 
$if block .= $line; } -
#Find the end of the reset block (it is ended with "else" or 
"end" ) 
} 
} 
while (not ($line =- / (\s I \r I \n I \t) (else I end) (\s I \r I \n I \t) /» { 
$count++; 
$line = <INFILE>; 
#Local signal like integer for Ifor l loop 
#they need to be place outside of reset block 
if($line =- /$local_declaration/x) { 
$local signals .- $line; } -
else { 
$if block .= $line; } -
} 
# find non-blocking 
while ($if_block =-
«=) 
((.)+) 
(;) /xg) 
push @ff, $1; 
} 
{ 
assignment in 
/($term) $s 
$s 
$s 
reset block 
print OUTFILE "-ifdef $sim_flag\n"; 
$tmp = &process_reset(@ff); 
print OUTFILE "-endif\n"; 
print OUTFILE "-ifdef $sim_flag\n"; 
print OUTFILE "$new_always"; 
print OUTFILE n'else\n"; 
print OUTFILE "$alwaysn; 
print OUTFILE n'endif\n"; 
#print OUTFILE "$always·; 
print OUTFILE "$local_signals\n"; 
print OUTFILE "-ifdef $s1m_flag\n"; 
print OUTFILE "$trop·; 
print OUTFILE "-endif\n"; 
print OUTFILE "$if_block\n"; 
else { 
} 
} 
print OUTFILE "$line"; 
close INFILE; 
close OUTFJ:LE; 
#for debugging 
my $k = ""; 
my $v = ""; 
while (($k, $v) = each %signal_size) { 
# print "$k --------------->$v\n"; 
} 
A-S 
- - - -------------------------------
Appendix: Script for RTL emulation of SRPG 
# This procedure takes an array of ff and add power up/down, 
retain 
# conditions for those signals 
sub process_reset { 
my @ff_list = (); 
my $tmp = 1111; 
my @tmp_array = (); 
my Sret_ff = 1111; 
my $str = .. "; 
my Sbare_name = "11; 
foreach $tmp (@_) { 
if ( $tmp =- /$concat/x ) { 
#if $tmp is a concat of signals { x, y , z , •• } 
#then it is splitted into individual signals 
@tmp_array = split(/, I\{I\}/, $tmp); 
} 
foreach $tmp (@tmp_array) { 
} 
$tmp =- s/$s//g; 
if( $tmp ne "") { 
push @ff list, $tmp; } -
else { 
$tmp =- s/$a//g; 
push @ff_list, $tmp; 
} 
} 
#create some shadow ff for each ff 
foreach $tmp (@ff_list) { 
} 
if ($tmp =- /(.+) ($range)/x) { 
$ret_ff = $1 • "$postfix"; 
$bare_name = $1; 
#$tmp =- s/\[(.*)\]//; #remove the range from the signal name 
} 
else { 
$ret_ff = $tmp • "$postfix"; 
$bare name = $tmp; } -
if($signal_size{$bare_name} ne "already_declared") { 
print OUTFILE "reg 11; 
pr1nt OUTFILE " $signal_size{$bare_name} "; 
print OUTFILE lI$ret ff;\n"; 
$signal_size{$bar;_name} = "already_declared"; 
} 
#generate a SAVE process 
print OUTFILE "always @(negedge $nretain)\n"; 
print OUTFILE "begin \n": 
foreach $tmp (@ff_list) { 
if ($tmp =- /(.+) ($range)/x) { 
$ret_ff = $1 • "$postfix"; 
$bare_name = $1; 
#$tmp =- s/\[(.*)\]//; #remove the range from the signal name 
} 
else { 
A-6 
} 
Energy efficient sac design technology and methodology 
Sret_ff g Stmp • "Spostfix"; 
$bare name = $tmp; } -
print OUTFILE " "; 
print OUTFILE· $ret_ff <= $bare_oame;\n"; 
} 
print OUTFILE "end \n"; 
Sstr g "if (ISpower) begin\n"; 
foreach Stmp (@ff_list) { 
$str .=" $tmp <_ 32 I bx;\n"; 
} 
$str ._ "end\n"; 
Sstr .= "else if (ISnretain) begin \n"; 
foreach Stmp (@ff_list) { 
if (Stmp =- /(.+) (Srange)/x) { 
Sret ff = SI • "Spostf1X" • "S2"; } -
else { 
Sret ff = Stmp • "Spostfix"; } -
$str .= n $tmp <= $ret_ff;\n"; 
} 
$str .- "end else\n"; 
return $str; 
A-7 
Appendix: SCript for RTL emulation of SRPG 
A-8 
Appendix B. External non-confidential publications 
DVS926 130nm project and Adaptive Voltage Scaling 
DesignCon 2003 B·3 
ARM I National Semiconductor JOint paper 
Microprocessor Reports Jan 2003 B·21 
ARM material authored for Max Baron article 
Hot Chips 2003 (presentation) •••••••• .........•. .••••• .•••.•••••.••.•. ...•• •••.••••••.•• B·27 
Intelligent Energy Manager DVFS presentation 
Design Automation and Test in Europe, DATE 2004 ............................. B·39 
DVFS results published with measured energy savings 
UL TRA926 130nm project and DVFS Methodology 
San Jose Synopsys Users Conference 2005 panel (presentation) 
JOint project disclosed 
Synopsys Advanced Technology Group (invited presentation) 
Guest presentation at a 45nm off·slte meeting 
B-43 
B-47 
Design Automation Conference DAC 2005 low power tutorial (presentation) B·63 
Design for DVFS (morning) tutorial 
Design Automation Conference DAC 2005 low power tutorial (presentation) B·79 
Design for Leakage (afternoon) tutorial 
Design Automation Conference DAC 2005 low power panel (presentation) B·87 
Joint ARM/Synopsys/UMC panel announcing collaboration progress 
European Synopsys Users Group, ESNUG 2005 •.••••.•••••••• ••••••....... B·93 
JOint paper With Synopsys and UMC on ULTRA926 implementation 
ATLAS926 65nm LP project and Leakage Mitigation 
Design Automation Conference, DAC 2006 low power panel (presentation) B·115 
Joint ARM/SynopsyslTSMC panel, ATLAS project announcement 
Leakage Mitigation and early SALT926 90nm disclosures 
ARM Designers Conference 2006 ARM/Synopsys (presentation) 
JOint ARM/Synopsys announcement of joint leakage R&D project 
B·119 
Boston Synopsys Users Group 2005 (ARM contributor)............ ••.••..•..•••. B·141 
Joint author for ARM/Synopsys presentation on SALT Implementation 
8-
1 
Appendix B: External non-confidential publications 
B-912 
Appendix B External non-confidential publlcatrons 
DesignCon 2003 
System-on-Chip and ASIC Design Conference 
A Combined Hardware-Software 
Approach for Low-Power SoCs: 
Applying Adaptive Voltage 
Scaling and Intelligent Energy 
Management Software 
Kriszti{m Flautner 
David Flynn 
ARM Limited 
Mark Rives 
National Semiconductor Corporation 
B-
3 
Eneregy efficient sce design technology and methodology 
Abstract 
Increased functIOnalIty and performance demands are challengmg System-on-Chlp (SoC) 
designers to seek better methods for optImlzmg available battery power m portable 
applIcations. Key areas of exploratIon mclude dynamic voltage scaling and improved 
software algonthms for the control of power modes Dynamic voltage scaling can be 
Improved by adaptlvely mOnItonng hardware performance to minimize the applIed 
supply voltage for any given clock frequency While adaptIve voltage scalIng optImlzes 
power use based on temporal environmental conditIons, IntellIgent Energy Management 
(!EM) algonthms optImlze power consumptIOn based on the dynamiC workload of the 
processor !EM software and hardware mOnitor the executIon and commUnicatIOn 
charactenstIcs of workloads and predictIvely set the performance of the processor to the 
level that minimizes energy use, while stIli meetmg applicatIon deadlInes The combmed 
use of adaptIve voltage scalmg and IEM prOVides the optImum trade-off between 
performance and battery lIfe for portable deVices 
Authors' Biography 
KflsztIan Flautner, Pnnclpal Research Engmeer, ARM, holds a Ph D degree m Computer 
SCience and Engmeenng from the University of Michigan His thesIs explored the 
relevance of multlthreadmg for mteractIve desktop workloads and descnbed the 
ImplementatIOn of an automatic power-management algonthm for processors supportmg 
dynamiC voltage scalIng Dr Flautner's research mterests are focused on Simple Ideas 
that enable high-performance low-power computers to support advanced software 
environments. In the research group at ARM Limited, he is currently workmg on the next 
generation ARM architecture 
David Flynn has been With ARM for II years and is a Fellow m the Research and 
Development group based m Cambndge, UK, speclalIzmg m System-an-Chip IP 
deployment and methodology He IS the ongmal architect behmd ARM's syntheslzable 
CPU family and the AMBA on-chip mterconnect standard HIS current research focus is 
low-energy system-level deSign He holds a number of patents m on-chlp bus, low power 
and embedded processmg sub-system deSign (8 US, 21 worldWide) and has a BSc (1st) in 
Computer SCience from Hatfield Polytechnic, UK 
Mark Rlves, Pnncipal ApplIcatIOns Engmeer, NatIonal Semiconductor CorporatIon, 
Jomed NatIonal In 1995 where he IS responSible for supporting advanced development 
projects m the Portable Power Group HIS prevIOus expenence includes the deSign and 
support of NatIonal's duect IF-sampling DiverSity Receiver Chlpset in additIon to a Wide 
range of both system level and IC deSign projects from pro audIO gear to pacemakers He 
received his B.S m Electncal Engmeering from MISSIssippi State University in 1987 
B-
4 
Appendix B External non-confidential publications 
Introduction 
Low power consumpllon IS arguably the most Important feature of embedded processors, 
which Significantly Impacts the cost and physical size of the end device Even though the 
processor may not be the most power-hungry component of a system, It IS essenllal to 
manage processor power In order to reduce overall system power consumption Better 
processor power efficiency can Increase the available power budget for features such as 
color screens and backlIghts, which are growing in popularity on portable deVices 
Hlstoncally, low power consumption In embedded processors has been achieved through 
Simple deSigns, limited use of speculation, and emploYing a number of low-power sleep 
modes that reduce Idle-mode power consumpllon Embedded processors are now 
performing more sophisticated tasks, which reqUire ever-higher performance levels As a 
result, new processor deSigns are more dependent on sophlsllcated architectural 
techniques (such as predICtIOn and speculallon) to achieve high performance 
Unfortunately, such technIques can also Significantly increase the processor's power 
consumptIOn 
Process technology trends are also complIcating the power story Unlll recently, CMOS 
transistors consumed neglIgible amounts of power under static conditions However, as 
process geometnes shnnk to provide increaSing speed and denSity, their static (leakage) 
power consumpllon has also Increased Current esllmates suggest that static power 
accounts for about 15%-20% of the total power on chips Implemented In 0 13 Jlm high-
speed processes Moreover, as process technology moves below 0 I Jlm, stallc power 
consumptIOn is set to Increase exponenllally, and Will soon dommate the total power 
consumed by the processor 
1200.---______________ -, 
; 1000 __ 105°C 
~ ~75°C 
e .00 
m ~500C 
.ll 
,; 600 -+-25°C 
'g 
~ 400 
~ 
~ 200 
02 015 01 ,os 
Minimum gate length (lAm) 
Figure 1. Normalized leakage power through an Inverter 
The CirCUIt Simulation parameters mcludmg threshold voltage were 
obtamed from the Berkeley Predictive Spice Models [1] The leakage 
power numbers were obtained by HSPICE sIDlUlattons 
Figure I shows proJecllons for leakage power mcrease In future process technologies 
There IS a strong correlatIOn between the operating temperature and the amount of 
leakage power However, regardless of the temperature, all lines exhibit exponenllal 
trends In embedded processors, where the maJonty of transistors are usually dedicated to 
B-
5 
Eneregy efficient sac design technology and methodology 
memory structures (such as caches), leakage is a particularly important problem to attack, 
since th e static power consumption o f these structures can dominate overall power 
consum pti on. 
Power Saving Opportunities 
A way to bridge the gap between high perfo rmance and low power is to allow the 
processor to run at different perfonnance levels depending on the current workload. An 
MPEG video player, fo r example, requires about an order of magn itude higher 
performance than an MP3 audi o player. Even greater savings can be achi eved by 
reducing the processor's supply voltage as the c lock frequency is reduced. Dynamic 
Vo ltage Sca ling CDVS) exp loits the fact that the peak frequency of a processor 
imp lemented in CMOS is proportional to the supply voltage, while the amount of 
dynam ic energy required for a given workload is proportional to the square of the 
processor's supply vo ltage [2]. Reducing the supply voltage whi le slowing the 
processor' s clock frequency yields a quadrati c reduction in energy consumption, at the 
cost of increased run ti me. 
Required 
processing 
speed 
100 
O%·~LL~~LL ______________ ~~-L~ __ __ 
100 I Traditional Power Management I 
Power 
consumption 
Energy 
consumption 
0°1. h-n1 sfh I Dynamic: Voltligt Scaling 
Traditional Power Management 
...........•.............. 
-, ... , 
...• ...•... ............. ...........•................ ...• ....... Dynamic Voltage Scaling 
....... ···~-----I 
Significantly lower total energy consumption 
Figure 2. Traditional Power Management vs. Dynamic Voltage Scaling 
I 
Often, the processor is running too fas\. For example, it is pointless from a quality-of-
service perspecti ve to decode the 30 frames of a video in hal f a second , when the 
software is only required to display those frames during a one second interva l. 
Co mpleti ng a task before its deadline is an inefficient use of energy [3] . The key to taking 
ad vantage of this trade-off is the use of perfo rmance-setting algorithms that aim to reduce 
the processor' s perfo rmance level (clock frequency) only when it is not critical to meet 
the application's deadlines. Figure 2 illustrates a sign ificantly lower total energy 
consumpti on using dynami c vo ltage scaling compared with traditional gated-c1ock power 
B-
6 
Appendix B External non-confidential publicallons 
management, for the same workload Note that with DVS, the lower supply voltage 
reduces static power even when the clock IS gated off 
Static leakage power can also be substantially reduced If the processor does not always 
have to operate at ItS peak performance level One technique for accomplIshmg this IS 
adaplIve reverse body biasing (ABB). Combmed with dynamiC voltage scalmg, thiS can 
yield substantIal reductIOns m both leakage and dynamiC power consumplIon [4] The 
key enabler for controllIng both DVS and ABB IS knowledge about how fast a given 
workload needs to run ThiS mformatIOn can be provided by performance-settmg 
algorithms that take varIOUS operatmg system and oplIonal applIcatIOn-specIfic 
informalIon mto account to provide an estimate for the necessary performance level of 
the processor. 
Just as performance-settmg algOrIthms optimize power consumption based on workload 
varIalIons, slgmficant power effiCiency can also be gamed if the processor does not have 
to operate under worst-case assumptions but can tune ItS operatmg parameters to 
temporal environmental condllIons [5] Processors are deSigned to operate reliably over a 
Wide range of temperature levels and varIations of the SIlIcon substrate Increased voltage 
levels must be used to assure the large safe-operating range at the cost of reduced power 
effiCiency By momtorIng the margm between expected and actual operatmg condllIons, 
the voltage level of the processor can be reduced Without sacrIficmg operalIonal stabilIty 
ThiS closed-loop mOnItormg of system margm Will be referred to as adaptIve voltage 
scalIng (A VS). 
While DVS, A VS, and ABB are effective ways of managmg the processor's power 
consumplIon, mtegratmg these Ideas mto SoC deSigns has proven to be a significant 
challenge The key issue IS that not all parts of the SoC can be scaled m equal measure 
Consequently, multiple voltage and frequency domams With asynchronous mterfaces are 
reqUIred Moreover, these extra parameters complIcate testing and valIdatIOn processes 
and reqUIre speCial support from syntheSIS tools 
B-
7 
Eneregy efficient SOC design technology and methodology 
System-on-Chip Implementation 
Figure 3 depicts the system architecture reqUIred to Implement the systemic power 
reductIOn schemes m a SoC design combinmg Intelligent Energy Management (IEM) 
with Adaphve Power Control (APC) The !EM mterfaces with the CPU though the 
AMBA Penpheral Bus allowmg It to be eaSily added to any ARM-based SoC design 
~APB 
+--cpuclk 
+---pclk 
+---hclk 
IEM 
currentyerf 
SaC-specifIC 
Oock 
rvlanagerrent 
Unrt 
~ ... rdaptlVe Power 
perCreques Controller 
I+----clk_conlrol-
t---prrclk -----l'~ Hardware 
Performance 
rvtJnrtor 
FIgure 3 [EM + A VS Architecture 
-sclk 
.-data 
t 
Va.tS 
External 
w rtchlngPow e 
Supply 
The !EM software and hardware momtor the system workload to generate a performance 
request The APC can then set the correct operatmg voltage m either open-loop or c1osed-
loop mode without processor mtervention The APC will transparently proVide the fastest 
possible response while assunng that the processor Will always receive the mmlmum safe 
operatmg voltage for any given clock frequency The APC would also coordmate all 
clock sWitching mcludmg the venficahon of stable supply voltage The IEM proVides a 
umform software mterface to SimplifY Implementallon and reuse The APC proVides an 
open-standard mterface to the external power supply 
ARM Limited and National Semiconductor CorporatIOn have agreed to work together to 
offer syntheSlZable mtellectual property (JP) to Implement the !EM and A VS 
funcllonality for SoC designers Work IS underway to assure support from both design 
tool vendors and operatmg system vendors to allow SoC designers to Implement thiS 
power savmg technology transparently The followmg sectIOns proVide more detail on the 
key components of the solutIOn 
B-
a 
Appendix B External non-confidential pubhcallons 
Intelligent Energy Management 
Completmg a task before Its deadlIne, and then IdlIng, IS SIgnificantly less energy 
efficIent than running the task more slowly so that the deadlIne IS met exactly The goal IS 
to reduce the performance level of the processor WIthout allowmg applIcatIOns to mIss 
theIr deadlines. The central Issue IS how the rIght level of performance can be predIcted 
for the applicatIOn. 
The IntellIgent Energy Management (IEM) framework proVIdes a hardware and software 
mechanism for achlevmg these goals' It standardIzes the mterface for settmg the 
processor's performance level, speCIfies counters for measuring the amount of work that 
is bemg accomplIshed, and mcludes operatmg system and applicatIOn-level algOrIthms 
for predictmg future behavlOr 
The IEM software layer has the abIlIty to combine the results of multIple algOrIthms and 
arrIve at a smgle global decIsIon The polIcy stack Illustrated m FIgure 4 supports 
multIple mdependent performance-settmg polICIes in a Unified manner. The primary 
reason for havmg mUltiple polICIes IS to allow the speCIalIzatIOn ofperformance-settmg 
algOrIthms to specific SItuatIons, instead of havmg to make a smgle algOrIthm perform 
well under all condItIOns The polIcy stack keeps track of commands and performance-
level requests from each polIcy and uses thIS mformatlOn to combme them mto a smgle 
global performance-level decIsIOn when needed 
POlicy (performance control) stack Pohcy event handlers 
I I 
Lovol2 I SET IFGT I 80 I 
Common events 
Level1 I IGNORE I 0 I ·On Reset 
·On task switch 
LeyelO I SET I " I ·On perf change 
- ---------
Command 
"'" - -
Figure 4. Performance pohcy stack 
The dIfferent polICIes are not aware of their pOSItIOns m the hIerarchy and can base theIr 
performance deCISIons on any event m the system When a polIcy requests a performance 
level, It submIts a command along with ItS deSlfed performance to the polIcy stack The 
command speCIfies how the requested performance should be combmed WIth requests 
from lower levels on the stack It can specIfY to Ignore (IGNORE) the request at the 
current level, to force (SET) a performance level WIthout regard to any requests from 
below, or set a performance level only If the request IS greater than anythmg below 
B-
9 
- ---------------------------------- -------
Eneregy efficient SOC design technology and methodology 
(SETJFGT) When a new performance level request arnves, then the commands on the 
stack are evaluated bottom-up to compute the new global performance level In FIgure 4, 
the evaluatIOn would YIeld the followmg at level 0 the global predIctIOn IS set to 25, at 
level I It remams at 25, and level 2 changes the predIctIOn to 80. 
Usmg thIs system, performance requests can be submItted any tIme and a new result 
computed wIthout explICItly havmg to invoke all the performance-settmg polICIes WhIle 
polICIes can be tnggered by any event m the system and they may submIt a new 
performance request at any time, there are sets of common events of mterest to all On 
these events, mstead of recomputing the global performance level each tIme a polIcy 
modIfies Its request, the performance level IS computed only once after all mterested 
polICIes' event handlers have been invoked Currently the set of common events are 
reset, task sWItch, task create, and performance change The performance change event is 
a notIficatIOn whIch is sent to each polIcy and does not usually cause any changes to the 
performance requests on the stack. 
There are signIficant benefits m usmg multIple performance-settmg polICIes, each 
optImlzed for a partIcular SItuatIon, mstead of a smgle one that needs to be optImal under 
all cIrcumstances FIgure 5 provIdes some qualItatIve InSIght mto the charactenstics of 
the IEM algonthm vs LongRun, a conventIOnal algoTlthm Implemented m the Crusoe 
processor's firmware The bIggest dIfference between the two algoTlthms IS that whIle 
LongRun keeps on ramping the performance level up and down m fast succession, the 
IEM algoTlthm stays close to a target performance level. 
B-
10 
Appendix B External non-confidential publications 
Entire mOV"le 1 second movie segment 
c 
~ 
'" 
~ 
'  g 
..J 
IL 
ID. 
m 
> 
.!! 
m 
u 
c 
• E 
.g 
• C. 
." IL ~ ~!L ce • ~ a 
:; 
!!! ~ 
.. 
> 
... 
m 
u 
c 
mlJl-I • ~ ~ ....... E 
" 15 ~ t: .... 
l!. IL IL ~ I}. 
z ~, 
F,gure 5. Perfonnance-settmg dunng MPEG vIdeo playback ofRed·s NIghtmare 
To achIeve the most effectIve energy reductIon WIth mimmal mtruslOn, applicatIOn 
momtoring and performance-settmg declSlons need operatmg system mvolvement. 
B-
11 
Eneregy efficient sac design technology and methodology 
Danse De Cable MP EG 
1 100% ,--,----.-----: 
• u 
c 
• E {! 
• 
"-
.c 
u 
• • ;; 
• § 
'0 
c 
o 
1; 
e 
u. 
.c 
Legendary MPEG 
,," 
u 81UIll~ 
• 
• 
;; "" 
~ 
'0 "" 
c 
o 
'" u E 
u. L"""Run 'EM 
Figure 6. MPEG video playback LongRun vs. IEM 
Figure 6 illustrates the fracti on of time spent at each of the processor's four performance 
levels (300, 400, 500, and 600 MHz) using the Crusoe' s built-ill LongRun power 
manager, contrasted with !EM during playbacks of two MPEG movies. The data for both 
a lgorithms were collected on the same hardware. However, during the !EM 
measurements, the bui lt- in LongRun power manager was disabled . While the playback 
quali ty of the different runs was identica l, it can be seen that IEM spends significantly 
more time below peak performance than LongRun . During the first movie, IEM switches 
mostly between two performance levels. The machine' s minimum 300 MHz and 400 
MHz clock frequencies are suffic ient for the first movie, whi le during the second, it 
settles on the processor's third performance leve l at 500 MH z. LongRun, on the other 
hand, chooses the machine's peak performance sett ing fo r the dominant portion of 
execution time during both movies. 
B-
12 
.. --_._---- ---- --------- ------------------------
Appendix S. External non-confidential publications 
Voltage Scaling Methods and Benefits 
Currently, proprietary dynamIc voltage scahng (DVS) solutIOns offer Improved 
performance by reducing the supply voltage as the clock frequency is reduced Open-loop 
DVS, as shown In FIgure 7, allows the processor to set the supply voltage based on a 
table of frequency/voltage pairs ThIs table must be determined by characterization to 
assure sufficIent margin for all operating condItIons and process corners 
VDD 
.. I 
VDD OK 
Clock 
Management 
System Unit 
Companion 
Power 
Processor Supply 
la.IS. = 10MHz 11V
20MHz 12V 
VSET 
Figure 7. Propnetary Open-Loop DVS 
In opera!ton, the processor must determine the deSIred operating frequency, request a new 
voltage, waIt for the voltage to stabIlize, and then SWItch Itself to the new frequency The 
SWItch may be made ImmedIately when changing from a hIgher frequency to a lower 
frequency When SWItching from a lower frequency to a hIgher frequency, the power 
supply voltage must be hIgh enough to support the new frequency prior to changing the 
clock Power supply stabihty can be assured eIther by a !tme delay or by an analog 
measurement Use of a time delay IS risky, since there wIll always be a deSIre to 
Implement the minImum possible delay for enhanced processor response !tme 
Open-loop opera!ton can be slmphfied by creating an Adap!tve Power Controller (APC) 
module to off-load the voltage scaling and clock management from the processor The 
APC approach supports a common software API allowing the DVS func!ton to be easIly 
accessed by apphcatlOns or the operating system ProvIding a standard Interface to the 
external power supply also slmphfies system deSIgn and faclhtates second-source opltons 
for the power supply component An archItecture uSing an APC IS shown In FIgure 8 
S-
13 
Eneregy efficient sac design technology and methodology 
VDD 
.. I 
APC VDD OK 
Performance Clock 
vs Voltage Management 
System Table Un~ 
Companion 
Power 
Processor Supply 
~ 
APC 
Controller 
VSET 
Figure 8. Vendor Independent Open-Loop DVS 
Closed-loop or Adaptive Voltage Scaling (A VS) IS a new approach, which offers 
Improved performance and ease of ImplementatIOn compared to open-loop DVS In the 
closed-loop system shown In Figure 9, the voltage IS set automatically by mOnltonng the 
system's performance margin and adjusting the supply voltage adaptlvely 
VDD 
I I I 
'" '" APC 
Performance Clock 
vs Voltage Management 
Table Umt Companion 
System t + Power Processor Supply Hardware 
~ APC Performance Controller Momtor 
VSET 
F,gure 9. Closed-Loop A VS 
Since the system is closed-loop In nature, a much finer degree of control over the voltage 
IS possible when compared to the discrete table values In an open-loop system Response 
time of the A VS system can be much faster, since It is hmlted only by the external power 
supply The performance measuring circUitry can be used to venfY the power supply 
stablhty to offer the fastest possible sWitching from one clock frequency to the next 
s-
14 
Appendix B: External non-confidentia l publications 
Closed-loop operat ion also offers im proved power sav ings since the operat ing vo ltage 
margin may be red uced due to the continuo us voltage updates. Any temperature effects 
are inherently compensated by the necessary change in supp ly vo ltage. This allows the 
A VS-equipped SoC to be operated at a lower vo ltage at room temperature since the 
voltage will be increased automati ca lly as the temperature increases. 
For example, in a I .8V system with +/-5% tolerance, the system must operate at 85C and 
1.7 1 V. For a 200mA load, thi s equates to 342m W. Even though the system will operate 
at a lower voltage at 25C (say 1.5V), normal ly at least 1.7 1 V must be provided to assure 
85C operati on. With closed-l oop A VS, the system can be run on 1.5V at 25C with no 
problems since the A VS technology will increase the voltage if necessary. This all ows a 
25C power of300m W, a saving of 42m W or 14%, even at the max imum clock 
frequency. 
Figure 10 shows measured data for a closed-loop A VS system runn ing at 32MHz, 
16MHz, 8MHz, and 4MHz. The plot shows vo ltage vs . time where th e highest voltage is 
associated with 32MHz operation. The three traces show how the closed-loop A VS 
voltage is automatica lly adj usted with changing temperature. 
4 
3.5 
3 
2.5 
.l!! 
"0 2 
> 
1.5 
0.5 
o 
32MHz 
-1 -0.5 o 
16MHz 
0.5 
Vavs vs. Temperature 
I 
-Vavs, +25C 
8MHz 
1.5 
Time (s) 
Vavs, -25C 
Vavs, +60C 
f 
4MHz 
2 2.5 
Figure 10. AVS Voltage vs. Temperature 
3 3.5 4 
Figure II compares the power used by a system with a fi xed 3.3 V supply with the power 
used with A VS. The effect of reduced margin at 25C can be seen in reduced power, even 
at the highest clock frequency. 
B-
15 
45.0 
40 .0 
35.0 
~ 30.0 
.s 25.0 
~ 20.0 
o 
Il.. 15.0 
10.0 
5.0 
0.0 
Eneregy efficient SOC design technology and methodology 
AVS and Fixed 3.3V Power 
~. 
- PowerAVS r"'<""'-
...,.,. 
- Power 3.3V 
"-
i-- J 
I 
-1 o 2 3 4 
Time (s) 
Figure J 1. Power Reduction with A VS 
Figure 12 shows the percentage of power used by the A VS system relative to that used by 
the fixed 3.3V system. As the clock freq uency is decreased, the power savings provided 
by A VS can increase to 80% or more when using a switching regu lator fo r the power 
supply. 
100.0 
90.0 
80.0 
70.0 
~ 60.0 .. 
~ 50.0 0 
c.. 
~ 40.0 I*' 
30.0 
20.0 
10.0 
0.0 
-1 
AVS Power as a Percent of 3.3V Power 
d, ., 
r - AVS/3.3V Power % -] 
1 r" ] 
.. " .. J 
." 
!"'11 "",'l1li 
o 2 3 
Time (s) 
Figure 12, Relative Power Usage of AVS vs. a 3.3V Fixed Supply 
B-
16 
r..~/oJoII.t 
4 
Appendix B External non-confidential publications 
SoC Design Flow Issues 
UntIl now, dynamic voltage scaling has only been commercially explOIted In stand-alone 
CPU integrated CircUits To support voltage scaling of processing sub-systems within a 
system-on-chlp design requires enhancements to both EDA tools and design 
methodology. Key Issues Include 
• MultIple physical power domains 
• Synchronous clock relatIOnships across boundaries 
• Standard-cell lIbrary and RAM compiler deSign views 
• StatIc tIming verIfication 
• ManufacturIng test 
Multiple power domams reqUire careful handlIng at Interfaces where some form of 
analog level-shifting IS reqUired between different voltages Also, many EDA tools treat 
voltage rails as special global resources which are ImplICitly connected, which makes 
separating voltage domams a manually intensive design step 
Best-practIce SaC deSign flows typically assume synchronous clocking relatIonships 
between sub-systems In order to allow top-level static tImmg-c1osure and analysis, 
automatIc test structure insertIOn and test pattern generatIon Ideally, multiple voltage 
domams should be treated as asynchronous because the tolerancmg of buffered clocks 
across the top-level system becomes near-Impossible where sub-systems can have 
varIable voltage With respect to each other Different sub-systems have mherently 
variable clock buffer latencies 
Cell lIbrarIes and memory compilers are normally characterIzed and modeled for a 
process and temperature range acrossa tightly toleranced (+1- 5% to +1- 10%) supply 
voltage To ensure deSign Integrity With voltage scalmg, more comprehenSive tIming 
models are reqUired The deSign tools make this harder because In order to deSign for 
multiple performance levels, the target sub-system frequency must first be speCified 
Then, the power supply reqUirements must be determined to proVide suffiCient voltage to 
maintam operatIon, either statically or adaptlvely However, from a deSign-flow 
perspectIve, It IS necessary to work the other way around start With a defined voltage and 
then calculate the achievable performance from the statIc tIming analYSIS at this precise 
voltage CharacterIzing RAMs at low voltage IS complIcated by the fact that sense 
amplIfier performance degrades non-Imearly With respect to logiC gate speeds 
VerIficatIon of statIc tIming and functIonal test are complIcated With voltage scalmg of 
parts of the SoC deSign The EDA tools need to be gUided 
B-
17 
- ----------------------------
Eneregy efficient sac design technology and methodology 
Implications for Front-end Design 
Front-end design tools typically read In RTL descriptIOns In Venlog or VHDL of the 
hardware design, for both Simulation and syntheSIS Such HDL descnptlOns have no 
concept of multiple power rails, a global view of power and ground are assumed 
Similarly, clocks and resets are treated as Ideal Signals In the HDL These are later 
buffered as carefully balanced hlgh-fanout buffer-tree networks. 
The boundanes between voltage domains must be handled with detailed management of 
hierarchy and the instantiation of explicit voltage level-shlfter cells between different 
voltage rails The onus IS on the designer to carefully abstract out the top-level 
management of clocks, resets, test scan chains and power management such that 
indiVidual sub-systems can be syntheSized and even hardened Independently uSing 
standard ASIC design flows 
Implication for Back-end Design 
The layout tools need to understand separate voltage rails and this may reqUire manual 
interventIOn and careful Inspection and review of conversIOn from the front-end logical 
design flow to the place and route ImplementatIOn phase 
In the worst-case, the cell library may need to be replicated with speCial cell and power-
rail naming schemes to ensure that optimizatIOn, setup and hold timing fixes applied to 
the post-routed top-level design do not aCCidentally stray over voltage domains or level-
shlfter boundanes 
Design venficatlon needs to be extended beyond standard ASIC design flows to cover the 
extra complication of analog level-shlfter integrity This IS espeCially relevant to power 
domains that can be powered off completely. These must not draw stallc currents from 
dnven Inputs, and need outputs clamped during power down and power up (I e operating 
outSide valid logic state operation) 
An ARM926EJ-S based design with Independent voltage scaling of the cached CPU, 
which tackles all these design tool issues is scheduled for fabncatlOn In February 2003 
Conclusions 
The ARM Intelligent Energy Manager (IEM) prOVides continuous predictive momtorlng 
of the CPU workload It attempts to run the clock frequency at the lowest available value 
while stili completing the work pnor to ItS deadline The correct performance level is set 
by predictive algonthms that are embedded In the operating system kernel to momtor all 
processes 
B-
18 
Appendix B External non-confidential publications 
National Semiconductor CorporatIOn's AVS technology accepts the IEM's performance 
request and sets the lowest possible operatmg voltage for any resultmg clock frequency 
SuffiCient margm IS always present to assure proper operatIOn Smce NatIOnal's 
hardware performance momtor IS always adJustmg the voltage for sufficient margm at 
any given clock frequency, the effects of process and temperature vanatlOn are mherently 
corrected If the temperature nses, the margm will decrease and the voltage Will be 
mcreased to compensate 
The combination ofthese two technologies Will provide optimum power savmgs for 
embedded processors m portable systems 
The proposed dynamiC voltage scahng system also forms the baSIS for techmques that 
address a chip's static (leakage) power consumptIOn Some of our mltlal mvestlgatlOns 
are descnbed in [4] 
References 
[I] http-//www-devlce eecs berkeley edu 
[2] T Mudge Power A FlTSt Class Architectural Design Constramt IEEE Computer, 
vol 34, no 4, Apnl2001 
[3] K Govll, E Chan, and H Wasserman Comparmg Algonthms for DynamiC Speed-
Settmg of a Low-Power CPU Proceedmgs of the FIrst InternatIonal Conference on 
MobIle Computmg and Networkmg, November 1995 
[4] S Martm, K Flautner, D Blaauw, and T Mudge Combmed Dynamic Voltage 
Scalmg and Adaptive Body Blasmg for Optimal Power Consumption in 
Microprocessors under Dynamic Workloads Proceedmgs of the InternatIOnal 
Confel ence on Computel AIded DeSIgn (ICCAD 2002), San Jose, CA, November 
2002 
[5] S Dhar, D Makslmovlc, and B Kranzen Closed-loop Adaptive Voltage Scalmg 
Controller for Standard-cell ASICs Proceedmgs of the 2002 InternatIOnal 
SymposIUm on Low-Power ElectrOnics and DeSIgn (ISLPED 2002) Monterey, CA, 
August 2002 
B-
19 
Eneregy efficient sac design technology and methodology 
8-
20 
Appendix B External non-confidential publications 
a Reed Electronics Group 
M I G--R rO-~--Pi--R-, O·---C" E--~S\~S---O R 
I I ' ! I _j , 
I i -~ ~ www.MPRonline.com I 
I 1 \ \ I '., 
V THE INS I DER' S GlrrD E TO M I CROl'RO CESSO R ,H'ARDWARE V 
ANALOG AND CPU WIZARDS 
REDUCE DIGITAL POWER 
National Semiconductor and ARM Increase Battery Life 
By Max Baron 1l/2l/03-0I} 
Memory speed IS no longer gUIlty of limiting processor performance. The infamous title 
has been awarded to battery capaCIty. Cellular telephones, PDAs, notebooks, and portable 
multimedIa deVICes could brIng hIgher mIcroprocessor revenues and more rewarding 
Improvements In performance and functlOns--lf only batter-
Ies could be made to last longer Increases In battery capacity 
are still creepmg along the roadmap-a hne of progress that, 
If plotted, would look almost honzontal compared With one 
chartmg the evolutron of mIcroprocessors Unt'" a small, prac-
tICal fuel cell or slITlI.lar miracle comes along, microprocessor 
developers must come up With power-reduction methods 
Answermg the call to arms NatIOnal SemIConductor 
Carp and ARM: announced, In November 2002, a strategic 
busmess relattonshlp to Jomtly develop and market power· 
effiCIent systems that, they claun, will mcrease the battery hfe 
of handheld portable deVIces In several stages-from 25% to 
as much as 400% The two compames' ]omt effort will lever-
age ARM's penetratIOn In the mobIle phone market and 
NatIonal SemIConductor's expertIse m analog deSign and 
power management 
The overall handset market-mc1udmg mobtle phones, 
smart phones, and handheld devICes-IS expected to grow to 
more than 525 millIon devtces by 2006, an mcrease of 31 % 
from 2002 accordmg to market researcher In-StatlMDR 
Powe, Management aeyond Clock Gatlng 
Faced WIth the need to find sockets In portable-deVice markets. 
mICroprocessor and ASSP vendors have used clock gatmg to 
temporarI1y turn off unneeded penpherals, blocks of on-chip 
memory, and, dunng Idle penods, even the processor Itself 
ARM and NSC propose to obtam further power reductions 
by mtellIgent control of frequency, supply voltage, and leak-
age current 
Combmed control of frequency and voltage can reduce 
both power and energy reqUirements Frequency reduction 
alone contnbutes llOear savmgs In power but does not, by 
Itself, reduce the amount of energy reqUired to complete a 
task. Reducmg frequency IS Justified, however, whenever a 
task's early completion will not Improve perceived perform. 
ance or, because of dependencies on other tasks, will even 
)'leld mcorrect results Lower frequenCIes can be supported by 
lower voltage levels. which have a quadratIc effect on reducmg 
power requirements and contnbute to lowenng energy con· 
sumptlOn In most current schemes that use frequency-
voltage power reductIOn, the voltage IS delIvered In open-loop 
mode. sans feedback from chip mternals 
CompaOles such as AMD. Intel, and Transmeta have 
obtamed good results usmg thiS type of frequency-voltage 
management ID addltlon to clock gatmg Intd has used the 
approach ID Its PXA250 chip. whICh can be SWitched through 
several frequenCIes and voltages, dependmg on workload and 
penpheral activity For IA·32 ChiPS, Intel uses Its SpeedStep 
@ IN·STAT/MDR V JANUARY 21. 2003 V MICROPROCESSOR REPORT 
B-
21 
- ------------------------------------------------ ---
Eneregy efficient sac design technology and methodology 
12 Analog and CPU Wizards Reduce Digital Power 
technology. whIch estabhshes two frequency-voltage pomts to 
save battery power For Its Crusce chip. Transmeta has m-
troduced LongRun, a table-based set of multiple frequency-
voltage pomts that helps Crusoe track workloads more effi-
CIently than by usmg a few steps AMD. with Power Now', 
uses an approach that IS slffiIlar to Transmeta's 
As always, a few problems must be overcome Accurate 
frequency pomts can be obtamed by reference to a system 
dock, and good power supphes can provide relIably accurate 
voltage Nevertheless. operanng frequency must be guard-
banded to ensure the apphcatIon win run at speed, and there 
must be reliable data to determme frequency pomts Power 
suppbes must be guard-banded also open-loop voltage 
sources must use farrly wide voltage guard bands to ensure 
appropnate voltage levels at all pomts mSlde a chip The guard 
bands must cover IR drops and fabncatlon tolerances, and 
they must ensure chtp operanon under all foreseeable enV1~ 
ronmental condttlons Voltage guard~band defimtlons reqwre 
good knowledge and predlctablhty of the process used to fab-
ncate the chips 
Leakage current IS rapidly becommg a Sizeable smk of 
wasted power As chips begm to be fabncated on processes 
of 130nm and less, leakage current will rapidly clImb to 
more than 20% of total mICroprocessor power Clock gatmg 
and frequency-voltage matchmg are no longer suffiCient, 
creatmg and controlhng separate power domams become 
Important 
To aVOid mlsmterpretatlon, MPR will define chip areas 
under the control of one dock frequency as "frequency 
domams" and will sImtlarly define "voltage domams" as areas 
supplied With a common voltage A "power domam" JS one 
whose power can be turned off to mmtmlze a system on chtp's 
leakage current Chtp areas can belong to one or more domams 
The Hardware-Software, MIXed-Signal Solution 
NSC and ARM's Jomt project alms to create crrcUlts, software, 
and tools that address three energy-consumptiOn tasks Frrst, 
the seemmgly tnvlal problem of matchIng frequency to work-
load must be solved Second, the approach must determme 
Vavs 
c uclk SoC -speCifiC pmdk sclk External Clock Hardware Swrtchmg pclk Management Perlormance Power 
hckl Un. MOnitor Supply d,ta 
Figure 1 Frequency-voltage power-management block diagram shows 
pohcy~uslng ARM's Intelligent Energy Management (rEM) unrt and the 
Adapbve Power Controller (APC) and Hardware Performance MOnitor 
(HPC) that help reduce voltage guard bands 
the absolute mmunum supply voltage needed over process 
and temperature vanatIons to generate reduced-WIdth voltage 
guard bands Third. the end results must support creation of 
power domams used to mmumze leakage current 
FIgure 1 shows a conceptual block diagram of power 
management by matchmg frequency and voltage to work-
loads The first product. based on NSC's PowerWlse technol-
ogy, targets embedded SaC devICes In mobile phones The 
power-reductIon archItecture uses an ARM-deSigned Intelli-
gent Energy Management (IEM) block that combmes soft-
ware and hardware to momtor the system workload and 
generate appropnate performance/frequency requests The 
IEM mterfaces With the CPU though the AMBA penpheral 
bus, allowmg It to be added to AMBA-based SoC deSigns 
Usmg a performance request from the IEM. an AdaptIve 
Power Controller (APC) can set the correct operatmg volt-
age lO either open-loop or dosed-loop mode and, m Its turn, 
mterface With the dock-management umt to enable transI-
tIOns to new frequencIes 
The APC receives commands from the Hardware Per-
formance Momtor (HPM) when new, higher frequenCies can 
be deployed and enables the new dock frequencIes on the 
CPU core and. If applIcable, In on-chip cache, memory, and 
penpherals The HPM can requrre new voltage levels and 
fine-tune them by commumcatmg to the external power sup-
ply and Its PowerWlse-comphant power-management chips 
The HPM IS Implemented as an on-chip 3,OOO-gate macro-
cell, but lIttle else IS known about ItS mternals, whICh NSC IS 
keepmg confidential Nevertheless. enough background on 
thiS topIC IS pubhdy avaIlable to let us fill m some of the hkely 
pnnclples of operatIon and the components that may be used 
to tmplement them 
Can You Hear Me Now? 
The problem of defmmg a mlmmal guard band for voltage 
can be stated 10 a few words Ensure that voltage levels sup-
port frequencIes across the core But effiCIent solutions can 
become very SophIstIcated 
The sunplest solunon IS to deltver open-loop voltage lev-
els that are high enough to ensure across-the-chlp operation 
at every frequency-essennally, the open~loop solutIon used 
today A timer can replace the HPM to enforce a delay from a 
new voltage-level requrrement to stable conditions that can 
support a higher frequency Thts rather pnmlnve solution Will 
reduce consumed energy compared with the absence of volt-
age control. but It will be mfenor to results obtamed by feed-
back from the powered crrcwts to the power supply 
An Improved approach based on local sensmg of voltage 
ill one or more spots and analog feedback to the power supply 
may sound attractive The sensor and ItS analog output will be 
subJect to high-frequency mterference from surroundmg dig-
Ital crrCUlts It may not functton at all. smce low core voltages 
will requrre feedback accuracy m the nulhvolt range 
A feedback that IS easier to Implement could measure 
local propagation delays to determme when local voltage IS 
e IN·STAT/MDR 'J JANUARY 21, 2003 'J MICROPROCESSOR REPORT 
B-
22 
Appendix B External non-confidential publlcallons 
Analog and CPU Wizards Reduce Digital Power 31 
high enough to support a hIgher frequency A nng oscillator 
suggests Itself. Its frequency measured between two docks of a 
known period and sent as dtgItal feedback to the power supply 
National Semiconductor seems to have opted for yet a 
dIfferent approach while the core IS still operatmg under Its 
prevIOUS stable frequency, the next-hIgher test frequency IS sent 
to the HPM. Its results checked agam and agam Voltage IS 
mcreased In steps unw the HPM reports that the test frequency 
ytdded a correct result by Issumg a vdd_ok signal to the APC 
NSC's choice may )'leld more mformatlOn than a rmg 
oscillator does, smce, hidden ill the HPM, It may have placed 
blstables, paraSltiCS, and maybe even a hot spot, the better to 
simulate condItions across a wider area of the chip The test 
frequency }'lelds a go/no-go answer It must be augmented by 
additIOnal HPM logiC that can dellver voltage adjustments as 
the chip's temperature rises and IR drops change, owmg to 
changmg demands m supply current Feedback to the external 
power supply IS based on a local spot on the ChIp only and 
may requue several sensors for the tIghtest voltage perform-
ance Downward frequency shifts are less problematiC, smce 
the lower frequencies are supported by higher voltages 
The combmed functIOn of IEM, HPM, and APC CIr-
cumvents dIfferences m process, fabncatlOn, and enVIron-
mental condItIOn, prOVIders of syntheslZable mICroprocessors 
will use It to advantage The closed-loop adaptive-voltage fea-
ture can also Improve upon the results obtamed by fully mte-
grated semIconductor houses 
PoliCies, PoliCies, Who Sets the Policies? 
HaVIng selected NSC's adaptive voltage scalmg (AVS) archI-
tecture and ARM's IEM, one must next be concerned with 
selectmg and applymg the appropnate frequency for each 
workload The architects at ARM have mtroduced a system-
architecture stack that proVides hardware/software support 
for the IEM The JEM compnses a set of counters, tlfllers, and 
other undisclosed logiC that can be used to momtor the work-
load and the processor's performance The IEM also mcludes 
operatmg-system and applicatIon-level algonthms for pre-
dIctmg future behavIor 
Figure 2 chagramsARM's concept of the IEM perfonnance-
pollcy stack, most of whICh IS Implemented m software Its 
purpose IS to store multiple algonthms that can best minImIZe 
energy consumptIon for given workload behavlors The IEM 
software IS mtended to examme several algonthms suggested 
by acttve workloads and system processes-and generate the 
best ScC-wIde control poltcy, based on their combmed re-
qUirements The approach IS general enough to support spe-
CialIZed coprocessors and multiple processors 
IEM polIcy descnptors contam one field that defines the 
way they should affect deCISIons and one that mdIcates the 
level of performance that must be delivered MnemOnIC SET 
IS a umlateral request to deliver the aSSOCIated performance, 
SET JFGT requIres that the assocIated performance level be 
dehvered only If It lS the greatest performance level reqUired 
by the polICies suggested by the actIve processes 
PohcIes can be recorded by tracmg the execution of 
workloads and thus appear to be automatiC, they can also be 
demanded by programmers of applIcations and by system 
processes The final deCISIon-maker must be the operatmg 
system ARM has very wisely reframed from mtroducmg as 
archItecture extensIons any of the counters/timers It uses for 
performance momtormg At thIS time (January 2003), for 
general-purpose computmg, the poltcy features are good 
beta- (If not alpha-) level startmg pomts, SInce the Industry 
still has much to learn about power-management polIcy-
setting algonthms Right now, however, cellular telephones 
With fewer applIcatIOns and With known system processes 
may be a good fit 
Power DomainS Reduce Leakage Current 
The best way to reduce leakage current IS to turn off the power 
supply, and that's exact1y the Idea behmd an archItecture that 
uses power domams SoC blocks that are not bemg used can 
be turned off under operatIng-system control The Imple-
mentation of power domaInS IS sImuar to hot-swappIng 
boards and requrres special on-chIp Interfaces 
Figure 3 shows an example of a typICal SoC usmg power 
domaInS to control an ARM926EJ core's leakage current The 
de~l1gn defines the ARM core and Its cache RAMs as a power 
domam that can also be a voltage domam A tightly coupled 
memory with state retention (TCMS) must be used to restore 
processor state upon power-up, followmg a penod durmg 
wruch the processor's power was turned off 
TCMS must belong to a dIfferent power domam, It can 
be mamtamed at a lower level of voltage-enough to keep the 
data mtact whtle the core IS turned off In Its suspend mode 
The TCMS, however, belongs to the same voltage domaIn as 
the ARM926EJ to enable, when reqUired, correct operatIOn 
With the core at frequency and voltage A lOgIc-level clamp 
between the core and TCMS ensures correct operation durmg 
power up and power down logIc-level clamps are also used to 
aVOid dnvmg large currents mto the core as It goes down and 
to minimIZe the probability oflatch-up One should note that 
a loglcwlevel damp IS really an AND gate forced mto a given 
state dunng power tranSItions, It IS not a voltage clamp m the 
context of hnear CircUIts 
The process of turmng off power to the CPU mvolves 
savmg machme state and placmg the CPU In reset mode 
Policy (performance control) stack Policy event handlers 
level 2 r SET IFGT I 801 Common events 
level 1 I IGNORE I 0 I • On reset 
levelO r r 2sl • On task SWitch SET • 9n pe_rf change 
;_~9j,ir!!a~-d _: Perl-: 
Figure 2 ARM's Intelligent Energy Management (IEM) conceptual block 
diagram shows pnonbzlng policy-stack and policy-event handlers 
© IN-STAT/MDR V JANUARY 21, 2003 V MICROPROCESSOR REPORT 
B-
23 
Eneregy efficient SOC design technology and methodology 
14 Analog and CPU Wi za rds Reduce Digital Power 
, ................ ,' .................................................................................................................... ······1 Battery Voltage Supply 
r-____ ~r'1.~;jl;.-i 11 Dynamic Voltage Scaled ' psu VOO RAM L IS I TCMS U RAM with stale relention ~ (Ov7-1v2) I 
~ ::::::1 CLA~p~·(CO::~:l::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::~ PowerWise Regu lators 
:r: 1'1":''':'7:':;::, Dynamic Voltage Scaled voo 
f-'C"P.=U.=C=lK'---.-I ":' ~ CPU with power.down CPU P5U_VDD_CPU 1-
~ - to! () 11 Perfonnance Monitor] CPURESfT ~ - , I? ARM926EJ 11 ~~SE 11 Hardware i (Qv7· 1v2 + OFE~ 
'-r ::,,' " , "'" 
""""" C LEVEl SHIFT, ::1.---- ----------.---- ----------.--.---------- L·SHIFT ... --............. ----. 
Ao./i>.,r,../,\./;'., , • , , 11 HClK ••• , •••• , , -
r~ Retiming I Interface I TARGfTClK r"'--.. 
~ 
AMBA AHB/APB subsystem 
Intelligent 
Energy 
M anager 
• 
Perfonnance I 
Monitoring I 
APB Clock 
Clock 
NSC :I 
!lnitialiZatiOn APC ; ~} 
I+-",In"te"rla::.:c", ++1 Adaptive • _ PWI 
Iperformanc Power .:::;f~,H, PowerWise Interface Setting Controller;" I 
11'1 FrL"'~T 
VDD PSU_ VDD __ SoC J-
Sot (1v2) 
-
'--___ '--__ --:A:;H;:B'-C;o'"OC';k;-i Management 
CPU Reset Unit 
SoC 
Target Clock 
CPU Clock + Resets 
Pll(s) 
Voltage 
Ready 
VDD PSU_ VDD]AD~_ 
"0 C3v3) 
Figure 3. Block diagram of typical SaC uses power domains to remove power from the ARM926EJ core and to restore power 10 it. To save energy, 
lCMS memory is maintained with a lower voltage than It needs when it is interfaced to the core. 
Clocks that could enable the CPU to read and use incorrect 
logic levels from TCMS memory are turned off. CPU power-
up foUows the inverse procedure. With docks and interfaces 
enabled, coming out of reset state, the CPU can use a vector 
that poin ts it to the correct TCMS add ress, from which it can 
st'art restoring its state. 
The SOC uses four voltage domains: CPU core and cache 
RAM; TCMS; on-chip bus and peripherals; and 1/0 to exter-
nal logic. Closed-loop adaptive voltage is provided only for the 
CPUrrCMS domains. This simple approach is just ified 
because most peripherals operate at lower frequencies and 
vohages, and so me peripherals count on frequency stability. 
I/O voltage levels must be kept with in specifica tions to con-
form to external voltage standards. 
The CPU and TCMS domains are connected to system 
signals and clocks via level-shift clamps to compensate fo r 
voltage changes in different domains. These are different from 
the logic-level damps that must be used at boundaries of 
power domains. In additio n to level-shift clamps, connections 
to the AM BA bus must use retiming imerfaces to deal with 
changing frequencies. 
The IEM implements the programmer's model and 
performs dynamic perfo rmance monitoring to assist the 
policy-stack software. T he perfo rm ance monitor hardware 
counts cycles received by the CPU to estimate the amount of 
rea l wo rk that has been do ne during the elapsed lime. The 
I EM block also ou tputs the requi red performance settin g for 
the target rate of workload execution. 
Putting It All in Persp ective 
The architects at ARM and NSC clai m that , on th e basis of 
exist ing silicon, they expect to see energy savings of 30% for 
peak workloads and 60% for midrange workloads over energy 
use in fued. voltage schemes. Energy savings from reduced 
gua rd bands will depend o n the particular design but will 
deliver furt her ga ins of 10-15%. 
Figure 4 shows ARM's estimate of power distribution for 
me ARM920T processor, in which instruction and data caches 
consume 44% of total power. The remai ning 56% is splil 
among the integer core, memory managemem units, bus 
interface unit, and other essential CPU ci rcuitry. The relation-
ships among CPU, peripherals, and caches may change in the 
future, to the detriment of the CPU. Higher operating fre-
quencies will exact larger cache RAM and consume more 
energy. CommerciaUy viable ASSPs already have tens of 
peripherals on chip. 
o I N - STAT / MDR JANUARY 21 . 2003 v MICROPRO C E SS OR REPORT 
B-
24 
Appendix B: External non-confidential publications 
An alog and CPU Wiza rd s Redu ce Digit al Power 3 
high enough to support a higher frequency. A ring osciUator 
suggests itself, its frequency measured between two clocks of a 
known period and sent as digital feedback to the power supply. 
Na tional Semicondudor seems to have opted for yet a 
different approach: while the (ore is still operating under its 
previous stable frequency, the next-higher test frequency is sent 
to the HPM, it's results checked again and again. Voltage is 
increased in steps until the HPM reports that the test frequency 
yielded a correct result by issuing a vdd_ok signal to the APe. 
NSC's choice may yield more information than a ring 
oscillator does, since, hidden in the HPM, it may have placed 
bistables, parasitics, and maybe even a hot spot, the bener to 
simulate conditions across a wider area of the chip. The test 
frequency yields a go/no-go answer. It must be augmented by 
additional H PM logic that can deliver voltage adjustments as 
the chi p's temperature rises and I R drops change, owing to 
changing demands in supply current . Feedback to the external 
power supply is based on a local spot on the chip only and 
may require several sensors for the tightest vohage perfo rm-
ance. Downward frequency shifts are less problematic. since 
the lower frequencies are supported by higher vohages. 
The combined function of IEM, HPM. and AP cir-
cumvents differences in process, fabr ication, and environ-
mental conditioll ; providers of synthesizable microprocessors 
will use it to advantage. The closed- loop ada ptive-voltage fea-
ture can also improve upon the results obtained by fu lly inte-
grated semiconductor houses. 
Policies, Policies, Who Sets the Policies? 
Having selected NSC's adapt ive voltage scaling (AVS) archi-
tecture and ARM's IEM, one must next be concerned with 
select'ing and applying the appropria te frequency for each 
workload. The architects at ARM have in troduced a system-
architecture stack that provides hardwarelsoftware support 
fo r the IEM. The IEM comprises a set of counters, timers,and 
other undisclosed logic that can be used to monitor the work-
load and the processor's performance. The IEM also includes 
operating-system and application- level algorithms fo r pre-
dicting future behavior. 
Figure 2 diagrams ARM's concept of the IEM performance-
policy stack, most of which is implemented in software. Its 
purpose is to store multiple algorith ms that can best minimize 
energy consumption for given workload behaviors. The IEM 
software is intended to examine several algorithms suggested 
by active workloads and system processes-and generate the 
best SaC-wide control policy, based on their combined re-
qu irements. The approach is genera! enough to support spe-
cializ.ed coprocessors and multiple processors. 
IEM policy descriptors conta in one field that defin es the 
way they shou ld affect decisions and one that indicates the 
level of performance that mUSt be delivered. Mnemonic SET 
is a unilateral request 10 deliver the associated performance; 
SET_IFGT requires that the associated performance level be 
delivered only if it is the greatest performance level required 
by the policies suggested by the active processes. 
Policies can be recorded by tracing the execut ion of 
workloads and thus appear to be automatic; they can also be 
demanded by programmers of applications and by syste m 
processes. The fina l decision-maker must be the operating 
system. ARM has very wisely refrained from introducing as 
archite<ture extensions any of the counters/t imers it uses for 
performance monitoring. At this lime (January 2003), for 
general-purpose comput ing, the policy featu res are good 
beta- (if not alpha-) level starting points, since the indust ry 
still has much to learn about power-manage ment pol icy-
setting algorith ms. Right now, however, cellular telephones 
with fewer app lications and with known system processes 
may be a good fit . 
Power Domains Reduce l eakage Current 
The best way to reduce leakage current is to turn off the power 
supply, and that's exactly the idea behind an architectu re that 
uses power domains, SoC blocks that arc not being used can 
be turned off under operating-system cont rol. The imple-
mentation of power doma ins is similar to hot-swapping 
boards and requires special on-chip interfaces. 
Figure 3 shows an example of a typical SoC using power 
domains 10 control an ARM926EJ core's leakage current. The 
design defines the ARM core and its cache RAMs as a power 
domain that can also be a voltage domai n. A tightly coupled 
memory with state retention (TCMS) must be used to restore 
processor state upon power-up, following a period during 
which the processor's power was turned off. 
TCMS must belong 10 a different power domain; it can 
be mai ntained at a lower level of vohage-enough to keep the 
data intact while the core is turned off in its suspend mode. 
The TCMS, however, belongs to the same voltage domain as 
the ARM926EJ to enable, when required. correct operation 
with the core at frequency and voltage. A logic-level damp 
between the core and TCMS ensures correct operation du ring 
power up and power down. Logic- level clamps are also used to 
avoid driving large currents into the core as it goes down and 
to minimize the probability oflatch-up. One shouJd note that 
a logic-level dam p is really an AN D gate forced into a given 
state during power transitions; it is not a voltage clamp in the 
context of linear circuits. 
The process of turning off power to the CPU involves 
saving mach ine state and placing the CPU in reset mode. 
Policy (performance control) stack Policy event handlers 
lml 2 I SET IFGT I 80 ] Common events 
l evel1 I IGNORE I 0 --' • On reset 
l evelO I' SET -1 25 1 • On task switch 
• On pert change 
rco~mand Pe'!l 
Figure 2. ARMs Intell igent Energy Management (tEM) conceptual block 
diagram shows prioritizing policy-stack and policy-event handlers 
C IN -S TA T / MDR 'y J AN UARY 21 . 20 0 3 v' MI C ROPRO CESSOR REP ORT 
B-
23 
Appendix B: External non-confidential publications 
An alog a nd C P U Wi za rds Redu ce Di git al Po we r 5 
(P15 
2% 
BlU 
8% 
PATag RAM 
1% 
Clocks Other 
IMMU DMMU 
4% 5% 
O-Cache 
19% 
Figure 4. ARM920T power distribution shows dominant power con-
sumption attributed to cache RAM and ALU. 
A cellular phone may be able to use the technology to 
affect 40% of its active devices' power. Assuming a 50% aver-
age energy reduction using NSC and ARM's short-term prod-
uct, the overall savings are a significant 20%, considering that 
turning the devices off completely would improve consump-
tion by only 40%. 
P rice & Availability 
The collaboration between National and ARM includes a 
licensing agreement to enable easy deployment of APe. 
Under the terms of the agreement. ARM will license 
National's APe along with its Intelligent Energy Manager 
to key customers. beginning in 2003. National will also 
market and license its APe. license terms have not been 
disclosed. For more information please visit www.nsc.com 
and ~.ann.com 
Closed-loop adaptive voltage is only one component in 
the project ARM and NSC have undertaken. Clock gating, 
new cell libraries to minimize leakage current, and power 
domains to turn it ofT will further reduce consum ed energy. 
Microarchitecture control b:1sed on workload bchavi or 
will play a major role: for example, parti tioning cache RAM 
into power domains ca n trim act ive cache size to match active 
applicat ions, drastically cutting consumption . 
Power-efficient technology involves process, physical 
design , logic. microa rchitecture. operating-system and appli-
cations software. and, now, analog expertise. ARM and NSC 
have embarked o n an important project. An equally gifted 
software company should join them. 
To subscribe to Microprocessor Report, phone 480.609.4551 or visit ww\v.MD Ronline.com 
C IN -ST AT/MOR JANUARY 21.20 0 3 v 
B-
25 
MI C ROPRO CESSO R RE P ORT 
Eneregy efficient sac design technology and methodology 
8-
26 
Appendix B: External non-confidential publications 
B-
27 
Eneregy efficient SOC design technology and methodology 
Intelligent Energy Management (IEM) 
• Conserving power whilst running = saving energy 
• Running only fast enough to do the work just in time 
• Adapting to changing software workloads 
100% 
Conventional 
OnlOff Power 
0"1. 
100"10 
Dynamic 
Voltage 
Scaling 
OY~ 
'00 ---~:~ .-
" 
" 
'00 
lli 
--- r--
- f.(' 
--1-=1 ~
'" 
--- ~ 
, 
B-
28 
:: r"" 
", " -rE-n~,gy Us •• 
r~~: ,gy Us •• 
Appendix B: External non-confidential publications 
B-
29 
Eneregy efficient sac design technology and methodology 
B-
30 
Appendix B: External non-confidential publications 
B-
31 
Eneregy efficient SOC design technology and methodology 
ConfiguratIon Interface 
OPM 
DynamIc Performance 
Monitor 
OPC 
Dynamic p,rfonnance 
Controller 
B-
32 
Appendix B: External non-confidential publications 
B-
33 
Eneregy efficient sac design technology and methodology 
Tightly Coupled 
----{» Memories 
r (Te Ms) 
In 
% 
:!j CLAMP CLAMP 
1111 111 n 
CPUCLK ~  I> 
~ ARM926EJ CPU RESET N 
~ I> 
L-5HIFT L-5HIFT 
rrrn J,J, J,J, J, 
CAC HE 
RAMS 
B-
34 
Dynamic Voltage 
RAM with s lale 
retention 
Dynamic Voltage 
CPU with 
power..(lown 
I 
~ 
J 
vooe 
L 
Appendix B: External non-confidential publications 
B-
35 
DynIom~vo .... 
CPU ," "" 
""_,,,wn 
Eneregy efficient SOC design technology and methodology 
B-
36 
Appendix B: External non-confidential publications 
B-
37 
Eneregy efficient sac design technology and methodology 
B-
38 
Appendix B External non-confidential publications 
IEM926: An Energy Efficient SoC with Dynamic Voltage Scaling 
Krosztuin Flautner Dav,d Flynn Dav,d Roberts Dlpesh I. Patel 
{knsztlan flautner, davld flynn. davld roberts, dlpesh patel}@arm corn 
ARM LImited, 110 Fufboum Road Cambndge, UK CBt 9NJ 
Abstract 
One of today's most successful embedded devices, the 
mobIle phone. embodies a set of challengmg design requIre-
ments long battery life. small Size, high peifonnance and 
low cost The smgle parameter that comphcates the simulta-
neous fulfilment of all of these design goals IS energy effi-
CIency of the system. since batteries only hold a ji1llte 
amount of charge To operate wlthm the allotted energy 
budget, systems must be opllrnlZed for energy consumptIOn 
durrng deSign and also at run-time Increasmgly It IS not 
suffiCient to srallcally Opt1m1ze for worst-case condmons but 
deSigners must enable systems to adapt to conditions at run-
time The Intelligent Energy ManagerTM (IEM) technology 
prOVIdes an Integrated solutIOn for addreSSing energy man-
agement of SoC deVIces In thIS paper we present data about 
the energy consumpllon characteristiCS of a multiple power-
domam based SoC whIch mcludes PDA functlOnaIIIY bUIlt 
around an ARM926El-S core 
1. Introduction 
Power consumption IS arguably the most Important fea-
ture of embedded processors with Significant Impact on the 
co<;t and phYSical size of the end deVice Hlstoncally, low 
power consumptIOn In embedded processors has been 
achieved through slmphclty, hmlted use of speculatIOn, and 
through the use of low-power sleep modes that reduce Idle-
mode power consumption Embedded processors are now 
perfonmng more sophisticated ta<;ks, which requITe ever-
higher performance levels As a result, new processor 
deSigns are more dependent on sophisticated architectural 
techmques (such as predictIOn and speculatIOn) to achIeve 
high performance Unfortunately, such techmques can also 
slgnlficamly Increase the processor's power consumptIOn 
One way to bndge the gap between high performance 
and low power IS to allow the processor to run at dIfferent 
performance levels depending on the workload's reqUIre-
ments An MP3 audiO player, for example, requites about an 
order of magnitude less performance than an MPEG Video 
player The difference In performance reqUIrements can be 
exploited to save energy With the use of dynamiC voltage 
scahng (DVS) DVS exploits the fact that the peak fre-
quency of a processor Implemented In CMOS IS propor-
tIOnal to the supply voltage, while the amount of dynamiC 
energy required for a given workload IS proportIOnal to the 
square of the processor's supply voltage RedUCing the sup-
ply voltage while slOWing the processor's clock frequency 
Yields a quadratiC reduction In energy consumptIOn, at the 
cost of Increased run time I5] 
The IEM technology mcludes software components 
that can accurately predict the mlmmum necessary perfor-
mance level of the processor for the runOlng workload, thus 
a reductIOn of performance does nOI necessanly Imply any 
degradatIOn of quahty 13] In thiS paper we show the poten-
tial energy savings that can be achieved on a real SOC usmg 
dynamiC voltage scalIng 
2. The IEM926 test chip 
The IEM926 test chip was exphcltly deSigned to sup-
port DVS and fast clock-switching and Includes on-chip 
penpherals that are Similar to the ones found on PDA 
deVices The test chip mcludes the followmg components, 
graphically Illu<;trated III Figure I 
• ARM926EJ-S processor With caches (l6K I and D) 
• 16K I and D Tightly Coupled Memones (TCMs) 
• 240, 180. 120, 60. 0 MHz processor performance levels 
• A DMA subsystem 
• The Intelhgent Energy Controller prototype 
• SDRAM and Flash memory controllers and baSIC pen ph-
erals (lIleludlllg on-board audIO) to support a minimal 
LlIlux environment 
• Interface to NatIOnal Semiconductor's PowerWlse™ 
controller to support open- and clo~ed-loop DVS 16] 
The SOC IS partitIOned mto three power domams The 
system bus and penpheral bus subsystems are In a smgle 
power domam supphed With a fixed I 2V The CPU domalll 
IS the only domam whose frequency and voltage can be var-
Ied dynamically and It lIleludes a separate power domam for 
the TCMs which can be placed III a low-power state reten-
tIOn mode while the mam processor IS powered off The 
deSign mcludes clamps between the TCMs and the core to 
FIGURE 1 Components 01 the IEM926 SaC 
Penpheral bus 
domain 
Proceedlllgs. of the DeSign, Automation and Test In Europe Conference and Exhibition DeSigners' Forum (OATE'04) 
1530-1591/04 $2000 © 2004 IEEE 
-.~ 
COMI'lJTER 
SOCIHY 
B-
39 
Eneregy efficient SOC design technology and methodology 
FIGURE 2 IEM926 die photo 
support this mode of operation, however when runmng. 
both the TCM and core run at the same voltage and fre-
quency 
The test chip was manufactured usmg the TSMC 0 13G 
process A picture of the 5x5mm die IS shown In Figure 2 
without the processor (middle box), two PLLs (top left and 
nght corners) and instructIOn and data TCMs (middle nght 
box) The .. y .. tem mcludes two PLLs one controls the fre-
quency of the processor and another proVides a fixed fre-
quency for the penpherals 
2.1 Clockmg strategy 
The two mam challenges of the SaC design were to 
support fast switching between the available frequency lev-
els and to support dynamic frequency changes on a core 
With only synchronous bus Intetfaces The first Issue was 
solved by the use of frequency dIVIsIOn of one of the PLLs 
runmng at 480MHz to four frequency levels 240, 180, 120, 
and 60 MHz On the bench the chips successfully operate at 
300,225,150, and 75 MHz by running the PLL at 6(X)MHz 
To Simplify the system design, the system bus and penpher-
als run at a fixed 25% of the peak frequency of the proces-
sor Generating a frequency at 75% of peak IS a challenge 
With a Single PLL, further complicated by the need for syn-
chronous mterfaces to the buses The solutiOn employed m 
thiS chip relies on a skewed clock that has an uneven duty 
cycle (3/8 compnsed of I 1 5, 1 1 5, I 2 core to PLL clock 
ratios), ensunng that bus and core clocks are ahgned on the 
nsmg edge of each bus clock transItion In the followmg 
figures, the datapomts correspondmg to a Wide vanety of 
frequenCies were generated by under- and over-dnvmg the 
480MHz PLL by -10% to +25% m 5% mcrements and then 
dlvldmg by the four ratios (I, 3/4,112,114) 
22 Voltage levels 
FIgure 3 shows the mmlmum voltage levels suffiCient 
for sustammg a Wide range of frequenCIes on the core at 
room temperature The peak frequency of the core IS set 
between 215 and 300 MHz and scaled to 75%, 50%, and 
25% Theorettcal models suggest a hnear relatIOnship 
between voltage and frequency Our measurements broadly 
confinn these expectations With two Important differences 
FIGURE 3 Cote voltage vs frequency 
.. 
" 
.. 
• 
• 
'" ~
.. 
" 
V 
r~ 
'00 
Y< V 
. I 
." 
, 
voltages for frequencies correspondmg to 75% of peak he 
above the hnear predictIOns and voltages for rnmlmum 
(25%) frequencies do not substantially decrease below the 
levels at 50% and In fact .. how an Increase for low frequen-
cies 
The former lITegulanty IS explamed by the clockmg 
techmque employed on the SoC at the 75% peak frequency 
pomt the core IS actually operatmg shghtly (a httle over 6%) 
faster than 75% due to the mterface With synchronous 
buses The higher actual frequency In turn necessitates a 
higher operatmg voltage, which explams the divergence 
The lITegulanty at low frequencies IS as yet unexplamed but 
IS hkely to be caused by the level-shlfter; employed m the 
system We have also ob<;erved that the voltage charactens-
tiCS when caches are turned off are substantIally the same as 
m the graph above, thus m thiS case, the sense-amplifiers are 
not the cau .. e of the lower hmlt on voltage scahng 
3. Power and energy 
Figure 4 shows that there IS a Imear relatIOnship 
between the core's frequency and the amount of work done 
per umt time lfl a processor-bound workload As expected, 
ruflnlflg at 25% of peak frequency causes thiS workload to 
FIGURE 4 Performance vs frequency 
.0000<> 
I i 
I 
, 
I 
, 
I , 
400000 
, 
I V 1 i 
I , /: i I , , 
i / I 
1/ I I I 1 I 
~ ! 300000 
f 
o 
'0000<> 
'00000 
I I , 
I 
, 
, 
i : , I 
, .. 
'" ". 
Proceedmgs of the DeSign. Automation and Test In Europe Conference and ExhlbrtJon DeSigners' Forum (DATE'04) 
1530-1591/04 $20 00 © 2004 IEEE 
B-
40 
Appendix B External non-confidential publications 
FIGURE 5 Power and energy consumption at different frequency and voltage levels 
"r----c-----~---------r-
! :. ~ i u_--_--_-_+-_ 
~ I I 
------~I"-
" 
'" 
I i ~ ----t-"" .. ----i-----' 
I 
'" '" '" eo... Il'ItqUI""Y (111Hz) 
I I I_i 
J i 
I 
, , 
--------, 
, I Ti 
'" '" 
run four hmes longer In general, bus-bound apphcatIOns 
exhibit a flatter slope, meamng that due to the uneven scal-
mg of bus frequencies, with reduced frequency the work-
loads' run-time Increase at a lower rate than that of 
processor-bound applicatiOns 
Figure 5 shows the ARM926EJ-S core's power con-
sumptIOn and energy use (mcludmg on-chip cache and 
RAM structures) when runmng Dhrystone on a wide range 
of frequency and voltage levels Energy consumptIOn IS nOf-
maltzed 10 the amount used at the statlcally charactenzed 
maxImum operatmg pomt (240MHz at 1 2V) Our results 
show that a factor of 10 (90%) power and more than a factor 
of two (65%) energy saving IS achlveable by runmng the 
cores at their mlmmum levels (25% of peak frequency) 
However, there IS very httle pay back on running the core 
below the half-frequency pOint since voltage cannot be slg-
mficantly reduced and consequently the energy consump-
tIOn remams about the same On the other hand, If heat 
management (thus average power consumptIOn) IS an Issue, 
then more power savmgs can be achieved by further lower-
Ing the frequency-thIs behaVIOur IS shown at the bottom of 
the power curve 
Our measurements match the theory the power con-
sumed dunng a workload IS proportIOnal to the frequency 
times the square of the voltage at which It IS run Smce 
energy IS the Integral of power consumptIOn, the longer exe-
cution time due to lowered operatIng frequency cancels out 
the frequency term and thus energy consumptIon IS propor-
tIOnal to the square of the operatIng voltage 
The amount of energy saved for a given workload 
depends on the peak frequency and voltage levels of the 
core Table I Illustrates our results when runmng a work-
TABLE 1 Energy savings al different (t, v) points 
Ma< Workload Energy Workload Energy 
speed (MHz) speed (MHz) reduction ,pe'" (MH,) reduction 
300 75 .. % 150 54% 
240 60 56% 120 48% 
21. 54 45% 104 45% 
- -
~ - --
- V k-': 
- -
;Y 
L 
--
-
" '" '" '" '" '" Core F\'."",,,,,~ (MHz) 
load at the mmlmum 25% (second column) and 50% (third 
column) for three maxImum frequencies Results In the sec-
ond column show that there IS more energy reduction If the 
peak frequency and voltage levels are higher One reason 
for this IS that our hardware does not functIOn below 0 7V 
and this voltage level can already be achieved at 50% of 
maximum frequency for the 240MH z and 216MHz configu-
rations This ImplIes that the 25% frequency levels of these 
configuratIOns do not Slgmficantly reduce energy consump-
tIOn any further 
However, even on cores with lower minimum voltage 
levels, the pnmary benefit of voltage scahng IS towards the 
top end of the frequency range This IS a consequence of the 
scalIng equatIons and the quadratic relatIOnship between 
energy consumptIOn and operating voltage [4] The third 
column of Table I shows the energy savIngs for workloads 
runmng at half of maximum frequency While the difference 
between the reported energy savmgs In each row IS less than 
m the first case, the trend IS clear higher maxImum fre-
quency and voltage enables more energy savings when 
slower operatIng levels are used 
3.1 Operating margins 
Data In the prevIOus sections were collected for mIni-
mum voltages at room temperature However. there IS no 
guarantee that the same voltage levels would be suffiCient to 
run the processor at the specified frequencies under differ-
ent conditions (or that different chips would behave the 
same way) To deal With uncertamty and vanatIOns due to 
the ambient enVironment, Silicon, lR-drop, etc deSigners 
mclude operatIng margins m the voltages that are specified 
for each frequency level 
The first graph In Figure 6 shows the energy Impact of 
the operatIng margms on the JEM926 processor runmng at 
the four different frequency levels that are achlveable m a 
smgle configuratton For each frequency. the energy con-
sumptIOn of the Dhrystone workload IS plotted USIng five 
different voltage levels the hmlt voltage-below which the 
system falls to operate-and at 5%. 10%, 15%, and 20% 
Proceedmgs of the DeSign, Automation and Test m Europe Conference and ExhibitIOn DeSigners' Forum (DATE 04) 
1530-1591/04 $20 00 © 2004 IEEE 
-,~ 
COMI'l!TER 
SOCIETY 
B-
41 
Eneregy efficient sac design technology and methodology 
FIGURE 6 The energy Impact of operating margins 
12 14r--------------------------------------, 
o,t------------------------------------1 
" 
100 150 200 
Frl'qU8nc1(MHI1 
above the minimum The knee In the hne at 120MHz shows 
the hmlted energy savmgs at the 60MHz level due to the 
hard limit on minimum voltage that IS near the same level as 
at 120MHz Typical tolerance levels on ",upply voltages are 
between 10%-15%. which translates 1010 20%-25% energy 
overhead when the processor IS not runmng close to worst-
case conditions The energy consumption of the different 
configurations are normahzed to the amount consumed at 
the stahcally charactenzed level (240MHz at 1 2V). which 
corresponds closely to the hne with 15% voltage overhead 
The second graph In Figure 6 shows the energy con-
sumptIOn of the workload without voltage scahng In these 
expenments, the operating voltage was kept at the statlcally 
charactenzed level and at levels 5% and 10% above and 
below for all frequency pomts The results show that with-
out voltage scahng the energy consumptIOn for a workload 
IS not reduced and In fact may Increase at lower frequencies 
We beheve that thiS behavIOur IS due to on- and off-chip bus 
interactions and extra overhead Incurred dunng some mem-
ory transaCUons While the +5% and +10% voltage levels 
may be beyond the amounts that are Incorporated mto the 
processor's operatmg margms, such overshoots may be a 
functIOn of the power regulator Accurate power dehvery IS 
an Important component of an energy effiCient system as 
even a small Increase over the necessary voltage level mcurs 
slgmficant energy overhead 
4. Conclusion and future work 
Our results show that voltage scahng enables slgmfi-
cant reduction of the energy consumptIOn of the core Imple-
mented m a l300m process Our ongomg work quantifies 
the system-wide Impact on energy consumption under real 
workloads. operatmg systems. and performance-settmg pol-
• i 06+--------------------------------------1 
~04+_------------------------------------_1 
~ 
o,t-----------------------------
O+---__ ----__ ----__ --__ ----__ --~ 
50 100 150 200 25' 300 
Frequency (MHz) 
ICles OUT Imtlal results indicate that when runmng at the 
peak level, the processor accounts for 75% of the energy 
used on the IEM926 saC Our data confirms that while 
deslgnmg with worst-ca~e parameters may be necessary. 
actually runnmg a chip with worst-case voltage levels 
wastes energy In our case up to 25% Our ongomg research 
explores on-chip structures [1] and mlcroarchltectural tech-
mques [2] for reducmg operatmg margms 
5. Acknowledgements 
The IEM926 design was done as a Jomt project 
between ARM. Synopsys. and TSMC We thank Anwar 
Awad and Han Pm-Hung Chen of Synopsys for their Imple-
mentatIOn work 
References 
[I] S Dhar. D MakSlffiOVIC, and B Kranzen Closed-Loop 
Adaptive Voltage Scalmg Controller For Standard-Cell 
ASICs Proceedings 2002 Int I SymposIum on Low Power 
Electromcs and DeSIgn (ISLPED-2002), August 2002 
[21 D Emst. N SKim. S Das. S Pant, R Rao. T Pham. C 
Zle~ler. D Blaauw. T Austin, T Mudge, and K. Flautner 
Razor A Low-Power Pipeline Based on Circuit-level TImmg 
Speculation Proceedings of the 36th SymposIum on M,croar-
chltecture (MICRO 36). San Diego, CA. 2003 
[3] K Flautner and T Mudge Verugo Automatic Performance-
Settmg for Lmux Proceedings of the 5th S}mposlum on 
Operating Systems DeSign and ImplementatIOn (OSDI2002). 
Bo<;ton MA. 2002 
[4] N SKim, T Austm, D Blaauw. T Mudge. K Flautner. J S 
Hu. M J IrWln, M Kandemlr. and V Narayanan Leakage 
Current Moore's Law Meets StatiC Power IEEE Computer, 
December 2003 
[5] T Mudge Power A FIrSt Class Architectural DeSign Con-
stramt IEEE Computer. vol 34. no 4. Apnl 2001 
[6J http/lwwwnatlona) comlappmfo/power/powerwlse html 
Proceedings of the DeSign, Automation and Test In Europe Conference and Exhlbrtlon DeSigners' Forum (DATE'04) 
1530-1591/04 $20 00 © 2004 IEEE 
-.~ CoMPUTER 
SO<..If:.TY 
8-
42 
Appendix B: External non-confidential publications 
B-
43 
Eneregy efficient SOC design technology and methodology 
~u~ Summary: IEM/DVFS challenge ARM 
SANJOSl 2005 IP design and RTL 
• Multiple voltage domains->multiple clock domains 
• Partitioning and interfaces require special care 
• What clock frequency/voltages are energy efficient? 
• Library enhancements 
• Level Shifters, Isolation cells 
• Multi VDD Characterization (Std Cells and Memories) 
• Multi VT Cells 
• Tools 
• Multi-Voltage aware Galaxy Platform 
• Multi-VT Optimization 
• JOintly developed IEM Reference Methodology 
synopsys-
8-
44 
Appendix B: External non-confidential publications 
~u~ IEM Leakage mitigation next. .. ARM 
SAN 10SI 
2005 Customers want ARM to support many of the following: 
1. Mixed Vt libraries (2, 3 even 4 VTs!) 
l! EDA tools support now 
2. Multi-rail switched domains 
l! External power down of sub-systems 
3. On-chip Power Gating (MTCMOS) 
<'! Local fine/coarse grain power header/footer switches 
4. Retention Registers (a.k.a. "Balloon Flops") 
5. Reverse Bias memory leakage support 
6. Dynamic Threshold Scaling (VTCMOS) 
~ Support for both forward and reverse bias operation 
synOpsys· 
B-
45 
Eneregy efficient sac design technology and methodology 
B-
46 
Appendix B: External non-confidential publications 
B-
47 
Eneregy efficient SOC design technology and methodology 
B-
48 
.- ------------------------------
Appendix B: External non-confidential publications 
~-B 
B-
49 
Eneregy efficient soe design technology and methodology 
8-
50 
Appendix B: External non-confidential publications 
B-
51 
IEM Architected 
state save/restore 
+PSU handshakes 
Eneregy efficient SOC design technology and methodology 
IEM . Architected 
SleeplWake 
+PSU handshakes 
B-
52 
Appendix B: External non-confidential publications 
B-
53 
Eneregy efficient SOC design technology and methodology 
Voltage/Power level analysis (MHzN) 
• Given PVT derive accurate cycle time 
• Approx 1 day per analysis point 
Frequency Analysis 
l00000.0uA 
-~ • 
., 
-----
l0000.0uA 
l000.0uA 
100.0uA ~ 
10.0uA 
I 1.0uA 
-40 
8-
54 
10 
.... 
Appendix B: External non-confidential publications 
Voltage scaling limit analysis (FNIOC) 
• To understand typical silicon limit of DVS 
• -3 days per analysis point 
• Understand the temperature inversion point 
,.. 
'" ~. 
"" ~/-'" 
"" ;6/ 
"'" 
./' 
"" 
'''' 
Typical Power analysis (uW/MHzN) 
111 To understand typical silicon DVS power 
L -3 days per analysis point 
L Key to understanding energy efficient frequencies 
Ave rage powe r 
6000uW/MHz '==============::::1 g 5 0 O ~ ~ 400.0uW/MHz 1-----------___ ~*'/~----! 
~ 300 OuW/MHz 1;;~~~~~~~~~~~~~3 g' 200.0uW/MHz .. .-"'" ~ 100.0uW/MHz 
> ~ O.OuW/MHz 
OAOv 
B-
55 
a.sov a.BOv 1.00v 1.20v 
vd d 
Eneregy efficient sac design technology and methodology 
... 
!, ~ ~, 
~ ~ 
t t • 
i l 
~ 
, 
~ 
t t 
B-
56 
I 
. 
~ 
~ !, i, i ~ 
" 
, 
-, ~' 
" s " f I f f 
i' ;' ., ;' ~ < < • 
Appendix B: External non-confidential publications 
.. 
\ £" 
f" 
• 
•• 
'----~ 
-~ 
..  
B-
57 
, .  , .• 
Eneregy efficient sac design technology and methodology 
B-
58 
Appendix B: External non-confidential publications 
B-
59 
Eneregy efficient SOC design technology and methodology 
8-
60 
Appendix 8 : External non-confidential publications 
8-
61 
Eneregy efficient sac design technology and methodology 
B-
62 
() 
« 
o 
"'0 
C 
Appendix B: External non-confidential publications 
SOC and IP Design challenge 
Advancements in Energy-Efficient Design 
System Level Dynamic Power/Energy 
Management 
David.Flynn@arm.com 
Overview 
• Systems level challenge 
.. 
• SW/SOC/PSU design, EDAlLibrary implications 
• Dynamic Power/Energy Management 
• Dynamic Voltage and Frequency Scaling 
• Real -world design issues 
• Static/Leakage Power Management 
• Multiple power management states 
• Work in progress ... 
C 2005 ARM Prenntl r. O.yld f L YNH OAC 21105 Tutorial : f.4 • Advln(:tlmlnlS in En •• gy·Etfle'-nl Onl;n (2) 
B-
63 
Eneregy efficient sac design technology and methodology 
CMOS Power and Energy in a Nutshell 
• Power and Energy consumption trends of a workload 
running at different frequency and voltage levels . 
• DFS: frequency scaling only 
• DVFS: frequency & voltage scaling 
Us.ful for OVS 
• 
f"qL>enc:y 
E = JPd' 
c 
f - ( Vdd-V,)" / Vdd 
a ~ 1.3 
P = Cvdif + vddlleak 
Avg . power - heat Need DVS to save energy 
VI ' v max ':=; 0.3 
Must reduce voltage to save energy and extend battery life! 
Cl 2005 ARM Pt .. ,n.,..: O.vld FL'rNN OAC 2005 Tutorial: ", . Advl~.ment. In En. fgy.f.fllc:lenl Oftlll" 
'" 
, 
Performance scaling for energy effic iency c 
'1('''11 
UtiIiZ:;.: ] I w.~ I 
0% , 
'00% -1 I -_f-- I._. 
Work 
wo~ 
0%-14----_~--+--+_ 
i 
' 00% -1--1---1-------1--+- '00% -l-+_------+---f-.....,I--
Power 
0% I I I En~~g: JU L--, --IVI,-' 
0%, I Time ~ 
0% ...l+-----1===4----1f-. 
100% -I-+-------+---~.....,!-
• Reduced processing rate enables more efficient operation 
• Need software intelligence to meet (multiple) task deadlines 
Cl 2005 ARM Pr ... nt .. , Olvld fL YNN OAC 2005 Tutorh.I : '" - Adv.ncements In En.flly·Elllclent Onlgn 
s-
64 
,., 
Appendix B: External non-confidential publications 
Design for DVFS 
• Need to determine set of energy efficient 
performance points I clock frequencies 
• Not easy: EDA tools will give you 
1/frequency (eventually) from set of 
ProcessNoltage/Temperature conditions 
• Voltage headroom on sub 1.2V process 
technologies needs great care - especially 
RAM 
• Energy efficiency is product of power x 
time to complete workload to a deadline 
• Leakage impact must be factored 
1:1 2005 ARM Pr .. ,nl.,: Davld f l'l'NN DAC 2005 Tutori,I : " _ Adunc:emenll 10 energy-flllet'"1 C..tin (5) 
Design for DVFS - stage 1A 
• Determine FMAX at worst case conditions 
• Worst case process/temperature 
• VNOM - 10% voltage 
• Standard Cell-based set-up timing sign-off 
• (Confirm with Transistor-level simulation) 
.. 
.. 
• Analyse clock latency for synchronous design 
interfacing 
• Repeat analysis at wider operating voltages 
• Transistor simulation of caches for accuracy 
1:1 2005 ARM Pr'''nt,r. O.vld Fl YN N olte 200S Ty\orllll: '" - AdY.nClmenll In En.flly-ffficltonl Cnlgn 
B-
65 
1'1 
Eneregy efficient soe design technology and methodology 
Design for DVFS - "Slow" analysis 
0 2005 ARM 
I 
Delay 
(ns) 
10.000 
9.000 
UMC HS130 DVFS Analys is 
s lowls low, max rated te mperature 
8.000 
7.000 
6.000 
5.000 '-. 
4.000 __ 3.000 •• 
2.000 ~ 
1.000 • 
0.000 
0.50 1.00 
Voltage (V) 
1.50 
...... Petood/SS) (lIFffeJ<) 
__ L.Jo~~ 
P, . .. nte,: Clvld FL YNN OAC 2005 Tl.llo.I I I: • .( · Adv.nu ..... nl. In Energy-Efficient Dnlg" 
Design for DVFS - stage 1 B 
• Determine typical silicon characteristics 
• Typical processl room temperature 
• VNOM initial voltage 
• Standard Cell-based power sign-off corner 
• (Confirm with Transistor-level simulation) 
(7) 
• Also analyse clock latency for synchronous 
design interfacing 
• Repeat analysis at wider operating voltages 
• Transistor simulation for accuracy ... 
0 2005 ARM Presenter. Cl vld FL YNN DAC 2005 Tutorl.l: t4 • Ady~ncem.nl. In EnuliIy-Elllclent Onlgn 
B-
66 
(8) 
.. 
-------------------------------------------------------------------- ---
Appendix B: External non-confidential publica tions 
Design for DVFS - "Typ" analysis 
0 2005 ARM 
Delay UMC HS130 DVFS Ana lysis 
(ns) typicalltypical , room temperature 
10.000 
9.000 
8.000 
7.000 
6.000 
5.000 
4.000 
3.000 
2.000 
1.000 
0.000 
0.500 1.000 
Vo ltage (V) 
__ LIII8ncyfTT] 
Poly iPeflodfTT] ( l fF mu)) 
18V ' 64 701.' . 100.'1 1. . 42 4.<1 7 
1.500 
Pr . .. nt.r: Oavld FL YNN OAC 2005 Tutorlll: ... . Ad ... nc ...... nu. In E~rliil)'·effit;l.nt 0.. "," 
Design for DVFS - stage 1 C 
• Determine fast corner characteristics 
• Fast silicon I lowest rated temperature 
• VNOM +10% initial voltage 
, , 
iu: 
19) 
, 
ne 
• Standard Cell-based hold-time sign-off corner 
• (Confirm with Transistor-level simulation) 
• Also analyse clock latency for synchronous 
design interfacing 
• Repeat analysis at operating voltage extremes 
• Transistor simulation for accuracy ... 
0 2005 ARM Pr ... nt.,: Dnld f l YNN OAC 2005 Tu torial : '" _ Advlne. .... nl.$ In ene,gy..f lficlenl o.,lgn 
B-
67 
(10) 
Eneregy efficient SOC design technology and methodology 
Design for DVFS - " Fast" analysis 
0 2()05ARM 
Oe lay 
(ns) 
10.000 
9.000 
8.000 
7.000 
6.000 
5.000 
4.000 
UMC HS1 30 DVFS Analysis 
fa st/fast , min rated temperature 
___ Period[f FJ (1IFmax) 
__ Llllency!FFj 
3.000._ 2.  ,"676 1_; . 16$01-" 12.053 
1.000 
0.000 
---- Poly (Penod[FF)U/FmlllC)) 
0.500 1.000 
Voltage (V) 
1.500 
Prennler. O.yld FLYNN DAC 200S Tutorial: ... . Advane.manll ln En.rgy-Elflelanl Oulgn 
DVFS Frequency range analysis 
C 20DS ARM 
Frequency Analysis 
- Fmax[FFJ 
- f mS)l[TT] 
- Fmax{SS] 
Voltage 
P, ... nte r. O...,1d FLYNN OAC 2005 Tutorial : ... . Advanumentl In Enu"y-Efflc"-nt Ou18" 
B-
68 
, 
ft,. 
I" ) 
(12) 
1 
,. 
Appendix B: External non-confidential publications 
DVS Clock Tree Latency Analysis 
Clock Tree Latency Analysi s 
15 
" 
__ Lalerq1FFI 
13 
___ LalenqfTTJ 
12 __ LaIe~sSJ 
;: 11 
. 
m 
• .. 
> 09 
08 nil 
,~ 
.7 
•• 
• 5 
0 ~ 0 ~ 0 ~ 0 ~ 0 
0 0 
- -
N N M M • 
l atency (n5) 
Cl 2005 ARM P .... nt. r: o.vld FL YHN OAC 2005 Tutorlll: '" • Adv.ne ..... nt.ln Enlrgy-Efficllnt 0.5111" (13) 
Power Analysis - stage 2 
• Typical silicon power analysis is most 
important to understand for product design 
• Dynamic power analysis requires 
representative workload code/vectors 
• Applicat ion dependent in final product 
e 
• Stat ic power analysis requ ires representative 
halt-state vectors 
• Data dependent analysis 
• Prerequisite to determining energy efficiency 
Cl 2005 ARM 
• Run at lower power for longer to complete 
workload ... 
Pr ... ntl ' : O .... ld Fl YHN OAC 2005 Tutori.l; /I..( _ Advlnclmlnts In eM'lIy..ffficilnt Dui,," 
B-
69 
(H I 
Eneregy efficient SOC design technology and methodology 
Dynamic Power Analysis 
• Boot cached system and cache test code 
• Infamous Dhrystone benchmark, etc ... 
• Need to ensure cached behaviour analysed 
• Normalise the results as power/MHz 
• To facilitate 
• Subtract out the leakage component 
• .. To factor back in for final power 
• Slow and tedious simulation job ... 
O za05ARM Pr ... nt. r. O.vld Fl YNN OAC 2005 Tutorial: ... . Advane ..... na tn Ena'91.fffldllnl D .. lgn (15) 
Typical Power analysis (uW/MHzN) ." 
• To understand typical silicon DVS power 
C 2005 ARM 
• -3 days per analysis point (NanoSim) 
• Key to understanding energy efficient 
frequencies 
• Leakage 
component 
subtracted 
Average power 
600 .0uW/M-tz E=======~Z~ lii 500.0uW/~ ~ 400.0uW/~
: 300.0uW/wt1z .. ~------=-----"" - ---1 
g' 200.0uW/~ '.~---~... """'---""-------l 
~ 100.0uW/wtiz '.~--~--------1 
11:1 O.OuW/wttz +-. ~-~-~-~----1 
O.40v a .60y a .aoY 1.00y 1.20y lAOy 
,dd 
[~Ayefage Power pel MHz I 
Pr ... nt.r : Oayld Fl YNN OAC 2005 TuIO.h l; .... Ady.ne.mlnta In Energ1-EHlclan t O .. lgn (16) 
8-
70 
Appendix S: External non-confidential publications 
Detailed leakage analysis (uA/oC) 
• Given PVT derive accurate leakage prediction 
• 3-4 hours per point to simulate (NanoSim) 
temp 
I--+-n, 1.2OY -+- FF, l32v SS, l .08v I 
(1 2005 ARM 1171 
Safe Voltage Scaling Range? (stage 3) ' C! 
• Memory typically the critical limitation 
• Write operation margins 
• Sense-amp read safety 
• Soft Error Rate highly voltage sensitive 
• Off-chip supply and load regulation 
• Beware "temperature inversion" point where 
slow and fast timing cross over 
Cl 2G05 ARM 
• Ensure minimum operating voltage above this 
P,u.ntet: Davld FL YNN OAC 2005 Tulorl,I: .. . Adv.nnment& In En, rvy·Efficient Oulsn 
S-
71 
(Ill 
Eneregy efficient sac desig n technology and methodology 
Analysing delay versus temperature ' ~ 
700 
600 
_ 500 
N 
:r 
~ 400 
'" u § 300 , 
0-
~ 
"- 200 
100 
0 
0 
Cl 200S ARM 
Typ/Typ Silicon scaling with Te mperature 
0.2 0.4 0.6 0.8 1.2 
Voltage (V) 
Pt ... ntlt; D.vld Ft YNN DAC 200S TuIOrl. l; tu · Ad ... nnmlntl In En, . gW·Effiel<lnt Outgn 
1.4 
-Q- -40C 
+25C 
-++-- +125C 
(19) 
Operating Frequencies - Stage 4 ne 
• Key characteristics now understood (at last!): 
• FMAX determined by worst case silicon 
• Understand DVFS frequency/voltage 
• Understand DVFS power scaling 
• Clock generation choices: 
• Fast-switching PLLs to switch frequencies 
• Master PLL with digital divider alternative 
• Must handle clock/voltage switching carefully 
(l 200S ARM 
• Clocking while changing voltage desirable 
• Serious real-time impact otherwise 
Pr ... ntlr: D.vld Ft YNN OAC 200S Tutorl.l: 1-4 • Adv.num,nl l In Enlrgy·efficlent Oulgn 
B-
72 
(20) 
Appendix B: Externa l non-confidential publications 
Worked example J .J 
'" 
• In this case FMAX worst case 288MHz/"100%" 
• Below 144MHz/50% no safe voltage headroom 
• Instant-switching clock generator chosen 
• Single PLL at multiple of SDRAM/bus speed 
• 4 levels of frequency scaling chosen (plus 0%) 
Performance Frequency 
Level (MHz) 
100% 288 
83% 240 
67% 192 
50% 144 
0% 0 
(\ 2005 ARM Pr ... llter; O.yld FL YHN OAC 2005 T",torlal: f4 • Adv.nu m.nUln En.rgy-Elflc:ilnt Design (21 ) 
Determining Energy Efficiency 
• DFS simply scales workload over time 
• Lower temperature is a benefit 
• But beware leaking for longer period .. . 
• DVFS analysis takes product of time to 
Cl 200S ARM 
complete workload and power consumed 
• Benefits from dominant VOD2 term scaling 
• For " open-loop" contro l need worst-case Si 
• For "closed-loop" control can compensate for 
actual process and temperature and benefit 
from typical Si. 
Pt ... ntlt: O. vld FL YNN OAC 2005 Tu torl.l: ... _ AdvIIK.m.nu In En.rgy-Efficlant O •• lg n 
B-
73 
(22) 
Eneregy efficient SOC desig n technology and methodology 
Energy Saving Analysis - Stage 5 
Voltage headroom/Energy savings (compared to WC 1.08V 125C) 
120.0% r-
100.0% 
-
-
-+ 80.0% ~ ~ ,- I- CAVS_TI_savirg • AVS_TI_Er.!tgy 
60.0% ~ OTI_5iJ'>'irg OTT_ EMIID' 
40.0% I-- I-- 1- I-- I-- I- - .WC_V.Iad< 
eWC_v" 08V 
20.0% 1- I-- I-- 1- C- l--
0.0"4 
" ~ " ~ ~ ~ ~ ~ ~ ~ ~ ~ r r 
> > > > 
, ~ , ~ > > ~ • re 0 ~ ; ~ ~ ~ 0 ; ~ ~ , ~ ~ ~' ~ N, ~, ~ ' 0 , 0 , 0 , ~ ~ ~ " " " f ~ ,.! f & f ~ ~ ~ ii' ~ ~ • ~ ~ < < ,n < w w, w w w w, 
> > > > ~' ~ ~ , ~ 
, 
~ 
, 
~ ~ 
, w, 
~ ~ ~ ~ ~ ~ > > > > > > < 0 0 0 0 
" L-
C 200SARM p,. .. nt. ,: Davld FL YNN OAC 2005 Tutorial: ... . Advanc:.,...ntl In En.'Uy-Efflclanl O .. IUn (23) 
DVFS Design Requirements 
• DVFS requires detailed analysis work 
• Slow/Fast analysis required for tim ing as usual 
• Ext ra Typical analysis required to understand 
battery l ife implications and sensible operat ing 
performance points 
• Energy savings - for the DVFS sub-system 
can be significant and important for designers 
who stay on lower-cost 130nm+ technologies 
Cl 2005 ARM 
• Compete on Max MHz and Average battery life 
B-
74 
Appendix S: External non-confidential publications 
Standard-cell design 
plus Transistor-level analysis? 
• Usual design challenge (for CPU) is to attain FMAX for 
worst case process/tempNDDNoM-10% 
• Applications processors compete on MHz etc. 
• Re-characterizing library and memories at extended 
PVT operating points is tough 
• AND pessimism tends to build up 
• So take the Std-Cell/RAM synthesized design and 
resort to HPSICE transistor level analysis 
o 200S ARM 
,.~ 
~~ 
Oh 
Oh 
_h 
.h 
Oh 
0 2005 ARM 
• PathMiII , TimeMiII, NanoSim ... 
• Very long machine run-times, full cache memories 
require workstat ions with large physical memory 
Joint IEM technology project 
(25) 
ARM/Synopsys/National Semiconductor/UMC "ULTRA926" 
!. 
'. ! 
• IEM technology 
demonstrator 
• ARM926-based SOC 
• 144-288MHz DVFS 
• DVS and AVS support 
. ~ .... -"--
...... "'-
.,"--
on .... 
. ""'--
. ....::""-, 
i I ! I i ! , I ! ! I ! , , , , , t , t i I I I ! ! ! ( ! i . i i ! ! ! I ! 
• Detailed power and 
energy simulation 
• Correlate w ith Silicon 
plan late Q2 
Pr.unte" O ... ·ld FL YNN DAC 2005 Tytori. l: .... _ Adv.ne.mlnta In Enlfgy-Elllcltont O .. llIn (26) 
S-
75 
Eneregy efficient sac design technology and methodology 
DVFS Implementation challenges ne:: 
• IP design and RTL 
• Multiple voltage domains->multiple clock domains 
• Partitioning and interfaces require special care 
• What clock frequency/voltages are energy efficient? 
• Library enhancements 
• Level Shifters, Isolation cells 
• Multi VDD Characterization (Std Cells and Memories) 
• Multi VT Cells 
• Tools 
• Multi-Voltage aware synthesis/verification/ST Altest 
• Multi-VT Optimization requires extra care 
• IP requires implementation methodology 
C 2005 ARM Pr. .. ntar: Oa ... 1d FL YN H OAC 2005 Tutorial; U · Advancement. In Eneflly.£ftId.nt o.slg" 
ARM1176EJZ Implementation 
"Reference Methodology" 
IP/RTL Design 
• Hierarchical structuring 
• DVFS Interfaces 
• Isolated voltage islands 
(27) 
ne:: 
" ...... _-
--
_ ..... 
-
-_. 
Multi-voltage implementation 
C 200S ARM 
• Floor-planning 
• MV-timing constraints and STA 
• PhysicallP " Multi Voltage Kits" 
Pr ... nter: 0 .... 1d Fl YNN OAC 2005 Tutoria l: .... - AdYlnutMntl In Ene'1ly_Effielent Onlll" 
s-
76 
tUI 
Appendix B: External non-confidential publications 
Intelligent software control is key • • 
"Intelligent Energy Manager" resides in the OS 
kernel and derives task performance 
requirements from kernel calls 
Key is not having to change/touch the applications 
But if you can, then can do even better " prediction" ! 
DVFS in summary! 
ne: 
.. 
Implementation = IP + Libraries + Tools + Flow 
(oh yes + PSU + SW + OS + effort!) 
• Significant energy efficiency gains in many 
products 
• e.g. Smartphone an MP3 player 95% of on-t ime? 
• Requires voltage headroom and careful 
production support 
• Design for worst case, optimize for typical Si 
Cl 2005 AR M Pr ••• ntlr: Olvid FL VNN OAC 2005 Tutorll l: . -4 • Advlnelmlnt. In Energy·efflellnt OUIIil" (30) 
B-
77 
Eneregy efficient SOC design technology and methodology 
B-
78 
Appendix B: Externa l non-confidential publications 
SOC and IP Design challenge 
Advancements in Energy-Efficient Design 
System Level Leakage Mitigation 
David. Flynn@arm.com 
Overview 
• Systems level challenge 
• SW/SOC/PSU design, EDAlLibrary implications 
• Dynamic Power/Energy Management 
• Dynamic Voltage and Frequency Scaling 
• Real-world design issues 
• Static/Leakage Power Management 
• Multiple power management states 
• Work in progress ... 
e 200S ARM Pr ... " ter; Onld Fl YNN DAC 2005 Tutorl. l; '"'" _ AdY,ncementeln Energy_Efficient Dulll" (2) 
B-
79 
Eneregy efficient SOC design technology and methodology 
IEM Leakage mitigation next... ne: 
Customers want ARM to support many of the following: 
1. Mixed Vt libraries (2, 3 even 4 YTs!) 
• EDA tools support now 
2. Multi-rail switched domains 
• External power down of sub·systems 
3. On-chip Power Gating (MTCMOS) 
• Local fine/coarse grain power header/footer switches 
4. Retention Registers (a.k.a. "Balloon Flops") 
5. Reverse Bias memory leakage support 
6. Dynamic Threshold Scaling (VTCMOS) 
• Support for both forward and reverse bias operation 
., 2005 ARM Pr ... nt.,: O .... ld FL YNN DAC 2005 Tutorial: N· Advlnu .... nu In E.arliy-Effidtnl Dulsln 
Leakage Reduction - SOC Implications 
/P implementation and deployment challenges: 
1. Mixed Vt libraries 
• Reduce leakage on non-critical paths, no system issues 
2. Multi-Rail Power-Switched domains 
{3, 
, 
!!one: 
• Clamp interfaces, long power ramp times, state save req. 
3. Switched 'Virtual Rail ' Power-Gating 
• Clamps, area cost and delay analysis , state save/restore 
4. Retention Registers 
• Area cost, extra power routing, but energy efficient re-start 
5. Reverse Bias Power Switching 
• Area/power cost, superior leakage, VBB generation req. 
6. Dynamic Threshold scaling (VtCMOS etc) 
Cl 2005 ARM 
• Well/Bulk routing overheads, triple-well process, PSU+ 
Pnunlar: Oavid FL YNN DAC 2005 Tutorial: U • Advancam.ntsln Energy-Elflclenl Dulgn 
B-
80 
(" 
Appendix B: External non-confidentia l publications 
, 
!onc:: Multi-Threshold CMOS (MTCMOS) 
• Multi-threshold CMOS 
• dual-Vt technology ... high-Vt 
transistors gates power supplies 
• Per-cell (very area expensive) 
• Characterize within the cell 
System: Architected 
state save/restore 
+PSU handshakes 
• Add extra sleep 'mode' 
• Shared (less area expensive) 
• Virtual power rails shared 
• Distributed power switches 
• Rows/columns/grids 
• IR-analysis serious challenge 
• Issues: 
Cl 20GS ARM 
• tx sizing, distribution strategy, 
analysis flows, tool flow support, 
loss of state, ... 
• Safe power switch-on (in-rush 
current, phased control timing) 
P ..... nt.r. o. .. 1d Fl YNN OAC 2005 Tutorl,I: '" • Ad .... nc.mtlnt. In ErMif\ly-Efflc'-"t [)nisin 
MTCMOS power architecture 
f 
" ) 
c:: 
• e .g. 'Footer' switches added to high-speed domain 
Cl 2005 ARM 
CPU 
sub-
system RAMI 
(Cache) 
sub-
system 
P,ennte,; OlVld FL YHN DAC 2005 Tlllorlal: "'I . Adv.nc,mlnts ln Enlrgy-Ellicllnl 0.1111" 
B-
81 
3V3/2V5 
1VO 
OV8-1VO 
OV8-1VO 
+ 
Retention 
MTCMOS 
'Footer' 
Switches 
OV 
") 
Eneregy efficient SOC design technology and methodology 
State Retention Registers 
' Retention flops (aka Balloon flops) 
High-VI slave lalch on se parale power 
, Seek 10 minimize CK->Q delay 
• state retained 
• Issues: extra power rail support implementation, analysis , .. . 
' EDA front end support 
• 'sleep' attributes 
' In-rush currentsl noise immunity challenge for production 
System: Architected 
SleeplWake 
+PSU handshakes 
C 2005 ARM Pr ... nte r: Ol\/id Fl YNN DAC 200S Tutorial : U • " dv.net .... n" In E ... rlil~.fffh;le nt C"Iv" (7' 
Retention Register power architecture ne 
• Retention Register supports state save/restore 
C 200S ARM Pr ••• ntlr: D.vld FL YNN OAC 2005 Tutorill : '~ • "dYl n"I,,"loll In eMrlily-EIflc:lenl 0 •• 111" 
s-
82 
3V3/2V5 
1VO 
OVa-1VO 
ova-1VO 
+ 
Retention 
Un-switched 
Supply Rail 
OV 
'" 
Appendix B: External non-confidential publications 
Variable Threshold CMOS (VtCMOS) ' ne 
• Substrate Bias control , Dynamic/Adaptive Body Bias , etc. 
• State retained ... 
• Leakage reduced with reverse bias voltage 
(Peak processing potentially using forward bias) 
• Triple-Well process cost/complexity IEM : Architected 
• On-chip VBB generator(?) Perf levels + Sleep 
• Off-chip BulklWell voltage controller: +PSU handshakes 
• Need carefully managed handshakes VOO + Vpwell/nwell*) 
• Issues: Library support, implementation tool flows, complex 
multi-dimensional timing models 
• Vo ltage scaling orthogonal 
• Timing analysis becomes 'multi-surface ' 
• Valuable to address gate leakage 
~2005 ARM P ..... nt.r: O .... 1d FL VNN OAC 2005 Tutorial: '" - Advanc,ment. In Energy.£ffidenl o..ll1n 
Energy Profiles - 1 (state loss) 
r -------- ------. I ........ 
V 4 
Co t of 
/,~ Re et 
.j +S ate 
Cost 0 f es ore 
State 
Save 
~ -
C 200SARM 
Minimal 
Leakage 
uIOri.l; .... . Adunc:tm.ntt In En • • gy_Effie'-nt Onl,," 
B-
83 
(1) 
ne 
Real-
Time 
Impact 
SLEEP 
) 
(ID) 
Eneregy efficient sac design technology and methodology 
Energy Profiles - 2 (state retention) c: 
O ~"'RM 
0:: 
UJ ;;: 
o 
--------
Higher 
Lea age 
----- . 
Quick 
Return 
E 
R 
) 
Pr ... ntlr: O.,,1d Fl YHN DAC 2005 Tutorial: ... - ... d .... ncements In En.flly-Etfldent O .. lon 
:~ 
IEM Dynamic and Static Power States 
• Run states (IEM-Dynamic Performance Control) 
• TURBO - operate at 'maximum' performance 
• "Forward-biased" thermally controlled (e.g. docked) 
• NORMAL - operate at 'standard ' performance 
• Standard active state (normal battery operation) 
• SLOW - battery saving reduced performance 
• "Reverse-biased" reduced leakage operation 
SLEI 
1'1 ) 
, 
c: 
• Sleep states (IEM-Leakage control , Wake-on-Interrupt) 
0 2005 ARM 
• HALT - clocks stopped, static leakage, quick wake 
• SNOOZE - local power gating, retention regs 
• HIBERNATE - state save/restore, ext PSU switching 
• OFF - explicit state restore required, turn cache RAMs off 
B-
84 
(12) 
Appendix B: External non-co nfidential publications 
Leakage Summary + wrap up 
• Many leakage mitigation techniques 
• IP providers must support many 
• Software need unified OS API 
• To hide the hardware specific sleep states 
• Analysis must take into account: 
• Leakage power reduction for each sleep state 
• Energy cost of each sleep state entry/exit 
• Real-time impact of each sleep state entry/exit 
• State-dependent leakage 'bounds ' 
• Effect of the thermal profile of run-states 
Key to product battery life .. . 
" 2005ARM 
B-
85 
(131 
Eneregy efficient SOC design technology and methodology 
8-
86 
~-------------------------- -; 
Appendix B: External non-confidential publications 
synopsys· 
Dynamic Voltage and Frequency 
Scaling: 
The "IEM" System Design 
Perspective 
David Flynn 
ARM 
B-
87 
I 
Eneregy efficient SOC design technology and methodology 
"What I need (to compete) is (peak) performance but not at the 
expense of great battery life .. . " 
* Performance challenge is worst-case timing closure 
* Need FMAX, but run as an MP3 player most of the time 
"I need to port OS quickly and to run 3rd party software" 
* Products typically differentiated by the software 
and 3rd party applications cannot be profiled in advance 
* Appl ications need to run without instrumentationlrewrite 
Dynamic Voltage and Frequency Scaling is attractive but a 
complex systems problem .. . 
A M synopsys 
Power and Energy consumption trends of a workload running at different 
frequency and voltage levels. 
DFS: frequency scaling on ly, DVFS: frequency & voltage scaling 
U •• ful lor DVS 
& ./ ~ .... ... /'" 
Fr' qINn(:,. 
J ~FS i 
• l:::~ 0'" • 
Fr.qINncy F" que",,)' 
E = jPdt 
2 
f - (v dd-V,)" I V dd 
" ", 13 
p = Cv di f + V dd 1leak 
Avg. power - heat Need DVS to save energy 
vt I vmaJl ~ 0.3 
Must reduce voltage to save energy and extend battery life! 
s-
88 
3 
Appendix B: External non-confidential publications 
"Intelligent Energy Manager" 
Implementation = IP + Libraries + Tools + Flow 
(oh yes + PSU + sw + OS + effort!) 
IP/RTL Design 
Hierarchical structuring 
DVFS Interfaces 
Isolated voltage islands "-~I~;l;;~ 
Multi-voltage implementation 
Floor-planning 
MV-timing constraints and STA 
ARM (Metro™) Physical lP 
r-~"~---------------- ~~----~---
B-
89 
4 
5 
--- ---- --------------------------------------------------------------------, 
Eneregy efficient SOC design technology and methodology 
IP design and RTL 
Multiple voltage domains->multiple clock domains 
Partitioning and interfaces require special care 
What clock frequency/voltages are energy efficient? 
Library enhancements 
Level Shifters , Isolation cells 
Multi VDD Characterization (Std Cells and Memories) 
Multi VT Cells 
Tools 
Multi-Voltage aware Galaxy Platform 
Multi-VT Optimization 
Jointly developed IEM Reference Methodology 
(1) Tim ing/Power: Analysis across all cor~n"'e-'-'rs::.:----------~ 
6 
Lu klo- ~"mln'" IItmPtr.otuff 
FN ' OC 
, 
, 
! 
-
~ -- . ... -~~ . 
--_._-----
-
'f\9I1'\'P_ • • • _ ....... T ..... ' ..... 
FN °C 
/1--i: I: 
synOPS)'S 
8-
gO 
_ ... 
. 
. 
------
_ ...
1 1OOO.1kIA 
--
J ~1kIA 
.... 
, .. 
... 
./" 
• 
I • • , 
_ n . I.2OIi __ Ff. Il2¥ SS.I'" 
ical si/conditions: 
H' 
Avtrag_ pow., UW/MH 
~~M< ~~~~~~~~~~~ 5OO.0uW'JoIHI. OOO.OIIWIIo'Hl 3OG 00IW'_ 2IIO.OItW loIHr , ......... 
o_ouw"'"' 
ut. D600r ._ 1_ 1.10. 1.0. 
n' 
1 __ ",· ... _ .. 1* ... 1 
7 
Appendix B: External non-confidential publications 
IEM technology demonstrator 
ARM926-based SOC 
° 144119612401288MHz DVFS 
DVFS control/analysis support 
(plus Adaptive Voltage Scali 
~~ hr~rrTnTTnT~~~ 
• ~ blHI f-Il-ll-l 
=.o.=.=_=' 
. A""_,".(_ an _  
.n ..... 
. ICv_ 
. wc.",_ 
.~· lIWIldbIIi.I,J:.I.}loIJoI.}~. J!il J!i ..!J. (lUll,! ,L;"I, 
v y ¥ Y :: =, :: :: :: 
I I t t I I I !! I ! 1 1 ,,~ • 1. 
Detailed power and energy simulation 
Correlate with Silicon early Q3 
no~ 
B-
91 
B 
Eneregy efficient sac design technology and methodology 
B-
92 
Appendix B External non-confidenllal publications 
RTL to GDSII Design Methodology for Dynamic Frequency and 
Voltage Scaling Enabled SoC - A Case Study 
ABSTRACT 
Dave Scot! and Sachm IdgunJI 
Staff DesIgn Consultants 
Synopsys ProfessIOnal ServIces 
Dar-Sun TSlen, Ph D 
Sr DIrector of DesIgn Methodology, UMC 
Dave Flynn 
ARM Fellow, ARM 
As the ments of dynamIC frequency and voltage scalIng for SoC energy effiCIency are 
becommg more WIdely known, deSIgn teams are updatmg theIr methodologIes to 
accommodate the requIrements of vanable supply voltage Two of the most unportant 
methodology areas to consIder are power planmng and statIc hmmg analYSIS A recent 
demonstratIOn SoC shows how these parts of the deSIgn flow can be taIlored to achIeve 
SIgnIficant power savmgs 
Synopsys ProfeSSIOnal ServIces and UMC collaborated to create thIS technology 
demonstrator SoC based on ARM's IntellIgent Energy Management solutIOn that uses 
dynamIC frequency and voltage scalIng on the ARM926EJ-S processor core The 
demonstrator chIp contams mternal memones and penpherals m addItIOn to the processor 
core that makes the chIp a good representatIve of a typIcal SoC The technology 
demonstrator chIps fabncated m UMC l30nm CMOS technology are expected to be 
avaIlable m June 2005 
SNUG Europe 2005 
B-
93 
Eneregy efficient SOC design technology and methodology 
ThIS paper descnbes the desIgn challenges of frequency and voltage scahng and the 
apphed deSIgn flow from RTL to GDSII, wIth specIal focus on deSIgn methodologIes for 
power planmng and stallc-lImmg analysIs 
SNUG Europe 2005 2 
B-
94 
Appendix B External non-confidenllal publications 
1. Introduction 
To demonstrate power-management capablhltes for a UMC 130nm fabncatIOn process, 
five partners collaborated m deslgmng the ULTRA926 technology demonstralton chIp 
ThIs paper descnbes the desIgn methodology used to Implement the chIp 
Each of the five partners supphed key technology for the complete SoC solutIOn ARM 
Lld provIded the Intelhgent Energy Manager (!EM) technology ArtIsan (now part of 
ARM) provIded the standard cell Metro hbranes NatIOnal SemIconductor provIded the 
PowerWlse™ Advanced Power Controller (APC) Synopsys Inc prOVIdes the Galaxy 
DeSIgn Platfonn and the resources of Synopsys ProfessIOnal ServIces to Implement the 
deSIgn from RTL to GDSII FInally, UMC proVIded the l30nm FusIOn process, whIch IS 
speCIfically deSIgned for the Integralton of hIgh-speed and low-leakage transIstors In a 
sIngle SoC dIe 
The SoC IS based on an ARM926EJ-STM processor and Incorporates most of the bUIldIng 
blocks typIcal of today's SoCs ThIs technology demonstrator IS therefore an accurate 
representatIOn of the most common types of SoC In development today 
ImplementIng a deVIce ofthls type poses many challenges Some of these challenges are 
genenc to all VDSM SoC developments Others are speCIfic to the dynamIC frequency-
and voltage-scalIng (DFVS) technology Implemented In thIS deVIce ThIS paper focuses 
on two of the most cntlcal DFVS challenges power planmng and statIc tImmg analYSIS 
2. Background on Energy-Efficient SoC Design 
Power consumptIon has become mcreasmgly Important for a varIety of reasons Power 
Issues In mamstream deep submICron deSIgns may hmlt functIOnahty or perfonnance and 
severely affect manufacturablhty and YIeld HIgher power dISSIpatIon mcreases JunctIOn 
temperature, WhICh slows tranSIstors and Increases mterconnect resistance DeSign 
technIques aImed at Improvmg perfonnance may therefore fall short If power IS not 
conSIdered Lower-than-expected perfonnance decreases deVIce YIeld AddItIOnally, 
hIgher power dIssIpatIOn reqUIres more elaborate system-level measures for thennal 
management In general, these power Issues are mcreasIng SoC and system costs 
Managmg power consumptIOn at appropnate pomts m the SaC deSIgn flow keeps these 
costs under control 
As processes shnnk, the problem of power and thennal management grows, and the 
relahve Impact of these effects begm to domInate the more tradlltonal tradeoffs of speed 
and area At a hIgher level, the apphcatIOns that SoCs enable are requmng more energy 
effiCIency as consumers demand longer battery hfe and more functIOnahty m mobIle and 
hand-held eqUIpment 
SNUG Europe 2005 3 
B-
95 
Eneregy efficient sce design technology and methodology 
The ULTRA926 technology demonstrator IS designed to show that energy-efficlCnt 
design IS possible wlthm a mamstream design methodology Compared to conventIOnal 
designs, UL TRA926-type SaCs can reduce power consumptIOn by as much as 60 percent 
with no performance compromises I 
2.1 Sources of energy consumption 
CMOS SaCs consume two types of power dynamic and static Dynamic power IS the 
power consumed m sWltchmg logic states, both mternal to the cells (mternal power) and 
for dnvmg the chip's nets and external loads (swltchmg power)' 
Dynamic power oc CV'F 
where C IS the load (capaCitance), V IS the voltage swmg and F IS the number ofloglC-
state transItions 
As semiconductor structures become smaller, device and mterconnect capacltances 
decrease, allowmg for higher performance and lower power. Countenng these factors are 
power mcreases due to larger designs and higher sWltchmg rates 
StatIc power (leakage power) IS consumed while transistors are not sWltchmg 
Static power = VISTAT 
Although transistors have some reverse-biased dIOde leakage from dram to substrate, the 
larger portIOn of leakage power IS due to the sub-threshold current through a transistor 
that IS turned off This sub-threshold current results from the conductIOn between source 
and dram through the transistor channel 
The sub-threshold leakage current IS problematIc because It mcreases as transistor 
threshold voltages (Vlh) decrease In fact, the move to 130 om and beyond could boost 
leakage power m some designs as high as 50 percent of the total chip power 
3. Strategies for reducing power consumption 
As CMOS technologies scale down, the mam approach for reducmg power consumptIon 
has been to scale down the supply voltage VDD Voltage scaling IS a good techmque for 
controlling a chip's dynamic power because of the quadratIc effect of voltage on power 
consumptIOn However, Just reducmg the power supply degrades CirCUIt speed because 
the sWltchmg delay lime IS proportIOnal to the load capacitance and the ratIO VlhN DD To 
rnamtam suffiCient dnve strength for fast sWllchmg, Vlh must also decrease. This 
decrease leads to the leakage power mcrease Fortunately, a power-aware design flow can 
balance tImmg reqUIrements with vanous power goals 
1 Analyses for the ARM926 core are presented ID appendIX A 
SNUG Europe 2005 4 
B-
96 
Appendix B: Externa l non-confidentia l publications 
The UL TRA926 project minim ized power consumption by using multiple VDD levels a nd 
multiple threshold voltages. The design also uses high-level power-management 
techniques such as a power-aware operating system, sleep mode, compi ler optimizations 
and low-power memory access strategies. 
3. 1 Dyna mic stra tegies 
While these tradi tional methods are effecti ve, the innovati ve aspects of the UL TRA926 
design lie in the use of methodologies and IP associated with dynamic frequency and V DD 
sca ling. This art ic le therefore focuses on these aspects of the design. 
To understand the high-level strategy applied to the ULTRA926 chip, consider the 
traditional approach outlined in Figure I. [n this approach, parts (or all) of a chip have 
their clocks turned off when not in use. Wl1en the parts are in use, they run at full speed. 
Uti lization 
100% • 
Run/Id le 
0% 
IDLE 
,..~ .... 
I " ...... .. , ... 
I ,'" 
Figure I Traditional System Power Management 
ON 
Figure 2 contrasts this all -or-nothing approach with a strategy o f intell igent energy 
management, in which a task runs only rast enough to meet the application requirement. 
Reducing both clock speed and VDD as appropriate for each task saves more power than 
the full- speed/idle approach'. Determ ining the appropriate reduction requires a prediction 
of how soon the application needs a task to complete. 
2 Assume a task requires 1000 clock cycles 10 complete, but 2000 clock cycles of time are avai lable. The 
run/ idle method will have 1000 Vn transitions consuming power and 1000 cycles consum ing no power. The 
scaling method wi ll scale both the frequency and the voltage so that it has 1000 V. trans iti ons consuming 
power and 0 cycles consum ing no powe r. As long as V. is less tban V" the scaling metbod will consume 
less power than the run/idle method. 
SNUG Europe 2005 5 
B-
97 
~~~~~~~~~~~~~~~~~~~~~~- ~ ~-
Eneregy efficient SOC design technology and methodology 
Performance 
100% • 
Run/Idle 
100% . 
IEM 
0% 
o Run task as slow as possible 
.-. 
: 2 : Reduce voltage to lower level 
' .' Q Run task in time avai lable 
.-. : 4: Reduce voltage to match time 
'., 
Figure 2 Intelligent Energy Management concept 
3.2 Potential power reduction from dynamic strategies 
Whi le the actual power of the ULTRA926 chip has not yet been measured, results from 
similar designs indicate that the reductions should be significant compared to devices that 
use non-dynamic strategies. Reducing clock speed from 288 to 144 MHz obviously cuts 
power requirements by half, for example. Not quite as obvious are the magnitude of the 
reductions due to scaling the supply vo ltage to the minimum acceptable level at the same 
time. This dual approach cuts power consumption to about 40 percent of full power. 
Note tbat these power reductions app ly on ly to the chip 's dynamic-voltage-and-
frequency-scaling subsystems. ormally in sucb SoCs, some of the chip will not be 
voltage sca lable. Components such as externa l memories typically operate at a fi xed 
voltage, for example, so design partitioning and planning must take into account tbe 
system-level power savings . 
Sj\.rUG Europe 2005 6 
B-
98 
------------------------------------------------------------------.----_.-----
Appendix B: External non-confidential publications 
4. The UL TRA926 SoC 
As Figure 3 shows, the ULTRA 926 SaC features an ARM926EJ-STM processor core 
connected to an AMBA bus system. The latter comprises high-speed AHB and lower-
speed APB peripheral buses. The AHB subsystem contains the interfaces to the external 
static and dynamic memory subsystems, while the APB connects to a wide range of 
common peripheral devices such as timers, UARTS, real -time clock and interrupt 
controller. Clock speeds arc 288, 240 , 192 and 144 MH z. 
18k Inslruclion RAM, 16k Data RAM VOORAM I ~" L. ______________________ ~ 
. . . . . . . . 
I - - • • •••• 1~tJc.nClamp • 
SAAM, FlASH, 
-""'''' 
"""'" 
SDRAM 
Interface 
Figure 3 UL TRA926 test chip architecture 
liP K.y: ARM -ligtlt bkIeI_ N~I · d.m bIueI_ 1 . Artisan · yeIowJ Synops)'l ' purpIeI_ . 
The chip is partitioned into three primary power domains: voltage/frequency-scaled CPU 
and memory power domains and a standard fixed-voltage domain for the rest of the chip. 
The independent power domains allow precise vo ltage control and current measurement 
for the C PU and RAM . Standard cells and level shifters operate in the 0.7- 1.2V (VDn) 
range. The project used two libraries characterized for different threshold voltages, and 
each of these libraries was characterized over the req uired VDD range. 
SNUG Europe 2005 7 
B-
99 
Eneregy efficient sac design technology and methodology 
4.1 The SoC's power-control elements 
The chip's power-control elements mclude the ARM IntellIgent Energy Controller 
(shown at upper nght m Figure 3) that works with energy-management software to 
balance processor workload and energy consumptIOn Other ARM power-related 
elements mclude a dynarmc clock generator and a umt for power, clock, reset and test 
AdditIOnally, the ARM core's voltage domam mcludes a hardware element from NatIOnal 
SemIConductor that momtors perfonnance and commumcates With voltage regulators to 
scale the supply voltage to the mmlmum operatmg level at each operatmg frequency 
This momtor hardware essentially consists of a delay Ime to measure the mmlmum 
acceptable perfonnance level The momtor thus calIbrates the control system with the 
actual performance of the sIlIcon This approach compensates for sIlIcon performance 
vanatIOns due to the manufacturmg process as well as run-tIme performance changes due 
to temperature fluctuatIOns 
5. RTL to GnSII Implementation 
The ULTRA926 SoC ImplementatIon methodology followed the tradItIOnal 6-step 
approach conslstmg ofloglc synthesIs, floorplanmng (mcludmg power planmng), cell 
placement, clock tree synthesIs, routmg, analysIs and slgnoff WhIle each step reqUired 
the deSign tearn to take some account of the multI-voltage nature ofthe deSign, thiS paper 
focuses on deSign planmng and statIc tImmg analYSIS 
SNUG Europe 2005 
LogiC SyntheSIS 
Floorplannmg (with power planning) 
! 
Cell Placement 
Clock Tree Synthesis 
Routmg 
AnalYSIS 
8 
s-
100 
Appendix B External non-confidential publications 
Note that Synopsys now offers tool features speclfically targeted at mulll-VDD desIgns 
Because these tools were not avaIlable to the UL TRA926 team, thIs project used 
methodologIes that are now meluded m the latest tool reVISIOns 
6. Design Planning 
The ImplementatIOn of mulllple V DD domams m the UL TRA926 chIp reqUIre that the 
domams be planned WIth separate power gnds and melude the ablhty to power-down a 
gnd altogether to ehmmate leakage power Note that multIple supply voltages must be 
provIded eIther through separate power pms or by analog voltage regulators that are 
mtegrated mto the deVIce The UL TRA926 chIp uses the fonner approach 
Completely powenng-down a logIC domam reqUIres the use of power IsolatIOn cells, 
because the outputs from a powered-down section mto an active power domam should 
never be allowed to float Power Isolallon logIC ensures that all mputs to the acllve power 
domam are elamped to a stable value 
To meet system reqUIrements WIth multIple voltage domams, It IS necessary to evaluate 
whether mterfaces between dIfferent domams reqUIre level shlfters (for sIgnals) and/or 
power IsolatIOn cells AddItIOnally, a state-retenllon techmque may be reqUIred m blocks 
that are powered-down so that these blocks can resume operallon when powered-up 
Powenng-down vanous domams' voltages or scahng theIr voltages dynamIcally may also 
reqUIre power-sequencmg CIrCUItry to ensure correct operallon of the chIp The processor 
core m the UL TRA926 chIp can be powered-down, but the chIp's mtemal RAM IS 
always powered so that It can save the processor state 
The multI-voltage concept can range from a set of fixed voltages to a fully adapllve 
approach In between IS a dynamIC approach m whIch the voltage IS scaled to 
predetennmed levels, WIth no regard to the actual slhcon perfonnance The adapllve 
approach uses the hardware perfonnance momtors descnbed earher to ensure that voltage 
levels meet perfonnance targets rather than SImply predetennmed levels that should work 
for most Implementallons Note that these altemallves can be combmed m a smgle SoC 
and meludc low-leakage/hlgh-speed tradeoff methodologIes 
Both level shlfters and Isolallon cells created deSIgn planmng Issues that the UL TRA926 
team had to handle For level shlfters, the cell heIght dIffered from that ofthe other 
standard cells, so the level shlfters were placed manually between power domams 
Isolallon cells had to be placed at the nght voltage domams, so both cell onentatlOn and 
locatIOn were Important 
7. Static Timing Analysis 
SNUG Europe 2005 9 
B-
101 
Eneregy efficient sac design technology and methodology 
WhIle tradllIonal sIgn-off reqUIres the use of a best- and worst-case corner value around a 
nommallImmg corner, multl-vol1age deSIgns have many more corners The desIgn's 
vanous regIOns have dIfferent IIbranes, dIfferent and/or vanable operatmg voltages and 
dIfferent operatmg condItIOns Thus, IImmg vanes from one voltage value to another and 
from one block to another MullIple ST A runs are therefore needed to cover all the 
possIble corners 
The deSIgn was Implemented In UMC Ll30E HS FSG process 
Two LIberty IImmg models are provIded wIth the standard Foundry deSIgn kIt for the 
LF027 core 
Worst case for stalIc IImmg analysIs of setup lImes and output settlmg lImes (desIgn 
cnlIcal paths) 
• I 2V - 10% = I 08V 
• 125 degrees C (commercIal grade hot) 
• slow-slow process corner 
Best case for stalIc IImmg analYSIS of hold lImes and mput arnvallImes (to fix any race 
condItIOns) 
• 12V+ 10%= 132V 
• -40 degrees C (commercIal grade cold) 
• fast-fast process corner 
Based on the system reqUIrements to process workloads, the followmg voltage and 
frequency pomts were used 
-
vottaae Process Period LatencYlmaxl LatenCVtmlOl Latencv ~--~--~~~------~----~----~~-----Fmax 
Iv\ Insl Insl Insl I Isnreadl I IMHzI 
0730 slow/slow 7721 3626 3237 0389 1295 
0800 slow/slow 6198 2963 2636 0327 1613 
0940 slow/slow 4442 2187 1931 0256 2251 
1080 slOW/SlOw 3472 1741 1531 0210 2880 
1320 slow/slow 2645 1333 1162 0171 3781 
Table I 
Due to the WIde spread m both IImmg charactenstlcs and buffer tree latencles, the typIcal 
process, room temperature and nommal I 2V operatmg voltage are useful to understand 
for voltage scalIng The latency vanatlOn m the above table bnngs out the Impact of 
scalIng voltages on the clock tree wlthm the ARM926 core 
BesIdes the mtra voltage domam STA, the tlmmg analYSIS reqUIred careful analYSIS of 
paths across the 3 voltage domams 
I SoC level from and to ARM926 Core 
2 SoC level from and to the RAM ( lIghtly coupled memones - TCM) 
3 ARM926 Core from and to the RAM 
SNUG Europe 2005 10 
B-
102 
Appendix B External non-confidential publications 
For paths m I and 2 above, thIS was achIeved by runmng the ARM926 Core and the 
RAM (TCM) at 6x (288 MHzll 08V), 5x (240MHzlO 95V), 4x (I 92MHzlO 85V) and 
3x(l44MHzlO 8V) of the AHB clock (48 MHzll 08V) The retImmg mterface between 
the SoC level domam and the scaled voltage domams (ARM926IRAMs) was deSIgned to 
account for the skew/delay vanatlOn resultmg from the scaled voltage pomts 
The paths between the ARM926 Core from and to the TCM were tuned WIth the above 
corner pomts Even though both the voltage domams operated at the same voltage level', 
the clock tree segments were separately bUlIt, and to account for any vanatlOn m skew 
resultmg from dIfference m the clock tree topology, It was essentIal to run ST A across 
the 2 mterfaces at each operatmg voltage pomt 
) When the ARM926 core IS In the active state the core and RAM domams are at the same voltage, but the 
core domam can be sWItched off completely WhIlst the RAM domain remams powered to save processor 
state 
SNUG Europe 2005 11 
B-
103 
Eneregy efficient sac design technology and methodology 
Table 2 shows the other corner values for the UL TRA926 desIgn The operatmg voltage 
pomts for the ARM926 core and the RAM move together, although the ARM core can 
also be powered-down completely whIle the RAM cannot Each of the sets of values m 
the table reqUIres a separate STA run AddItIonally, tImmg analysIs begms wIth a base 
STA run at the process's hIghest voltage (I 32V for the UMC 130nm process) for all 
blocks to check the desIgn's best tlmmg 
Table 2 ST A corner values 
DynamIc Voltage and Frequency Scalmg m the UL TRA926 reqUIred the followmg 
clocks and control logIc m the desIgn 
• Free runmng AMBA clocks for AHB HCLK (also serves as APB PCLK m thIS 
desIgn) ThIs runs contmuously and the processor and debug clocks are phase-
alIgned to thIS pnmary clock All the syntheslzable memory controllers and 
penpherals use thIS prImary clock m the SOC desIgn 
• Independent target processor clock frequency reqUIred to support AdaptIve Power 
controller module WIth the frequency under the control of an external 
(asynchronous) dynamIc power controller ThIS target frequency IS set by the 
IntellIgent Energy Controller to the deSIred processor frequency and used as part 
of the voltage control feedback loop to detenmne If the voltage IS sufficIent to 
safely support operatIOn at thIS target perfonnance level 
• DynamIcally sWltchmg, glItch-free CPU clock carefully controlled to alIgn WIth 
AHB transfer HCLK edges usmg HCLKEN enable The CPU frequency IS 
mdependently sWItched m frequency under control of the IntellIgent Energy 
Controller m a phase-alIgned manner to the free-runnmg AMBA clock 
• Power management protocol support for IsolatIOn (mterface sIgnal clampmg) and 
CPU sequencmg and synchronous de-assertIOn of reset 
One of the key challenges m the deSIgn was to meet tlmmg reqUIrements on speCIfic 
paths from the DVFS control logIc (at the SoC level) to the ARM926 core / TCM 
SNUG Europe 2005 12 
8-
104 
Appendix B External non-confidential publications 
memory macro because of the spread between the setup and hold margms on certam 
pms on ARM926 IImmg models and the memory IImmg models mSlde the TCM The 
hold reqUIrement dictated the fmal operatmg frequency of the mterface As the 
ARM926 and the TCMs were scaled down m voltage, the mcrease m the hold 
reqUIrement due to mcrease m clock latency mSlde these blocks reqUIred the reset 
sequcncmg and the clamp control slgnalmg to operate at a lower frequency. 
MullI Voltage STA was performed usmg PnmeTlme PnmeTlme provides several 
features that slmphfy mullI-voltage analysIs and slgnoff Though mstance specific 
operatmg condllIons could be applied to lIme this deSign, With the given hbranes that 
were charactenzed for each voltage corner, a Imk-path per mstance approach was used to 
setup the operatmg condllIons for the cells that ran on different rail voltages 
A set ofNLDM hbranes, one hbrary per operatmg pomt as descnbed m Table 2 was used 
dunng ST A Because NLDM hbranes have hmlted accuracy for mullI-rall cells, each 
hbrary used was charactenzed at the reqUIred PVT to run ST A 
To specify different NLDM hbrary cells for different mstances m the deSign the 
followmg IS reqUIred Set the vanable Imkyath yer _mstance to a hst Each hst element 
consIsts of a hst of mstances and thc correspondmg hnk paths that ovemde the default 
hnk path for each of those mstances An example IS provided below 
pt_shell> set link_path_per_instance [list \ 
[list {U _ ULTRA _ SOC) ..... ] \ 
[list U_ULTRA_SOCIU_ARM"· umc_130_hs_O_9_slow"] \ 
[list U_ULTRA_SOCIU_TCM"· umc_130_hs_O.8_slow"1l 
The hsted mstances m the case of the ULTRA926 were the hierarchical blocks that 
represent mdlVldual voltage Islands 
The ST A was very slmllar to a smgle voltage run The cells were also apphed the 
operatmg condllIons that were part of the hbrary that the cells were hnked mto 
PnmeTtme SI uses the mput slew, output load, and operatmg condllIons of each cell to 
calculate slew and delay for the cell It also reports transllIon lImes m terms oflocal-
hbrary thresholds 
To chcck the settmgs, the command report_cell hsts the specific operatmg condltlons for 
each cell along WIth the supply voltage mformatlOn, and reportyort hsts the condltlons 
for each port For further setup analysIs, the shell command check _Ilmmg -sIgnal_level -
verbose finds nets for whIch the dnver SIgnal level matches the load sIgnal level Smce 
thIS command reports vlOlatmg dnver/load palTS, It shows nets where the user mIght need 
to msert level shlfters ThIs feature also makes It possIble to dmgnose mconSlstenCles 
based on mput_voltage and Du/put_voltage groups and mconslstencles based on rall 
voltages or mput/output sIgnal_level 
SNUG Europe 2005 13 
B-
105 
Eneregy efficient SOC design technology and methodology 
8. Conclusions 
By combmmg key technology from five collaboratmg compames we have Implemented 
ULTRA926-a technology demonstratIOn vehIcle that wIll show how energy savmgs of 
up to 60 percent may be achIeved wIth the UMC \30nm fabncatlOn process ThIs level of 
energy savmgs IS VItally necessary to enable SoC desIgners to mcrease battery hfe whIle 
supplymg the desIred functlOnahty m mobIle and hand-held devIces The technology 
descnbed here allows the mdustry to mIgrate to smaller process geometnes, where power 
and thermal-management would otherwIse pose SIgnIficant hazards 
9. Acknowledgements 
The authors would hke to thank the many people at the five partner compames whose 
hard work and dedlcalIon made the ImplementalIon of the ULTRA926 devIce possIble 
SpecIal thanks should also go to the SNUG revIewers for theIr kmd gUIdance dUrIng the 
wrItmg of thIS paper 
SNUG Europe 2005 14 
8-
106 
Appendix B. External non-confidential publications 
A. Analyses for the ARM926EJS_1616 core 
ARM926EJS_1616 Frequency and Voltage analysis 
The ARM926EJS _1616 processor core was synthesIzed for a 288MHz target frequency -
for 1 2V - 10% voltage tolerance, slow slhcon worst case temperature DetaIled PathMIll 
simulahons were performed at dIfferent operatmg process/voltage/temperature pomts to 
allow the charactenshc for dynamIc voltage and frequency scahng to be denved Tb,s IS 
reqUIred to work out what frequencIes are energy efficIent m the overall system-on-chlp 
desIgn 
Voltage 
IV' 
Process 
0730 ss 
0800 ss 
0940 ss 
1080 ss 
1320 ss 
0660 tt 
0730 tt 
0800 tt 
0940 tt 
1080 tt 
1200 tt 
1320 tt 
0730 ff 
0940 ff 
1320 ff 
SNUG Europe 2005 
Period 
(ns) 
7721 
6198 
4442 
3472 
2645 
6172 
4803 
3920 
2896 
2357 
2075 
1882 
3287 
2100 
1468 
Fmax 
(MHz) 
1295 
1613 
2251 
2880 
3781 
1620 
2082 
2551 
3453 
4243 
4819 
5313 
3042 
4762 
6812 
15 
B-
107 
Eneregy efficient SOC design technology and methodology 
Frequency Analysis 
1000 
", 
- Fmax[TT] 
lOO .L--I--l---+----.j~-!-___+--l----I-___1 
~ w ~ ~ ~ q ~ ~ ~ ci c:i 0 ci ci _ 
Voltage 
ARM926EJS_ 1616 Dynamic Power 
The fo llowing dynamic power values for ARM926EJS_ 1616 were obtained from 
nanosim simulations using a range ofVOD values. All simulat ions were carried out at 
2SdegC using typical SP ICE models for the UMC 130nm I-I S process. The simulations 
used a standard "Dhrystone" vector set running with a clock frequency of I OMl-lz, and 
average power was measured during tbe fourtb dhrystone loop in these vectors (by which 
time the instruction sequence is loaded into tbe cache). These figures exclude leakage 
current. 
V DD Average 
Current 
O.75v 193.2uNMHz 
O.90v 237.2uNMHz 
1.08v 291 .8uNMHz 
1.20v 333.1uNMHz 
1.32v 372.7uN MHz 
SNUG Europe 200S 
Average 
Power 
144.9uW/MHz 
213.5uW/MHz 
315.1uW/MHz 
399.7uW/MHz 
491.9uW/MHz 
16 
8-
108 
Appendix B: External non-confidential publications 
1.00.0UA/MHz 
300.0uA/MHz 
Average vdd current 
" " ~ 
200.0uAlMHz 
100.0uA/MHz ----
O.OuA/MHz +----~---~--~--~--___4 
0.40v 0.60v 0.80v 1.00v 1.20v 1.40v 
vdd 
j-+- Average Current per MHz I 
Av erage powe r 
600.0uW/MHz ,------------~_~ __ 
~ 500.0uW/MHz ~ 400.0uW/MHz +--------------:"."'/=----l 
:; 300.0uW/MHz +----------"7'~ .......... =----_l 
Cl 200.0uW/MHz +-- ----=-===---... ----" -------l ~ 100.0uW/MHz -
> ~ O.OuW/MHz .L---~--__ --~--_--___1 
OA Ov 
SN UG Europe 2005 
0.60v 0 .80v 1.00v 1.20v 
vdd 
I-+- A verage Power per MHz I 
17 
B-
109 
1.40v 
Eneregy efficient sac design technology and methodology 
ARM926EJS_1616 Leakage Power 
Leakage current was measured by nmning nanosim in highest accuracy mode, and 
reco rding VDD current at t=O. It is appreciated that leakage current is sta te-dependent, but 
previous experiments have shown that this method gives fa irly consistent results when 
measurements are taken at different times in the simulation. Once again these simulati ons 
were run using the ty pica l models at 25degC. 
V DD 
0.75v 
0.90v 
1.08v 
1.20v 
1.32v 
lS00.0uA 
1400.0uA 
1200.0uA 
-
1000.0uA 
0 
~ BO O.OuA -, SO O.OuA -u 
400.0uA 
200.0uA -
O.OuA 
OA Ov 
SI\'UG Europe 2005 
Leakage Leakage 
Currenl Power 
512.7uA 384 .5uW 
680AuA 612AuW 
955.9uA 1032AuW 
1201 .5uA 1441 .8uW 
1517.6uA 2003.2uW 
Leakage current vs vdd 
O.SOv O.BOv 1.00v 
vdd 
18 
8-
110 
/' 
1.20v 1AOv 
Appendix B: External non-confidential publications 
Le a ka ge pow er vs vdd 
2500 .0uW ,..-______________ ., 
2000 .0uW -t-----------------j>---l ~ 1500.0uW -I---------------,~/-__l ~ 1000.0uW -1-------------==-.. .,/''''-----1 
500.0uW -t--------;.-V'"--"""~<-- ----__l 
O.OuW +---~--~--~--~--~ 
0.4 0v 0.60v 0.80v 1.00v 1.20v 1.4 0v 
SNUG Europe 2005 
vdd 
19 
B-
111 
Eneregy efficient sac design technology and methodology 
ARM926EJ S_ J616 O pen-loop (DVFS) and Closed-loop (A VS) Energy analysis 
The worst case freq uency analysis sets the operating frequencics for the system-on-chip 
design . In this case a set of fi xed frequen cy dividers was chosen to avoid significant 
phase-locked-loop re lock delays. 
The freq uencies selected are: 
Performance Frequency 
Level (MHz) 
100% 288 
83% 240 
67% 192 
50% 144 
0% 0 
Using the worst-case frequency analysis thc minimum voltages required to operate the 
processor relative to the worst-case 1.08V FMax 28 MH z are derived and interpolated. 
Energy to compete a task is the product of the power to complete the task at a particular 
frequency and voltage and the time or duration to complete the task. The open-loop 
voltagc scaled power and duration for workloads are derived graphically from the 
simulation analysis to predict the typical siliconlroom temperature energy savings for 
table look-up power supplies ab le to support worst case process and temperature. 
Finally the analysis of predicted energy savings when using the on-chip Adaptive Voltage 
Scaling c losed- loop control is also shown in the graph below. The effect of reducing 
voltage to match the process and temperature for typical si licon at room temperature is 
derived from the typica l si licon voltage and power analysis. Note that in this case the 
cnergy efficiency at the lowest perfom13nce point (144MHz) is in fact no better than at 
I 92MHz because there is no voltage headroom left for typical s ilicon and the lower 
dynamic power at 144/1 92MHz is counteracted by the 192/144 scaling in duration to 
complete the task. 
For a ll the predicted energy savings a safe operat ing margin must be added for safe 
operation and IR-drop effects. However by referencing all the savings back to 1.08V WC 
rather than a nomina l 1.20V operating point the relative savings for real-world power 
supply tolerance are nor undu ly optimistic. 
SNUG Europe 2005 20 
8-
112 
Appendix B: External non-confidential publications 
Voltage headroom/Energy sav ings (compared to WC 1.08V 125C) 
120.0% 
100.0% -
.- 1'-
80.0% 1-r- - r-
60.0% - r- - r- - C-
40.0% r- - r- - r-
20.0% r- - r- - C-
0.0% -
N N N N N 
J: J: J: J: J: 
::0 ::0 ::0 ::0 ::0 
'" 
0 N ., 
'" 
'" " ~ " '" N , N ~ N , , , , 
U U U u t: !, !, :;: !, ~ ,,' 2' 
'" '" '" '" " l J l'J l'J l J c 
'0 '0 '0 '0 w 
> > > > 
, 
If) 
U. 
> 
0 
SNUG Europe 2005 
N 
J: 
::0 
0 
" N , 
~ 
2' 
" c w , 
If) 
U. 
> 0 
C-
r-
C-
N 
J: 
::0 
N 
~ , 
~ 
2' 
" c w , 
If) 
U. 
> 
0 
21 
B-
11 3 
-
- l-
-
-
N 
J: 
::0 
., 
:! , 
~ 
2' 
" c w , 
If) 
U. 
> 
0 
r- - r- I- riIiVS_TI_~ 
• AVS_TI_Erergy 
r- - r- l-
D TT_saving 
D TT_Energy 
r- - r- I-
. WC_Vslack 
DWC_V/1 .0BV 
r- - r- r-
N~N r- Nr-_ 
I I I N 
:E :E :E I 
00 0 N :E 
00 <et 0) 'Ct 
N N <0- V 
I I I ~ t: t: t: ~, 
~ ~ ~ ~ 
~ ~ ~ ~ 
c c c w 
UJ LU UJ c: 
I I I LU 
en (/) (/J I 
> > > CIJ 
cC cC c( ~ 
Eneregy efficient sac design technology and methodology 
B-
114 
Appendix B: External non-confidential publications 
System Design 
for Leakage Mitigation 
Fast Track to Low Power 
David Flynn 
ARM Fellow, Cambridge R&D 
ARM Ltd. 
----- -_.-
Leakage Mitigation 
Leakage mitigation is a SYSTEM-level problem 
OS + software policies 
Entry/Exit energy/real-time cost functions 
Control IP road map for Intelligent Energy 
Manager (IEM) 
IP interface/partitioning enhancements to 
facilitate clean control 
EDA challenge 
Implementation/verification and significant 
analysis issues 
Physical lP 
Additional standard cells, enhanced RAM, 
power gates 
External Power supply control IP etc ... 
Voltage scaling and well bias 
B-
115 
• 
• 
• 
• 
• 
• 
• 
Appendix B: External non-confidential publications 
Joint Best Practice MV Reference Designs 
ARM926EJ-STM technology-based IEM SoC 
Dual 16K caches, performance-tuned leakage-managed CPU system 
Linux support platform (SDRAMIFlashlPeripherals integrated) 
IEM controller enhanced for leakagelback-bias control (plus DVFS) 
Target technologies (TSMC) 
"Generic" 90nm 1-volt, good performance but leaky process technology 
"Low Leakage" 65nm 1.2V portable lower performance technology 
Multiple implementation approaches 
• Shared Header switches on 90nm, intra-cell Footer switches at 65nm 
Lots of "science experiments" for real-world analysis 
Support detailed evaluation of power gating and state retention quality 
Representative MV design exploration vehicle with Synopsys 
Technology Demonstrator Results So Far ... 
Four primary levels of IEM leakage management: 
• HALTISRPG/SCAN-HIBERNATE plus SHUTDOWN for CPU 
Working 65nm LP silicon evaluated (see on ARM Booth) 
Supports DVFS for IEM dynamic energy saving 
important with 1.2V LP 
SRPG Savings of 85%+ measurable 
compared to the static -O.3mW leakage 
90nm G project just out of fabrication 
Distributed header power gating with inrush control 
Dynamic threshold scaling leakage management 
D 
Both CPU plus USB subsystem leakage management ~~~~i 
B-
117 
• 
• 
• 
• 
Eneregy efficient sac design technology and methodology 
What ARM's Customers Need .. . 
Help architecting "sleep" states onto SRPG technology 
Evaluating the leakage savings versus design complexity 
Evaluating the real-time costs for entry & exit from deeper sleep states 
Evaluating the energy-savings achievable depending on usage profiles 
What to do in the RTL ... 
Physical IP components that work well with tools 
Extra components to support Power gating, Isolation and State retention 
Proven implementation and verification methodologies 
Power Gating is just as much Multi-Voltage design as Dynamic Voltage 
Scaling 
Anything that reduces risk in production 
Better design-for-battery-life support a.s.a.p, 
B-
118 
Appendix B: External non-confidential publications 
rl:lOC!C!"ive Leakage Management for 
ARM Based Systems 
Agenda 
Alan Gibbons 
Synopsys Inc. 
David Flynn 
ARM Ltd. 
ARM DEVEL6)PERS' 
CONfUtE""C E 6 
• Overview of ARM's Intelligent Energy Manager (IEM) technology 
• L al If'! e"" la Ie. r 
• 
• 
yn v Af,.'v1 Teen 
( r 
• <;~ y 
ARM'% synopsys 
__ "",o...,~. 
v r 1 s(ral0r ~AL I 
A~'v1IP 5 I YS , 11 xy 0 S III 
B-
119 
ARM DEVELO PERS' 
CO NHIHN Cl '06 
Eneregy efficient SOC design technology and methodology 
What Consumers Care About 
• Users want more features in their mobile devices 
• MP3, Camera , Video, GPS ... 
• But also need long battery life 
• Convenient form factor, affordable price 
• Battery technology is not evolving fast enough! 
• Need to conserve energy 
ARM ~ synopsys 
~"IOa.-~ 
ARM DEVEL8 PERS' 
)NHU"'.f '0' 
The Performance vs. Power Dilemma 
Lowest leakage 
and/or dynamic power 
ARM· ~ synopsy; 
.-...-... o...g.. SoIwoI_ 
.. 
Lower Cost 
B-
120 
Increase Performance 
Thermal management 
Packag ing, cooling, cost 
Increased leakage 
JR-drop 
Etectromigration 
ARM DEVEL8 PERS' 
C.Ol'llfU,ENC.E '0. 
Appendix B: External non-confidential publications 
Power Dissipation 
, 
E = JeCV 'DD.!" +VDD I/kg )d/ 
o 
C-
Total Power 
Dissipation 
~_~A~ __ ~ 
/ \ 
Static Power 
Dissipation 
Dynamic Power 
Dissipation 
-
• Minimize I'eak by: 
Reducing operating voltage 
• Fewer leaking transistors 
ARM % synopsys 
"""-o-vn $oIwo_ 
'-
• Minimize ISWllctl by: 
Reducing operating voltage 
Less switching cap 
Less switching activity 
ARM DEVELG)PERS' 
.ONfU.(NCE 'GC. 
Improving Dynamic Energy Efficiency 
• Dynamic Frequency Scaling (DFS) 
• Reduce operating frequency if possible 
• Reduces average power (but not task energy) 
• Eliminates NOPs 
• Dynamic Voltage & Frequency Scaling (DVFS) 
• Requires DFS 
• Reduces voltage if frequency is reduced 
• Reduces task energy 
• Based on cha racterized frequency - voltage pairs (Iookup table) 
• Adaptive Vol tage Scaling (AVS) 
• Closed loop optimization of VDD at run-time 
• Can save energy even at fi xed frequency 
ARM'% synopsys' 
_ ..... -
B-
121 
ARM DEVEL8 PERS' 
CO""EII.lNCl '06 
Eneregy efficient sac design technology and methodology 
ARM IEM Technology 
Hardware and software solution for energy management 
Dynamic control of vol tage and frequency scaling . 
• IEMTM software connects to OS kernel and collects data. 
• Multiple policies categorize the software workload. 
• Prediction of future performance requirement is made. 
• Suitable operating point (Voltage and Frequency) is set. 
Intelligent 
Energy 
Controller 
/ \ 
Dynamic Dynamic 
Voltage Clock 
Controller Generator 
ARM'® synopsys ARM DEVEL8 PERS' 
Moo_ .... ~~ 
ARM IEM Principles 
• Batteries have fin ite amounts of energy stored in them 
• Running fast and then idling waste: 
Voltage 
•••• Saved 
........... [:::::y 
...... 
~---~ ... ... .. . .. ..... nergy 
Task 1 
ARM ® synopsys-
AcMoIIc .... o...g..~ 
...... ..... .... ... ..... ... ........ 
Idle 
8-
122 
Tim£> 
Task 2 Task 3 
ARM DEVELG)PERS' 
Appendix B: External non-confidential publications 
IEM System Implementation 
." 
I 
ARM ~ synorsys' ARM DEVEL8 PERS' 
___ O'-SoM ...... 
NfUE N ( 
Agenda 
• ( Y 
• Leakage power challenge and mitigation techniques 
• )' 
• ew r 
• 
ARM'~ synorsys' 
---
Ir 
" 
B-
123 
( ,~I X 
ARM DEVEL8 PERS' 
Eneregy efficient SOC desig n technology and methodology 
Trends In power dissipation 
• Static power dissipation can not be ignored 
• It is significant at 90nm and dominant at 65nm 
• Leakage cu rrents are rising fast 
• Must be controlled by circu it design and 
optimization tools 
• Transistors are not perfect swi tches 
• They always "leak" ® 
• Especially high performance (Iow Vt) 
• Currently sub-threshold leakage (I, ) 
dominates 
• Multi-threshold and Power Gating most 
effective 
• However gate leakage (14) is becoming 
significant 
• Mitigated by high K dielectric material? 
ARM'~ synoPSyS' 
.-....-..." 0.... SoM_ 
" 
Leakage Mitigation Challenges 
• Leakage mitigation is a system-level problem 
• Operating System and software policies 
• Entry and exit procedures 
• Energy and real-time cost functions 
• Control lP 
• Wlntelligent Leakage Control- (part of roadmap for rEM) 
· 1, 000<l0 __ _ 
· I __ ~-
. ,,---.... --. \, ""'"-~ 
ARM DEVELO PERS' 
ONHIHNCl '0' 
• IP interface and partitioning enhancements to facilitate clean control 
• EDA cha llenge 
• Implementation and verification with support for in-depth analysis 
• Physical I P 
• Additional standard cells, enhanced memories, power gates 
• External Power supply control IP 
• Voltage scaling and well bias 
• Etc. 
ARM ~ synoPSyS 
" 
ARM DEVELO PERS' 
~ DM'IJII SoM_ 
s-
124 
<ONHk(N(.[ '0. 
Appendix B: External non-confidential publications 
Some Leakage Mitigation Techniques 
Lower Operating Voltage MultiVt Cell sizing 
'if-
Non minimum size gate lengths VTCMOS Stack Effect 
ARM'® synOl'SYS' 
~o...on~. 
" 
ARM DEVELG)PERS' 
Power Gating: Coarse Grain vs . Fine Grain 
• Fine grain: one switch per cell 
• Simple to implement 
• Large area overhead on cells 
• Switch adds 2-4x area of original cell 
• Clamps needed in every cell 
• Effect on timing easy to characterise 
• Coarse grain: distributed switches shared by many cells 
• More complex to implement 
• Reduced area overhead 
• Switches are shared - so can be smaller 
• Clamps only needed at macro cell outputs 
• Effect on timing harder to characterise 
• Less performance impact 
ARM'® synopsys' 
---
" 
B-
125 
VDD 
ARM DEVEL6)PERS' 
CONftR ENC l '06 
Ene regy e fficient sac des ign technology and methodology 
State Retention Control 
• IP design assumes explicit clock and reset in RTL cod ing 
• Implicit always-on power and state preservation between clocks 
• Designer needs to add RTL control of SRPG sequencing 
• Typically as part of the clocklreseUlesUisolalion/power control per MV domain 
• Ideally transparent to the (legacy) RTL IP subsyslem 
Clock 
Save ______ _ 
Restore ==== === 
Power _ __________ _ 
Sequentia l RTL Power Seguential RTL 
Gating 
State Save Restore State 
ARM DEVELO PERS' 
NfUEN f '0. 
ARM % synopsys 
~ ... ~~ 
" 
Agenda 
• F w u nf Ms Intellqent tnerqy M,mager IE'M tE Ilnoloqy 
• hi j t& 11 
• Synopsys-ARM Technology Demonstrator (SALT) 
• 
• '11 
ARM'% synopsys 
~ .... o...,SoIoo!_ 
Pr t F 
" 
B-
126 
axy ') l\ln P 
ARM DEVELG)PERS' 
CONHIUNC l '0' 
Appendix B: External non-confidential publications 
SALT - Technology Demonstrator 
• ARM-Synopsys R&D partnership program 
• ARM926EJ-S'" based system within 
integrated USB PHY and sUb-system 
Support for AMBA®-based block hibernate 
and wake ~ Deep sleep" 
• Real silicon (TSMC90G) for evaluation and 
further toot/ lP development 
• Linux OS pla tform using IEM 'policy stack ' 
development environment 
• Leakage Power Management 
• Coarse grain MTCMOS power gated CPU and 
USB core 
• Multiple VI library 
• On-chip power-gating with PMOS header 
switches and retention registers - "light sleep· 
• Off-chip power-gating with scan-based 
save/restore to RAM - ~deep sleep· 
• Dynamic threshold scaling 
ARM'% synopsys 
~mo Onogn $oIut-. 
SALT Architecture 
~DD 
" 
RSD92 Jb n 
URD926~ 
ARM DEVELO PERS' 
O"'lf(IHNCf '0& 
PSU PSU 
uDWSYS 
ARM'% synopsys' c:::::::::j:1 c:=J 
CPUCLK " 
timing Intertace ~ 
ARM DEVEL(;/ PERS' 
~OfI{I o....~ CO",HHNCl'Ol 
(dynamic) BUSCLK 
S:._:. "" __ 1. 
127 
PADs 
DMA 
Controll 
M I 
Eneregy efficient SOC design technology and methodology 
Power Gating Considerations 
• Impact to design performance and size 
• Maintain design performance while gaining power savings 
,. Coarse Grain MTCMOS with PMOS header switch 
• Course grain power switches for leakage mitigal ion 
• Disconnect local power from global power via power gates (MTCMOS switches) 
• Different switch topologies possible (columns, rings) 
,. Maximize control over voltage drop (regularly spaced columns of sWitches) 
• System overhead of state retention 
• Protocol management for scheduling sleep states 
• Managing depth of sleep states 
• Energy impact of state save and restore 
,. Intelligent reuse of existing scan structures to facilitate state saving 
• Controll ing wake-up 
• In-rush current management with synchronization of state save and device reset 
,. Closed loop analog voltage sensing, coupled with extensive PrimeRail analysis 
ARM'~ synopsys 
-",-- " 
Processor Implementation 
• Favour header switches for system reasons 
• Suit active-high interface protocols 
• Output isolation matches external power-down 
(deep sleep support) 
• Strained silicon offers potential for high-mobility 
P-channel devices 
• Switch columns placed every 50~m 
• Starter columns every 10 switch columns 
(500~m) 
• Route start signal up and down ·starter" 
columns 
• State retention synthesis with Power 
CompilerTM 
• All state retained within the CPU 
• IR drop analysis with PrimeRail 
• IR drop analysis across switched power mesh 
• Power-on sequence analysis 
ARM'~ synopsys' 
"""_-0 Dottq\ SoI<.I_ 
B-
128 
ARM DEVELO PE RS' 
CONf( EN:( '0. 
ARM DEVEL@ PERS' 
CONfUENCl '0. 
Appendix B: External non-confidential publications 
In-Rush Current Management 
• Closed loop, sequenced power-up of the design 
• Combination of regular and 'starter' switch columns 
• Analog voltage-sense cell (Schmitt Trigger) for the Virtual-VDD supply rail. Generate a 
~ Ready~ signal when start-up voltage reaches 90% 
• Power-Up Sequence Analysis (PrimeRail ) 
• Rush current, wake-up time calculation and tR drop analysis 
• What-if analysis to fine tune power-up sequence 
:z ~ ~ ~ k ~. 
'-< '-< '-< '-< k '-< 
J J =7- l J 
J J I J J 1-< J I 
~ ft 
, ~ 1ft ft 
~ 
-
-
ARM'% synopsys' 
~DHogn~ 
" 
SALT Header Switch Columns 
'" 
~, 
"" VDD 
SLEEP 
READY 
START R 
r-
~ 
(-
~ 
r-
;.: 
'. 
ARM'% synopsys' 
.............. o..,g,. s..-. 
11 I r 
I I ): 
I11 r I 
I I ): 
I r 
I I ~ 
.. 
I 
I 
11 
I 
I 
I 
" 
B-
129 
START 
r-
11 
~ I 
r 
11 
~ I I 
r- I 
~ I 
~ 
ARM DEVEL8 PERS' 
CO"lF<tHNC[ '06 
t: 
~ 
t: 
~ 
t: 
~ vvo. 
-:;;1 
g:.. 
ARM DEVEL8 PERS' 
CONf(II.("ICt 06 
Eneregy e fficient sac design techno logy and methodology 
Determine Header Pitch 
• Bu ilt a representative test circui t and ran lots of HSPICE 
• Varied the load and also the number of headers 
• Measured effects on signal-delay. IR drop and leakage. 
• Layout sweet spot of 30 headers in a double height cell 
• Need header cell every 50um for <5% IR drop at 250MHz 
dd 
' outl 
ARM % synopsys 
" 
ARM DEVELO PERS' 
~ ... ~'*-
AHB Based Scan-Hibernate 
~ Extra naps to .1 11 31 I I I I I .... ..-.-. batance chains 
.... 1 .:~~iPJ 11111 .... ..-.-. 
~ 
11111 · ... ..-.-. 
I I I I 1-'" ..-.-. -. ~ · . : ARM CPU : · . Bus .... · . 
0 11111 .... ..-.-. Master 
" .... Off Chip SDRAM 
, 
'V 
• Bus transaction based save & restore to memory 
• Bus Master implements CRC-32 on the fly 
• Useful diagnostic check for "soft errors" whilst power gated 
• Can be used to explore effects of in-rush current induced voltage drop 
• Could also be used to restore a check point 
ARM % synopsys-
AcMtIc"" o...g" SoIvI_ 
" 
B-
130 
ARM DEVELG)PERS' 
CONfU.~NCi '06 
Appendix B: External non-confidential publications 
Agenda 
• 
• 
• 
• Driving new capability into ARM I P and Synopsys' Galaxy Design Platform 
• 
ARM ~ synopsys ARM DEVELG)PERS' 
.-....-"'!IIO"'O~ CO"'FU£NC£ '0' 
ARM Power Management Kit (PMK) 
ARM'~ synopsys' 
"",,-"_o...,~ 
B-
131 
ARM DEVELG)PERS' 
CONHlttNC[ .," 
Eneregy efficient SOC design technology and methodology 
ARM Artisan ™ Physical IP for ARM IEM 
PhysicallP 
Power Management KII. Standard Cells and Memones 
PowerWise 
Inlerface 
Power 
Management 
Unit 
• IEM methodology optimizes energy consumption using DVFS 
• ARM® Artisan® Physical IP used for implementation in GalaxyTM 
• Power Management Kit to support power islands with different voltage levels 
and power down Standard cell library and memories optimized for dynamic 
and leakage power 
• Standard cells and memories characterized over extended operating range, 
including low vollages 
ARM' ~ synopsys' 
---
" 
ARM DEVELO PERS' 
..... FUE ..... Cf "-" 
Comprehensive Low Power Kit 
Clock gating standard cells 
-7 no dynamIC power on FF', 
Voltage level shifters 
-+ Iowe< voltage. tor ' .... 1"'11 uncrilk:al bloclr.l 
Extended operating range 
~ Digital voltaga md ffeqUllllC)' sca~ng (OVFS) wpporl 
Multj VT support in slandard cells 
~ UN htgh VI tlow leakage) ea .. WIth redu<::e4 petformitnoI In uncnbcalpath5 
Power gates 
-+ Shut-down c4 pIIftS c4 the ~ 
Retention flip.flops 
-+ MIn .... al lnk. comtnt by malnlainlng FF Itlla w/o Iou 01 performanea In f'IOflNIl moda 
Back·bias support 
-+ MinimallaBkage currenl lor iNtcuve or low performance blocks 
ARM' ~ synopsys' 
~ o-g"SoIu,_ 
" 
B-
132 
Dynamic Power l eakage Power 
Control Control 
v' 
v' v' 
v' v' 
v' 
v' 
v' 
v' 
ARM D EVELO PERS' 
CONHk E ..... C l '0. 
Appendix B: External non-confidential publications 
Synopsys' Galaxy'" Low Power Design 
Platform 
ARM ® synopsys 
.w....a..,~s.._ 
" 
ARM DEVEL8 PERS' 
CONr IUNCE '06 
Galaxy'M Low Power Solution 
ARM'® synopsys' 
~""o.-.~. 
'" 
B-
133 
RTL Synthesis 
• Voltage aware power synthesis 
• Power & Voltage aware test 
• MTCMOS shutdown synthesis 
Physical imp-le mentation 
• Power-aware Placement 
• Low-Power CTS 
• Concurrent MCMM 
, MTCMOS Shutdown 
Verification & Anal sis 
• Power aware functional veri fication 
• Slatic MV compliance checks 
• Full chip power analysis 
• Dynamic IR Drop & EM analysis 
• In-rush current analysis 
ARM DEVEL8 PERS' 
CON f l-.(NCl '06 
Eneregy efficient SOC design technology and methodology 
Multi-Vt Leakage Optimization 
c: 
~ 
:J 
U 
Ql 
Cl 
ro 
"" ro 
Ql 
-' 
Gate Delay 
Meet timing with lowest % of Low VT cells - It's all about the quality of 
timing optimizations 
ARM DEVEL8 PERS' 
."'UfNC£ , 
ARM ~ synopsys 
AIt._"'IIo....,. kM_ 
" 
Supporting MTCMOS Power Gating 
RTL 
GDSII 
ARM'~ synopsys 
~o....SoM_ 
• Power gating representation 
• Library, RTL, constraints modeling 
• Optimization 
• Power gating representation 
understood throughout flow 
• Leakage optimization 
• Power gating verification 
" 
B-
134 
• HDL syntax checker 
• Functional verification 
• Formal verification 
ARM DEVEL8 PERS' 
ONJ£IHNCE '0. 
Appendix B: External non-confidential publications 
Implementation of MTCMOS Power Gating 
• Inference of power intent from RTL 
• Specification of power domains 
• Mapping to a state retenlion slyle 
• Always-on network synthesis 
• Multi-voltage design planning 
• Power gating switch cell placement 
• Sleep net and always-on net buffering 
• Power domain aware placement 
• Multi-vollage placement optimization 
• Rush current management 
• SleepiWake timing protocol 
• Static and dynamic IR drop analysis 
• Multi-voltage design integrity 
ARM' ~ synopsys-
"""- o...g,n s-_ 33 
ARM DEVELG)PERS' 
CONfE ENC( 'u 
State Retention Synthesis 
RTL I Gate set target_library "retention . db" 
set power_ enable-power_ gating true 
Map 
Retention 
Register 
I 
Hookup 
Power gating 
Ports 
j 
MW Design with 
State Retention 
ARM"~ synopsys' 
~ .... Dtoo9>~ 
Pw read_ve ri log RTL .v 
set-power_gating_style -type DRFF 
compile 
~ [ hookup ""'power _ga ting""'ports 
report-power_gating_ style 
Power Compiler Script Sample 
ARM DEVELG)PERS' 
B-
135 
CON'U.tNCl '06 
------- --------------------------------------------------~ 
Eneregy efficient SOC design technology and methodology 
Predictable Power Synthesis 
Dynamic & 
Leakage Power 
Correlation 
Intelligent Clock-
tree Power 
Estimation 
Optimized for ICC 
Power-aware 
Placement and 
CTS 
Power OoR during implementation is not enough, it must 
correlate @ signoff 
ARM·~ synopsys-
~0ftQ~S--
" 
ARM DEVELG)PERS' 
:ONfUU"'CE '0' 
Predictable Power Network Design 
ARM·~ synopsys" 
~"" 0...,., SoM_ 
Low Power Design 
" 
B-
136 
IR-Drop Correlation 
PNA: 150mV 
ARM DEVELG)PERS' 
CONfU.£NC E 'u 
Append ix B: Extern al non-confidentia l publicat ions 
PrimeRail Power-Up Analysis Flow 
• Power management cell modeling 
• I-V curve characterization wl HSPICE 
• Power-up sequence description 
• Power net extraction and net merge 
• Extraction for both PG & virtual PG nets 
• Net merge with PM cell circuit model 
• Rail analysis 
• Rush current , wake-up time calcu lation 
• Voltage drop analysis with rush current 
• What-if analysis 
Power-up sequence Vs. number of PM cells, peak 
current and peak voltage drop 
PM Cell Modeling 
.. . . . 
Rail Analysis 
(Rush Current. Wake-up Time) 
ARM'~ synorsys-
----
" 
Power Specification @ RTL 
• Modeling Logical Power Domains @ RTL 
• 
• 
• 
• 
• 
• 
Simulate power down of a block 
Infer physical information such as power 
rail connection 
Infer consistent constraints as early as 
possible. 
Reason & Verify possible states of power 
networks. 
Infer sub-circuit paths with targeted cell 
types to model Malways onM structures 
Manage this intenl lhrough the entire now. 
Synopsys Donates Technology for Low P~r Design to Accellera 
Standards Organlzallon 
1oIOJI11 .... VI£l'r ,~ ~ li 200...$_." roe ''' .. ~~SlPS, J_"_"'''-o...oor <Id""'1_1IHU) __ ",JI'II;,, _  m.,~ItCMOI09l'IO~ ... ~
o..,,"~f£OA org~I<Jon~0fI __ 4o"", .... _._ flM __ __ 
_ "' .......... _ S\ ...... ',O!IIO\lClltt._ -,,0. __ .... ~_, __ 
F"""II/SW, "OII_"~'IVnH<I_f""'II-"'_'_ ~"'~N 
"-~ .. ..- .. _ .... 40 ___ ro'Ir..>l"""",,",,"' __ "~ ," 
_" ___ '~"' __ OU'7't..'-""2007 
ARM'~ synopsys' 
~a.-~, 
" 
B-
137 
ARM DEVELG)PERS' 
'ONru,ENC£ 'h 
ARM DEVEL8 PERS' 
CONFEHNCl '06 
Eneregy efficie nt sac des ign techno logy and methodology 
Open Power Standards 
ONfUENC( '0' 
ARM ~ synorsys-
__ ...., 0... .. SoIi.oI_ 
" 
ARM DEVELG)PERS' 
Summary 
• By working through a real comprehensive technology demonstrator we 
are now in an excellent posi t ion to : 
• Quantify the benefits of various leakage mitigation techniques in ARM based 
systems. 
Enhance ARM's Intelligent Energy Manager technology with support for 
leakage mitigation. 
Drive increased automation into Synopsys' Galaxy Design Platform for the 
rapid implementation and deployment of high performance with low power 
ARM based system chips 
• Provide a comprehensive low power physical IP portfolio through ARM's 
Power Management Kit 
Define a set of best practices that provide a complete low power design 
solution based on open industry standards 
ARM ~ synorsys 
__ ...., CIMogn SoIut_ 
" 
B-
138 
ARM DEVELG)PERS' 
.ONfER(N(.E '0. 
Appendix B: External non-confidential publications 
ARM'~ synopsys 
-_ ....... 
synopsys~ 
Predictable Success 
.. 
B-
139 
ARM DEVELG)PERS' 
CONHfHNCf o. 
Eneregy efficient SOC design technology and methodology 
8-
140 
Appendix B: External non-confidential publications 
Aggressive Leakage Management in ARM Based Systems 
John Biggs - ARM 
Alan Gibbons - Synopsys 
ABSTRACT 
The management of power consumption for battery life i widely considered to be the limiting 
factor in supporting the concurrent operation of high performance, complex applications on 
mobile platforms. At 65nm and below, minimizing the static power dissipation through 
aggressive techniques such as coarse grain MTCMOS power gat ing and threshold voltage 
sca ling can yield these significant reductions in power consumption that are necessary. ARM and 
Synopsys have jointly developed a compre hensive low power technology demonstrator that 
employs these advanced low power techniq ues. Various alternative approaches to MTCMOS 
power gating and threshold voltage sca ling are discussed together with a detailed description of 
the implementation fl ow and the results. 
NOTE : 
Although th is was presented by a colleague In the research and development group, John Biggs, the 
primary content for the State Retention and Power Gating was provided by the Author (see the 
acknowledgment in Section 5.) 
One of the Engineering Doctorate programme goals is to mentor and encourage others to present the R&D 
work to wider audiences. 
In follow on, the Author co-presented a Tutorial (MC4) with Alan Gibbons at the San Jose Synopsys Users 
Group conference on April 2, 2007. 
B-
141 
Eneregy efficient sac design technology and methodology 
Tablc of Contents 
Introduction ........ .... ..... .. .......................................................................... ................................. 4 
1. 1 Power Dissipat ion ....................................................... ..... .......... ..................................... 4 
1.2 Dynamic Power ............................................................................. .... .............................. 4 
1.3 Leakage PoweL .................................... ..... .......... .. ... .... .......... ... ...................................... 5 
1.4 Leakage Power Miti gation Techniques ...... ... ................................................................. 6 
2 Synopsys ARM Leakage Technology Demonstrator ................................. ... ...................... ... 10 
2.1 SALT Design .......... ............. ................. ............... ......................... .. .............................. 10 
2.2 SALT Library .......................................................................... ...................................... II 
2.3 SALT Implementation ... .. ... ..... ... .. ............. ..... ........................................ ............. ......... 12 
3 Key Implemen tat ion Challenges ........ .. .............. ........................ ....... ................. .............. .. .... 12 
3.1 Power Gating ......... ......... ............. .............. ... ..................... ........................................... 12 
3.2 In-Rush Current Managcmcnt .............................. ... ... ............... .. ................................ . 14 
3.3 State Retention ..... ......... .................... ...... .......... ............... ....................... .................. .... 16 
3.4 Variable Threshold CMOS (VTCMOS) ................................................... .................. .. 18 
4 Conclusions and Future Work ..... ........ .. ..... ............................................................................ 18 
5 Acknow ledgments .............. ..................................................................................... ..... .... ...... 19 
6 References .......................... ...... ..... ........ .......... ... ..................................... ..... ................... ... ... . 19 
Table of Figures 
F· IT · d . I' D· . . [3) 5 Igurc - ren S 111 ower ISSlpatlon ...... ........... ................... .............. ... ................................. . 
Figure 2 - Components of leakage current in an MOS transistor ................................................ 6 
Figure 3 - Fine Grain Power Gating ......... ................... ................................................................... 7 
Figure 4 . Coarse Grain power Gat ing ... .. .................... .. ................................. ................................ 8 
Figure 5 - SALT Architecture ............................... ............ ............................................................ I I 
Figure 6 - Leakage Current vs. Gate Width and Length (TSMC90G) .. ....................................... 13 
Figure 7 - SAL T926 CPU Floor Plan Showing Power Gates in COIUIlUlS ...... .. .... ........... ....... .... 14 
Figure - Conceptual Representation Ofln-Rush Current Management Circuit ....................... 15 
Figure 9 - Soft Start .......... .. ................ ........... ........ ............ ............ .. ... ................... ... .................... 16 
SNUG Boston 2006 2 
s-
142 
Aggressive Leakage Management 
in ARM Based Systems 
Appendix B: External non-confidential publications 
Figure 10 - PMK Retention Register ......... ................................................................................... 17 
Figure I I - Scan Hibernate ............... .. .. .. .. ............. .. ...... .. ... ....................... ................................... 18 
SNUG Boston 2006 3 
B-
143 
Aggressive Leakage Management 
in A RM Based Systems 
Eneregy efficient SOC design technology and methodology 
Introduction 
Managing the power dissipation of complex SoCs has been a prime design consideration for 
some years now . It is we ll understood that reducing both the peak and average power 
consumption will reduce the manufacturing and packaging costs as well as improve the 
reliabi li ty and battery life. 
However, the thriving market for ever more sophist icated mob ile wireless devices such as cell 
phones, media players, PDAs and cameras is placing ever increasing demands on the battery . 
Consumers want morc and morc features in their mobile devices but still demand a convenient 
form facto r and long battery life. Unfortunately battery technology is not developing fast enough 
to meet this demand and this shortfall is what is driving the demand for cheap, low power, 
energy efficient SoCs. 
1.1 Power Dissipation 
There are three major sources of power dissipation in digital CMOS c ircuits and tbey can be 
broken down in to dynamic power dissipation (P SlI'itching+ P shQrHicllil) and leakage power 
dissipa tion (P',awg,) as summa rized by equation ( I). 
~1\'j'rQge; aCLV;D!clk + vDD1sc + v DD l 'eok 
, • . ~ '---v--' 
1', ...... 100.. p .... ~_ P,."..".. 
(I) 
1.2 Dynamic Power 
In order to minimise the dynamic power dissipation term of equation ( I) then not only should the 
clock frequency (fc,,) be lowered but also the switching activity (a) and where possible the 
supply voltage (Voo) should be reduced too. 
Onc of the simplest ways to reduce the switching act ivity (a) is to inhibit registers from being 
clocked when it is known that their output will remain unchanged. In a typical SoC as much as 
30% of the switching power is dissipated in the clock tree so this technique, known as Clock 
Gating (CG), can yield a significant saving in both power dissipation and energy consumptiontJ ] 
As power is the rate of doing work then the average power diss ipation of a system can be 
reduced by s lowing the rate at which work is done. In practice this means lowering the clock 
frequency (fclk) when the maximum system performance is not required. This technique, known 
as Dynamic Frequency Scaling (DFS), leads to a linear reduction in average power dissipation 
but unfortunately does not reduce the energy consumption for a given task as the work done 
remains a constant. For some very " leaky" processes, the total energy consumption may in fact 
increase due to spending longer in active mode. 
However, if at the same time as reducing the clock frequency, the voltage is also reduced to a 
level tbat is just high enough to support this lowered clock frequency, then there is less work to 
do in charging the internal capacitances to the supply vo ltage (Voo) and so less energy is 
consumed. This technique, known as Dynamic Voltage and Frequency Scaling (DVFS), leads to 
a quadratic reduct ion in energy consumption and a cubic reduction in average power 
dissipationl']. It should be noted that, as it is not possible to dynamically scale the voltage and 
SNUG Boston 2006 4 
B-
144 
Aggressive Leakage Management 
in A RM Based Systems 
Appendix B: Externa l non-confidential publications 
frequency instantaneously, there is some e nergy overhead In moving between the various 
perfonnance levels. 
1.3 Leakage Power 
The other source of power dissipation is leakage power which is predominant ly due to the fact 
that transistors arc not perfect switches and so can never be completely turned off. 
Although leakage power used to be considered insignificant when compared to dynamic power 
at 90n111, it has become signifi cant and al 65nm, it is dominant and so can no longer be ignored. 
100 3110 
Gale IenvIh 
.Ii '50 
a 1 Dynamic 
II , ...- E 
" 
, 
""s ~ , t , & 1 Pos:Mble IrJtecIory 
t 0.01 \ Itfllgtt-kdi8ltctrb ISO!, , leacn nWl'ISlream 
D \ produdlon ii • i , lOOl , 1 0.0001 , 
'. 50 '. 
'. 
-... 
O.OOXIOO I • 
"'" 
1995 '000 2000 201. 2015 2020 
Figure I - Trends in Power Dissipation l31 
Leakage power is diss ipated in both active mode and stand by mode and the currents which go to 
make up the total leakage are increasing fas t (Figure I). In some applications, it may be more 
energy effi cient to run fast and stop rather than to lower the voltage and frequency due to the 
higb acti ve leakage currents. 
There are four main sources of leakage currents in a CMOS transistor (Figure 2) 
I. Sub-threshold Leakage (lSUB ): the current which nows from tbe drain to the source 
current of a transistor operating in the weak inversion region. 
2. Gate Leakage (lOAf£): the current which nows directly from the gate througb the oxide to 
the substTate due 10 gate oxide tunllcling and hot carrier injection . 
3. Gate Induce Drain Leakage (/OIDL): the current which nows from the drain to the 
substrate induced by a high fie ld effect in the MOSFET drain caused by a high VDG. 
4. Reverse Bias Junction Leakage (/R EV): caused by minority carri er dri ft and generat ion of 
electronlho le pairs in the depletion regions. 
SNUG Boston 2006 5 
B-
145 
Aggressive Leakage Management 
in A RM Based Systems 
Eneregy efficient SOC design technology and methodology 
I LEAK ;;; I SUB + 1 GATE + j GIDL + 1 REI' (2) 
Figure 2 - Components of leakage currenl in a n NMOS transistor 
Of the various components which go to make up the total leakage current (h EAK) it is currently 
Ihe sub-threshold leakage (lSUB) whicb is dominant. However, thc gate leakage (lGATEl is 
becoming significanl but may yet be mitigated by high K dielectric material such as TiO, and 
TaO, t4J 
The most effective techniques for mi tigaling sub-threshold leakage are Powcr Gating aod 
VTCMOS, both of which will be described later. 
1.4 Leakage Power Mitigation Techniq ues 
There are a number of leakage miligation lechniques avai lable to reduce the various leakage 
currents in both acti ve and standby model'] Some techniques such as Dual VT and VTCMOS 
rely on additional support in Ihe manufacturing process to lower the leakage whi lsl others such 
as Power Gating and Stack Effect are stand-alone circ uit techniques. 
1.4. 1 Lower Voo 
___ voo 
-t>-
-L vss 
Again by referring to equal ion ( I), it can be seen Ihat leakage power will 
reduce with the lowering of Ihe supply vo ltage (VDD). However, any 
reduction in VDD also reduces VGS which impacts the MOSFET gate drive 
(VGs- VT). II can be seen from eq uation (3) that a reduct ion in (VGS-VT) 
s ign ificantly reduces the MOSFET's dri ve strenglh (lDS). 
(3) 
Some of this loss in performance can be regained by lowering the threshold vollage (VT) to 
restore UlC loss in gate drive, however lowering the Ihreshold voltage (VT) results in an 
exponential increase in the sub-threshold leakage current (lSUB) and hence overall Ihe leakage 
power increases - see equation (4). So, in order to manage the overall leakage power, the number 
of high leakage low VT transistors should be kept 10 a min imum 
SNUG Boston 2006 
, W V",-v,."v", ( .::.."J..) 
I C V Ill'"". I _ e V. SUB =1' ., ,;-·e 
L 
(4) 
6 
s-
146 
Aggressive Lcakage Management 
in ARM Based Syslems 
Appendix B: External non-confidential publications 
1.4.2 Dual VT 
Cr1lQ1Pa11l 
A~ B Y 
It is now quite common to use a "Dua l VT" fl ow during synthesis to ensure 
that the total number of low Vr transistors is kept to a minimum by only 
deploying low Vr cells when required. C 
This usua lly involves an initia l sy nthesis targeting a prime library in the 
convent iona l manner followed by an opt imization step targeting one (or 
more) additional libraries with differing thresholdsl'] 
As more often than not there is a minimum perfom1ance which mllst be met 
before optimizing power then in practice this usually means ta rgeting the high performance, hi gh 
leakage library first and then relaxing back any cells not on the critical path by swapping them 
for their lower perfonning, lower leakage equivalents. 
If however minimizing leakage is more important than achieving a minimum perfonnance then 
this process can be done the other way around by targeting the low leakagc library first and then 
swapping in higher performing, high leakage equivalents in speed critical areas. 
1.4.3 Power Cating 
A far more aggressive and effective technique for leakage mitigation is to 
simply cut the power supply to any inactive transistor. 
Fundamentally, this is done by placing switches in the power network, the 
ground network, or botb. However, the exact placement and sizing of these 
switches must be done with great care so as not to have an adverse impact 
on perfonnance. These switches are known as "power gates" and can be 
di stributed throughout the power/ground network in either a "coa rse gain" 
or a "fine gain" manner. 
Fine Cra in power gating is when the switch is placed loca ll y inside every standard cell in the 
library (Figure 3). Since this switch must supply worst case current required by the ce ll , it has to 
be quite large in order to not impact performance. In order to keep this a rca overhead to a 
minimum, fine grain power gates are usually implemented as "footer" switches in the ground as 
NMOS transistors have a lower on-resistance than PM OS and so will be smaller. 
SNUG Boston 2006 
vnn --"T"---""'--r~ £~i~ ) 
NSLEEP---!i-tr __ ...J 
1 
NMOS 
"Footer" 
Switch 
Figure 3 - Fine C rain Power Cating 
7 
B-
147 
Aggressive Leakage Management 
in ARM Based Systems 
Eneregy efficient SOC design technology and methodology 
Although the area overhead of each cell is quite large (often 2x-4x the size of the orig inal cell), 
overall the area overhead of fine grain power gating will be much less as it is only necessary to 
power gate the high leakage, low threshold cells. 
As not all cells arc power galcd and some will remain powered, it is important to ensure that the 
inputs to these cells do not float in order to avoid crowbar currents. This means that every power 
gated cell must have addi tional ci rcuitry to "c lamp" its outputs to a va lid CMOS logic level. In 
the case of fine grain power gates this means adding a weak PM OS pull up on each output 
(Figure 3). 
The kcy advantage of fine grain power gati ng is tha t the tim ing impact of the IR drop across the 
switch and the beha vior of the clamp is easy to characterize as they are contained within the cell. 
This means that it is still possible to use a traditional design fl ow to dep loy fin e grain power 
gating although care must be taken over the routing of the sleep signal. However, the larger 
footprint of the power gated cells means that swapping between high threshold (non power 
gated) and low threshold (power gated) cells is more complex than tha t of the tradi tional Dual VT 
fl ow. 
Coarse Grain power gating is when the switch is placed such that it is shared amongst a Dumber 
of cells (Figure 4). The sizing of a coarse grain switch is much more difficult than a fine grain 
switch as the exact switching activity of the logic it supplies is not known and can only be 
estimated. Also, it is common to have distributed coarse grain power gating where the outputs of 
all the switches are joined to create a "virtual" power or ground. This just complicates the 
switch sizing calculations still further as each power gated cell is in fact fed by a Dumber of 
switches connected in parallel. 
VDD ----r-------,----
SLEEP ---011-----<>11 PMOS "Header" 
Switches 
Virtual VDD ----... --..,..---'----
Figure 4 - Coarse Grai n power Gating 
The size of a course grain switch will be much less than the sum of the equi valent fin e grain 
switches of the logic it supplies. Th is is because for a given block of logic the switching acti vity 
will not only be far less than 100% but due to the propagation delay through the cells the 
SNUG Boston 2006 8 
B-
148 
Aggressive Leakage Management 
in ARM Based Systems 
Appendix B: External non-confidential publications 
switching acti vity will be distributed in lime. As coarse grain power gating switches do not have 
the same area overhead as fine grain il is poss ible 10 use Ihe sli ghlly larger "PMOS " header" 
swi tch in the power supply inslead . This nol on ly has tbe advantage of a common ground plane 
bUI also means Ihat the outpulS of power gated blocks can be clamped 10 Ihis common ground, 
wh ich is convenient for multi voltage design. Also, wi th coarse grain power gating, not as many 
c lamps arc needed as Ihey are only required al Ihe block outputs rather than on every cell. 
Un like fine grain power gating, when the power is switched in coarse grain power gating, the 
power is disconnecled from all logic, including Ihe registers , resulting in the loss of all slales. If 
stale is 10 be preserved whiIslthe power is disconnecled then il must be stored somewhere which 
is nol power gatcd. Most commonly Ihis is done locally 10 the regislers by swapping in special 
"retention" registers which have an extra storage node that is separately powered. There arc a 
number of retention register des igns which trade off performance against area. Some use the 
existing s lave latch as Ihe slorage node whil sl olhers add an addil ional "balloon" lalch slorage 
node. However, they all require one or morc extra control signals to save and restore the statcl7l. 
The key advanlage of retention registers is thal they are simple to use and arc very quick to save 
and restore state. This means thal they have a relalively low energy cost of en le ring and leaving 
standby mode and so are often used 10 implemenl " light sleep" . However in order to minimize 
the leakage power of these retention registers during standby, it is importanl thal the storage 
node and associated control signal buffering is implemented using high threshold low leakage 
transistors. 
If very low standby leakage is required then il is possible to slore the state in main memory and 
cut the power to all logic including the retention registers. However, this technique is more 
complex to implement and also takes much .longer to save and restore state. This means that it 
has a higher energy cost of enlering and leaving slandby mode and so is more likely to be used 10 
implement "deep s leep". 
One of the key challenges in power galing is managing the in-rush current when the power is 
reconnected. This in-rush current must be carefully controlled in order to avoid excess ive IR 
drop in Ihe power network as thi s could resull in Ihe collapse of the main power supply and loss 
of the retained stale. 
In summary, although fine grain power gating is easier 10 implement, it has the disadvantage of 
req uiring a completely new cell library with the integraled power gates which have a significant 
area impact. Coarse grain power gating on the other hand is more complex to implement and 
verifyl7] . II may require special tooling bUI has Ihe advanlage of less area overhead and only 
requires the addition of relention regislers, isolation clamps and power gales to the library. 
1.4.4 VTCMOS 
Variable Thresho ld CMOS (VTCMOS) is another very effective way of 
miligating standby leakage power. By taking advantage of Ihe body effeel and 
reverse biasing Ihe substrate, il is possible 10 reduce the slandby leakage by up 
to Ihree orders of magnitude. However, VTCMOS adds complexity 10 the 
library views and requires two additional power networks to separately control 
the voltage appl ied to Ihe wells. Unfortunately, Ihe effecti veness of reverse 
body bias has been shown 10 be decreas ing wilh scaling leehno logyl'] 
SNUG Boslon 2006 9 
B-
149 
Aggressive Leakage Management 
in ARM Based Systems 
Eneregy efficient sac design technology and methodology 
1.4.5 Stack Effect 
The Stack Effect . or self reverse bias. can help to reduce sub-threshold leakage 
when morc than onc transistor in the stack is turned off. This is primaril y 
, because the small amount of sub-threshold leakage causes the intermediate 
nodes between the stacked transistors 10 noat away from the power/ground rail. 
the reduced body-source potential (more Thi s resu lts in a slightly negati ve gate-
source drain voltage (which reduces the sub-threshold leakage) as well as a 
reduced drain-source potential (less DIBL) which. together wi th body effect). 
increases the threshold. aga in lowering leakage. The leakage of a two transistor stack has been 
shown to be an order of magnitude less than that of a single transisto"! 'O] Also this stacking 
effect makes the leakage of a logic gate highly dependant on its inputs and so there is a minimum 
leakage state for a particular circ uit which could be applied just prior to halting the clocks. 
1.4.6 Long C ha nn el Devices 
• 
1 
• • 
Using non-minimum length channels will reduce the active leakage as well as 
stand by leakage by avoiding the VT roll off that occurs in short channel devices . 
Unfortunately. long channel devices have larger area and therefore greater gate 
capacitance which has an adverse effect of performance and dynamic power 
consumption. This means that there may not be a reduction in total power 
dissipation unless the switching activity of the long channels is low. Therefore. 
switching act ivity must be taken in to accoun t when choosing gates whose 
transistor lengths are to be increased. Howcver. the properties of long channel 
devices make them very suitab le for the implementation of power gates. 
2 Synopsys ARM Leakage Technology Demonstrator 
Synopsys and ARM have a long history of working together on lowering the barriers to the 
adoption of advanced methodologies for the rapid deployment of ARM synthesizab le IP with 
Synopsys toolst11t)1t 11 It I2It 131. 
The Synopsys ARM Leakage T echnology demonstrator known as "SA LT" was an R&D 
collaborat ion implemented in TSMC90G to ex pl orc the practical deta ils of implementing some 
of the more aggressive leakage mitigation techniques described above. Specifically we chose to 
implemen t Coarse G rain Power Gating together wi th Dual VT and VTCMOS as these techniques 
arc the most effecti ve at combating standby leakage power dissipation in the 90nm node. 
2. 1 SALT Design 
The design of the SALT technology demonstrator was bascd on an established ARM926EJS 
rcfcrcnce systemt)1 with the addition of a prototype next generation Intelligent Energy Controller 
(" IEC") for leakage control and an Synopsys DesignWare OTG PH Y (Figure 5). The 
ARM926EJS was partitioned into two voltage domains to allow the RAMs to remain powered 
whilst the core logic was switched off. The design also implemented in-rush current 
management wi th a "soft-start" to avoid any adverse IR drop in the power supply during start up 
Thc SALT design has support for four levels of stand by leakage power management : 
I . Halt - s imple stopping of the clocks. 
SNUG Boston 2006 10 
8-
150 
Aggressive Leakage Management 
in ARM Based Systems 
Appendix B: External non-confidential publications 
2. Light Sleep - the CPU is power gated and the state retained in retention registers. 
3. Deep Sleep - the CPU is switched off and the state retained in RAM 
4. Shutdown - both CPU and RAM are switched offso tbe state is not retained. 
The sequencing of the va rious control signals for entering and leaving these sleep modes is 
managed by the Inte lli gent Energy Controller (" IEC"). 
The implementation of Deep Sleep uses a novel scan based techniq ue together wi th a dedicated 
AMBA bus master to store the state in any AHB connected mcmory. This wi ll be described in 
mOTC detail later. 
RSDmJb 
uRS0926 1 
IIRSD9'2UORE 
I - -·--l----~ 
PV~k- r-.""OWSYS=,---------------' 
• VOOCP\.. i VDDRAM i l ~:n i ~~ ! 
; j UIK klIt:he ! 
I ARl<926EJ ! 16K O-cac:he ~ 
I ! TAGRAH., ~ 
I I ! 1 C~tIintIg Interbol 
CPlClK raus'Cl. (~) BtlSCLK 
H l2RAM , L-
A I~ 1'''''''''1 - ;~  JP.,,,,, 
Powes, Tnt. 
Reiet &Ctlcc 
",,,.,, 
- - T 
Ir.'sruy Cont'Olef 
flJ.SH 
~ I $0,,"_' I 
I CFlASH I I SORAM _. I 
~~ ~~ 
Figure 5 - SALT Architecture 
2.2 SALT Library 
The SALT technology demonstrator targeted an experimental " R&D" library based on Artisan 's 
SAGE-X standard cell library in TSMC90G process. In order to support VTCMOS it was 
necessary to target a triple well process and add deep nwellto each cell. Also it was decided to 
add an extra 10,h track supplying true V DD to the top of each cell in the library in order to 
simplify the distribution of the un-switched power to the "always on" buffers and rctention 
registers. In addition 10 tbese modifications, a power management kil consisting of the foll owing 
cells was also created, drawn to the same standard cell rules : 
SNUG Boston 2006 IJ 
B-
151 
Aggressive Lcakage Management 
in ARM Based Systems 
Eneregy efficient SOC design technology and methodology 
• Power gates to disconnect the power from the logic. 
• Iso la tion Clamps to preserve CMOS logic leve ls on the power gated outputs 
• Always On Buffers to drive power management signals, clocks and reset. 
• Retention Registers to retain the state whilst power gat cd. 
• Schmitt Trigger for in-rush current management. 
• Well Ties and Deep nwell End Caps for VTCMOS support 
The power gates were imp lemented as PM OS "headers" in order to bave a common ground 
plane and also so active high power gated outputs get clamped inactive wben iso lated. 
To use the deep nwell layer, it was necessary to also create a set of pbysical on ly deep nwell 
capping cells. These must be placed around the standard cell region to ensure that there is 
su flicient nwell overlap of deep nwell at the ends of each standard cell row to meet the design 
rules. 
Fina lly this new "R&D" experimental library was recharacterised at lower voltage to account for 
the estimated IR drop across the PMOS "beader" switches. 
This collection of additional cells became known as a "Power Management Kit" and formed the 
basis ofa prototype library which has now been productized (without tbe 10lh track!) as ARM 's 
PMK. 
2.3 SALT Implementation 
The implementation employed the 2005.09 XG Galaxy design platform and tbe flow was largely 
based on tbe ARM Synopsys IEM Reference Methodolog/'] wbicb extends the standard ARM 
Synopsys Galaxy Reference metbodology 10 have support for multiple voltage domains and dual 
VT. The only additional functionality tbat was required over and above this flow was the ability 
to perform state retention synthesis, size and place the power gates, implement the in-rush 
current management circuitry and add deep nwell capping cell s. 
Although there was full support for state retention synthesis in the tools, the placement of the 
power gates, implementation of the in-rush current management and support for deep nwell were 
all somewhat manual steps. 
In order to minimize the impact on the tools and flow it was decided to implement these power 
management cells as "physical only" cells wbicb could be placed in Jupiter during the floor 
planning stage. This will be described in more detail later. 
3 Key Impleme ntation Challenges 
3.1 Power Ga l"iog 
In coarse grain power gating there is a clear trade off between the size, number and spacing of 
the switches, simplistically the fewer there are the bigger they need to be. However, it is not 
quite as simple as that as some subtle short channel effects come in to play. For example, 
increasing the gate length by a small percentage can significantly reduce the leakage current and 
the leakage per unit width genera lly goes down as the transistor width is reduced (Figure 6). 
SNUG Boston 2006 12 
B-
152 
Aggressive Leakage Management 
in ARM Based Systems 
Appendix B: External non-confidential publications 
After much simulation it was decided that a sw itch transistor of width 0.55f1m and length 0.13ftm 
(TSMC90G) provided a good trade-off between area and the ION to 10FF ratio and so the power 
gates were built using multiple transisto rs of th is size in parallel. 
Nonnatlzed leakage Vs W PMOS Leakage Vs L 
(PMOS, TT, BSC, L=O.1 urn) (IT 1V 8SC) 
SO , ~ <0 ~ 45.00 t--- ----- -
< 
-30 13500 
./' f: 
------
~ 2500 
• ,~ 
->500 0 
000 1.00 ZOO 300 '00 500 0.10 0.11 .12 .13 0. 14 0. \5 0.16 0. 17 0,18 
channe l wldlh (um.l Channel length (1MTf 
I "'" Iwt (511) Iv! (115x) IcIsat(tw'T)l1e4 1 I '" /'M(Sx) M (I&) 1 
Figure 6 - Leakage C urrent vs. Gate Width and Length (T MC90G) 
For several reasons, not least layout convenience, the number of transistors in a power gate was 
chosen to be 30, so now tbe resistance (RON) was fi xed the spacing could be determined. This 
was again done by running many HSPI CE simulations on a representati ve test circuit varying the 
number of headers and the load that tbey were supplying. The effects on signal delay, IR drop 
and leakage were then measured. It was fo und that a power gate was required approx imately 
every 50plll in order to have less than a 5% IR drop in the switched power supply at 250MHz. 
The power gates were laid out as double height cell s in such a way that they would eas ily stack 
in columns with all the vert ica l connecti vity done by abutment. Th is meant there was enough 
room to integrate all the necessary re-buffering of the control signals. A script was then written 
to place these power gates in columns every 50pm throughout the VCPU placement region (The 
40 or so columns of header cells can be clearly seen in Figure 7). 
SNUG Boston 2006 13 
B-
153 
Aggressive Leakage Management 
in ARM Based Systems 
Eneregy efficient sac design technology and methodology 
Figure 7 - SAL T926 C PU Floor Pla n Showing Power Gates in Columns 
Once the power gate network was sized and placed, extensive PrimeRail ana lys is was performed 
to veri fy the IR drop from the pads through the VDD mesh, and across the power gates. It was 
found to be 18mV, well within the 50mV budget (assuming a 20% switching act ivity). 
3.2 In-Rush C urrent Ma nagement 
The "soft start" was implemented by bui lding two networks of power gates, a daisy chain of 
weak "starter" power gates and a network of full power gates. These were then linked by a 
Schmitt trigger which senses the level of the switched "virtual" VDa and, when the level rcaches 
approximately 90% of the un-swi tched "true" Vaa, it engages the main power gate network and 
asscrts a " ready" signa l (Figure 8). The Schmin tr igger cell in the R&D experi mental library also 
had an integrated AND function to gate in the SLEEP signal to ensure that the READY signal is 
de-asserted as soon as SLEEP is asserted with out having to wait for the virtual VDa network to 
discharge. 
SNUG Boston 2006 14 
B-
154 
Aggressive Leakage Management 
in ARM Based Systems 
vs, 
woo 
,",0 
woo 
vs, 
woo 
'"'0 
woo 
Appendix B: External non-confidential publications 
STARTE MAIN 
w oo ~~~~~~~~~~ VDD 
SLEEP ~~~~t-It=======~~~~======~~~~~~~~~~~~~~====~ 
READY +-----------------------~ ________________________________ ~~~~~ 
Figure 8 - Co nceptu al Representation Of In-Rush Current Ma nagement Circui t 
The whole circuit was simulated using NanoSim to verify the in-rush current and switch on 
times. It was found that the maximum in-rush current was no more than 80mA and it took just 
under lOOnS from dc-asserting SLEEP to bring the switched " virtual" VDD up to operating 
vo ltage and for the Schmitt trigger to fire and assert READY(Figure 9) . 
SNUG Bos ton 2006 15 
B-
155 
Aggressive Leakage Management 
in A RM Based Systems 
Eneregy efficient sac design technology and methodology 
T 
_________________________________ -,M I(t) 
~ ::'::I -~ ."-"-t-- J.,,,,., . 
~ ::':, 1:== ~~:::;::==:::::=:~I::':' 
_ (10 101". 0502(11 ) . 
, ~I~===;==========:::;:~~·-:·:~::· .: .. : .. :.:. :·=.·=.--=. . =.~=.· ;;:::· ,~':~·';,::·::'::...,=======~I~ 
,~~~I================================..m="=·~j:::-
00 ,po.. 200 .. 
Figure 9 . Soft Start 
3.3 State Retention 
The design of SALT employed aggressive coarse grain power gating to di sconnect the power 
from both the ARM926EJS processor and the OTG USB core wben ill standby mode. However, 
IQ ensure a quick re turn from slandby back into act ive mode, it is necessary to preserve the sta te 
whilst the power is gated. Two state retention techniques were implemented in SALT: one for 
" light" s leep, where the state was stored loca lly in retention registers, and the other for "deep" 
sleep, where the state was scanned out and stored in memory. 
3.3.1 Retention Registers 
For most designs it is not strictly necessary to preserve the contents of every storage element 
whi lst in stand by as only the sa lient "architectural" state needs to be preserved. In the case of 
the ARM926EJS, this essential state is in effect the state relating to the programmer' s modeL 
However, unless this essential state is explicitly marked in the source RTL, it is very difficult to 
infer during implementation. In the SALT implementation it was decided to simply convert 
every register in the ARM926EJS into a retention register to ease the verification process. 
This was done using Power Compi ler in the following manner: 
set powe r e nable power gating t r ue 
set power-gating-style--type DRFF 
set-power-gating-signal - type DRFF nrestore 
compile ult ra -scan 
hook up_powe c_gating_ports - t ype DRFF - por t_nami ng style nrestore 
SNUG Boston 2006 16 
8-
156 
Aggress ive Leakage Management 
in ARM Based Systems 
Appendix B: External non-confidential publications 
As previously ment ioned [here are a number of styles of retention register design which trade o ff 
speed, power and area. The si mplest design uses the existing slave latch as the storage node 
which must be kept powered du ring power ga[ing and should be implemented using high 
threshold, low leakage transistors. However, although this design has a minimal area overhead 
and only requi res onc addi[ional control signal, i[ does unfortunately suffer from a loss in 
perfornlance due to the hi gh threshold transis tors of the storage node being on the data path. This 
performance impact can be avoided by keepi ng the high [hreshold, low leakage transis tors off the 
data path by adding a "balloon" la[ch storage node off [ 0 one side. A I[hough this design results 
in minimal impact on perfom1ance, there is an area overhead and unfortunately it requires two 
addi[ ional control signals. 
The retent ion register used in SA LT was a prototype of [he one [hat is now ava ilable in ARM 's 
Power Management Kit. The design of this "PMK" retention register manages to retain the 
perfornlance of the "balloon" style whi ls[ having the same simple control as [he " live slave" 
(Figure 10). 
3.3.2 Scan Hibernate 
These cells must 
remam powered 
.. 
'-------{::<>--, 
Figu re 10 - PMK Retention Register 
In order [ 0 reduce the leakage s[ill further [he power to the ARM 926EJS can be shut off 
complete ly. In [his case, [he slate can not be stored loca lly in retention registers and must be 
stored elsewhere before the power is disconnected. 
A novel bus [ransac[ion based techniq ue was developed [ 0 save and restore slate [ 0 any AHB 
connected memory. This technique (ca lled "Scan Hibernate") involved padding out [he number 
of retention registers to ensure that [he number was " multiple of 32 so [hat [he slate could be 
scanned out and presented in a series of 32 bit words [ 0 a dedicated AM BA bus master [0 be 
saved [ 0 memory (Figure I I). The des ign of [his dedica ted bus master included an 
implemen[a[ion of [he "CRC-32" a lgori[hm [0 check [he integr ity of the restored [he data. 
SNUG Bos[on 2006 17 
B-
157 
Aggressive Leakage Managemen[ 
in ARM Based Systems 
Eneregy efficient SOC design technology and methodology 
31 I 1111" 't1li 
11111" 't1li 
11111" n-1 
11111" 'n-1 
ARM 
o 11111" 'n-1 
Extra flops 10 
balance chains 
Figure 11 - Sca n Hibernate 
An interesting use of this "Scan Hibernate" system is to ve ri fy the integrity of the state restored 
from the retention registers. This can be done by storing the state to memory as well as the 
rctcntion registers before entering light-sleep mode and then storing the restored state to memory 
immediately after return to active mode. By comparing the two images of the state from before 
and a fter power gating it is possible to verify whether any state got corrupted. This is a very 
useful diagnostic technique which can be used to explore the low voltage operation of the 
retention registers as well as the effects of in-rush current induced fR drop. 
3.4 Variable Threshold CMOS (VTCMOS) 
The implementation of VTCMOS requires a triple well process so that (assuming a p-type 
substrate) "deep" nwells can be placed under the pwells in order to isolate them from each other 
so that they can be held at different potentia Is. In add ition to this extra process step, VTCMOS 
also requires a "tapless" library with noating well s so that special cells which have independent 
contact with the wells can be placed at regular intervals 10 set the body bias. These special well 
bias cells then need to be all connected together with two extra power meshes, VODU for the 
nwell and VSSB for the pwell. 
As all the power gates in SALT were arranged in columns placed at regular intervals, it was 
convenient to make the well bias connections by incorporating them into the layout of each 
power gate cell. This meant that the implementation of VTCMOS almost came for rree as all the 
vertica l connectivity of these ex tra power nets was done by abutment between each power gate 
cells j ust like the SLEEP signal in the in-rush current management. 
To complete lhe VTCMOS implementation, it was necessary to place a ring of special deep 
nwell "capping" cells around the standard cell region in order to meet the minimum nwell 
overlap of deep nwell as prescribed by the TSM C90G rules. 
4 Conclusions and Future Work 
As wc move down the process generations leakage currents are fast becoming a significant 
source of power dissipation in both active and standby modcs. Various techniques for mitigating 
leakage power were investigated and Power Gating, Dual VT and VTCMOS were found to be the 
most effective. These techniques were then explored in further in practical detail through an 
SNUG Boston 2006 18 
B-
158 
Aggressive Leakage Management 
in A RM Based Systems 
Appendix B: External non-confidential publications 
ongoing collaborativc R&D program with Synopsys to investigate aggressive leakage mitigation 
techniques on ARM based systems. The first phase of this program was focused on 
understanding the technology and yielded the design described in this paper (whi ch at the time of 
writing was sti ll in fab). Being an R&D project, the demands of thi s des ign were a little ahead of 
the capabilities in both the tools and the library and so certain back roads had to be taken in order 
to complete the tape out. Howevcr many valuable lessons have been learned, some of whi ch 
have already been factored into the latest releases of ARM 's Power Management Kit and 
Synopsys tools. 
When the s ili con comes back, we plan to investigate the rea l time impact and entry!exit energy 
costs of the various sleep modes in order to funher develop the next generation " Intelligent 
Leakage Controller". Also, we plan to verify the effectiveness of the in-rush current 
management by using the diagnostic features of the "scan hibernate" system as well as benefits 
of VTCMOS in both forward and back bias modes. 
The second phase of tbis collaborative program is focu sed on defining a set of best practices for 
the rapid deployment of ARM IP wi th Synopsys 100ls 10 provide a comp lete low power design 
solut ion based on open industry standards for our mutual customers. 
5 Acknowledgments 
The authors wou ld like to express a special thanks to both ARM and Synopsys for their 
continued support with this collaboration. In addition they would like to specifi ca lly thank 
Dave Flynn, Mike Kcating, Dave Howard, Sachin Idgundi and last but not least Rich Goldman . 
6 References 
[ I) Greenhalgh P. " Power Management Techniques for Soft IP" SNUG Europe 2004 
(2) Biggs, J. Ult ley, P. "Rapid implementation of an IEM enabled ARM I I 76JZF-S with 
Galaxy Power" SNUG 2005 
(3) Kim, N.S. Aust in, T. Baauw, D. Mudge, T. Flautner, K. Hu, J.S. Irw in, M.J . Kandemir, 
M. Narayanan, V. " Leakage curren t: Moore's law meets static power" IEEE Computer 
Vol. 36, Issue 12,2003 
(4) S. Borkar, " Design Challenges of Technology Sca ling" lEE Micro Vol. 19, Issue 4, 1999. 
(5) K. Roy, S. Mukhopadhyay, H. Mahmoodi-Meimand, " Leakage Current Mechanisms and 
Leakage Reduct ion Techniques in Deep-S ubmicrometer CMOS Circuits", IEEE 
proceed ings 2003 
[6J V. Sundararajan, P. Keshab "Low Power Synthesis of Dual Threshold Vol tage CM OS 
VLS I Circu its" ISLPED 1999 . 
(7) Kao, J. Chandrakasan, A "MTCMOS Sequential Circuits" ESSCIRC 2001 
(8) Calhoun, B. Honore, F. and Chandrakasan, A. "A leakage reduction methodology fo r 
distributed MTCMOS," IEEE J. Solid-State Circuits, vo1.39, no.5, pp.8 I 8- 826, May 2004. 
[9) A. Keshavarl i, C. F. Hawkins, K. Roy, and V.De, "Effectiveness of reverse body bias for 
low power CMOS circuits" Symp. VLSl Design 1999. 
SNUG Boston 2006 19 
B-
159 
Aggressive Lea kage Management 
in A RM Based Systems 
Eneregy efficient SOC design technology and methodology 
(10) Y. Ye, S. Borkar, and V. Dc, "New technique for standby leakage reducti on in high-
performance circuits," Symp. VLS I Circuits, 1998. 
[1 1] Wh itfield, T. Gent, C. Chris ty, R, " Implementation of an ARM Core wi th Perfom1ancc 
Sca ling for Inte ll igent Energy Management" S UG Europe 2004 
(1 2) Biggs, J. Gibbons, A. "Enabling Core Based Design" SNUG Europe 2002 
(13) Flynn, D. Flautner, K. Patel, D. Robem D. " IEM926: An Energy Efficient SoC with 
Dynamic Voltage Sca ling" DATE 2004 
SNUG Boston 2006 20 
B-
160 
Aggressive Leakage Management 
in ARM Based Systems 

