Radiation Hardened Clock Design by Chellappa, Srivatsan (Author) et al.
  
Radiation Hardened Clock Design 
by  
Srivatsan Chellappa 
 
 
 
 
 
A Dissertation Presented in Partial Fulfillment 
of the Requirements for the Degree 
Doctor of Philosophy 
 
 
 
 
 
Approved July 2015 by the 
 Graduate Supervisory Committee: 
 
 Lawrence Clark, Chair  
Keith Holbert 
 Yu Cao 
Umit Ogras 
 
 
 
 
 
 
 
 
 
 
 
ARIZONA STATE UNIVERSITY 
 
August 2015  
 i 
 
ABSTRACT 
Clock generation and distribution are essential to CMOS microchips, providing 
synchronization to external devices and between internal sequential logic. Clocks in 
microprocessors are highly vulnerable to single event effects and designing reliable energy 
efficient clock networks for mission critical applications is a major challenge. This 
dissertation studies the basics of radiation hardening, essentials of clock design and impact 
of particle strikes on clocks in detail and presents design techniques for hardening complete 
clock systems in digital ICs. 
Since the sequential elements play a key role in deciding the robustness of any 
clocking strategy, hardened-by-design implementations of triple-mode redundant (TMR) 
pulse clocked latches and physical design methodologies for using TMR master-slave flip-
flops in application specific ICs (ASICs) are proposed. A novel temporal pulse clocked 
latch design for low power radiation hardened applications is also proposed. Techniques 
for designing custom RHBD clock distribution networks (clock spines) and ASIC clock 
trees for a radiation hardened microprocessor using standard CAD tools are presented. A 
framework for analyzing the vulnerabilities of clock trees in general, and study the 
parameters that contribute the most to the tree’s failure, including impact on controlled 
latches is provided. This is then used to design an integrated temporally redundant clock 
tree and pulse clocked flip-flop based clocking scheme that is robust to single event 
transients (SETs) and single event upsets (SEUs). Subsequently, designing robust clock 
delay lines for use in double data rate (DDRx) memory applications is studied in detail. 
Several modules of the proposed radiation hardened all-digital delay locked loop are 
designed and studied.  
 ii 
 
Many of the circuits proposed in this entire body of work have been implemented 
and tested on a standard low-power 90-nm process.  
                                            
 
iii 
 
DEDICATION 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
To my parents 
P. S. Chellappa and Anandhi Chellappa 
 
 
 
 
 
 
 
 
 
 
 
 
                                            
 
iv 
 
ACKNOWLEDGMENTS 
 
I would like to express my gratitude to many people who have played an integral 
role in helping me during this endeavor over all these years. 
First of all, I would like to thank my parents, P. S. Chellappa and C. Anandhi for 
their love, encouragement and support throughout my graduate studies and research. Their 
relentless efforts and hard work has been a constant source of motivation during tough and 
challenging times in life. 
I would also like to thank Dr. Clark for this opportunity and my committee members 
Drs. Holbert, Cao and Ogras for their time and support. I am also indebted to my colleagues 
Chandarasekaran Ramamurthy, Vinay Vashishtha, Aditya Gujja, Sandeep 
Shambhulingiah, Sushil Kumar, Chris Lieb, Dan Patterson, Tom Mozdzen, Nathan 
Hindman, Xiaoyin Yao, Satendra Maurya, Jerin Xavier, Yitao Chen, and Harshad Navale 
for their invaluable contributions, discussions, and support to all our projects. I must also 
thank the helpful administrative staff at the EE department Donna, Darleen, Esther, Jenna, 
Toni and Lynn for helping me with all the administrative procedures at ASU. 
 I would also like to thank all my friends and family for all the great moments that 
make life worth living. To all of you, I just want to say, “I may forget all the jokes we 
shared, but I can never forget the laugh we had.” Thank You. 
Finally I would like to thank the Almighty, for his grace and blessings without 
which none of this would have been possible. 
 
                                            
 
v 
 
TABLE OF CONTENTS 
                                                                                                                            Page 
LIST OF FIGURES                                                                                                   xii 
CHAPTER 
1. INTRODUCTION .............................................................................................1 
1.1. Introduction .................................................................................................. 1 
1.2. Radiation Environment in Space ................................................................. 1 
1.3. Effect Of Radiation Particles on Circuits..................................................... 3 
1.3.1. Single Event Effects in CMOS ............................................................. 6 
1.3.1.1. SEE Mechanism ............................................................................. 6 
1.3.1.2. Types of Single Event Effects ........................................................ 9 
1.4. Radiation Hardening .................................................................................. 11 
1.4.1. Radiation Hardening by Process (RHBP) ........................................... 11 
1.4.2. Radiation Hardening by Design (RHBD) ........................................... 12 
1.4.2.1. Design Techniques for Mitigating SEE Effects ........................... 12 
1.5. Outline........................................................................................................ 15 
2. VLSI CLOCKING ...........................................................................................16 
2.1. VLSI Clocking Basics................................................................................ 16 
2.2. Sequential Element Design ........................................................................ 17 
2.2.1. Latch ................................................................................................... 18 
2.2.2. Flip-Flop ............................................................................................. 19 
2.2.3. Pulse-Clocked Latch ........................................................................... 21 
  
CHAPTER                                                                                                                Page 
vi 
 
2.2.4. Timing Constraints for Sequential Designs ........................................ 22 
2.3. Clock Distribution ...................................................................................... 24 
2.3.1. Buffered Clock Distribution Trees ...................................................... 24 
2.3.2. Clock Mesh ......................................................................................... 26 
2.3.3. Clock Spine ......................................................................................... 27 
2.4. Clock Generation ....................................................................................... 28 
2.4.1. PLL ..................................................................................................... 29 
2.4.2. DLL ..................................................................................................... 32 
2.4.3. Voltage Controlled Oscillator (VCO) ................................................. 34 
2.4.3.1. VCO Analog Implementations .................................................... 35 
2.4.3.2. Digital VCO Implementation (DCO)........................................... 39 
2.4.4. Phase Detector .................................................................................... 41 
2.4.4.1. Analog Phase Detection ............................................................... 42 
2.4.4.2. Digital Phase Detection................................................................ 43 
2.4.4.3. Time To Digital Converter (TDC) ............................................... 46 
2.4.5. Loop Filter .......................................................................................... 48 
2.4.5.1. Analog Loop Filter Implementations ........................................... 48 
2.4.5.2. Digital Loop Filters ...................................................................... 52 
2.4.6. Feedback ............................................................................................. 54 
2.4.6.1. Frequency Divider ....................................................................... 54 
2.4.7. Classifications Of PLLs / DLLs .......................................................... 57 
2.4.7.1. Analog or Linear PLL .................................................................. 57 
2.4.7.2. Digital PLL or DLL (DPLL)........................................................ 58 
  
CHAPTER                                                                                                                Page 
vii 
 
2.4.7.3. Master –Slave Delay Locked Loop (DLL) .................................. 59 
2.4.7.4. All-digital PLL/DLL (ADPLL) ................................................... 60 
2.4.7.5. Software PLL or DLL (SPLL) ..................................................... 61 
2.5. Chapter Summary ...................................................................................... 62 
3. RADIATION HARDENED SEQUENTIAL ELEMENT DESIGN ...............63 
3.1. Introduction ................................................................................................ 63 
3.2. Prior Work: Self-Correcting TMR Flip-Flop Design ................................ 64 
3.2.1. Physical Design Using Fences for TMR Separation with TMR Flip-
Flops .......................................................................................................................... 65 
3.3. TMR Pulse Clocked Latch Design ............................................................ 66 
3.3.1. AES Overview and Design ................................................................. 68 
3.3.2. Self-Correcting TMR Pulse Latch Macro Design .............................. 70 
3.3.3. Optimal Pulse Width Determination ................................................... 71 
3.3.4. TMR Pulse Latch Test Mode .............................................................. 73 
3.3.5. Pipeline Stage Unification Using Pulse Latches ................................. 74 
3.3.5.1. Pipeline Depth and Pipeline Stage Unification Overview ........... 74 
3.3.5.2. PSU Using Pulse Latches............................................................. 76 
3.3.6. AES Implementation and Test Chip Design ....................................... 77 
3.3.7. Results and Analysis ........................................................................... 79 
3.3.7.1. Pipeline Collapse Analysis .......................................................... 79 
3.3.7.2. Experimentally Measured Performance ....................................... 81 
3.3.7.3. AES Beam Test Results ............................................................... 83 
3.4. Temporal Pulse Clocked Flip-Flop ............................................................ 84 
  
CHAPTER                                                                                                                Page 
viii 
 
3.4.1. Design of the Temporal Pulse Clocked Flip-Flop .............................. 85 
3.4.1.1. Derivation of the Design .............................................................. 86 
3.4.1.2. Clock Generator Design ............................................................... 87 
3.4.1.3. Hardness Validation Simulation Methodology ............................ 89 
3.4.1.4. Pulse Width Design and Timing .................................................. 90 
3.4.2. Physical Implementation ..................................................................... 92 
3.4.2.1. Multiple Node Charge Collection Mitigation .............................. 92 
3.4.2.2. Physical Design and Multi SET Simulation Result ..................... 93 
3.4.2.3. Test Silicon and Results ............................................................... 93 
3.4.3. Comparison with Previous Work ........................................................ 93 
3.4.4. Future Work ........................................................................................ 94 
3.5. Chapter Summary ...................................................................................... 95 
4. RADIATION HARDENED CLOCK DISTRIBUTION .................................96 
4.1. Contribution of this Work .......................................................................... 96 
4.2. Single Event Effects in the Clock Network ............................................... 97 
4.3. Radiation Hardened Custom Clock Spine ............................................... 100 
4.3.1. Global Clock Distribution ................................................................. 101 
4.3.2. Unit Clock Driver Design ................................................................. 103 
4.3.3. Clock Spine Physical Design ............................................................ 106 
4.3.4. Experimental Verification ................................................................. 108 
4.3.4.1. Test Chip Design........................................................................ 108 
4.3.4.2. Single Event Testing and Results .............................................. 109 
4.4. ASIC Clock Trees for RHBD Applications ............................................. 112 
  
CHAPTER                                                                                                                Page 
ix 
 
4.4.1. ASIC Clock Tree Synthesis .............................................................. 112 
4.4.2. Radiation Hardened ASIC Clock Trees ............................................ 114 
4.4.3. HERMES2 Clock Trees .................................................................... 115 
4.4.4. HERMES2 Physical Design Methodology for Spatial Redundancy 116 
4.4.4.1. HERMES2 TMR Clock Tree Synthesis .................................... 118 
4.4.5. HERMES2 Clock Tree Characteristics ............................................. 119 
4.4.5.1. Comparison of the Three TMR Tree Structures ........................ 119 
4.4.5.2. Tree Insertion Delays ................................................................. 121 
4.4.5.3. Inter-Tree Clock Skew: .............................................................. 122 
4.4.5.4. Impact of SET vs Tree Level ..................................................... 123 
4.4.5.5. Node Capacitance vs Driver sizes.............................................. 124 
4.4.6. Clock Tree Risk Analysis ................................................................. 126 
4.4.6.1. Risk ............................................................................................ 127 
4.4.6.2. Empirical Analysis of the Impacting Factors............................. 128 
4.4.6.3. Techniques to Reduce Rtree ........................................................ 133 
4.4.6.4. Risk Analysis of Individual Flip-Flops ...................................... 135 
4.4.6.5. Case Study: Under Driven Clock Node ..................................... 136 
4.4.6.6. Case Study: Delay Filter in the Clock Path ............................... 137 
4.4.6.7. Case Study: Redundant Clock Trees.......................................... 138 
4.4.6.8. HERMES2 Risk Analysis .......................................................... 139 
4.4.7. Temporally Redundant Clock Trees ................................................. 141 
4.4.8. Proposed Integrated Temporal Clocking and TMR Pulse FF 
Methodology ........................................................................................................... 141 
  
CHAPTER                                                                                                                Page 
x 
 
4.4.8.1. Experimental Implementation .................................................... 143 
4.4.8.2. Analysis of Power Consumed .................................................... 144 
4.5. Chapter Summary .................................................................................... 146 
5. RADIATION HARDENED DDRX CLOCK GENERATION .....................147 
5.1. DDR SDRAM Memory Interfaces .......................................................... 147 
5.2. An RHBD DLL for DDR2 and DDR3 .................................................... 149 
5.3. RHBD AD-DLL Architectural Overview ................................................ 151 
5.4. Digital Delay Line (DDL) ........................................................................ 153 
5.4.1. Coarse Delay Line............................................................................. 154 
5.4.2. Fine Delay Line................................................................................. 155 
5.4.2.1. Interpolator ................................................................................. 156 
5.4.2.2. Fine Delay Line Response with Systematic Process Corners .... 158 
5.5. RHBD Time-to-Digital Converter ........................................................... 159 
5.6. TC 23 Test Structures .............................................................................. 161 
5.6.1. Digitally Controlled Oscillator ......................................................... 162 
5.6.2. Frequency Step Control .................................................................... 163 
5.6.2.1. Circuit Operation and Timing .................................................... 164 
5.6.3. Frequency Divider ............................................................................ 166 
5.7. TC-23 Test Structure Overview ............................................................... 168 
5.7.1. Time to Digital Converter Test Structure ......................................... 169 
5.8. Proposed AD-DLL Top Level Architecture ............................................ 171 
5.9. Modes of Operation ................................................................................. 172 
5.9.1. Coarse Lock Acquisition (CLA) Mode............................................. 173 
  
CHAPTER                                                                                                                Page 
xi 
 
5.9.2. Lock Mode ........................................................................................ 177 
5.10. Control Unit Design ............................................................................... 179 
5.10.1. DDLMultiplexer Select Design ...................................................... 180 
5.10.2. TDC Multiplexer Select .................................................................. 181 
5.10.3. Generate CD and FD Selects .......................................................... 183 
5.11. RTL Implementation and Future Work ................................................. 186 
5.12. Conclusion ............................................................................................. 186 
6. SUMMARY .....  .............................................................................................187 
 
                                            
 
xii 
 
LIST OF FIGURES 
FIGURE                                                                                                                               Page 
1.1. Cartoon Showing the Space Radiation Environment (Illustration By K.Endo, Nikkei 
Science Inc., Japan). .................................................................................................... 2 
1.2(A). Ion Strike at the Output of an Inverter. (B) Funnel Formation and Charge 
Collections Mechanisms in the Semiconductor Following an Ion Strike. .................. 5 
1.3. Funneling in an N+/P Silicon Junction Following an Ion Strike Showing Contours Of 
A) Electrostatic Potential, B) Electron Concentration after [Hseih81a]. .................... 8 
1.4.(A) Implementation of a Triple-Modulo Redundancy (TMR) Based Hardware 
Redundancy Scheme And (B) Temporal Redundancy Based on Delayed Sampling in 
Flip-Flops After [Mavis02]. Note T Represents The Delay Introduced................. 13 
2.1. Clock Network in Synchronous VLSI Designs. ........................................................ 16 
2.2. (A) Latch Operation. (B) Static Latch Schematic. [Chandra01] ............................... 18 
2.3.(A) D Flip-Flop Constructed from Two Latches.(B) D Flip-Flop Operation. ............ 19 
2.4. Master-Slave Flip-Flop Schematic[Chandra01]. ....................................................... 20 
2.5.(A) Pulse Latch Schematic and (B) Working. Delay can be Generated by Buffers 
Depending on the Delay Required. Note the Pulse Generator can be Shared Across 
Multiple Latches. ...................................................................................................... 21 
2.6.(A) Tree Based Clock Distribution.(B) Symmetrical H Tree Based Clock 
Distribution. (C) RC Matched H-Tree Clock Distribution Network for a 
Microprocessor after [Rest98]. ................................................................................. 25 
2.7. (A) Clock Grid; (B) Clock Grid With Clock Gating. Clock Gating can be Added to 
the Tree as Well, and May be Implemented at Multiple Levels. .............................. 26 
2.8.Clock Distribution in the Intel Penryn Processor Using Multiple Clock Spines after 
[Varg07]. ................................................................................................................... 27 
2.9. Basic Architecture of a PLL after [Egan08] .............................................................. 29 
2.10. Mathematical Model of the PLL. ............................................................................. 30 
2.11. Basic Architecture of a DLL .................................................................................... 31 
2.12. Mathematical Model of the DLL. ............................................................................ 32 
  
Figure                                                                                                                                  Page 
xiii 
 
2.13.(A) Voltage Transfer Characteristic of the (A) VCO (B) VCDL after [Song10]. .... 34 
2.14.(A) LC Resonant Tank Based VCO Circuit. Note that the Control Voltage VCNTL  
Controls the Capacitance of the Variable Capacitor. (B) the Equivalent Circuit of the 
Implementation. ........................................................................................................ 36 
2.15.(A) Relaxation Oscillator in Astable Multivibrator Topology Implemented using 
Operational Amplifiers. (B) Ring Oscillator Topology Implemented using Voltage 
Controlled Delay Elements. Note VCNTL Controls the Delay Of Each Stage And 
Hence the Frequency Of Oscillations. ...................................................................... 37 
2.16.(A) Current Starved Inverter Delay Element (B) Symmetrical Load Differential 
Delay Element. .......................................................................................................... 38 
2.17.DCO Implementation using Digital Switching Capacitors. ...................................... 40 
2.18.(A) DCO Implemented Using Multiplexers (B) Alternative Implementation with 
Delay Elements Integrated into the Multiplexer. ...................................................... 41 
2.19. Analog Multiplier Phase Detector. .......................................................................... 42 
2.20.(A) XOR Phase Detector (B) Waveforms of the XOR Phase Detector Output........ 43 
2.21.PFD Circuit Implemented using (A) D Flip-Flops (B) Using Logic Gates after 
[Best07]. .................................................................................................................... 44 
2.22.(A)Two State Phase Detector.(B) Three State Phase Detector Implementations after 
[Burn07]. ................................................................................................................... 45 
2.23. A Mux-DDL Based Time-To-Digital Converter (TDC) Implementation. .............. 47 
2.24.Passive Loop Filter Implementations [Egan08] with (A) Single Pole And (B) Two 
Poles. ......................................................................................................................... 48 
2.25.Active Loop Filter Implementations with their Corresponding Transfer Functions 
(A) Active Lag Filter I (B) Active Lag Filter II (C) Active PI Filter [Egan08]........ 49 
2.26. PLL Loop with PFD, Charge Pump, Loop Filter Based Implementation after 
[Gard80]. ................................................................................................................... 50 
2.27.(A)K Counter (B) Waveforms of the UP-Counter, DOWN-Counter, Carry Borrow 
States Showing The Filtering Operation after [Best07]. ........................................... 51 
2.28.(A)Up/Down Counter (B) Digital Integrator Implemented using an Accumulator 
Structure. ................................................................................................................... 52 
2.29. A Basic Frequency Synthesizer System. [Best07]. .................................................. 54 
  
Figure                                                                                                                                  Page 
xiv 
 
2.30. A Divide by 2 Circuit Using D Flip-Flop and its Functional Waveforms. .............. 55 
2.31. A Programmable Divider Circuit with the Cycle Stretch Circuit. ........................... 56 
2.32. Basic Architecture of an Analog PLL. ..................................................................... 57 
2.33. Basic Architecture of (A) Digital PLL Frequency Synthesizer (B) Digital DLL. ... 58 
2.34. Master-Slave DLL Architecture. Note DO is the Delayed Version of DI as the 
Master and Slave Share the Same Control Voltage. ................................................. 59 
2.35. (A) Basic Architecture of an ADPLL used in a Microcontroller.(B) The All Digital 
PLL As Implemented in the Part 74HC/HCT297. The DCO used in this Circuit is an 
Increment/Decrement Counter After [Best07]. ......................................................... 60 
3.1. Schematic of the Self Correcting TMR Master-Slave Flip-Flop after [Hind11]. ...... 64 
3.2. Proposed Placement Methodology using Interleaved Serpentine Fences.................. 65 
3.3. Schematic of the TMR Pulse Latch with Redundant Pulse Generators and Majority 
Gated Latch Feedback. .............................................................................................. 67 
3.4. Fully Pipelined (Loop-Unrolled) Advanced Encryption Standard Architecture as 
Implemented. All Stages have very Similar Configurations and Timings. Even And 
Odd Pipeline Stages are Identical. Mathematical Transformations for the 
Combinational Logic are Shown in Boxes. .............................................................. 68 
3.5. .(A) Block Diagram Showing Self-Correcting Pulse-Clocked Latch Macros in the 
AES Design. Multiplexers that Select Between Data and Test Mode Input are also 
shown. The Pulse Generator is Modified to Allow Pipe Stage Unification using the 
Open Control Signal. (B) Layout Showing the Custom Designed Block Consisting 
of A, B, And C Pulse Latch Copies and the Respective Pulse Generator. Spatial 
Separation Between the Latter Provides a Means to Reduce Multi-Bit Upsets. ...... 70 
3.6 Statistical Analysis of the Pulse Latch and Pulse Generator Showing Worst-Case 
Pulse Width for Proper Data Capture Mean and Sigma as well as Pulse Generator 
Variation as Determined by MC Simulation. ........................................................... 72 
3.7. Timing Waveforms Showing Error Correction with a TMR Latch During Test Mode 
Simulation. ................................................................................................................ 73 
3.8. Conventional Pipeline (A), Bypassing Master-Slave FF (B) And Pipeline Depth 
Collapse Using PSU (C). Note that when the OC (Open Control) =1 , the FF Is 
Bypassed and the Pipeline Stage is Logically Removed from the Design. .............. 74 
3.9. Operation in Normal and PSU Mode. Note the Values of TC2Q And TD2Q Also 
include Delay Across One Buffer Stage at the Output of the Latch for Providing the 
  
Figure                                                                                                                                  Page 
xv 
 
required Drive. All Numbers Indicate the Worst Case Timing for the Latch Farthest 
from the Pulse Generator. ......................................................................................... 76 
3.10 AES Pipeline using Pulse-Clocked Latches for Pipeline Stage Synchronization. 
Maintaining the Pulse-Clock High Removes that Latch from the Pipeline, Providing 
Feed-through without Added Bypass Delay. A 256-Bit Key Width Mandates 15 
Pipeline Stages for Fully Pipelined Operation. The 7 Different PSU Configurations 
Possible are also shown. ........................................................................................... 77 
3.11. (A) Die Photomicrograph (Left) with Layout Inlaid. (B) Spatially Separated 
Combinational Logic in AES TMR with Fence and Floorplan Sizes. (C) TMR 
Separation Flow Generated Powerplan with Offsets for Flip-Flop Placements. ...... 78 
3.12. (A) Primetime Analysis of the Different PSU Cases at Different Supply 
Voltages.(B) Primetime Critical Path Delay Results for Different PSU Cases as a 
Function Of VDD. ...................................................................................................... 79 
3.13.(A) Energy per Operation at 1.2 V.(B) FPGA Setup for Testing the Designed AES 
Chip. .......................................................................................................................... 80 
3.14. (A)Measured Test Chip FMAX Vs. VDD at High Operating Voltages. The Upper 
Values are Measured Using the PSU Mode.(B) Measured Test Chip FMAX Vs. VDD 
at Low Operating Voltages. The Trend does not Match the High Voltage Behavior.
 ................................................................................................................................... 82 
3.15. Measured Hardware Results for 1/FMAX At VDD = 0.7 V for Different Pipeline 
Stages Using PSU. Passing Points are shown. FMAX of the Uncollapsed Pipeline at 
0.7 V is Limited by the Chip IO. Testing PSU Beyond 4 Stages is Limited by the 
Minimum Possible FPGA Clock Frequency. ............................................................ 83 
3.16. (A) Concept Showing Redundant FFs Clocked by Temporally Separated Clocks. 
(B) Erroneous SET at CLK Input Generates False Edges on Delayed Version of 
Clocks. ...................................................................................................................... 85 
3.17. Block Diagram of Pulse Based FF. Note the TMR Pulse Generators (PGA, PGB 
And PGC) are Shared across 16 Latches in Parallel. The D input is Temporally 
Sampled by Delayed Pulses and Voted Out at the Output. ....................................... 86 
3.18. Waveform showing a Low to High SET during the Low Phase of the Clock. False 
Temporal Pulse Edges Generated, Sample the Incorrect D Value (Logic 0) causing 
an Error to be Latched at the Output. ........................................................................ 87 
3.19. Proposed FF Design with Muller-C Elements in the Pulse Generator. This Design is 
Hard to SETs on the Clock and Data Inputs. ............................................................ 88 
3.20. Simulation Methodology for Testing SET Immunity. All 4 SET Cases are Covered 
(Two (1,2) Clock Edges and Two (3,4) Clock Phases). ........................................... 89 
  
Figure                                                                                                                                  Page 
xvi 
 
3.21. Simulation Waveforms for the Proposed FF with a Sample Low SET at the D Input. 
Two Redundant Copies Latch the Correct Input Value and the Error is Voted Out. 89 
3.22. Statistical Design of the Pulse Latch Using Monte-Carlo Simulations for the 
Optimal Pulse Width Required. ................................................................................ 90 
3.23. Timing Parameters Of the Proposed FF. Note the Tdead is Half of 
[Nase06][Matush10][Shambhu14]. Data Must be Held Until the PCLKC Closing 
Edge for Hardness. .................................................................................................... 91 
3.24. Floor Plan View of the Proposed FF. The Vertical Stripes are M2 Tracks. Various 
M2 Tracks are Shared to Minimize Routing. ............................................................ 92 
3.25. Test Chip Die Photo with the Proposed FF Test Structures. ................................... 93 
4.1.(A) Clock, Unhardened Flip-Flop (LMX) and Logic Cross-Sections (B) Clock, 
Hardened (TMR) and Logic Cross-Sections at Different LETs after [Hans09]. ...... 99 
4.2. The Clock Distribution Network. Note that the Signals E5Gclk, E4Gclk, E3Gclk, 
E2Gclk And E1Gclk are Large Nodes that are Shorted Throughout the Chip and 
Hence Have Large Node Capacitances; 23 Wires are used to Equally Distribute the 
E5Gclk Signal Throughout the Clock Spine to Reduce Node Delay. .................... 101 
4.3. Radiation Induced Jitter on the Egclk (A Global Clock Node). A Simulated SET with 
Charge Equivalent to LET of 30 Mev-Cm2/Mg produces Clock Jitter of Less than 1 
ps. ............................................................................................................................ 102 
4.4. An SET Strike at the Input of One of 102 Inverters in the PLL-To- E5Gclk Buffer 
Produces a Jitter of 0.63 Ps at the Input of the Clock Spine. The Straight Line shows 
E5Gclk Node without the SET while the Dashed Line Shows the E5Gclk Affected 
Due to the Impact Of the SET. There is 10 µm Separation between Inverters. ...... 103 
4.5. The Unit (Local) Clock Network that Produces the Individual Clock Signals from the 
Egclk (A). Schematic of the Unit Clock Checker Circuit (B). ............................... 104 
4.6. Simulation Waveforms of SET Hits at Different Parts of the Local Clock Network 
Shown on Different Clock Cycles. Note that False Clock Hit Signals are Produced 
When the Copy of the Enable Signal is Hit. Since the Enable and its Copy are 
Spatially Separated, the Probability of an SET Affecting both Nodes Simultaneously 
is Very Small. Other SETs that Affect at the Local Clock Edge are Correctly 
Detected by the Checker. ........................................................................................ 105 
4.7. The Layout of the Complete Clock Spine with the PLL to E5Gclk Buffers. The 
Global Clock Nodes (E5Gclk, E4Gclk, E3Gclk, E2Gclk And E1Gclk) are Laid 
Gridded to Minimize Skew. .................................................................................... 106 
4.8. Die Microphotograph With the Clock Spine Layout Overlaid. ............................... 107 
  
Figure                                                                                                                                  Page 
xvii 
 
4.9. Clock Spine Driving Different Kinds of Logic Circuitry in the Test Chip. ............ 108 
4.10. (A) The Heavy Ion Beam Test Board at the Texas A&M University Cyclotron. (B) 
Proton Test Setup at the Lawrence Berkeley Labs Cyclotron. ............................... 111 
4.11. SEU Hardened Inverter (A) Schematic And (B) Layout Using Resistive Hardening 
and Well Based Isolation after [Baze00]. ............................................................... 113 
4.12. TMR Clock Trees in HERMES2 and their Span. .................................................. 115 
4.13. Clock Trees (A) ClkA (B) ClkB (C) ClkC In HERMES2. .................................... 116 
4.14. Distribution of Clock Nodes in Each Level of the HERMES2 Clock Trees. ........ 119 
4.15. Distribution of (A) All Sequential Elements and (B) TMR FFs in Each Level of the 
Trees in HERMES2. ............................................................................................... 120 
4.16. PDF of the Rising Insertion Delay from the Clock Source to all Flip-Flops in the 
Design for (A) PLMasterGClkA, (B) PLMasterGClkB and (C) PLMasterGClkC 
Tree. ........................................................................................................................ 121 
4.17. PDF of the Inter-Tree Clock Skew to All Flip-Flops in HERMES. ...................... 122 
4.18. (A(B),(C) Fan out of an SET Analyzed by Stage for the Three Trees 
PLMasterGClkA, PLMasterGClkB and PLMasterGClkC. .................................... 123 
4.19. (A(B),(C) Fan Out Of An SET Analyzed By Stage for the Three Trees 
PLMasterGClkA, PLMasterGClkB and PLMasterGClkC. .................................... 124 
4.20. Drive Quality of the Tree Nodes in (A) ClkA and (B) ClkB in HERMES2. ........ 125 
4.21. Drive Quality of the Tree Nodes in ClkC. ............................................................. 126 
4.22. Unique Clock Path from Clock Source to a Flip-Flop. .......................................... 135 
4.23. Under Driven Node in the Clock Path from Clock Source to a Flip-Flop. ............ 136 
4.24. Delay Filter Used at the Clock Input of the Flip-Flop. .......................................... 137 
4.25. Redundancy based SET Protections Schemes Showing (A)Semi Redundant and (B) 
Fully Redundant Clock Networks. .......................................................................... 138 
4.26. (A) Cumulative Risk per Level of the ClkA Tree (B)Risk of the Individual Nodes in 
ClkA Tree Vs Insertion Delay. ............................................................................... 139 
4.27. Cumulative Risk per Level of the (A) ClkB Tree (B) ClkC Tree. ......................... 140 
  
Figure                                                                                                                                  Page 
xviii 
 
4.28. Proposed Integrated TMR Pulse FF and Temporally Redundant Clock Tree Design.
 ................................................................................................................................. 142 
4.29. TMR AES Design in TC-23 Implemented with (A) Non-Redundant or Single Clock 
Tree and (B) Triple Redundant Clock Trees. .......................................................... 143 
4.30. Comparison of Overall Power Consumed due to Sequential Elements in AES When 
Implemented with TPFF Methodology [Sushil15], Proposed and BISER FFs. ..... 146 
5.1. Clock Frequencies for SDR, DDR, DDR2 and DDR3 Memories. .......................... 147 
5.2. Quadrature Clocks Produced by the DLL for DDR2. .............................................. 149 
5.3. Proposed DLL Architecture Top Level Block Diagram. ......................................... 150 
5.4. DDL Block Diagram. ............................................................................................... 152 
5.5. The Coarse DDL using a Multiplexing Chain. AO Gates on a 14 Track Pitch are 
used. ........................................................................................................................ 153 
5.6. Quadrature Clock Pulse Width. ............................................................................... 154 
5.7. Coarse Chain Delay vs Coarse Select from Post Layout Extracted Simulations. Note 
Each Stage Adds a Delay of 60ps to the Total Delay of the Chain. ....................... 155 
5.8. Interpolator Based Fine Delay Stage Circuit. .......................................................... 156 
5.9. Simulated Fine Delay Edges (Rising And Falling) Produced by the Interpolator 
Based Fine Delay Line. ........................................................................................... 157 
5.10. Simulated Fine Edge Delay at Process Corners. Note the Impact of Edge Rates for 
P/N Mismatched Corners Impacts but is Quite Small. ........................................... 158 
5.11. (A) RHBD TDC using TMR Flip Flops and Triplicated Delay Lines. (B) Outputs 
are Voted as Shown. ............................................................................................... 160 
5.12. TC-23 Test Structure of the Course Delay Line to Generate a Hardened Coarse 
Clock. ...................................................................................................................... 161 
5.13. Layout of the TMR DCO Test Structure in TC-23. ............................................... 162 
5.14. Block Diagram of the Frequency Step Control Logic. .......................................... 164 
5.15. Divider Modes. ...................................................................................................... 166 
5.16. TC-23 Frequency Divider Circuit Diagram. .......................................................... 167 
5.17. TC-23 Coarse Frequency Generator Test Structure Top Level Block Diagram. .. 168 
  
Figure                                                                                                                                  Page 
xix 
 
5.18. Coarse Frequency Generator Test Structure Physical Implementation. ................ 169 
5.19. (A) Single Redundant TDC Test Structure Circuit and (B) Layout. ..................... 170 
5.20. Top Level Architecture Of The Proposed AD-DLL. Note all Clock Signals are 
Highlighted in Blue, while Control Signals are shown in Black. ........................... 171 
5.21. Loop Operation in the CLA Mode. The Multiplexer Selects and Other Control 
Signals have not been shown for Clarity. ............................................................... 174 
5.22. Timing Waveforms for the CLA Mode Operation. ............................................... 175 
5.23. Timing Waveforms for the Lock Mode Operation. ............................................... 176 
5.24. Multiplexer Configurations for Different Modes. ................................................. 177 
5.25. Multiplexer Configurations for Different Modes. ................................................. 178 
5.26. Design of the DDL Multiplexer Select Sub-Block. The Flow Chart and 
Corresponding Waveforms in each of the DLL Modes is shown. .......................... 180 
5.27. Design of the TDC Multiplexer Select Sub-Block. The Flow Chart and the 
Waveforms for each Mode are Presented. .............................................................. 182 
5.28. Design of the TDC Multiplexer to Incorporate Both Modes. ................................ 183 
5.29. Algorithm For Generating the Coarse and Fine Delay Selects in CLA Mode. ..... 184 
5.30. Modelsim Waveforms of the DLL Working in the CLA Mode. Note the RefClock 
and SysClock are Locked and in Sync After DLLlock is achieved. The fdbkclock is 
Twice the Frequency of the RefClock and is also Phase Locked with the RefClock.
 ................................................................................................................................. 185 
 1 
 
CHAPTER 1. INTRODUCTION 
1.1. Introduction 
In recent technologies, electronic systems operating in radiation environments are 
more vulnerable to radiation effects, due to decreasing feature sizes and supply voltages. 
Single event effects (SEEs) are caused when radiation particles such as protons, neutrons, 
alpha particles, or heavy ions strike sensitive diffusion regions in VLSI designs. 
Historically, SEEs were troublesome for military and space applications as radiation-
induced single-event transient (SET) phenomenon has been identified as the primary 
failure mechanism behind several spacecraft malfunctions in recent years [Koga93, 
Ecof94, Sore00, Prit02]. However, other critical applications like biomedical, industrial 
and banking also demand highly reliable systems [Nara06]. The study and analysis of 
radiation effects on circuits has been a major area of research. The technique of designing 
and fabricating electronic systems to withstand radiation is called radiation hardening. This 
chapter provides an overview of the radiation environment, radiation effects on devices 
and circuits and techniques to achieve radiation hardness. 
1.2. Radiation Environment in space 
The space environment contains phenomena that are potentially hazardous to 
human and electronic systems. The natural environment is not static and includes the 
variations caused by solar flares or coronal mass ejections. This complex environment 
creates a multitude of issues for the electronics used in a satellite [Barth03]. The spectrum 
of radiation environments typically consists of charged particles originating from various 
sources such as   
 2 
 
i) Protons and other heavy nuclei associated with solar events  
ii) Trapped radiation (by the Earth’s Van Allen belts)  
iii) Galactic Cosmic Rays that consist of interplanetary protons, electrons and 
ionized heavy nuclei 
iv) Neutrons (primarily cosmic ray albedo-neutrons or CRAN particles) 
v) Photons (-rays, X-rays, UV/EUV, optical, infra-red and radio waves) 
Solar energetic particles (SEP) are large fluxes of atomic particles, primarily 
electrons and protons with energies of the order of MeV that are accelerated and expelled 
from the sun due to solar flares. Solar flare durations are highly variable on the scale of 
minutes to a few days in response to events such as storms or sub-storms in the Sun. 
Trapped particles, which are 93% protons, 6% alpha particles, and about 1% heavy nuclei, 
contribute the most to radiation effects in low and medium Earth orbits that pass through 
the Van Allen belts [Stass88]. The Van Allen belt(s) is a toroid of particles trapped by the 
earth’s magnetic field and consist of a low altitude “inner belt” from 100 ~6000 km with 
 
Fig 1.1. Cartoon showing the space radiation environment (Illustration by K.Endo, Nikkei 
Science Inc., Japan) 
Solar flare
Galactic Cosmic 
Rays
Solar 
protons and 
heavy ions
Van Allen 
belts
Plasmasphere
Magnetosphere
 3 
 
high energy (tens of MeV/nucleon) protons and electrons; and a high altitude “outer belt” 
of up to 60,000 km with mostly high energy (1-10MeV/nucleon) electrons [Guss96]. Fig 
1.1 illustrates the space environment and the Van Allen belts with respect to the earth. 
Galactic cosmic rays (GCR) can comprise of ions from all elements of the periodic 
table and are typically found in much lower fluxes than trapped particles, but can have 
energies as high as TeVs/nucleon. The GCR intensity is modulated by the 11-year solar 
cycle [Barth03]. GCRs are about 87% protons, 12% helium, with the remainder composed 
of heavy ions through actinides [Fred96].  
CRAN particles are primarily secondary cosmic ray neutrons produced by the 
interaction of GCR with the earth’s atmosphere at about 55km above the earth surface. 
These have a half-life of 11.7 minutes beyond which they decay in to an electron, proton 
and an anti-neutrino. Secondary neutrons are the most important contributor to single event 
effects at altitudes below 60,000 feet The rest of the electromagnetic spectrum consists of 
X-rays (wavelengths 10Å – 100Å), extreme ultraviolet or EUV (100Å – 1000Å), 
ultraviolet (1000Å – 3500Å), the visible spectrum (3500Å – 7000Å ) and the infra-red 
spectrum ( 0.7 – 7mm). Each type of radiation has a characteristic spectrum and preferred 
interaction mode with matter that give rise to various effects such as photo-ionization, 
photoelectron emission, Compton effect, .etc. Photon interactions are not a primary 
concern for satellites in the natural space environment [Fred96]. 
1.3. Effect of radiation particles on circuits 
Radiation can cause degradation, malfunction, loss of function or even permanent 
damage in electronic circuits and devices [Kerns88].  The manner in which radiation 
 4 
 
interacts with solid material depends on the type, kinetic energy, mass, and charge state of 
the incoming particle and the mass, atomic number and density of the target material. 
When an ion travels through a material, it loses its kinetic energy predominantly 
through coulombic interactions with the electrons of that material and thus leaves a trail of 
ionization in its wake (ions can also interact directly with material nuclei but this reaction 
probability is usually significantly lower than the electronic interaction). The higher the 
energy of the ion, the farther it travels before being "stopped" by the material. The distance 
required to stop an ion (its range) is both a function of its energy and the properties of the 
material (primarily the material’s density) in which it is traveling. The stopping power or 
linear energy transfer (LET) is a function of the material through which a charged particle 
is traveling and refers to the energy loss of the particle per unit length in the material. The 
LET (MeV-cm2/mg) is a function of both the ion’s mass and energy and density of the 
target material.  
𝐿𝐸𝑇 =
1

𝑑𝐸
𝑑𝑥
     (𝑀𝑒𝑉 − 𝑐𝑚2/𝑚𝑔)     
 (1) 
where  
𝑑𝐸
𝑑𝑥
 is the energy loss per unit length and  is the material density in mg/cm3. The 
maximum LET value near the end of the particle’s range is called the Bragg peak 
[Hseih81].  
Radiation particles interact with material, depositing charge by two major 
mechanisms: direct ionization and indirect ionization. In direct ionization, a high energy 
charged particle interacts directly with the electrons in the target material, breaking them 
free from their bound states, creating a dense track of free charge. During indirect 
 5 
 
ionization, the high energy particle collides with a nucleus in the target material, freeing 
that nucleus from its bound location. This recoiling nucleus is the charged particle that 
creates the dense track of charge. 
The two major radiation effects on MOS circuits and devices are single event 
effects (SEEs) [Mavis02] and total ionizing dose (TID) effects [Barn06]. TIDs and TID 
hardening is beyond the scope of this work and will not be discussed further. All material 
presented in this work focus on mitigating SEEs. 
 
(a) 
 
(b) 
Fig 1.2(a). Ion strike at the output of an inverter. (b) Funnel formation and charge 
collections mechanisms in the semiconductor following an ion strike. 
Heavy ions, protons 
and neutrons
Ion Path
Funnel
Drift
Depletion 
region
Diffusion
Recombination
 6 
 
1.3.1. Single Event effects in CMOS 
By definition SEEs are caused by a single radiation particle strike. All single-event 
effects are caused by the same fundamental mechanism: collection of charge at a sensitive 
region of a microcircuit following the passage of an energetic particle through the device 
as shown in Fig 1.2(a). The three major classes of particles responsible for SEE are heavy 
ions, protons and neutrons. Radiation effects from heavy ions are most often due to direct 
ionization while the vast majority of SEEs from protons are due to indirect ionization 
through collisions with heavier nuclei. SEEs from neutrons, are entirely due to indirect 
ionization as they do not cause direct ionization owing to their neutral charge [Sagg05]. 
1.3.1.1. SEE Mechanism  
There are three stages involved in the formation of a SEE – charge generation, 
charge collection and circuit response. Charge generation is decided by the particle’s mass 
and energy and the properties of the materials it passes through. Charge is generated from 
a single event phenomenon generally within a few microns of the junction. In silicon, one 
electron-hole pair is produced for every 3.6 eV of energy lost by the impinging radiation. 
As silicon has a density of 2328 mg/cm3, it is easy to calculate from equation (1) that an 
LET of 97 MeV-cm2/mg corresponds to a charge deposition of 1 pC/m. Hence the amount 
of collected charge in silicon can be given by the formula  
𝑄 = 0.01036. 𝐿𝐸𝑇     𝑝𝐶 /µ𝑚 .      
 (2) 
Thus the collected charge (Q) for these events is from 1-100 fC depending on the type of 
ion, its trajectory, and its energy over the path through or near the junction. The most 
 7 
 
sensitive semiconductor device structure is the reverse-biased junction. In worst-case the 
junction is floating (as in DRAMs, dynamic logic circuits, and some analog designs) and 
is extremely sensitive to any charge collected from a radiation event. 
There are basically three mechanisms that act on the charge deposited by an 
energetic particle strike: 1) carriers can move by drift in response to applied or built-in 
fields in the device, 2) carriers can move by diffusion under the influence of carrier 
concentration gradients within the device, or 3) carriers can be annihilated by 
recombination through direct or indirect processes. These three mechanisms are of course 
not unique to the particle strike problem and are in fact the governing processes of charge 
transport in semiconductors under most operating conditions [Dodd03]. As discussed when 
a particle strikes a microelectronic device, the most sensitive regions are reverse biased p/n 
junctions, as illustrated in Fig 1.2(b).  
Charge generated along the particle track can locally collapse the junction electric 
field due to the highly conductive nature of the charge track creating a “field funnel” as 
shown in Fig 1.3 [Hseih81a]. This funneling effect can increase charge collection at the 
struck node by extending the junction electric field away from the junction and deep into 
the substrate, such that charge deposited some distance from the junction can be collected 
through the efficient drift process. The high field present in a reverse-biased junction 
depletion region can very efficiently collect the particle-induced charge through drift 
processes, leading to a transient current at the junction contact. Strikes near a depletion 
region can also result in a significant transient current as carriers diffuse into the vicinity 
of the depletion region field where they can be efficiently collected. Note that even for 
direct strikes, diffusion plays a major role as carriers generated beyond the depletion region 
 8 
 
can diffuse back toward the junction. The sensitive volume is the approximate volume 
where collection of generated carriers from an ion track occurs, and results in unwanted 
current flow in associated circuit nodes. 
Funneling does not require a direct strike on a depletion region. Near misses can 
also cause funneling if a high enough carrier density diffuses into the depletion region to 
collapse it [Hseih81a]. Due to the differences in the hole and electron mobility, funneling 
occurs in reverse biased n/p diodes, but is much weaker or nonexistent in equivalent p/n 
diodes. The applied voltage at the struck junction is not a constant and in fact the struck 
node may switch from being reverse-biased to zero-biased. In such cases, funneling may 
play a role in the early-time response of the circuit by helping initially flip the node voltage, 
but it is the late-time collection by diffusion that ensures the node stays flipped. When this 
 
  
(a)                                              (b) 
Fig 1.3. Funneling in an n+/p silicon junction following an ion strike showing contours of 
a) electrostatic potential, b) electron concentration after [Hseih81a]. 
 9 
 
funnel-assisted drift or diffusion following an energetic particle strike causes the charge 
state of a node to change, it causes an error in the circuit operation. 
The device characteristic that determines the upset sensitivity of a device is its 
critical charge (Qcrit ). This is the amount of charge that must be collected at the terminal 
of the device to cause the single event effect. 
1.3.1.2. Types of single event effects 
Single-event effects may be broadly characterized as either non-destructive 
(causing a soft error) or destructive SEE (resulting in a hard error). The error is “soft” 
because the circuit/device itself is not permanently damaged by radiation – if new data is 
written to the bit, the device will store it correctly – in contrast, a “hard” error is manifested 
when the device is physically damaged such that improper operation occurs, data is lost 
and the damaged state is permanent. 
Examples of non-destructive SEE include single-event transients, single event 
upsets in memory circuits (SEU), multi bit upsets (MBU) and single event functional 
interrupts (SEFI). Destructive SEE include such phenomena as single-event latchup (SEL, 
which can be either destructive or non-destructive depending on circuit design), single-
event burnout (SEB), and single-event gate rupture (SEGR). 
A single event transient (SET) is defined as a momentary voltage excursion 
(voltage glitch) at a node in an integrated circuit due to the passage of a charged particle 
[Heil89, Bene04, Gadl04]. Under certain conditions, the voltage spike can propagate 
through combinational logic away from where it was generated and eventually appear at 
the circuit’s output. It may also be captured either locally if it is generated within a latch, 
or non-locally if it first propagates through the circuit before being captured at the input of 
 10 
 
a latch. When a charged particle strikes one of the sensitive nodes of a memory cell, such 
as a drain in an off state transistor, it generates a transient current pulse that can turn on the 
gate of the other complementary transistor. This effect can produce an inversion in the 
stored value, in other words, a bit flip in the memory cell causing a single event upset 
(SEU) [Sagg05]. Once a SET is captured by a latch or flip-flop it becomes a SEU, and it is 
impossible to distinguish SEUs that result from SETs that have propagated from logic 
circuits, from SEUs that have been generated within the latch or flip-flop itself [Axne86]. 
If the radiation event is of very high energy, more than a single bit SRAM might be 
affected, creating a multi-bit upset (MBU) as opposed to a single bit upset (SBU). MBUs 
are defined as the occurrence of two or more bit upsets, appearing within the same clock 
cycle from a single particle hit, to distinguish from random multiple hits within a single 
cycle [Muss96]. While MBU are usually a small fraction of the total observed SEU rate, 
their occurrence has implications for memory architecture in systems utilizing error 
correction. 
Another type of soft error occurs when the bit that is flipped is in a critical system 
control register, such as those found in field programmable gate arrays (FPGAs) or DRAM 
control circuitry, so that the error causes the product to malfunction . This type of soft error, 
called a single event interrupt (SEFI) [Koga97], obviously impacts the IC reliability since 
each SEFI leads to a direct product malfunction as opposed to typical memory soft errors 
that may or may not affect the final product operation depending on the algorithm, data 
sensitivity, etc. Digital functions most likely to cause SEFIs are clock and control trees, 
phase locked loops, counters, address registers and poorly regulated power networks. 
 11 
 
Single event latch-up is a steady high current state that results when a parasitic 
silicon controlled rectifier (SCR) (p-n-p-n) structure is triggered into a regenerative 
forward bias [Dodd03]. A circuit in latch-up will continue to malfunction until the event is 
shut down. If the latch-up current is large enough this can be a destructive event. 
The last failure modes related to SEEs are the single event gate rupture (SEGR) and 
single event breakdown (SEB) [Dodd03]. Both mechanisms are destructive and lead to 
hard failures. In SEGR the gate oxide is rendered to a conductive state (breakdown) 
initiated by a hit to the gate region while in the SEB the junction is broken-down when the 
event causes avalanche and thermal runaway. 
The rate at which soft errors occur is called soft error rate (SER). The unit of 
measure commonly used with SER and other hard reliability mechanisms is failure in time 
(FIT). One FIT is equivalent to one failure in 109 device hours.  
1.4. Radiation hardening 
There are several methods for mitigating the effects of radiation. Radiation effects 
can be mitigated by using design techniques at all levels of the system [Kerns88]. From the 
basic structure level to the circuit level to the system level, there are methods that can be 
implemented to mitigate and all types of radiation effects 
While it is not possible to discuss all the published techniques in soft error 
mitigation, the most common techniques are outlined in the subsequent sections. 
1.4.1. Radiation hardening by process (RHBP) 
Radiation hardened by process (RHBP) is a term that describes a method to harden 
a device to TID and/or SEE using the fabrication process. This is done either through 
 12 
 
carefully selecting the starting material or by modifying the process and/or the design of 
device primitives used to create a semiconductor device. Modification is typically done by 
adding or changing a process step without impacting the performance or normal operating 
characteristics of the device. Modifying fabrication processes is expensive and may only 
be feasible for specific applications. This method of hardening is also not considered for 
discussion in this work. 
1.4.2. Radiation Hardening by Design (RHBD) 
Radiation Hardened by Design (RHBD) uses design techniques implemented in a 
standard commercial foundry to make a non-hardened process hard to a certain degree. 
RHBD techniques promise to improve the power/performance of such integrated circuits 
by utilizing state-of-the-art commercial foundry silicon processes [Lacoe00]. The work 
carried out in this dissertation and presented in this report is based on RHBD principles. 
1.4.2.1. Design techniques for mitigating SEE effects 
o Layout and Electrical level based techniques: 
 Built-in sensors for ionization detection: The bulk-built in current sensor (BICS) 
works as a monitor that senses the current at the bulk terminal. During normal 
operation, the current in the bulk is approximately zero. When a charged particle 
generates a current in the bulk, it is sensed by the BICS and the system control 
logic is notified to perform some fault tolerant technique to tolerate the detected 
SET and to reset the bulk-BICS. These techniques however are not suitable for 
modern processes and is hence not discussed in further detail. 
 13 
 
 Transistor resizing for charge dissipation: The idea of transistor resizing is to 
enlarge the width of some transistors in order to increase the capacitance of the 
most sensitive nodes in such a way that the node’s critical capacitance is 
increased making it harder for an SET to upset it [Dodd95]. 
o Logic-level based techniques: 
 Hardware (Spatial) redundancy using majority voting : A triplicate voting 
system as shown in Fig. 1.4(a) compares the outputs of three identical devices 
bit by bit, relying on the fact that while each bit is equally vulnerable to upset, 
  
(a) 
 
 
(b) 
Fig 1.4.(a) Implementation of a triple-modulo redundancy (TMR) based hardware 
redundancy scheme and (b) Temporal redundancy based on delayed sampling in 
flip-flops after [Mavis02]. Note T represents the delay introduced. 
 14 
 
the probability of the same bit upsetting in two independent devices is very low 
[Hind09] [Hind11]. 
 Time redundancy using temporal filtering : Temporal voting is employed against 
SETs in which the same device or data path is polled 3 times and the results 
stored and voted as shown in Fig 1.4 (b). One of the side effects of using this 
technique is that it limits the maximum speed at which the circuit can operate. If 
the delay is long enough it will filter out the signal [Mavis02] [Weav04]. 
 Error correcting codes (ECC) for detection and correction of bit-flips [Fuji90]: 
In error detection and correction (EDAC) scheme , redundant bits are added to a 
data word to enable the system to detect and correct errors in the data (caused by 
SEU or SEFI) using ECC schemes such as Hamming codes [Chen84]. 
 Hardened memory cell for bit-flip avoidance : SEU on memory: Memory 
elements can be protected against SEU by modifying their original design with 
extra resistors or transistors, able to recover the stored value if an upset strikes 
one of the drains of a transistor in “off” state [Yao10] [Knud06]. 
o System level techniques: 
 Recovery and recomputation : Microprocessors maintain checkpoints that detect 
faults at various stages and try to recover the information [LaBel06]. Forward 
error recovery is detecting an error and continuing on in time, while attempting 
to mitigate the effects of the faults that may have caused the error. Backward 
error recovery is detecting an error and retracting back to an earlier system state 
or time. 
 15 
 
1.5. Outline 
 This chapter provides a brief overview of the radiation environment, the different 
effects of radiation on circuits and common mitigation techniques. Our basic understanding 
of these principles is crucial to the design of fault tolerant systems. Chapter 2 discusses the 
basics of clock design in VLSI systems and the components of clock networks: sequential 
elements, clock distribution and clock generation. Chapter 3 describes the different design 
approaches for designing radiation hardened sequential elements: flip-flops and pulse 
latches and Chapter 4 discusses the design of radiation hardened clock distribution 
networks, both custom and CAD tool based clock trees for microprocessors and proposes 
techniques to evaluate the vulnerabilities of clock trees. Chapter 5 elaborates on radiation 
hardened clock generation using programmable frequency generators and proposes the 
design of a RHBD AD-DLL for future use in DDR clocking systems. Chapter 6 concludes 
this dissertation.  
 16 
 
CHAPTER 2. VLSI CLOCKING 
This chapter describes the basics of clocking in VLSI systems. The clock network 
consists of different parts such as clock generation and distribution. In addition, the design 
of sequential elements has a profound effect on the clock network and depends upon the 
clocking scheme chosen in the system. This chapter studies in detail the different aspects 
of the clock network: clock generation, distribution and sequential element design. This 
chapter intends to lay the foundation for most of the radiation hardened designs and circuits 
proposed in subsequent chapters. 
2.1. VLSI Clocking Basics 
Most synchronous digital ICs are finite state machines consisting of cascaded banks 
of sequential registers with combinational logic in between each of the register sets to form 
a pipeline as shown in Fig. 2.1. A periodic signal called the clock is used to define the time 
reference for the movement of data throughout the system [Fried01].  
Clock controls data propagation through the system by enabling or disabling the 
various memory storage elements or sequential elements. As the clock signal fans out to 
 
 
Fig.2.1. Clock network in synchronous VLSI designs. 
FF
D Q
FF
D Q
Combinational 
logic
Clock 
generator
Clock distribution 
network
Sequential element 
(FF)
Sequential element 
(FF)
Clock 
 17 
 
the large number of sequential elements and provides timing synchronization across 
different parts of the system, it is critical for both the functionality and performance of the 
entire chip. Clocks also provide synchronization to external devices and control the inputs 
and outputs to the chip. This large fanout, the greater on-chip distances travelled and higher 
operating speeds impose stringent constraints on clock network design.  
Clock design consists of generating the clock of desired frequency (or creating a 
clock source) and distributing it reliably to all the sequential and memory elements (or 
clock sinks) throughout the chip. A phase locked loop (PLL) usually generates the clock at 
the desired frequency that is then distributed to the thousands of clock sinks, ideally all 
receiving the clock at the same time through the clock distribution network. The registers 
transition state at the clock edges and data propagates synchronously thorough out the 
system.  
In the following sections, we study various commonly used design and 
implementation techniques of each different section of the clock network: sequential 
elements, clock distribution and clock generation. 
2.2. Sequential Element Design 
Sequential elements are used extensively in VLSI systems for both data 
synchronization and data storage. All sequential elements use the clock signal to transition 
from one state to another. Sequential elements may be static or dynamic. Static registers 
preserve state as long as the circuit is powered while dynamic registers store data for a 
short period of time. Only static registers are presented in this work as we primarily focus 
on radiation hardened designs. 
 18 
 
Sequential elements are classified into two major categories: level sensitive circuits 
and edge-triggered circuits [Chandra01]. Level sensitive circuits (such as latches) change 
state at a particular logic level of clock while edge-triggered circuits (e.g. Flip-flops) 
transition at a given change in the clock state such as a rising or a falling edge. 
2.2.1. Latch  
The latch is the most basic level sensitive sequential storage element in use. The 
latch propagates the data input D, to its output Q when the clock CLK is asserted and when 
the clock is de-asserted, the latch retains the previously sampled value at Q. Thus the latch 
operates in two modes, transparent (D propagates to Q) and opaque (Q is retained) 
depending on the state of the clock. The phase of the clock for which the latch may by 
transparent can be either positive (called transparent high latch) or negative (called 
transparent low latch). Fig. 2.2 (a). illustrates the working of a transparent high latch. 
The core of a latch is a bistable circuit – a circuit having two stable states that 
represents a logic 0 or 1. The data is stored in the latch by changing the bistable circuit to 
 
 
(a)                                                     (b) 
Fig.2.2. (a) Latch operation. (b) static latch schematic. [Chandra01] 
tC2Q tD2Q
tSU tH
CLK
D
Q
QD
CLK
Q
CLK
CLK
CLK
CLK
D
 19 
 
the required state. Fig 2.2 (b). shows the most commonly used static latch design. The clock 
edge at which the latch transitions from the transparent mode to the opaque mode is called 
the closing edge of the latch (typically the falling edge in a transparent high latch). To 
ensure that the proper data has been captured, the data D must be set up to its correct value 
for a minimum duration called the setup-time (tSU) before the latch becomes opaque and 
the data must also be held stable for a minimum hold time (tH), after the closing edge of 
the clock CLK. The time taken for the data to propagate from D to Q when the latch is 
transparent is called the data latency (tD2Q) and the time taken for the data to propagate 
from D to Q at the rising edge (for a positive high latch) of the clock CLK, is called the 
latch latency (tC2Q). 
2.2.2. Flip-Flop 
The D flip-flop is the most ubiquitous sequential circuit used in digital designs. The 
edge triggered D flip-flop is constructed from two latches that are transparent in 
complementary phases of the clock in a master-slave configuration [Chandra01]. Thus in 
 
 
(a)      (b) 
Fig.2.3.(a) D flip-flop constructed from two latches.(b) D flip-flop operation. 
QD
CLK
QD
CLK
QD
CLK
Data 
D
CLK
Output 
Q
tC2Q
tSU tH
CLK
D
Q
 20 
 
a master-slave flip-flop, the master and slave latches are in different modes, transparent 
and opaque at any given point in time. The MSFF can be either positive-edge triggered or 
negative-edge triggered.  
Fig. 2.3 (a), shows the positive edge triggered master slave flip-flop (MSFF) 
constructed from a transparent low latch (master) followed by a transparent high latch 
(slave). When the clock CLK is low, the master latch is transparent sampling the changes 
at its input D. The slave is however opaque and retains the previously sampled value. At 
the rising edge of the clock, the master latch transitions to the opaque state while the slave 
becomes transparent, propagating the value stored by the master to the output. 
Similar to the latch, the minimum time for which the data has to be held before the 
rising edge of the clock for it to be reliably stored is called the flip-flop setup time (tSU) and 
the minimum time for which the data has to be held constant after the rising edge of the 
clock, is the hold time (tH). The output Q is available tC2Q after the rising edge of the clock 
as shown in Fig 2.3 (b).Fig 2.4 shows the standard implementation of the MSFF included 
in most standard cell libraries.  
 
Fig.2.4. Master-slave flip-flop schematic[Chandra01]. 
QN
CLKB
CLKB
D
CLK CLKB
CLKN
CLKN
CLKN
CLKN
CLKN
CLKB
Q
CLKB
 21 
 
2.2.3. Pulse-clocked Latch 
As the conventional FF consists of two latches operating as a master-slave pair, the 
overall area and power of the circuit is considerably larger than latch based designs. Thus 
for large designs with size or power restrictions, a pulse latch or pulse clocked latch with 
only a single latch is used [Shiba 06] to simulate edge-triggered operation. 
A pulse latch is a latch clocked by a pulse generator. The pulse generator generates 
a high pulse for a brief duration at the rising edge of the clock as shown in Fig 2.5. Thus 
the pulse latch simulates master-slave flip-flop operation by being transparent in the brief 
 
Fig.2.5.(a) Pulse latch schematic and (b) working. Delay can be generated by buffers 
depending on the delay required. Note the pulse generator can be shared across 
multiple latches. 
QD
CLK
tC2Q
tSU tH
System 
CLK
D
Q
System
CLK Pulse
QD
CLK
Pulse
Pulse Generator
Delay 
(a)
(b)
 22 
 
high clock pulse and opaque for rest of the clock. The pulse generator can be shared by 
multiple latches, thereby reducing the area and power overhead. However, since the latch 
is transparent in the high phase of the pulse, the data must be held constant during this time 
duration, thereby significantly increasing the hold time required. As with a latch, the timing 
parameters of the pulse latch (tSU, tH and tC2Q) are measured to the closing edge of the pulse. 
The pulse latch or pulse-clocked latch is commonly used in high performance 
microprocessors such as the Pentium 4 [Kurd01] and Itanium [Naff02]. Integrity of the 
pulse shape is crucial for reliable operation and pulse degradation may result in setup/hold 
violations [Lee08].  
2.2.4. Timing constraints for sequential designs 
In synchronous systems, sequential elements introduce a period of time where no 
useful logic can be evaluated, known as dead time (tDEAD).  
𝑡𝐷𝐸𝐴𝐷 = 𝑡𝐶2𝑄 +  𝑡𝑆𝑈       (3) 
The maximum clock frequency (fCLK) or minimum clock period tCLK for the system is then 
a function of the dead time,  
1
𝑓𝐶𝐿𝐾
=  𝑡𝐶𝐿𝐾 ≥ 𝑡𝐷𝐸𝐴𝐷 + 𝑡𝐶𝐿𝑚𝑎𝑥     (4) 
where, tCLmax is the largest worst case combinational logic delay in the chip. 
. A setup time violation occurs when data from the previous flip-flop doesn't 
propagate through the combinational logic to the next FF in time to meet the its setup time. 
As this violation is due to the propagation delay through the logic elements between the 
flops, it is a frequency dependent problem and can be addressed by lowering the frequency. 
A hold time violation occurs when the hold time constraints imposed by the sequential 
 23 
 
elements are violated. This means the data from the sender flop races through the shortest 
combinational logic path (tCLmin), also called the contamination delay and violates the hold 
time of the subsequent receiver flop. Thus the following min delay constraint is imposed 
on the system [Chandra01].  
𝑡𝐻 ≤ 𝑡𝐶2𝑄 +  𝑡𝐶𝐿𝑚𝑖𝑛        (5) 
Hold time errors are frequency independent and thus fixing these violations is of utmost 
importance when designing the chip. The hold time violation can be solved by adding 
delays between the stages of the flip flop to increase tCLmin. 
Ideally the clock signal at both the sending and receiving flop transition at the same 
time. However in practical designs, the clocks to both the sending and receiving flops may 
be temporally offset with respect to each other changing the design parameters 
considerably. The difference in clock arrival times between two sequentially adjacent 
registers is called clock skew tSKEW. The periodicity of the clock signal may also be affected 
by the deviation of its edges from their expected transition time causing jitter, tJIT. 
In the presence of clock skew and jitter, equation (4) becomes 
1
𝑓𝐶𝐿𝐾
=  𝑡𝐶𝐿𝐾 ≥ 𝑡𝐷𝐸𝐴𝐷 + 𝑡𝐶𝐿𝑚𝑎𝑥 +   𝑡𝑆𝐾𝐸𝑊 + 𝑡𝐽𝐼𝑇     
 (6) 
and equation (5) becomes  
𝑡𝐻 ≤ 𝑡𝐶2𝑄 +  𝑡𝐶𝐿𝑚𝑖𝑛 − 𝑡𝑆𝐾𝐸𝑊       (7) 
  
 24 
 
2.3. Clock Distribution 
The goal of a robust clock distribution network in VLSI circuits is to permit the 
system to operate as fast as possible without creating any unnecessary timing uncertainties 
or decreasing functional chip yield. As all sequential logic is designed to change state in 
temporal reference to the clock, minimizing the clock delay and clock skew between 
different points in the chip is crucial for reliable operation. Clock timing skew is both 
systematic and random, but the former can be mitigated through good design. Different 
approaches to clock distribution have been developed based on chip requirements such as 
clock frequency, permissible skew, physical die area and the power budget.  
2.3.1. Buffered Clock Distribution Trees 
The most common approach to achieving equipotential clocks at all clock sinks is 
to use a buffered clock distribution tree [Fried01]. This consists of inserting buffers from 
the clock source to each of the clock sinks to create unique clock paths, forming a tree 
structure (see Fig. 2.6(a)). Large buffers may be used to drive the initial levels of the tree, 
while distributed buffers fan out to drive each individual register or memory group. Thus 
the clock source is described as the root of the tree, buffers that drive a group of registers 
are the branches and each clock sink is called a leaf node. The tree buffers are designed to 
provide enough current to drive the network capacitance while maintaining uniform clock 
slopes. The buffer drive strengths and capacitance of each node is progressively decreased 
as the signal propagates to the lower levels of hierarchy in the tree. This minimizes 
reflections of the high speed clocks at branching points. 
 25 
 
The common type of buffered clock tree used is the symmetric H-tree or X-tree 
structure [Fried01]. In a H-Tree (or X-Tree) design, the primary clock is buffered to the 
center of main H structure and then transmitted to corners of the H (or X) structure. These 
clocks are then buffered progressively into smaller H structures till the clock fans to all 
sinks (see Fig. 2.6(b)). Minimal clock skew is achieved by maintaining identical or delay 
 
(a)                                                                                               (b) 
 
                                                                                   (c) 
Fig.2.6.(a) Tree based clock distribution.(b) Symmetrical H tree based clock distribution. 
(c) RC matched H-tree clock distribution network for a microprocessor after 
[Rest98]. 
Clock 
source
(root)
To all 
Clock 
sinks 
(leaves)
branches
Clock Sinks
 26 
 
matched paths to each of the sinks in the symmetrical H topology. Fig 2.6 (c) shows a 
balanced clock distribution H tree in a microprocessor [Rest98]. 
2.3.2. Clock Mesh  
In the clock mesh distribution network, the buffered clock tree signals are shunted 
to create a clock grid or mesh structure as shown in Fig. 2.7 (a).  The grid may be driven 
by multiple branches of buffers and is typically routed early in the design. The grid has the 
advantage that since the branch resistances are in parallel, the effective clock overall skew 
is minimized—differences from driver circuits, either as designed in or due to such effects 
as power supply noise are averaged out. The grid however, consumes significantly more 
power and requires more metal wire area. A Mesh architecture is used mainly in high 
performance systems such as IBM Power4 [Restle01] and SUN Sparc V9 [Heald00]. 
Power dissipation increasingly limits modern ICs. Since gating the clocks 
eliminates active power dissipation in the sequential and intervening combinational logic, 
it is the most effective low power technique. Gated clocks must follow the tree topology, 
 
 
(a)                                                                  (b) 
Fig.2.7. (a) clock grid; (b) clock grid with clock gating. Clock gating can be added to the tree 
as well, and may be implemented at multiple levels. 
 27 
 
since individual branches are gated independently, as evident in Fig. 2.7(b). Thus, even 
mesh networks generally end in a tree as shown. 
2.3.3. Clock Spine 
Finally, in many designs, the entire distribution network is compressed into a “clock 
spine” or multiple spines. Since a clock spine is not spatially distributed, i.e., interspersed 
with other blocks, its mesh or tree structures can be designed independently. Clock gating 
is employed at different levels to gate different sections of the spine or different spines. Fig 
2.8. shows the clock distribution scheme on the 45nm Intel Penryn processor using multiple 
 
Fig.2.8.Clock distribution in the Intel Penryn processor using multiple clock spines after 
[Varg07]. 
 28 
 
spines [Varg07]. The spine also reduces the chip level design effort to balancing global 
skew across different spines as the spine typically has very little local skew at the different 
outputs. However, a disadvantage of the clock spine is that the final clock outputs must 
travel further to the clock sinks, i.e., the driven sequential elements, while maintaining 
balanced skew. 
2.4. Clock Generation 
Current microprocessors and high performance digital circuits operate at or above 
gigahertz clock frequencies. Crystal oscillators can generate low jitter clocks over a 
frequency range from tens of MHz to 200 MHz. The most commonly used circuits for 
generating clocks of different frequencies are the phase locked loop (PLL) and delay locked 
loop (DLL) [Chandra01].  
A PLL is a negative feedback system in which an oscillator-generated clock signal 
is synchronized to track a reference clock signal, such that the rising/falling edges of the 
oscillator-generated clock align (“lock”) to the rising/falling edges of the reference clock 
[Gard79]. Thus the PLL is often used to synchronize the internal clock edges to the 
reference clock (which is typically generated from a crystal oscillator) such that the timing 
windows (tSETUP + tHOLD) in all the flip-flops are with respect to the reference clock. 
Another important application of the PLL is to the multiply the frequency of the reference 
clock to produce an on-chip clock that is faster than the reference and is synchronized in 
phase or ‘phase locked’ with the reference. Apart from frequency synthesis, the locking 
property of the PLL also has other numerous applications in communication systems (such 
as frequency, amplitude, or phase modulation/demodulation, analog or digital), tone 
decoding, clock and data recovery, self-tunable filters, motor speed control, etc.  
 29 
 
DLLs are a special category of PLL circuits that do not generate their own clock 
signal but rather delay an input clock signal. DLLs generate variable delays to delay the 
input signal to be phase locked with the edges of the reference signal. DLLs are also 
commonly used in IO interfaces to hide the clock distribution delays, clock phase 
manipulation and for multiphase clock generation.  
Timing uncertainties of PLL  and DLL circuits are mainly contributed from the 
jitter and clock feed-through from input reference clocks, the added noise from the core 
circuits of the PLLs, such as delay line, oscillator, the phase detector and the clock 
distribution circuits. These jitter effects include both the random jitter components from 
the thermal noise effects of devices and deterministic jitter components coupled from 
cross-talk and power supply noises. 
2.4.1. PLL  
The PLL compares both frequency and phase of the oscillator-generated clock 
(called internal clock) and reference clock, automatically raising or lowering the frequency 
in a controlled oscillator until the internal clock is matched to the reference in both 
frequency and phase. A generic PLL (shown in Fig 2.9) has three fundamental components 
connected in feedback loop: voltage controlled oscillator (VCO), phase detector (PD), and 
loop filter (LPF) [Gard79]. 
 
Fig. 2.9. Basic architecture of a PLL after [Egan08] 
Phase detector 
(PD)
Loop filter 
(LPF)
Phase error Voltage controlled 
oscillator (VCO)
Reference 
clock
Feedback path
Control voltage
Internal clock
 30 
 
The PLL operates as follows: The phase detector detects the phase difference 
between the reference clock and the internal clock and produces the phase error depending 
on whether the internal clock is ‘leading’ or ‘lagging’ the reference clock. The phase error 
can be in the form of Up or Down pulses indicating that the VCO output frequency needs 
to be increased or decreased to match the reference clock edge. The VCO is an oscillator 
whose frequency fosc  is proportional to the input control voltage. The input control voltage 
is generated from the PD output by filtering, using a low pass filter in analog domain or an 
integrator in the digital domain. 
The feedback path often includes a frequency divider which divides the system 
clock frequency by a factor N to equal the reference clock frequency for comparison. Thus 
the system clock can be N times faster than the reference clock. When in lock, the VCO 
generates an output frequency and phase such that the phase detector detects no phase error 
between the reference and feedback inputs.  
The general PLL behavior can be described using two parameters known as the 
type and order. The PLL type specifies the open loop behavior of the PLL and is defined 
 
Fig. 2.10. Mathematical model of the PLL. 
KPD
Phase 
detector
H(s)
Loop filter
s
k vco
N
1
÷ N counter
)(1 s
)(2 s


)(2 s
VCO
 31 
 
as the number of poles at origin in the PLL circuit open loop transfer function from its 
mathematical model (Fig 2.10) [Song10], 
𝑇𝐹𝑜𝑝𝑒𝑛_𝑙𝑜𝑜𝑝_𝑃𝐿𝐿 = 𝐻(𝑠)𝐾𝑃𝐷𝐾𝑉𝐶𝑂/𝑠,      (8) 
where 𝑇𝐹𝑜𝑝𝑒𝑛_𝑙𝑜𝑜𝑝_𝑃𝐿𝐿 is the open loop transfer function of the PLL, H(s) is the loop filter 
transfer function, and KVCO is the VCO gain. The VCO as will be seen subsequently adds 
a pole (or integration function) resulting in KVCO/s contribution. 
The PLL order, on the other hand, is defined by the number of poles in its closed 
loop transfer function 𝑇𝐹𝑐𝑙𝑜𝑠𝑒𝑑_𝑙𝑜𝑜𝑝_𝑃𝐿𝐿 given by 
𝑇𝐹𝑐𝑙𝑜𝑠𝑒𝑑_𝑙𝑜𝑜𝑝_𝑃𝐿𝐿 =
𝜃2(𝑠)
𝜃1(𝑠)
= 𝑁
𝐻(𝑠)𝐾𝑃𝐷𝐾𝑉𝐶𝑂/𝑁
𝑠
1+
𝐻(𝑠)𝐾𝑃𝐷𝐾𝑉𝐶𝑂/𝑁
𝑠
 
  ,    (9) 
where N is the frequency division applied to the loop as can been seen in Fig 2.10. Some 
implementations of PLLs use current controlled oscillators instead of VCOs. In such cases 
the output signal of the phase detector is a current rather than a voltage signal. The 
operating principle however remains the same. In this study we will discuss some 
implementations of VCOs only. 
 
Fig. 2.11. Basic architecture of a DLL 
Phase detector 
(PD)
Loop filter 
(LPF)
Phase error
Voltage controlled 
delay line (VCDL)
Reference 
clock
Feedback path
Control voltage
Internal clock
 32 
 
2.4.2. DLL  
DLLs have been widely used for designing high-speed memory interface circuit or 
clock multiplier for clock de-skewing applications [Dehng00] and multi-phase clock 
generation [Moon00]. In many cases, the DLLs offers better jitter performance than PLLs 
due to the reduced impact of reference clock jitter and power supply or substrate noise. 
Typically DLLs are designed to lock to 1 or ½ input clock cycles, making them sensitive 
to input clock duty cycles. 
The major difference between the DLL and PLL is that DLLs do not generate their 
own clock signals from a VCO, instead they employ a voltage controlled delay line 
(VCDL) to delay the or reproduce the input reference signal. This is advantageous from a 
jitter perspective because, any jitter gets transferred to the output of the delay line once and 
lost from the system, and is not re-circulated like in a VCO. Thus, while the VCO 
accumulates jitter indefinitely, the VCDL only accumulates jitter proportional to the total 
delay of the delay line. Hence, variations in temperature, supply voltage, and 
 
Fig. 2.12. Mathematical model of the DLL. 
KPD H(s)
Phase error
KVCDL
Reference 
clock
Feedback path
Control voltage
2(s)
Phase detector Loop filter 
VCDL
2'(s)
1(s)
 33 
 
manufacturing process can affect the stability and operating performance of PLLs, while 
DLLs are more immune to such issues. 
The fundamental components of a DLL are the VCDL, PD and LPF as shown in 
Fig. 2.11. The VCDL produces a delayed version of the reference clock, whose phase is 
then compared with the original reference clock to produce the phase error, similar to the 
PLL. This phase error can then be used to produce the control voltage that modulates the 
delay in the VCDL. The feedback path often includes the clock distribution network and a 
frequency divider to make the DLL lock to a multiple of the reference edge. Thus the DLL 
can produce multiple clock edges that are phase shifted from the reference clock and yet 
are phase locked to the reference clock. 
As shown by the mathematical model in Fig. 2.12. Similar to the PLL the open loop 
transfer function of the DLL (𝑇𝐹𝑜𝑝𝑒𝑛_𝑙𝑜𝑜𝑝_𝐷𝐿𝐿) is given by 
𝑇𝐹𝑜𝑝𝑒𝑛_𝑙𝑜𝑜𝑝_𝐷𝐿𝐿 = 𝐻(𝑠)𝐾𝑃𝐷𝐾𝑉𝐶𝐷𝐿      
 (10) 
The DLL order, is also defined by the number of poles in its closed loop transfer 
function which is 1 less than the PLL with the same number of poles in the loop filter. This 
is because in the PLL, the VCO adds an integration (1/s) operation, while the VCDL is a 
linear function with no poles. The closed loop transfer function 𝑇𝐹𝑐𝑙𝑜𝑠𝑒𝑑_𝑙𝑜𝑜𝑝_𝐷𝐿𝐿 of the 
DLL is also given by  
 34 
 
𝑇𝐹𝑐𝑙𝑜𝑠𝑒𝑑_𝑙𝑜𝑜𝑝_𝐷𝐿𝐿 =
𝜃2′(𝑠)
𝜃1(𝑠)
= 𝑁
𝐻(𝑠)𝐾𝑃𝐷𝐾𝑉𝐶𝐷𝐿
𝑠
1+𝐻(𝑠)𝐾𝑃𝐷𝐾𝑉𝐶𝑂 
 .    
 (11) 
Most components of the DLL are identical to those of the PLL. Thus in this chapter and 
throughout the report, concepts and components described for the PLL can be similarly 
applied to the DLL. Hence, unless explicitly differentiated as a PLL or DLL, the term 
PLL/DLL is used to study features common to both. 
2.4.3. Voltage Controlled Oscillator (VCO)  
The VCO is the most critical part of the PLL for achieving low output-jitter and 
good overall performance. A voltage controlled oscillator generates a periodic signal with 
a frequency that varies (ideally linearly) with the input control voltage VCNTL. In most PLL 
circuits, a VCO is implemented by adding an inverting feedback path to the delay line or 
VCDL. The main design goal of VCO circuits is to achieve wide linear controllable 
frequency with minimized jitter injections. An ideal VCO /VCDL would have the 
following specifications: 
           
(a)                                    (b) 
Fig 2.13.(a) Voltage transfer characteristic of the (a) VCO (b) VCDL after [Song10]. 
 35 
 
 Low noise 
 Low power 
 Integrated 
 Wide tuning range 
 Small die area occupancy 
The linear function of the frequency vs. voltage is given as. 
𝜔𝑜𝑢𝑡 =  𝜔0 +  𝐾𝑉𝐶𝑂 . 𝑉𝐶𝑁𝑇𝐿       (12) 
Where ωo is the free running frequency and KVCO is the gain of the VCO, expressed in 
rad/s-V. and VCNTL is the control voltage. 
Fig. 2.13 shows the VCO and VCDL transfer characteristics. 
2.4.3.1. VCO analog implementations 
There are two types of VCOs that one may choose to design: Resonant oscillators and 
waveform oscillators. Resonant oscillators usually have two topologies: 
 Crystal oscillator 
 LC tank oscillator 
A crystal oscillator is an electronic oscillator circuit that uses the mechanical 
resonance of a vibrating crystal of piezoelectric material to create an electrical signal with 
a very precise frequency. The most common type of piezoelectric resonator used is the 
quartz crystal. Crystal oscillator based VCO circuits are commonly designed in Clapp or 
Colpitts oscillator configuration [Vittoz88]. Even though crystal oscillator based circuits 
 36 
 
are commonly used for external clock generation, the crystals cannot be integrated on a 
semiconductor IC and are hence not considered as a part of this study. 
VLSI LC circuits offer very low jitter VCO solutions where the frequency is mainly 
determined by the inductance and capacitance of the circuit. Resistance and therefore noise 
can be minimized without impacting the oscillation frequency. The key limitations of LC 
based VCOs are the limited tuning range of the circuits and the need for on-chip inductors. 
A LC oscillator VCO circuit shown in Fig. 2.14 is based on a resonant LC tank with 
differential gain circuit in positive feedback. The closed loop gain circuit serves as a 
negative resistance to compensate for the inductor and capacitor losses. The oscillation 
frequency ωout is given by  
𝜔𝑜𝑢𝑡 = √
1
𝐿𝐶
        (13) 
       
                                               (a)                                                               (b)  
Fig 2.14.(a) LC resonant tank based VCO circuit. Note that the control voltage VCNTL  
controls the capacitance of the variable capacitor. (b) the equivalent circuit of 
the implementation. 
V0- V0+
L
VCNTL
VDD
V0+
C
L
VCNTL
VDD
 37 
 
where L is the inductance and C the capacitance of the circuit. 
Resonant oscillators such as LC tank oscillator and crystal oscillator (which is neither 
integrated nor tunable) are not directly scalable with IC generations and are also not 
considered for implementation in this work. 
Waveform oscillators have two topologies: 
 Relaxation oscillator 
 Ring oscillator  
A relaxation oscillator is an oscillator based upon the behavior of a physical 
system's return to equilibrium after being disturbed. That is, a dynamical system within the 
oscillator continuously dissipates its internal energy. The period of the oscillations are set 
by the time it takes for the system to relax from each disturbed state to the threshold that 
triggers the next disturbance. Almost all relaxation oscillators implemented in electronic 
circuits operate on the principle of storing energy in a capacitor and repeatedly dissipating 
that energy to set up oscillations. Relaxation oscillators can be implemented in various 
  
                                   (a)                                                                              (b)  
Fig 2.15.(a) Relaxation oscillator in astable multivibrator topology implemented using 
operational amplifiers. (b) Ring oscillator topology implemented using voltage 
controlled delay elements. Note VCNTL controls the delay of each stage and hence 
the frequency of oscillations. 
R
R
R
C
V-
V+
VDD
VSS
Vout
VCNTL
2N+1 (odd) inverters 
 38 
 
ways using RC integrating circuits, Schmitt triggers with negative feedback or using 
astable multivibrators as shown in Fig 2.15 (a). 
The benefits of relaxation oscillators include a large frequency tuning, low cost and 
a very linear frequency versus voltage relationship. These oscillators however have very 
poor frequency stability at high frequencies and are more susceptible to phase noise 
compared to the LC oscillators. One of the most popular VLSI implementations of a 
relaxation oscillator is the ring oscillator topology. This is studied in detail in this report 
and the implemented VCO topology in this work is based on the ring structure. 
A ring oscillator is a ring of identical cascaded delay elements with inverting 
feedback between the elements that enclose the ring. VCO ring oscillators are derived from 
a common inverter ring oscillator by replacing the inverters with voltage controlled delay 
elements (see Fig 2.15(b)). The frequency (𝑓𝑜𝑢𝑡) of the VCO is determined by the delay of 
a ring inverter, as 
𝑓𝑜𝑢𝑡 =  
1
2(2𝑁+1)𝑇𝑑
   ,       (14) 
       
                                             (a)                                                  (b)  
Fig 2.16.(a) Current starved inverter delay element (b) Symmetrical load differential delay 
element. 
VOUT
VDD
VIN
VCNTLP
VCNTLN
VCNTL PMOS 
Symmetrical 
load 
IB
VON VOP
VIP VIN
VBN
VDD
 39 
 
where 2N+1 are the number of delay stages and Td is the delay of each stage. The ring 
oscillator topology without the feedback is the most commonly used VCDL topology. 
Voltage controlled delay elements used in ring oscillator VCOs may be current-starved 
inverters or differential delay elements. The delay of each ring inverter is controlled by the 
varying gate voltage VCNTL (VCNTLP, VCNTLN in Fig 2.16(a)) of its current starved transistor. 
The maximum discharge current (and hence delay of the ring inverter) is limited by the 
current starved transistor. Lowering VCNTL reduces the current and hence increases the 
propagation delay of the delay element. In a voltage controlled differential delay element 
(shown in Fig 2.16(b)), the VCNTL controls the resistance of the symmetrical load structure. 
Changing VCNTL changes the resistance of the PMOS load that varies the current in the 
branches of the differential element. The differential delay element has the advantage of 
greater power supply noise rejection. The gain of the ring oscillator based VCO (KVCO) can 
be given by the differential  
𝐾𝑣𝑐𝑜 =
𝜕
𝜕𝑉𝐶𝑁𝑇𝐿
(
𝜋
(2𝑁+1)𝑇𝑑
)  .       (15) 
2.4.3.2. Digital VCO Implementation (DCO) 
Digital implementations of voltage controlled oscillators (called digitally controlled 
oscillators or DCOs) are used in digital PLLs (DPLLs), all-digital PLLs (ADPLLs) for 
frequency synthesis. DCOs use M-bit controlled word ( CNTL[M:0] ) for controlling the 
output frequency. Most often DCO use the ring oscillator structure with digital frequency 
tuning for control. Similar to the VCDL, the digitally controlled delay line (DCDL) 
provides the necessary variable delay in a DLL. 
 40 
 
Fig 2.17 shows a DCO implementation using digital switching capacitors for 
frequency control. The digital control word determines the capacitance to be added or 
removed from the circuit which determines the frequency of oscillation.  
Another commonly used DCO architecture is shown in Fig 2.18(a). It uses an array 
of multiplexers to select the VCO tap that most closely matches the phase of the reference 
[Best07]. The control logic converts the control word to appropriate thermometer code that 
is used as g the multiplexer select signals. Each delay element presents a unit delay or the 
smallest step by which the frequency could be adjusted. This topology is simple and 
guarantees monotonic response. Also as the number of unit delay elements increases, the 
size of the multiplexer array also increases adding more coarse delay. An alternative 
implementation of the multiplexer based DCO is shown in the Fig .2.18 (b) above where 
the multiplexers have been embedded in the delay elements making it easier to balance the 
delay among different paths. The delay can be increased or decreased by switching a single 
multiplexer, reducing the chances of generating a large transient change in delay and 
eliminating potential sources of timing uncertainty. Two disadvantages of this architecture 
       
Fig 2.17.DCO implementation using digital switching capacitors. 
CNTL[0]
CNTL[1] CNTL[2] CNTL[N]
 41 
 
are increased step size and additional control logic needed to convert the output of phase 
detector to thermometer code. 
2.4.4. Phase Detector 
Phase detector (PD) circuits are commonly used in PLLs for comparing the delay 
time or the phase of two signals. The outputs of the phase detector circuits are discrete-
time signal and that are usually low pass filtered to form voltage domain signals. The low 
pass filtered phase detector output is proportional to the delay phase difference of the two 
input signals. The PD output voltage is proportional to the difference in phase between its 
inputs, i.e., 
𝑣𝑑 =  𝐾𝑑(𝜃𝑖 − 𝜃0)        (16) 
 
(a) 
 
(b) 
Fig 2.18.(a) DCO implemented using multiplexers (b) alternative implementation with 
delay elements integrated into the multiplexer. 
Multiplexer array
Control
Logic
VCO generated 
clock
M bits
N bit
select
Control logic Control voltage CNTL[M:0]
VCO generated 
clock
0
1
Integrated 
MUX and 
delay 
element
0
1
0
1
0
1
 42 
 
where Kd is called the phase detector gain and θi and θo are the two phase inputs to the PD 
with output vd  
2.4.4.1. Analog phase detection  
The phase detector used in analog PLLs is typically some form of analog multiplier, 
the most common implementation of which is either a double balanced diode mixer (diode 
ring) and the four-quadrant multiplier (Gilbert cell) [Egan08]. The analog multiplier phase 
detectors work on the following principle: Consider the following analog multiplier that 
multiplies the two sinusoidal signals X1 and X2 with amplitudes A1 and A2 respectively. 
𝑌 = 𝑋1. 𝑋2 =  
𝐴1𝐴2
2
[cos(𝜙1 − 𝜙2)] + 
𝐴1𝐴2
2
[cos(2𝜔𝑡+𝜙1 + 𝜙2)]  . 
 (17) 
The high frequency component in the output signal can be eliminated through low-pass 
filtering and the phase detector output (Fig. 2.19) mainly depends on the phase difference 
of the two input clock signals as 
𝑌 =
𝐴1𝐴2
2
[cos(𝜙1 − 𝜙2)] .       (18) 
Analog phase detectors are highly dependent on the analog voltage of the signals 
and a radiation strike could potentially disrupt their operation making them highly 
 
Fig 2.19. Analog multiplier phase detector. 
 43 
 
vulnerable to SETs. Hence for the purpose of this dissertation they are not studied in further 
detail. 
2.4.4.2. Digital Phase detection 
Phase detector circuits can be implemented in digital circuit forms where only the 
polarity of the delay or phase detection is presented at the phase detector output [Best07] 
[Song10]. Ideally the output would be a multi-bit digital word proportional to the phase 
difference but multi-bit designs require precise alignment. Instead single bit designs that 
simply indicate whether the phase of the input leads, lags or is within a pre-determined 
range of reference can also used when using a loop filter that performs integration over 
time.The digital XOR gate can be used as a simple clock phase detector as shown in Fig 
2.20. 
During ‘lock’ condition, the duty cycle of the XOR phase detector output (Y) is 50%. When 
the internal clock (say X2) lags behind reference clock (say X1), the duty cycle of Y 
      
                                      (a)                                                               (b)            
Fig 2.20.(a) XOR phase detector (b) Waveforms of the XOR phase detector output. 
 44 
 
becomes larger than 50% and the average value of Y is considered positive. When X2 leads 
X1, the average value of Y becomes less than that during the 50% duty cycle and it is 
considered negative. 
Another type of digital phase detection is using the phase frequency detector (PFD) 
[Burn07].The PFD detects both phase and frequency difference of the divided feedback 
signal to that of the reference signal. Fig. 2.21(a) illustrates the typical PFD implemented 
using D flip-flops. The PFD could also be implemented using JK flip-flops or logic gates. 
When the reference clock is leading the implemented clock, the output signal ‘Up’ 
generates the positive pulse, while QB and Dn (for ‘down’) remain at logic ‘0’. When the 
reference clock is lagging the internal clock, the output signal Dn generates the positive 
pulse, while Up (and QA) stays low. When the frequencies of both inputs are equal, the 
circuit produces almost no Up/Dn pulses. The circuit in general always generates pulses 
  
                                      (a)                                                               (b)            
Fig 2.21.PFD circuit implemented using (a) D flip-flops (b) using logic gates after 
[Best07]. 
FF
D
QA
X1 
(reference 
clock)
FF
D
X2 (internal 
clock)
Reset (R)
R
R
Up
DnQB
VDD
VDD
R
Q1
Q2
X1
X2
Up
Dn
 45 
 
equal to the phase difference between the two inputs at both QA and QB. Such PFD circuits 
are commonly used in the charge-pump based PLLs discussed in later sections. Fig 2.21 
(b) shows the gate level implementation of the PFD. 
A two phase detector (Fig 2.22(a)) operates by generating an increment signal when 
the input leads the reference and a decrement when the input lags [Burn07]. Ideally the 
transition from increment to decrement occurs at the precise phase where the input signal 
switches from leading and lagging, though in practice there will be some offsets. The major 
disadvantage with two state phase detector is that the output is uncertain for certain values 
of phase difference. These dead zones are regions in the phase characteristic where the 
output of the PD is either delayed or unambiguous because of failure to meet the setup and 
hold time requirements of the input circuit. When the reference and input signals fall in 
 
(a) 
 
(b) 
Fig 2.22.(a)Two state phase detector.(b) Three state phase detector implementations after 
[Burn07]. 
 46 
 
this range the increment and decrement signals may go to a wrong value causing the loop 
to go into a wrong state. 
A three state phase detector shown in Fig. 2.22(b) has a locked state window width 
of . It consists of a pair of two state phase detectors which define the edges of the phase 
detector window. The first flip flop compares the phase of the input signal to the reference 
and the second compares the input to a delayed version of the reference. When the input 
leads the reference, an increment is generated and when it lags the delayed reference, a 
decrement is generated. If the input falls between two reference signals no output is 
generated and the phase detector is considered to be in the locked state. 
The accuracy of the three phase state detector depends primarily on the width of 
the locked state window which determines how far the input signal can drift from the 
reference before the loop corrects it. In most cases the peak phase error is approximately 
equal to the window width, although his relationship only holds when the window width is 
equal to or greater than the minimum delay step size. 
2.4.4.3. Time to Digital Converter (TDC) 
An important circuit used in digital phase detection is the time-to-digital converter 
(TDC). The TDC is used to determine the phase difference between the generated system 
clock and the reference or input clock as a digital code. This is necessary for determining 
if the PLL/DLL is tracking the reference clock. The quantified phase difference is then 
used to compute the fine settings needed for the subsequent clock edge for a perfect 
PLL/DLL lock. The TDC is constructed from the digital delay line as shown in Fig. 2.23. 
The multiple temporally separated clock edges produced clock different flip-flops with the 
reference clock at the data input of the flip-flop. The flip-flops for which the reference 
 47 
 
clock meets the setup time of the flip-flop, store a logic ‘1’ state, while the others store a 
logic ‘0’ state. Thus the time difference between the reference clock and the system clock 
is converted into a digital thermometer code. This digital code can be analyzed by the 
control logic for creating the subsequent coarse and fine settings for PLL/DLL tracking. 
The reference clock is buffered to drive the D inputs. The reference clock is delayed 
such that a 0° phase difference in the two input clocks to the TDC produces a thermometer 
code of (MSB-LSB)/2 or 16 for a 32 bit TDC under PVT conditions. This centers the 
thermometer code at the center of the fine delay chain and enables us to determine if the 
system clock is leading or lagging the reference clock. A thermometer code greater than 
16 would denote that the generated system clock is leading the reference clock and a code 
less than 16 would indicate lagging.   
 
Fig 2.23. A Mux-DDL based time-to-digital converter (TDC) implementation. 
Reference
 Clock 
D
Q
Delayed Clock edges [0:31]
System 
Clock
Buffer for drive
 and delay matching
D
Q
D
Q
D
Q
TDC 
out [0]
TDC 
out [1]
TDC 
out 
[30]
TDC 
out 
[31]
0
1
0
1
0
1
0
1
Digital Delay line (DDL)
 48 
 
2.4.5. Loop Filter 
The loop filter in a PLL performs several functions like removing unwanted high 
frequency noise from the phase detector output, stabilizing the feedback loop, and in some 
cases increasing the loop gain at lower frequencies. The filter also helps to determine the 
dynamic performance of the loop. Filter transfer function is given by H(s). 
In most PLL designs a first order loop filter is used. This makes the PLL a second 
order PLL (VCO performs the second integration function) [Song10].Three different types 
of loop filters are used in PLL circuits: the passive lead-lag filter, the active lead-lag filter 
and the active proportional –integral (PI) filter.  
2.4.5.1. Analog loop filter implementations 
In an analog PLL, filters are usually implemented using a one or two pole low pass 
architecture constructed from passive component or active components. The two widely 
used passive loop filters are shown with their respective transfer functions in Fig 2.24. The 
 
(a)                                                                  (b) 
Fig 2.24.Passive loop filter implementations [Egan08] with (a) single pole and (b) two poles. 
R1
C1
111
1
1
1
1
1
1
1
1
)(
sCsR
sC
R
R
sH






111 CRwhere
R1
C1
R2
1
2
211
12
1
21
1
2
1
1
1)(
1
1
1
)(


s
s
RRsC
CsR
sC
RR
sC
R
sH









111 CRwhere 122 CR
 49 
 
passive filter is quite simple and is often satisfactory for many purposes. The passive filter 
has very linear performance, relatively low noise and unlimited frequency range. It is 
however hard to integrate on a chip when the resistance and capacitance values are large. 
Active filters (Fig 2.25) require high gain DC amplifier but provides better tracking 
  
(a)                                                                (b) 
 
(c) 
Fig 2.25.Active loop filter implementations with their corresponding transfer functions (a) 
active lag filter I (b) active lag filter II (c) active PI filter [Egan08]. 
R1
C2
V-
VDD
VSS
Vout
R2
V+
































1
1
1
1
1
1
)(
1
2
2
1
11
22
2
1
1
1
2
2


s
s
C
C
CsR
CsR
C
C
sC
R
sC
R
sH
C1
111 CRwhere 122 CR
Vin
R1
C2
VDD
VSS
Vout
R2
V+
C1
































1
1
1
1
1
1
1
1
)(
2
1
1
2
22
11
1
2
1
1
1
1
2
2
2
2


s
s
R
R
CsR
CsR
R
R
sC
R
sC
R
sC
R
sC
R
sH
111 CRwhere 122 CR
Vin
V-
R1
C2V-
VDD
VSS
Vout
R2
V+





 





 



1
2
21
22
1
2
2
11
1
)(


s
s
CsR
CsR
R
sC
R
sH
111 CRwhere 122 CR
Vin
 50 
 
performance. They are however frequency limited and consume more power than passive 
implementations. 
One additional component of interest that is usually studied with loop filters in 
analog PLLs is the charge pump [Gard80]. The use of PFD in place of a conventional PD 
allows us to use a charge pump along with the low pass filter. The purpose of the charge 
pump is to convert the logic states of the PFD into clearly specified analog signal levels.  
A charge pump is a three position electronic switch controlled by the three states of 
the PFD. If the ‘Up’ output signal of the PFD is high, a positive current of magnitude +IP 
is delivered to the loop filter while a negative current of –IP is delivered if the Dn output is 
high. The Up signal is high when the input reference signal is operating at a higher 
frequency than the feedback signal. The charge pump forces current into the loop filter 
causing the VCO control voltage to increase the VCO output frequency. This causes the 
feedback frequency to move towards the input reference signal.  
 
Fig 2.26. PLL loop with PFD, charge pump, loop filter based implementation after 
[Gard80]. 
PFD
Reference clock
Internal clock
Up
Dn
Loop filter VCNTL
VDD
Charge pump
 51 
 
Likewise, the ‘Dn’ signal is high when the input reference signal is operating at a 
lower frequency than the feedback signal. The charge pump sinks current out of the loop 
filter causing the VCO control voltage to decrease which in some time (due to the 
integration in the loop filter) brings down the feedback frequency to match with the input 
reference signal. If both the terminals are low corresponding to a ‘phase locked’ state, the 
loop filter is isolated from the charge pump or the charge state of the loop filter is 
maintained to keep it at the locked state. Charge pumps in PLLs are capable of extremely 
accurate phase tracking of the input signal. It is this property of the charge pump that makes 
mixed signal PLL solutions (with digital PFD, charge pump and analog loop filters and a 
 
(a) 
 
(b) 
Fig 2.27.(a)K counter (b) Waveforms of the UP-counter, DOWN-counter, carry borrow 
states showing the filtering operation after [Best07]. 
UP-counter
DOWN-counter
K clock
carry
borrow
Up/Dn
K clock
Up/Dn
UP-Counter
DOWN-counter
Carry
Borrow
K/2
K/2
 52 
 
VCO/DCO) the most popular implementation. The charge pump output stage of the PFD 
is shown in Fig 2.26. A charge pump typically uses current mirrors for implementing the 
two current sources. 
2.4.5.2. Digital loop filters 
As seen before, different digital PDs generate different types of output signals. The 
XOR phase detector, PFD, two-state, three-state phase detectors produce different outputs 
and hence it becomes evident that not every digital loop filter is compatible with all types 
of digital PDs.  
The simplest loop filter is built from an ordinary UP/DOWN counter.(Fig 2.27(a)). 
On each Up pulse generated by the phase detector, the content N of the UP/DOWN counter 
is incremented by 1 and every Dn pulse generated by the PD decrements the count by 1. 
Because the content N is the weighted sum of the Up and the Dn pulses (the Up pulses 
have an assigned weight o f+1 and the Dn pulses , -1) this filter can roughly be considered 
an integrator having the transfer function H(s) as  
𝐻(𝑠) =
1
𝑠𝑇𝑖
        (19) 
 
                                    (a)                                                            (b)        
Fig. 2.28.(a)Up/Down Counter (b) digital integrator implemented using an accumulator 
structure. 
UP/DOWN counter
N bit input
1 bit
Up/Dn
Clk
+
FF
DQ
Up/Dn
DCO control 
word
Clk
 53 
 
where Ti is the integrator time constant. This is however a crude approximation since the 
Up and Dn pulses donot carry any information about the actual size of the phase error; they 
only tell whether the phase of the internal clock is leading or lagging the reference clock. 
Fig 2.28(b) shows a digital integrated implemented using an accumulator. 
One of the most important digital loop filters is the K counter [Best07]. This loop 
filter always works together with the XOR or the JK flip flop detector. The K counter 
shown in Fig 2.27(a) consists of two independent counters which are usually referred to as 
UP-counter or DOWN-counter. In reality however both counters are counting upwards. K 
is the modulus of both counters, i.e the contents of both counters are in the range 0 to K-1. 
The frequency of the clock signal (K clock) is by definition M times the center frequency 
f0 of the PLL where M is typically 8, 16, 32. The operation of the K counter is controlled 
by the Up/Dn signal. If this signal is high, the UP-counter is active, while the DOWN-
counter stays frozen. In the opposite case, the DOWN-counter counts up while the UP-
counter stays frozen. Both counters recycle to 0 when their contents exceed K-1. The most 
significant bit (MSB) of the UP-counter is used as the carry output while the most 
significant bit of the DOWN-counter is used as the borrow output. When the count in each 
counter exceeds K/2 their corresponding outputs are high. The positive going edges of the 
carry and borrow signals are used to control the frequency of the DCO. When the signal is 
low during a larger fraction of the clock, the UP-counter gets more clock pulses on an 
average than the DOWN-counter. The average number of carries then becomes larger than 
the average number of borrows per unit time producing a filtered signal for controlling the 
DCO frequency.  
  
 54 
 
2.4.6. Feedback  
Using PLLs for frequency synthesis applications is ever increasing in the field of 
communications. Originally the frequency synthesizer has been a system creating a set of 
frequencies that were an integer multiple of a mostly fixed reference frequency. Such 
synthesizers are referred to as integer-N-frequency synthesizers. In contrast to integer-N 
frequency synthesizers, structures capable of creating frequencies that are N/f times a 
reference frequency are called fractional-N synthesizers. Here, N is the integer part and f 
is the fractional part. Such frequency synthesizers use a frequency divider in the feedback 
part of the PLL circuit. The use of such dividers allows us to have internal clock frequencies 
much higher (~ GHz) than the reference frequency for use in microprocessor applications. 
Fig 2.29 shows a typical frequency synthesizer system 
2.4.6.1. Frequency Divider  
A frequency divider is used to divide the frequency of the internal clock (generated 
by the VCO) to produce a divided-by-N clock that is used in phase and frequency 
comparison with the reference clock. The simplest binary frequency division circuit is the 
 
Fig 2.29. A basic frequency synthesizer system. [Best07]. 
Phase detector 
(PD)
Loop filter 
(LPF)
Phase error Voltage controlled 
oscillator (VCO)
Reference 
clock
Control voltage
Internal clock
Divider ÷ N
Feedback 
path
Scale factor N
 55 
 
toggle flip-flop which can be constructed from the common D flip-flop as shown in Fig 
2.30. 
VLSI programmable ratio frequency divider circuits are widely used for variable 
achieving frequency division ratios in fraction-N frequency division operations. The 
programmable frequency divider circuits can be constructed based on the binary divider 
and a cycle-stretch circuits.  
A programmable ratio between 1/2N and 1/(2N+M) is shown in Fig 2.31. Such a 
circuit consists of N-bit asynchronous binary divider and a feedback path in the first stage 
employing a NAND gate. The feedback path offers the so called cycle stretch operation, 
such that one additional clock cycle is inserted (cycle stretch) in the dividing process every 
time the hard-coded stretch location code is met during the division counting. The 
maximum number for the divisor would be 2N-1. For example in a divide-by-3 
implementation, under normal conditions such a divider serves as a divide-by-2 circuit. 
However when the divide counter matches the stretch code, an extra cycle is inserted by 
making QFB ‘0’. 
 
Fig 2.30. A divide by 2 circuit using D flip-flop and its functional waveforms. 
FF
D Q
Clk in
T
2T
Clk out
Clk in
Clk out
 56 
 
Other common circuits used in the frequency division are binary counters. An n-bit 
binary counter divides the input clock frequency by 2n. There are two types of binary 
counters: Synchronous and asynchronous. Each has its own advantages and disadvantages 
for a PLL design. Each stage of an asynchronous counter operates at lower frequency. 
However, an asynchronous counter results in high jitter, since jitter is accumulated at every 
stage. This is due to the fact that output of one stage is fed to the clock input of the next 
stage. In contrast, a synchronous counter results in reduced jitter, but with a power 
overhead. The reason for the high power consumption of the synchronous design is that 
each stage of a synchronous counter operates at the input clock’s frequency (which is high). 
  
 
Fig 2.31. A programmable divider circuit with the cycle stretch circuit. 
FF
D Q
fref
FF
D Q
FF
D Q
FF
D Q
D1
D2
DN
QFB
X
fo
Cycle stretch circuit
Stretch codeN bits
 57 
 
2.4.7. Classifications of PLLs / DLLs  
Depending upon the required constraints and specifications, PLLs/DLLs may be 
implemented using either in analog, a hybrid of analog and digital, all-digital, a hybrid of 
analog and software or purely in software. The major classifications of PLLs/DLLs based 
on implementation are [Guan96] [Best07]: 
 Analog or Linear PLL 
 Digital PLL/DLL (DPLL) 
 All digital PLL/DLL (ADPLL) 
 Software PLL/DLL (SPLL) 
2.4.7.1. Analog or Linear PLL 
The analog PLL or linear PLL is the classical form of the PLL. All components of 
the LPLL operate in continuous –time domain. The phase detector is typically some form 
of analog multiplier. The general architecture of a purely analog PLL is shown in Fig.2.32. 
 
Fig 2.32. Basic architecture of an analog PLL. 
Analog 
multiplier 
phase detector
Low pass 
filter 
(passive)
Voltage 
controlled 
oscillator
))(sin(2 0 ttA  
))(ˆcos(2 0 ttA  
Phase error
Control voltage
 58 
 
2.4.7.2. Digital PLL or DLL (DPLL) 
The digital PLL in spite of its name is really an analog PLL or DLL with a digital 
phase detector. Digital PLLs are mostly hybrids built from linear and digital circuits 
[Egan08] . The most common implementation of a DPLL uses the PFD with a charge pump 
and a passive low pass filter to control the frequency of a VCO as shown in Fig 2.33 (a). 
The Digital DLL (Fig 2.33 (b)) consists of a digitally controlled VCDL that may be 
controlled by the control voltage. 
 
(a) 
 
(b) 
Fig 2.33. Basic architecture of (a) digital PLL frequency synthesizer (b) digital DLL.  
Charge 
Pump
Voltage Controlled 
Oscillator
Loop Filter
Reference 
Clock
r_clk
i_clk
Phase 
Frequency 
Detector
up
dn
Frequency 
Divider
Clock 
Distribution 
Network
Feedback clock
Control
Voltage
Internal 
clock
Charge 
Pump
Voltage Controlled Delay 
Line
Loop Filter
Reference 
Clock
r_clk
i_clk
Phase 
Frequency 
Detector
up
dn
Frequency 
Divider
Clock 
Distribution 
Network
Feedback clock
Control
Voltage
Internal 
clock
 59 
 
2.4.7.3. Master –Slave Delay Locked Loop (DLL) 
The most common DLL architecture is the master-slave DLL (MS-DLL) as shown 
in Fig. 2.34. In the MS-DLL, both periodic and non-periodic signals can be used as the 
reference signal to the DLL. In the MS-DLL architecture, the master and the slave delay 
lines share the same control voltage. Thus the master delay line is locked to the reference 
signal while the slave delay line follows the master. As the master and slave delay lines 
may have different delay lengths, the slave delay is a scaled version of the master delay. 
However, delay differences in the master and slave delay lines may cause skew between 
the master and slave and have to be compensated for. 
 
Fig 2.34. Master-Slave DLL architecture. Note DO is the delayed version of DI as the master 
and slave share the same control voltage.  
Phase detector 
(PD)
Loop filter 
(LPF)
Phase error
Voltage controlled 
delay line (VCDL)
Reference 
clock
Feedback path
Control voltage
Voltage controlled 
delay line (VCDL) D0
Replica Delay line
DI
 60 
 
2.4.7.4. All-digital PLL/DLL (ADPLL) 
The all-digital PLL/DLL (ADPLL) is exclusively built from digital function blocks; 
hence, it does not contain any passive components like resistors and capacitors [Stas06]. 
The VCO is replaced by the DCO or a more specific implementation called the numerically 
 
(a) 
 
(b) 
Fig 2.35. (a) Basic architecture of an ADPLL used in a microcontroller.(b) The all digital 
PLL as implemented in the part 74HC/HCT297. The DCO used in this circuit is 
an increment/decrement counter after [Best07]. 
Digital Integrator
Digitally controlled VCO
Reference 
Clock
r_clk
i_clk
Phase 
Detector
Phase 
error
Frequency 
Divider
Clock 
Distribution 
Network
Feedback clock
Digitlal 
Control
Internal 
clock
K counter
Incr/Decr Counter
÷ N 
counter
DEC INC
K clock
K modulus 
control
ID clock
Carry
BorrowUp/Dn
Output 
(internal 
clock 
nominally 
at f0)
Reference 
clock
N control
J
K
Q
Unused 
JK PD
XOR
 61 
 
controlled oscillator (NCO) in some systems. The DLL implementation uses a digitally 
controlled delay line (DCDL). A generic ADPLL architecture used in microcontrollers for 
frequency synthesis is shown in Fig.2.35 (a) and a specific implementation is shown in Fig 
2.35(b).The implementation carried out in this dissertation is based on all digital PLLs.  
2.4.7.5. Software PLL or DLL (SPLL) 
The software PLL orDLL is normally implemented by a hardware platform such as 
a microcontroller or a digital signal processor (DSP) [Best07]. The PLL orDLL function is 
realized by software. This offers the greatest flexibility because a vast number of different 
algorithms can be developed. The SPLL can be tailored to perform similarly to a DPLL to 
execute a function that none of these hardware variants is able to do. 
When comparing SPLLs with hardware PLLs, we should recognize first that the 
LPLL or DPLL actually is an analog computer that continuously performs some arithmetic 
operations. When a computer algorithm has to take over that job, it must replace these 
continuous time functions with discrete-time process. According to the sampling theorem, 
the algorithm of SPLL must be executed two or even four times in each cycle of the 
reference signal. The implementation of the loop filter is typically a difference equation. 
The design and analysis of the loop filter is done using the z-transform. Thus the SPLL can 
compete with the hardware implementations only if the required algorithms are executing 
fast enough on the hardware platform used to run the program. If the given algorithm 
performs too slowly on a relatively slower microcontroller, the designer is forced to resort 
to more powerful hardware and a price tradeoff comes into play. The high speed and low 
cost availability of hardware PLLs makes it difficult for the SPLL to compete. Software 
PLLs are beyond the scope of this discussion and are hence not discussed in further detail. 
 62 
 
2.5. Chapter Summary 
This chapter has studied the basics of VLSI clocking: Sequential element design 
(latches and flip-flops), clock distribution using tree, mesh or spine based topologies. 
Finally clock generation using different PLL or DLL topologies was studied in detail along 
with their constituent circuits such as VCOs, PDs, LPFs and dividers. Since the work 
considered in this dissertation is intended for radiation hardened applications, the use of 
analog circuits or analog based circuit components is ill suited for RHBD work. Hence 
purely digital components and digital design based implementations are considered for 
further investigation in subsequent chapters. 
  
 63 
 
CHAPTER 3. RADIATION HARDENED SEQUENTIAL ELEMENT DESIGN 
Hardening the flip-flops and latches is the most straightforward way to improve the 
soft-error robustness of sequential logic circuits. This chapter describes the essentials of 
creating radiation hardened designs for sequential elements. As was described in Chapter 
1, both TMR and temporal radiation hardening techniques are considered for common 
sequential elements; flip-flops and pulse latches. In particular, in the chapter, we introduce 
(and will study in greater detail in Chapter 4) CAD methodologies for designing large 
designs with the previously proposed self-correcting TMR flip-flop. We then design, 
implement and analyze a TMR pulse latch with standard ASIC flows to design an AES 
engine. Finally, a novel temporal pulse latch design is proposed and analyzed in detail.  
3.1. Introduction 
Many existing hardened latch and FF designs, e.g., dual interlocked cell (DICE) 
[Calin96] and built-in soft error resilience (BISER) FF [Zhang06] mitigate SEU but not 
SETs at their D or clock inputs. Simultaneous multiple node charge collection (MNCC) 
from a single charge track has been able to thwart hardened latch redundancy for some 
time [Black08][Warren09]. Poorly designed ‘hardened’ FFs, with insufficient critical 
(redundant) node spacing have demonstrated similar upset rates as unhardened FFs 
[Gasp13]. 
 64 
 
3.2. Prior Work: Self- Correcting TMR Flip-Flop design 
The TMR self-correcting flip-flop considered for implementation in this work (see 
Fig 3.1) follow those in [Hind11], but without scan capability to keep the area penalty over 
the baseline flip-flop down to four poly tracks. The TMR flip-flop consists of three 
redundant copies of the master-slave flip-flop that are spatially separated. The majority 
voter in the slave latch feedback path is driven by the slave storage nodes of other redundant 
copies. At the clock falling edge, the slave latch goes from the transparent mode to 
feedback mode, the voter corrects the data based on the majority of the latch states. Thus 
at the falling edge of the clock (for a positive edge trigged FF as shown in Fig 3.1), the 
stored data is restored to its correct state. This self-correcting feature also allows clock 
gating, in the triple redundant self-correcting MSFF. The flip-flops are designed as 4-bit 
 
 
Fig.3.1. Schematic of the Self Correcting TMR master-slave flip-flop after [Hind11]. 
QN
CLKB
CLKB
D
CLK CLKB
CLKN
CLKN
CLKN
CLKN
CLKN
CLKB
Q
CLKB
To Copies B & C
From Copies 
B & C
 65 
 
macros that provide storage interleaving to mitigate MNCC as in [Hind11] where they were 
shown to be extremely hard in both proton and heavy ion testing. 
3.2.1. Physical Design using Fences for TMR separation with TMR flip-flops 
Full TMR logic with triplicated sequential logic, combinational logic and clocks 
necessitate that redundant logic be spatially separated while maintaining communication 
within the TMR flip-flops, such as in [Nathan09]. The design scheme in [Hind11] used 
interleaved domains composed of multiple cell rows each. The fence based physical 
separation scheme discussed here further automates that approach, while better leveraging 
the APR tool optimization capabilities. 
During physical design, we ensure the necessary spatial separation for redundant 
copies by using fences in Cadence Encounter (Synopsys has a similar facility). A fence is 
a contiguous bounded geometry that provides a hard placement constraint. Blocks 
associated with a particular fence can only be placed within that boundary. Arbitrary 
 
Fig. 3.2. Proposed placement methodology using interleaved serpentine fences. 
 66 
 
shaped fences provide separation for heavily interacting logic with limited inter-domain 
connections. Thus the key to achieving spatially separation between redundant logic is to 
have three non-overlapping fences for different copies. 
Since fences must be contiguous, it is necessary to interleave them in a serpentine 
fashion as shown in Fig. 3.2. The vertical bars can be as small as the smallest spacer cell, 
which precludes placement of regular logic there. The multi-bit flip-flops now span 
vertically across all three fences. As in the first scheme, their placement is restricted to 
keep their constituent flip-flops within their respective fences. A and C copies have twice 
the cell row height, but B occurs twice as often making the areas equal. The combinational 
logic, though restricted to their respective fences can be much closer to the flip-flops from 
any domain. Thus interleaved serpentine fences provide separation between redundant 
copies but better facilitate inter-domain connections. The only restriction on the fence 
dimension is that the B regions must be as tall as the B flip-flops in the multi-bit flip-flop 
implementation for fence and flip-flop alignment. This design technique of using fences 
for physical separation is used extensively in HERMES2 and will be discussed in further 
detail in Chapter 4. 
3.3. TMR Pulse Clocked Latch Design 
Replacing master slave flip-flops with pulsed-clocked latches has shown clock and 
sequential circuit power reductions of over 40% [Clark01] [Tshanz01]. It reduces clock 
power via reduced capacitive loading C and improves performance (flip-flop dead time) 
by eliminating the master latch. Pulse-clocked latches sample data in the pulse’s 
transparent high phase, providing better timing through time-borrowing during the 
transparency. Pulse-clocked latches, albeit with local pulse generation for best clock 
 67 
 
fidelity, have become almost standard in high performance designs. The TMR pulse-
clocked latch employs three redundant latches with majority gates in the latch feedback, 
providing self-correction (Fig. 3.3).  
The clock pulses are generated by three spatially separated redundant pulse 
generators. Multi-bit TMR pulse-clocked latch blocks share a common pulse generator, 
reducing the overall clock distribution and providing good pulse fidelity. Consequently, 
the 16-bit TMR pulse clocked scheme discussed in this report affords a 40% reduction in 
sequential circuit and clock power (due to loading) compared to a flip-flop implementation 
and reduces the overall circuit area. Moreover, since correction occurs in the feedback 
 
Fig.3.3. Schematic of the TMR pulse latch with redundant pulse generators and majority 
gated latch feedback. 
M
PClkNA
QA
DataA
16
CLKA
Pulse Generator
Delay
PClkA
PClkA
P
C
lk
N
A
QB
QC
Pulse latches B
Pulse
X 16
X 16
X 16
Pulse latches B
16
16
16
16
16
CLKB
Pulse Generator
B
DataB
Datac
CLKC Pulse Generator
C
PClkA
 68 
 
mode during the negative phase of the pulse, the pulsed-clock affords a larger correction 
window (greater than half the clock cycle) than previous master-slave implementations 
[Hind11].  
The prototype vehicle used in this work is an Advanced encryption standard (AES) 
engine [NIST01]. The circuit techniques employ pulse-clocking for reduced power and 
area, with testability features to ensure correct operation, and testability features that allow 
the design to function in a non-redundant, unhardened mode. The designed pulse latch 
macro also has an additional pipe-collapse mode for lower energy operations. Novel 
physical design approaches provide both sequential and combinational node separation for 
MNCC immunity. 
3.3.1. AES Overview and Design 
The Advanced encryption system (AES) is a cryptographic standard that works on 
128, 192 or 256-bit key streams that are merged with a plain text to generate the encrypted 
 
Fig.3.4. Fully pipelined (loop-unrolled) Advanced Encryption Standard architecture as 
implemented. All stages have very similar configurations and timings. Even and 
odd pipeline stages are identical. Mathematical transformations for the 
combinational logic are shown in boxes. 
 69 
 
cipher text [7] [8]. The basic high level block diagram of a fully pipelined AES comprises 
Fig. 3.4, where the pipeline operations are also shown. The data is XORed with the output 
of the sub-key unit. The sub-key is generated based on mathematical transformations. This 
output is processed through two substitution and a transformation block followed by 
another XOR block. The first substitution block substitutes the data byte-wise based on a 
lookup table (here synthesized combinational logic). The translation block translates the 
arrangement of bit in each byte based on circular shifting transformations. The second 
substitution block substitutes four bytes, based on a polynomial modulo function instead 
of a lookup. Each of these operations are performed on a 4x4 array of data bytes. Studies 
on the effect of pipeline depth on AES power and performance show that full pipelining 
provided the highest performance (as expected) [Chitu05]. The number of such stages N is 
11, 13 or 15 depending on the key width. 
Our AES implementation uses a 15-stage pipeline with a 128 bit initialization 
vector and 256 bit encryption key (Fig. 3.4). Data and cipher texts are 64-bits, streamed 
continuously, using counter mode, where 64-bit data is XORed with the key pipeline MSB 
or LSB. In this mode, keys and initialization vectors are loaded separately. A test mode 
allows operation directly into the pipeline and observation of the output. The design is 
radiation hardened, using fine-grained triple mode redundant (TMR) self-correcting latches 
for pipeline storage. A test mode provides the ability to override the redundancy and 
operate a single pipeline at a time. This allows us to compare the behavior in a hardened 
vs. unhardened mode. 
 70 
 
3.3.2. Self-correcting TMR Pulse Latch Macro Design  
Fig. 3.5 shows the 16-bit TMR latch macro designed for use in the AES. 16 was 
chosen as the optimal number of latches to be clocked by a single pulse generator based on 
the opposing constraints of pulse fidelity and area utilization. In addition, the TMR pulse 
latches also provide a test mode, controlled by the multiplexer select TMA/B/C (see Fig. 
3.5) which allows for the testing of each pipeline copy by forcing the inputs of the other 
two copies to logic 1 and 0 individually by controlling TMA/B/CIN inputs. The pulse 
 
(a)                                                                               (b) 
Fig.3.5. .(a) Block diagram showing self-correcting pulse-clocked latch macros in the AES 
design. Multiplexers that select between data and test mode input are also 
shown. The pulse generator is modified to allow pipe stage unification using the 
Open Control signal. (b) Layout showing the custom designed block consisting 
of A, B, and C pulse latch copies and the respective pulse generator. Spatial 
separation between the latter provides a means to reduce multi-bit upsets. 
M
ClkN
Clk
QA
DA
16
System
CLK
Pulse Generator
Open 
Control
Delay
PClkA
P
C
lk
A
N
QB
QCPGC
DC
Pulse latches 
B
TMA
Data A
TMAIN
Data C
PGA
System CLKA
Open 
Control
Pulse
X 16
X 16
X 16
Pulse latches 
C
16
16
16
16
TMB
PGB
DB
16
Data B
TMBIN
TMCIN
TMC
System CLKC
System CLKB
 71 
 
generator also has an open control signal that allows forcing the clock high to hold the latch 
transparent, acting as a buffer. This permits entirely bypassing a particular pipeline stage. 
Thus, different pipeline stage depths can be created by pipeline stage unification (PSU) for 
speed testing and for providing additional energy savings beyond dynamic voltage scaling.  
The latch blocks were characterized in Nanotime, incorporating time-borrowing 
during synthesis and automated place and route (APR), easing timing closure. Separate top 
level A/B/C clock trees ensure that the design is robust to clock SETs. The three pulse 
generator copies are separated by about 4 m to reduce the probability of MNCC upsets 
affecting redundant copy pulsed-clocks. The latch copies are separated by 7.84 m. 
3.3.3. Optimal Pulse Width Determination 
One key aspect of the pulse latch is determining the optimal pulse width required 
for reliable operation. Random and systematic (PVT corner) variations require statistical 
analysis of both the latch write time and the maximum pulse width. The former determines 
the minimum pulse width for a successful write, whereas the latter determines the hold 
time required to prevent a hold failure throughout the chip. The number of hold buffer for 
the chip also increases with the pulse duration, requiring its minimization during the top 
level timing optimization.  
We used HSPICE Monte-Carlo simulations to provide target design margins that 
account for pulse degradation due to the aforementioned variations. The narrowest pulse 
width required to prevent a latch failure was determined from the mean required write pulse 
width MIN-WRITE of 100.4 ps plus a 5MIN-WRITE (MIN-WRITE = 7.9 ps) of 139.7 ps (see Fig. 
3.6). The pulse generator was then designed to have a worst-case pulse width greater than 
 72 
 
this at its -4 tail, ensuring that the worst case pulse generator (producing the shortest 
pulse) could still sufficiently clock the worst case latch (slowest latch) in the entire chip.A 
disadvantage of applying the statistical method to analyze the optimal pulse width required 
is that this methodology ignores the localized correlated and deterministic nature of the 
variations and instead considers systematic variation as a system-wide global parameter. 
This design margin, though more than that required as circuit variations in both pulse 
generators and latches tend to be correlated, was ensured deliberately to provide greater 
reliability. The pulse generator was designed to generate pulse widths based on PG-WIDTH 
= 186.8 ps minus 4PG-WIDTH (PG-WIDTH = 9.6 ps) = 147.3 ps. The minimum hold margin 
required for the chip was then determined as the largest pulse produced by the pulse 
generator, i.e., PG + 4PG = 225 ps. Clock skew and design guard band increased the hold 
 
Fig.3.6 Statistical analysis of the pulse latch and pulse generator showing worst-case pulse 
width for proper data capture mean and sigma as well as pulse generator 
variation as determined by MC simulation. 
 73 
 
time actually used. Due to concerns about on-die VDD variations, particularly with wire-
bond packaging, we used worst-case voltage margins. Nanotime was used with these 
parameters for liberty timing file generation. The resulting files include the time-borrowing 
capability of the latches of 87.6 ps. 
3.3.4. TMR Pulse Latch Test Mode 
The test mode is controlled by the multiplexer select TMA/B/C and allows for the 
testing of each pipeline copy by forcing the inputs of the other two copies to logic 1 and 0 
individually by controlling TMA/B/CIN inputs, thereby negating the majority voting 
impact on the third (tested) copy. Fig. 3.7 shows the A pipeline being tested by forcing 
logic 0 at the B pipeline input and logic 1 at the C pipeline input. The B and C copies 
 
Fig.3.7. Timing waveforms showing error correction with a TMR latch during test mode 
simulation. 
 74 
 
correct at the end of the pulse, after a delay of 202.8 ps and 340.1 ps respectively. The 
difference in correction delays is due to the pull-up and pull-down ratios.  
3.3.5. Pipeline Stage Unification using Pulse Latches 
3.3.5.1. Pipeline Depth and Pipeline Stage Unification Overview 
The optimal processor pipeline depth is a complicated function of the micro-
architecture, circuit design, and power budget. There does not appear to be a clear best 
choice, which is demonstrated by the fact that closely related designs, e.g., the Pentium Pro 
and Pentium 4 used 10 and 20 stages, respectively [Hint01]. Later implementations of the 
Pentium 4 rose to over 30 stages, enabling greater reliance on static circuits. One penalty 
of deep pipelining is the added power due to the flip-flops and latches. This has led to much 
more modest pipeline depths in embedded processor designs. 
 
Fig. 3.8. Conventional pipeline (a), bypassing master-slave FF (b) and pipeline depth 
collapse using PSU (c). Note that when the OC (open control) =1 , the FF is 
bypassed and the pipeline stage is logically removed  from the design. 
 75 
 
As a result, the ability to adjust the pipeline depth to match the application 
requirements has been proposed. In pipeline stage unification (PSU) 
[Efth02][Jacob04][Koppa02][Shim06] the pipeline stage flip-flops are bypassed to 
combine multiple stages into one, collapsing the overall depth as shown in Fig. 3.8. 
Combining (unifying) M pipeline stages at a time shortens the pipeline to N/M stages, but 
creates longer critical timing paths, where length increases linearly with M. Overall 
pipeline latency is constant, however, since it is based on the sum of combinational logic 
block delays, and so combinational logic energy dissipation is also constant. DVS provides 
power reduction as VDD
2×F, whereas PSU provides linear energy reduction by reducing 
flip-flop power as pipeline stages are combined. Consequently, PSU is most applicable to 
total energy limited applications, rather than those minimizing energy per operation. Once 
at the minimum usable VDD (VDDMIN) operating frequency reductions provide further linear 
power savings, albeit at the expense of throughput. PSU can provide additional energy 
savings when at VDDMIN. 
One significant problem with using PSU in a processor is the difficulty in 
combining pipeline stages containing disparate circuit functions. For instance, many 
microprocessor stages contain register files or memories, which require clock edges to 
initiate read, write, or pre-charge operations. However, modern SOC designs include a 
large number of blocks with much more uniform pipeline stages, e.g., for digital signal 
processing functions. These may be more amenable to PSU 
 76 
 
3.3.5.2. PSU using Pulse Latches  
PSU can be implemented with standard master slave flip-flops (MSFFs) as shown 
in Fig.3.8. Implementing PSU using MSFFs also has the disadvantage of adding an extra 
delay through the bypass-mux as shown in Fig. 3.8 (b). Moreover, the extra MSFF clock 
loading increases the computation energy. However, pulse clocked latches are very 
amenable to implementing PSU—by continuously asserting the pulse clock high, the latch 
remains in the transparent condition, acting as a buffer (refer to Fig. 3.9). As shown in the 
design presented here, this logic is implemented by modifying the pulse-clock generator 
with the Open Control signal (see Fig3.5). Implementing the latches as a macro allows 
ensuring high pulse fidelity by controlling the pulse clock routing and line-to-line coupling. 
The possible PSU stages for the 15 stage AES pipe is shown in Fig 3.10. 
 
Fig.3.9. Operation in normal and PSU mode. Note the values of TC2Q and TD2Q also include 
delay across one buffer stage at the output of the latch for providing the required 
drive. All numbers indicate the worst case timing for the latch farthest from the 
pulse generator. 
Q = D when OC =1
          Q = D at pulse high 
when OC =0
System
CLK
Open 
Control
Pulse
Data D
Output 
Q Time borrowing 
available  
TPW – TSU =  ~88 psTC2Q =  298.2ps TD2Q =  179.2ps
 77 
 
3.3.6. AES Implementation and Test Chip Design 
The AES design was synthesized using RTL compiler and auto-placed and routed 
in SOC Encounter. The overall design approach leverages an ASIC auto-place and route 
(APR) design flow using standard and full-custom cells. The latter include the pulse-
clocked latch macros, as well as register file memory (used for buffering and test mode 
 
Fig. 3.10 AES pipeline using pulse-clocked latches for pipeline stage synchronization. 
Maintaining the pulse-clock high removes that latch from the pipeline, providing 
feed-through without added bypass delay. A 256-bit key width mandates 15 
pipeline stages for fully pipelined operation. The 7 different PSU configurations 
possible are also shown. 
1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 2
4 4 4 2
5 5 3
6 6 2
7 7
1 stage
Pipeline depth
2 stages
3 stages
4 stages
5 stages
6 stages
7 stages
1 2 3 4 14 155
Latches
D Q
Pipeline Mode
PGClock
OC =0
Latches
D Q
Transparent Mode
PGClock
OC =1
‘1’
Latch Macro Latch Macro
 78 
 
control). All custom cells are compatible with the foundry supplied LP 7-track standard 
cells. The library cells were re-characterized using ELC at different voltages for DVS and 
multi-voltage PSU analysis. RTL Compiler was used for synthesis while Encounter was 
employed for APR. To avoid MNCC SET generation, the combinational logic is also 
spatially separated. This is accomplished by a physical design flow that keeps the cells 
constituting the redundant copies in separate regions via fences that impose a hard 
placement constraint. The fences are shaped to maintain separation while achieving high 
density. Iterative timing driven placement was used to determine suitable fence geometries. 
The resulting dumbbell shapes for each pipeline stage are shown in Fig. 3.11. The three 
copies have approximately equal areas, despite different geometries. The latch macros 
occupy the spaces between pipeline stages, as they communicate with all three copies. 
Customized power plans and user-guided placements of the multi-bit latch macros reduce 
routing congestion and clock power. Clock tree synthesis achieved a clock skew of ±30 ps. 
Hold time buffer insertion provided a conservative minimum delay of about 600 ps. 
 
(a)     (b)      (c) 
Fig. 3.11. (a) Die photomicrograph (left) with layout inlaid. (b) Spatially separated 
combinational logic in AES TMR with fence and floorplan sizes. (c) TMR 
separation flow generated powerplan with offsets for flip-flop placements. 
 79 
 
Primetime analysis of the fully extracted design post-route showed 408 MHz with zero 
timing slack at 1.2 V VDD. 
3.3.7. Results and Analysis 
Circuit simulations show that the pulse clocked latches dissipate 42% less energy 
per operation than flip-flops from the library with equivalent buffering, in line with 
[Clark01][Tshanz01]. Designing with pulse clocked latches increased the required hold 
margins significantly. Approximately 8k hold buffers were required per non-redundant 
pipeline, reducing the power savings, since the AES stages have many latch to latch paths. 
3.3.7.1. Pipeline Collapse Analysis 
Maximum operating frequency FMAX determined by Primetime using fully 
extracted parasitics for various voltages and with all stages of pipeline collapse is plotted 
in Fig. 3.12. It can be seen here that delay scales with voltage as expected, with larger 
 
(a)                                                                (b) 
Fig. 3.12. (a) Primetime analysis of the different PSU cases at different supply voltages.(b) 
Primetime critical path delay results for different PSU cases as a function of 
VDD. 
 80 
 
delays and consequently lower frequencies as VDD is reduced. FMAX at VDD of 1.5 V, 1.2 
V, 0.9 V, and 0.6 V are 606 MHz, 408 MHz, 272 MHz, and 67 MHz, respectively, for a 
non-collapsed pipeline as shown in Fig. 3.12 (a). The FMAX reduction due to DVS from 1.5 
to 0.6 V is 89%. Further reduction in speed at VDD = 0.6 V using full (7 stage) PSU is 83%. 
Referring to Fig. 3.10, one stage is the non-PSU case and 2 stage refers to stages combined 
as groups of 2, etc. Delay per unified stage case is plotted in Fig. 3.12(b), which is linear 
as expected. 
While the stages alternate in structure, the synthesis, APR, and optimizations 
treated the design as flat, i.e., each stage is different in the implementation. Nonetheless, 
only the first stage shows significant timing differences from the other stages, due to 
reduced logic in the first stage of the AES engine. Key and data critical timing paths do not 
show significant differences between stages, as they are driven to the same timing slack 
during optimization.  
 
(a)                                                        (b) 
Fig. 3.13.(a) Energy per operation at 1.2 V.(b) FPGA setup for testing the designed AES 
chip. 
0
0.5
1
1.5
2
2.5
1 stage 2 stage 3 stage 4 stage 5 stage 6 stage 7 stage
E
n
e
rg
y
 p
e
r 
o
p
e
ra
ti
o
n
  
(n
J
)
AES stages per pipeline stage
Total flip-flop based AES energy
Total Pulse latch based AES Energy
Combinational logic energy
Total  flip-flop energy
Total pulse latch energy
Test FPGA DUT
 81 
 
Energy per operation at VDD = 1.2 V is shown in Fig. 3.13 (a). Power dissipation 
due to the combinational logic is 68% of the overall power when using flip-flops, and 
increases to 79% when using the pulse-clocked latches. Since the combinational logic is 
such a large fraction of the total power, PSU improvements are modest, providing a total 
energy savings at two pipeline stages per unification of 7.6%. When using conventional 
flip-flops, the savings are greater, at 16.4% since flip-flop clocks are gated off in 
transparent stages. The significant savings due to pulse-clocked latches significantly 
diminishes the potential for further energy switching energy reduction with PSU. The 
majority of the savings (6.4%) is obtained with the pulse-clocked latch scheme 
implemented in our test chip with four stage collapse. Using conventional flip-flops, The 
savings is 13.8% with four stages. Five stages increase this to 15.1%.  
3.3.7.2. Experimentally Measured Performance 
The test board uses a Xilinx Kintex-7 FPGA to drive the chip-on-board (COB) 
device under test (DUT) (Fig. 3.13 (b)). The DUT board is soldered to the main board to 
achieve small bond-wire inductance. COB packaging allows the standard foundry CMOS 
I/O to operate at speeds up to 167 MHz with 2.5 V I/O. The I/O speed limitation is 
overcome in silicon testing using PSU. 10 Gb/s throughput through the 64 bit input and 
output buses was achieved by running the design in AES counter mode. Internal throughput 
is much higher and indicative of the performance in an embedded application where off-
chip bandwidth is not limiting. 
The PSU modes allow us to test the pipeline at faster rates, despite the I/O 
limitation. Due to the limited improvement in energy dissipation beyond that obtained by 
the use of pulse-clocked latches and DVS, we believe its primary value lies in in providing 
 82 
 
a test mode to debug speed paths. We use it here to determine the non-PSU FMAX by 
ascertaining the FMAX in the PSU modes. Because the pipeline stage delays are nearly 
identical as shown above, the as-fabricated delays can be determined without higher speed 
clocks and I/O. Collapsing to groups of four pipeline stages shows a fully pipelined FMAX 
of 500 MHz, with the core VDD = 1.5 V (Fig. 3.14(a)). 500 MHz provides an equivalent 
throughput of 64 Gb/s. The measured results at high voltage track the Primetime analysis, 
but with a degradation in the silicon results of about 20% at 1.5 V. We attribute some of 
this to variability induced pulse width variations as well as variations in the critical timing 
paths in each stage. The perfect time-borrowing and stage delays assumed by the static 
timing analysis makes it optimistic, particularly for this design, which has thousands of 
identical, critical path delays. However, this cannot account for all of the analysis 
optimism. Energy per operation scales, as expected with VDD
2, and is linear with the 
number of collapsed PSU stages. 
 
 
(a)                                                                        (b) 
Fig. 3.14. (a)Measured test chip FMAX vs. VDD at high operating voltages. The upper values 
are measured using the PSU mode.(b) Measured test chip FMAX vs. VDD at low 
operating voltages. The trend does not match the high voltage behavior. 
100
1000
0.9 1 1.1 1.2 1.3 1.4 1.5 1.6
Fr
e
q
u
e
n
cy
 (M
H
z)
VDD (V)
10
100
0.58 0.63 0.68 0.73 0.78
Fr
e
q
u
e
n
cy
  (
M
H
z)
VDD (V)
 83 
 
 The foundry supplied I/O drivers (presumably the level shifters) limit the core 
VDDMIN to 570 mV even at a reduced I/O voltage of 1.8 V. Fig. 3.14 (b) shows the measured 
low voltage behavior of the test die near VDDMIN. While the expected frequency vs. VDD 
relationship using PSU was observed at high voltages, we did not observe different speeds 
for pipeline collapse at voltages near VDDMIN. The hardware results observed due to PSU 
at VDD = 0.7V is shown in Fig. 3.15. This leads us to conclude that the low voltage behavior 
is dominated by the I/O speed, outside the AES pipeline, again presumably at the level 
shifters. 
3.3.7.3. AES Beam Test Results 
Primetime timing analysis predicted a maximum speed of 408 MHz at nominal VDD 
of 1.2 V and nominal process. Simulations and Primetime power analysis show that 
combinational logic contributes to 79% of the power dissipation, a very large percentage, 
achieved by the pulsed-clocking. 
 
Fig. 3.15. Measured hardware results for 1/FMAX  at VDD = 0.7 V for different pipeline 
stages using PSU. Passing points are shown. FMAX of the uncollapsed pipeline 
at 0.7 V is limited by the chip IO. Testing PSU beyond 4 stages is limited by the 
minimum possible FPGA clock frequency. 
0
50
100
150
200
250
1 stage 2 stage 3 stage 4 stage 5 stage 6 stage 7 stage
C
o
m
b
in
e
d
 s
ta
ge
 d
e
la
y 
(n
s)
Pipeline Stage Unification
IO limited 
datapoint  
 84 
 
The test board drives the chip-on-board (COB) device under test (DUT) controlled 
by a Xilinx Kintex-7 FPGA. COB packaging allows the standard foundry CMOS I/O to 
operate at speeds up to 167 MHz with proper voltage selections. The I/O speed limitation 
is overcome in silicon testing using the pipeline collapse test mode that allows 
measurement of the total combinational logic delay. Using pipeline collapse, the core 
circuit performance was determined to allow 500 MHz at VDD = 1.5 V. Broad beam 63 
MeV proton tests at UC Davis at a flux up to 8.92×107 particles/cm2-s produced a total of 
32 errors for the non-TMR mode with a total fluence of 2.75(1011) protons/cm2. No errors 
were observed in the TMR mode with a total fluence of 3.1(1011) protons/cm2. Both tests 
varied VDD from 0.7 V to 1.4 V. 
3.4. Temporal Pulse Clocked Flip-Flop  
This section presents a novel pulse-clock based RHBD flip-flop design that is hard 
to both SETs and SEUs. The design uses triple-mode redundant latches, combined with 
appropriate clocking to provide redundancy in both space and time. The FF is 40% smaller 
and 30% more power efficient than delay-based, e.g., temporal FFs implemented on the 
same 90-nm foundry process. The design uses constituent circuit interleaving to provide 
robustness against MNCC. The multiple-bit FF macro shares common pulse-clock 
generation circuitry, so density is comparable to designs that are robust to SEU only. The 
multi-bit flip-flop macro has been fabricated and tested functional as shift registers on a 
90-nm foundry LP process. 
 85 
 
3.4.1. Design of the Temporal Pulse Clocked Flip-Flop 
The core of most successful hardened latch circuits is some form of circuit or timing 
(temporal) redundancy. Temporal hardening uses time based sampling or feedback to 
mitigate both SEU and SETs via the use of multiple clocks or delay elements within the 
latches [Mavis02]. This penalty of the temporal approaches is a significant increase in FF 
setup and/or dead-time. The delay filtered DICE (DF-DICE) [Nase06] combines delay 
circuits with C-elements to filter SETs at the FF inputs while the DICE storage element 
provides SEU mitigation. Temporal filtering is costly since making a circuit slow is 
increasingly difficult since circuit delay ( is  
 
(a) 
 
(b) 
Fig.3.16. (a) Concept showing redundant FFs clocked by temporally separated clocks. (b) 
Erroneous SET at CLK input generates false edges on delayed version of clocks. 

D
Q
Maj
D
Q
D
Q

Clk
Clk
Clk
Data
Clk
Q
QA
QB
QC
D1Clk
D2Clk
Clk
D1Clk
D2Clk
SET
 86 
 
𝜏 =
𝐶𝑉𝐷𝐷
2𝐼𝐸𝐹𝐹
 
where IEFF is the average switching current consumed with output capacitance C 
and supply voltage VDD. Long channels or stacked gates generate low drive and increase C 
on the preceding gate. Increasing C to generate required delays negatively impacts power 
dissipation. 
3.4.1.1. Derivation of the Design 
One interesting, but to our knowledge never implemented, variation of the 
temporally hardened FF is shown in Fig. 3.16 (a). Three redundant FFs clocked by 
temporally separated clocks mitigate SEU and SET [Mavis02]. By providing more than 
one t between each clock edge, any SET at the D inputs are sampled by at most one FF. 
The majority gate voted output provides a correct Q in all cases. 
 However, this design is not immune to clock SETs, as shown in Fig. 3.16 (b). Here 
an SET produces a positive clock edge, clocking a new value into the FF. Note that an SET 
 
Fig.3.17. Block Diagram of Pulse based FF. Note the TMR pulse generators (PGA, PGB 
and PGC) are shared across 16 latches in parallel. The D input is temporally 
sampled by delayed pulses and voted out at the output. 

D
Q
Maj
D
Q
D
Q

Clk
Clk
Clk
Data
Clk
QA
QB
QC
D1Clk
D2Clk
PGA
PCLKA
PCLKB
PCLKC
D <15:0>
Q <15:0>
x16
PGB
PGC
 87 
 
induced at the temporal filter output affects the DF-DICE design similarly.Our design 
replaces TMR FFs with TMR pulse-clocked latches (Fig. 3.17). Pulse-clocked latches 
simulate FFs and can substantially reduce power consumption. Power savings is 
maximized when multiple latches share one clock pulse generator, since pulse generation 
is power expensive. Since the clock generator employed here requires delay elements to 
provide the appropriate TMR clock edge temporal separation, this is critical. A single 
pulse-clock generator drives each 16- bit FF macro. 16 latches were chosen as it provides 
good area utilization while still ensuring pulse fidelity. The pulsed clocks PCLKA, 
PCLKB, and PCLKC are local to the cell, so clock pulse waveform quality is well 
controlled. 
3.4.1.2. Clock Generator Design 
A key aspect of the design is avoiding the issue pointed out in Fig. 3.16 (b). To 
mitigate clock SETs, it is imperative that an SET on the global clock does not propagate to 
 
Fig.3.18. Waveform showing a low to high SET during the low phase of the clock. False 
temporal pulse edges generated, sample the incorrect D value (logic 0) causing 
an error to be latched at the output. 
Clk
PClk1
SET
PClk2 PClk3
D<0>
Q<0> Erroneous 
value captured
 88 
 
more than one PCLK. Clocks and delayed version of clocks D1Clk, D2Clk are passed 
through pulse generators (PG) PGA, PGB and PGC. The SET duration tSET that the circuit 
is immune to is 500ps, the delay circuit t is set to provide that clock rising edge separation. 
A naïve clock delay and pulse generation (PG) circuit implementation suffers from 
the same issue of a clock SET propagating to all latches as in Fig. 3.16 (b) (Fig. 3.18). We 
filter each PCLK with a C-element (Fig. 3.19). This provides the correct rising to rising 
edge delay, while also filtering out any transient of the input clocks. The filter cost is low, 
as the delay circuits dominate the area and power, and are required in both the clock SET 
mitigated and unmitigated versions. Correct operation is also provided in the event that 
collected charge at a clock tree node upsets the CLK input. 
 
Fig.3.19. Proposed FF design with Muller-C elements in the pulse generator. This design 
is hard to SETs on the Clock and Data inputs. 



D1CLK
D2CLK
D3CLK
Clk
Data D <15:0> D
Q
Maj
D
Q
D
Q
Clk
Clk
Clk
QA
QB
QC
Q <15:0>
x16
C
C
C
PGA
PGB
PGC
PCLKA
PCLKB
PCLKC
Clk
D3CLK PCLKC
PGC
 89 
 
3.4.1.3. Hardness Validation Simulation Methodology 
The TMR latches provide SEU hardness. A simulation methodology applying SETs 
at different clock and latch states validated proper mitigation of SETs on the clocks and 
inputs (Fig. 3.20 ). Type 1 and type 2 SET (high going and low going, respectively) advance 
and delay the clock rising edge, respectively. Type 3 and type 4 SETs glitch the clock low 
while high and glitch it low while the clock is low, respectively. The proposed design 
 
Fig.3.20. Simulation methodology for testing SET immunity. All 4 SET cases  are covered 
(two (1,2) clock edges and two (3,4) clock phases). 
1 2 3 4
 
Fig.3.21. Simulation waveforms for the proposed FF with a sample low SET at the D input. 
Two redundant copies latch the correct input value and the error is voted out. 
CLK
PCLK
A/B/C
D
SET
Q
Correct 
value 
latched
9 11 13
Time (ns)
 90 
 
operates correctly for any CLK SET, with no false edges generated.In the event that one of 
the PCLK generators or buffer nodes collects charge and generates an SET, only that clock 
is affected. The FF Q remains correct due to the majority voter.  
Any SET up to a width of  at the D input is mitigated, as is an SEU in one latch 
(see Fig. 3.21). In this case, two of the latches capture the correct value at their respective 
PCLK falling edge, while depending on the SET timing, one latch may capture the 
incorrect value. 
3.4.1.4. Pulse Width Design and Timing 
A key problem with pulse-clocking is the high quality of the pulse clocks required 
over systematic, i.e., process, voltage, and temperature (PVT) as well as random process 
variations. We used Monte Carlo (MC) simulation methodology described in section 3.3.3 
for validation, using 1000 trials to establish the mean () and sigma () values. The 
minimum latch pulse width required, as well as pulse width generated by the PG have the 
normal distributions shown in Fig. 3.22.  
 
Fig.3.22. Statistical design of the pulse latch using Monte-Carlo simulations for the 
optimal pulse width required. 
140 150 160 170 180 19080 90 100 11060 70 130120
PDF of pulse width 
required by latch
µ=97.53ps
σ= 7.65 
PDF of pulse 
width 
generated
µ=153.5ps
σ= 5.7ps
(µlatch+3σlatch) 
= 128.1ps 
(µPG-3σPG) = 
136.7ps 
Margin 
of 8 ps 
 
 91 
 
The minimum pulse-width (tPW) required by the latch  = 97.53 ps and σ = 7.65 ps. 
A 4σ design provides a 129 ps pulse width. The PG generated clock pulse width  = 153.5 
ps with σ = 5.7 ps. The resulting targets provide margin to over 6 σ combined. 
Hardening to SETs as well as SEUs implies significant FF timing sacrifices. In the 
contemporary temporally hardened FF tSETUP increases by 2 t for hardened operation—
the worst case is a SET at D just as the FF has setup, which requires t internally. We 
measure conventionally, from the incoming clock edge (Fig. 3.23). tSETUP ≈ – (t + tPW + 
2tINV) as the first internal clock follows a delay. 
 The PCLK generation delay is similar to the latch internal delay. tCLK2Q follows the 
second clock, so the FF dead-time follows the separation of the clock rising edges, similar 
to the dead-time for other temporal FFs (c.f., [Nase06][Matush10]). For hardened operation 
tHOLD = 3t + tPW. This is a significant disadvantage over other SET mitigated FFs, where 
 
Fig.3.23. Timing parameters of the proposed FF. Note the tDEAD is half of 
[Nase06][Matush10][Shambhu14]. Data must be held until the PCLKC closing 
edge for hardness. 
Clk
PClk1 PClk3
TSETUP ~= +TPW
  
TCLK2Q ~= 2
TDEAD  = (½ DF-DICE  
½ Matush/Sandeep)
THOLD if 
needed
PClk2
 92 
 
tHOLD ≈ 0. Otherwise the last sampling latch may be incorrect, yielding unhardened 
operation for the overall FF. 
3.4.2. Physical Implementation 
3.4.2.1. Multiple Node Charge Collection Mitigation 
We mitigate MNCC by separating the A, B, and C copies of each latch, additionally 
isolating them from their respective clock PG circuits so that only one should collect charge 
from ions below about 30 MeV-cm2/mg LET [Shambhu14]. We use vertical interleaving 
since it separates N+ diffusions by N-wells. The latter tend to collect charge so that it is 
very unlikely to have a track that provides causes collection at two N+ nodes separated by 
two track heights, here approximately 4 m (two 7-track 90-nm cell heights). We avoid 
wasted layout area by interleaving the cells in the multi-bit FF cell, as in [Matush10] 
[Shambhu14]. 
 
Fig.3.24. Floor plan view of the proposed FF. The vertical stripes are M2 tracks. Various 
M2 tracks are shared to minimize routing. 
LA1
LA2
LB1 
LB2
maj 1
PG+ C
Delay1
Decap
PG + C
PG+ C
Delay2
Decap
Delay3
8.96 µm
1
5
.6
8
 µ
m
3.64 µm
 LC1
 LC2
maj 2
A
Copy
B
Copy
C
Copy
 93 
 
3.4.2.2. Physical Design and Multi SET simulation result 
Fig. 3.24 illustrates the layout floorplan of the 16-bit FF macro showing the 
interleaved constituent cells. The cell height matches the foundry library at 1.96 m. The 
PG delay elements, Muller C-elements, and PGs occupy 9 rows and are interleaved. Two 
rows of decoupling capacitors (fill) separate the sensitive clock generation nodes. The M2 
vertical wires share tracks where possible to minimize impact on IC routing. The PGs and 
2-bit TMR latches are 8.96 m and 3.64 m wide, respectively. 
3.4.2.3. Test Silicon and Results 
The fabricated 90-nm test die photo with the test structure layout overlaid 
comprises Fig. 3.25. The silicon has tested fully functional at voltages from 0.6 V to 1.5 V 
with the lower voltage being limited by the I/O pad ring level shifters. 
3.4.3. Comparison with Previous Work 
The use of pulse-clocked latches reduces the redundant circuit size below that of 
even SEU hard only designs. Due to the ability to share the delay circuits, which reside in 
 
Fig.3.25. Test chip die photo with the proposed FF test structures. 
 94 
 
the local clocking circuits, the proposed multi-bit FF has fewer delay elements per bit than 
any previous fully SET hard temporal design. Table 3.1 compares the measured size, 
power, and soft-error mitigation. The baseline FF from the foundry library is soft. The 
BISER FF does not mitigate input SETs, but still requires the equivalent of five latches. 
Amortizing the pulse clock generation and delay elements provides savings even over this 
design. Temporal FFs require two to four delay elements per FF as does the DF-DICE, 
although the DF-DICE could share the clock filter. In the proposed design SETs at the PG 
output are mitigated, but an SET at the DF-DICE clock filter is not. Consequently, the soft 
error rate of the proposed FF will be lower. Moreover, it is extremely difficult to spatially 
separate critical nodes in DICE designs, rendering them potentially as soft as the baseline 
hardened FF [Gasp13]. Like the temporal design in [Shambhu14] the proposed FF 
mitigates all SEU and SET, but it is almost 40% smaller and the power consumption is 
reduced 70% per bit of storage. 
3.4.4. Future Work 
Through application of pulse-clocking and interleaving constituent circuits in a 
multi-bit FF implementation the design presented here provides high SEU and SET 
 Baseline FF BISER FF Temporal FF Proposed FF 
Area (µm2) 12.6 36.3 67.97 37.31 
Energy (fJ/cycle) 1.77 5.91 9.66 7.77 
Clock SET mitigated     
D SET mitigated     
SEU mitigated     
Table.3.1. Comparison between various FFs. The proposed FF is hard to input D/CLK 
transients and has an area comparable to the BISER FF. The proposed FF is 40 
% smaller per bit stored than temporal FF proposed in [Shambhu14]. 
 95 
 
immunity, including protection against MNCC. The design compares favorably with 
previous designs and is the smallest per bit, of any published temporally hardened FF 
design. The design is provably hard through simulation and has been fabricated and tested 
fully functional in a 90-nm foundry bulk CMOS process. However, as will be discussed in 
Chapter 4, this design is further optimized for power by considering an integrated globally 
redundant clock tree and pulse latch design that consumes 20% lesser power without 
compromising on hardness. 
3.5. Chapter Summary 
This chapter proposed using self-correcting TMR master-slave flips with faster 
design cycles. The chapter also describes circuit techniques for designing hardened pulse-
clocked flip-flops. The design of a self-correcting TMR pulse latch is presented and its 
pulse width statistically optimized. The TMR pulse latch proposed also provides the 
designer with features such as an open control signal for pipeline stage unification and test 
mode for debug. This is then used in the design of an AES engine and its performance 
analyzed and hardness verified with broad beam testing. This chapter also proposes a novel 
temporal pulse latch design using temporally separated clock pulses that is 40% smaller 
and consumes 70% lesser power than previously a proposed temporal flip-flop. This design 
is then implemented in a 90-nm test chip for verification.  
  
 96 
 
CHAPTER 4. RADIATION HARDENED CLOCK DISTRIBUTION 
This chapter focuses on designing radiation hardened custom and ASIC clock 
distribution networks and analyzes the vulnerability of clock trees in general. A robust 
clocking and pulse clock FF methodology for SET and SEU mitigation is also proposed. 
4.1. Contribution of This Work 
Though single event effects on CMOS combinational and sequential circuits have 
been studied and analyzed in great detail [Beau93, Mass93, Koba09, Seif06], their effects 
on clock networks have not been widely addressed, and then only through simulation. In 
this chapter, we study the effects of particle strikes in clock networks and propose 
techniques for designing both custom radiation hardened clock distribution networks 
(RHBD clock spine) and ASIC clock trees for a radiation hardened microprocessor 
designed using standard CAD flow methodologies. 
In particular, a radiation hardened clock spine is design and is experimentally 
verified. The design is not only tolerant to most SEEs, but also detects single event upsets 
(SEUs) and single event transients (SETs) causing incorrect clock assertions. This is useful 
for characterization, but more importantly allows error recovery or mitigation in circuits 
where logic upsets cause clocks to be asserted, or failure to assert, such as in memory 
circuits. The radiation hardened clock spine was fabricated on two 90 nm test chips and 
tested for circuit hardness in the presence of protons and heavy ions with LET exceeding 
100 MeV-cm2/mg.  
In designs where custom clock distribution spine cannot be used, standard vendor 
supplied CAD tools are commonly used to build ASIC clock trees. This chapter also 
discussed techniques for designing and analyzing radiation hardened ASIC clock trees in 
 97 
 
conjunction with the clocking scheme and the sequential element chosen, using standard 
CAD flow methodologies. The design characteristics of such synthesized and routed trees 
are studied in detail. The proposed ASIC trees were also built on a 90-nm test chip and 
with the experimental verification pending. This chapter also provides a framework for 
analyzing the vulnerabilities of clock trees in general, and studies the parameters that 
contribute most to the risk of tree’s failure including impact on controlled latches. This is 
then used to design an integrated temporally redundant clock tree and pulse clocked flip-
flop based clocking scheme that is robust to SET and SEUs. 
4.2. Single Event Effects in the Clock Network 
SEEs affect the clock network primarily by producing three particular modes of 
circuit failure: radiation-induced clock jitter, clock glitching and incorrect clock assertions. 
1) Radiation Induced Clock Jitter—when the radiation particle induced charge is 
collected close to the clock edge, the clock edge may deviate from its expected transition 
time causing increased clock jitter. Basically, the driving transistors’ current to the clock 
node is enhanced or diminished by the amount of collected charge. The net effect is a 
timing push out or pull in of the affected edge. This directly affects the clock jitter, 
requiring larger setup and hold times to accommodate the worst-case impact.  
2) Clock Glitching—the collected charge causes the clock to transition to the wrong 
state, introducing a new clock edge. In edge sensitive circuit paths, this can lead to the 
wrong data being sampled. For flip-flops, this can also induce setup or hold time errors in 
the subsequent logic. This is referred to as “radiation induced race” in [Seif05], [Dash09]*. 
Edge triggered circuits are not limited to flip-flops. Memory circuits pre-charge dynamic 
bit lines in one clock phase, reading or writing in the other clock phase. At high clock 
 98 
 
frequencies, these high capacitance nodes, e.g., the bit lines, require the entire clock phase 
for correct operation. Clock glitches interrupt these operations and may produce incorrect 
results that may, in turn, not be captured by error detection and correction schemes 
[Yao10]. 
Since local clock nodes generally feed more than one sequential element, a 
radiation strike on one such node may cause multiple errors to be propagated. It has been 
shown that even hardened flip-flops, e.g., those using DICE storage [Calin96], may be 
prone to errors induced on high fan-out nodes such as clocks and resets [Warren09], 
[Knud06]. Designs that are not prone to such upsets are larger and dissipate more power 
[Mavis02], [Matush07]. 
3) Incorrect Clock Assertions in Gated Clocks—all modern commercial designs 
save power by using clock gating, which is, in general, more effective at finer granularities. 
Basically, clocks to circuits that are unused in that clock cycle are gated off, which 
eliminates both sequential circuit (flip-flop and latch) as well as combinational logic power 
as studied in Chapter 2. It is possible to “grid” clocks at all levels, which reduces skew, 
and more importantly in a radiation tolerant design, increases clock node capacitance and 
thus SET resilience. However, fine-grained clock gating results in relatively low drive, low 
capacitance clock nodes, which are more susceptible to SETs. Moreover, clock gating 
requires latching the clock enables through the active clock phase. These latches are in turn 
susceptible to SEU, and the logic path controlling the gating to SET. Depending on the 
sequential circuit approach used, hardened designs may not use this due to issues with 
clock SETs [Mohr07]. Thus an SET in the combinational logic can fan into the clock 
network through the clock gaters. 
 99 
 
It has been asserted that for commercial circuit architectures using master-slave 
flip-flops, SEEs affecting clock nodes contribute to about 20% of the total chip errors while 
the number rises to about 90% for pulse latch based designs [Seif05]. It has also been 
shown that at low LETs unhardened flip-flops are more vulnerable than the clock network 
[Hans09]. However at higher LETs (> 40 Mev cm2/ mg) the cross-section and hence the 
vulnerability of the clock network is greater than the cross-section of flip-flops in the 
design as shown in Fig. 4.1(a). The vulnerable cross-section changes when we consider 
hardened flip-flops (TMR) and the clock network has a cross-section that is more than two 
orders of magnitude greater than the TMR flip-flops as shown in Fig.4.1.(b). Thus robust 
radiation hardened clock distribution networks are crucial. The clock network has 
considerable overall capacitance, and a primary function of the network is to increase the 
overall drive through many stages, so that the large overall capacitance presented by the 
sequential circuits can be driven with good edge rates. Consequently, the hardening 
 
(a)                                                            (b) 
Fig. 4.1.(a) Clock, unhardened flip-flop (LMX) and logic cross-sections (b) Clock, 
hardened (TMR) and logic cross-sections at different LETs after [Hans09]. 
 100 
 
approach must differ throughout the clock network. At tree nodes with relatively low 
capacitance, limiting the impact of collected charge is important. In contrast, clock meshes 
have very large drive and capacitance, making them more immune by their size. Thus, at 
the end of the network, where individual clocks are gated and control the sequential 
elements, errors must be detected, or mitigated by the receiving circuits themselves. 
4.3. Radiation Hardened Custom Clock Spine 
The RHBD custom clock spine described here is divided into two sections: the 
global clock distribution network, which carries the clock in an H-tree configuration 
throughout the spine itself, and the large number of local clocks that synchronize the IC 
logic. The latter can be further gated at the circuit level, in triple mode or dual mode 
redundant (TMR or DMR) logic circuits in the test die. The circuit approach to designing 
the global clock network is based on fault-tolerance.  
The global clock nodes have large capacitances and are driven by a large number 
of spatially distributed drivers. An SET on these global nodes cannot be detected but due 
to the large capacitance, their impact is negligible. The approach to designing the local 
clock drivers emphasizes fault detection, where an SET is detected on the clock or the 
enable signals and a corresponding error signal is produced to indicate a faulty clock to the 
subsequent non-redundant circuits, which can be dealt with locally. 
 101 
 
4.3.1. Global Clock Distribution  
Referring to Fig. 4.2, the clock signal from the PLL is controlled by 102 spatially 
dispersed drivers, over 10 µm apart that drive a common node E5Gclk, which is the 5th 
stage early global clock. Any SET at this early stage of the clock spine will affect all clocks. 
The multiple drivers assure that any SET at one of these stages can cause a maximum jitter 
of less than 1% of the drive of the E5Gclk node. This is demonstrated in Fig. 4.3, where an 
 
 
Fig. 4.2. The clock distribution network. Note that the signals E5Gclk, E4Gclk, E3Gclk, 
E2Gclk and E1Gclk are large nodes that are shorted throughout the chip and 
hence have large node capacitances; 23 wires are used to equally distribute the 
E5Gclk signal throughout the clock spine to reduce node delay. 
Local 
Clock 
Networks
Local Clock 
Network
Local
clk
E
G
cl
k
E
G
cl
k
E
5
G
cl
k
PLL
Local Clock 
Networks
Local
clk
Local
clk
Local clk
Local clk
Local clk
Local clk
1
0
2
 D
ri
v
er
s
E
5
G
cl
k
2
3
 w
ir
es
E
4
G
cl
k
E
3
G
cl
k
E
2
G
cl
k
E
1
G
cl
k
E
G
cl
k
To Local Clock 
Networks
To Local Clock 
Networks
Local Clock 
Network
Local
 clk
Local Clock 
Network
Local
 clk
Local Clock 
Network
Local 
Clock 
Networks
 102 
 
SET at the Egclk global clock node produces negligible phase noise (less than expected 
from power supply noise). Fig. 4.4 shows the effect of an SET on one of the driver nodes 
that buffer the clock, resulting in less than 1 ps of jitter, which is less than that due to the 
PLL. Thus, high capacitance top level clocks have sufficient drive and capacitance to be 
essentially SET immune if driven by a large number of spatially distributed gates. 
In the test chips, the clock begins at an unhardened PLL (foundry supplied) that is 
shielded during testing. The primary spine input, the E5Gclk, is buffered up by spatially 
distributed inverters driven in the clock spine by 23 wires that distribute and buffer the PLL 
clock output to reduce clock skew due to wire RC in different sections of the spine.  
 
Fig. 4.3. Radiation induced jitter on the Egclk (a global clock node). A simulated SET with 
charge equivalent to LET of 30 MeV-cm2/mg produces clock jitter of less than 1 
ps. 
SET on Egclk 
node
Egclk with no 
SET
Radiation 
induced Jitter
Time (ps)
 103 
 
The E5Gclk is buffered to the local clock networks using 5 inversion stages, each 
connecting to a global clock node (E4Gclk, E3Gclk, E2Gclk, E1Gclk and EGclk). Each of 
these global nodes is driven from the previous stage by 38 spatially dispersed inverters (< 
3% jitter) that also provide sufficient drive fan-up to supply the large number of local clock 
networks. The interconnection of these stages is a mesh, which as mentioned, limits skew, 
while driving increasing capacitance in these un-gated levels. This large capacitance and 
drive makes them essentially immune to SET induced glitches. 
4.3.2. Unit Clock Driver Design 
The unit clock networks produce the local clocks for each of the IC logic sub blocks. 
The unit clocks can each be enabled or disabled as needed to conserve power by two 
enables Enclk and Engclk, the local and global enables respectively, as shown in Fig. 
4.5(a). The transparent latches are required to hold the enables through the clock high 
   
Fig. 4.4. An SET strike at the input of one of 102 inverters in the PLL-to- E5Gclk buffer 
produces a jitter of 0.63 ps at the input of the clock spine. The straight line shows 
E5Gclk node without the SET while the dashed line shows the E5Gclk affected 
due to the impact of the SET. There is 10 µm separation between inverters. 
1.0
0.75
0.5
0.25
0
3.4 3.45 3.5 3.55 3.6
V
(V
)
Time(ns)
SET at the input 
of 1 inverter
0.63ps
E5Gclk
E5Gclk before SET
E5Gclk after SET
 104 
 
assertion. The latches also represent the primary SEU cross-section in the clock spine. 
Additionally due to their smaller drive and capacitance, the enable signals are vulnerable 
to SETs. When driving TMR logic, a clock can be asserted or fail to assert without causing 
a logic error since the TMR logic is self-correcting.  
When driving non-redundant circuits, e.g., memories, the unit clocks and drivers 
have checking circuits that detect when an inadvertent clock edge occurs, whether due to 
an SET or SEU in the controlling latches. These produce a Clock_hit signal to indicate 
such a clock error event. The checker compares the unit clock signal by XORing it with a 
 
(a) 
 
(b) 
Fig. 4.5. The unit (local) clock network that produces the individual clock signals from the 
EGclk (a). Schematic of the unit clock checker circuit (b). 
clk
D Q
clk
D Q
Local Clock Network
EGclk
Engclk
Enclk
clk
clk
D Q
Local Gclk 
(EGClk)
Enable_clock
_copy
Clock Clkyfall
Clkhitn
Hits from other sets of 4 
checkers
hit1
clk
D Q
clk
D Q
Clock_hit 
From 3 
other local 
checkers For the entire spine 
This signal is used to 
interrupt on a clock error
 105 
 
redundant copy of its corresponding enable (from one of the redundant control logic copies) 
to determine if the unit clock was correctly asserted. The design of the checker circuit is 
shown in Fig. 4.5(b). The memory circuits respond to an inadvertent clock or clock 
assertion failure as described in [Yao10], which also describes local checking in the cache 
circuits for phase and other timing errors. The unit clock networks and their corresponding 
checkers are spatially separated to reduce the probability of multiple node charge collection 
from upsetting two critical nodes in the same circuit at the same time. An SET on the copy 
of the enable signal would trigger a false Clock_hit signal when the original enable would 
 
Fig. 4.6. Simulation waveforms of SET hits at different parts of the local clock network 
shown on different clock cycles. Note that false clock hit signals are produced 
when the copy of the enable signal is hit. Since the enable and its copy are 
spatially separated, the probability of an SET affecting both nodes 
simultaneously is very small. Other SETs that affect at the local clock edge are 
correctly detected by the checker. 
Gclk
Local 
Gclk
Clock_Enable
Local 
Clock
Enable_Copy
Clock
_hit
SET on the Enable 
at falling edge of 
clock causes a 
clock  glitch
SET (> 20Mev) on clock 
path causes jitter at the edge 
that is trapped by the 
checker
Time (ns)
SET on 
Enable_copy 
triggers a false 
Clock_Hit signal
SET on 
Enable_copy 
(False Clock_Hit) 
 106 
 
not have been affected. Such false errors trigger an error response in the affected memory 
circuits, but this cross-section is relatively small. 
The possible radiation induced errors in the unit clock network and the checker 
circuits are simulated using HSPICE in Fig. 4.6. Any SET on the enable or the local clock 
path at the falling edge of the clock is detected by the checker. The models of the SETs 
used to carry out the SPICE simulation are implemented in Verilog-A following the 
approaches in [Cha93], [Fulk07].  
4.3.3. Clock Spine Physical Design 
The test chip clock is generated by an unhardened, foundry supplied PLL. The PLL 
controls are TMR and voted at the periphery. The global nodes E5Gclk, E4Gclk, E3Gclk, 
 
Fig. 4.7. The layout of the complete clock spine with the PLL to E5Gclk buffers. The global 
clock nodes (E5Gclk, E4Gclk, E3Gclk, E2Gclk and E1Gclk) are laid gridded to 
minimize skew. 
P
L
L
 t
o
 E
5
G
cl
k
 B
u
ff
er
s 
(1
0
2
)
C
lo
ck
 S
p
in
e
E5Gclk (23 wires)
7
8

275
2
0
4
5

1
8
2
0

1200
E5Gclk
E4Gclk
E3Gclk
E2Gclk
E1Gclk
 107 
 
E2Gclk, E1Gclk and EGclk have 2.76 pF, 2.98 pF, 3.83 pF, 5.92 pF, 7.6 pF and 12.69 pF 
of total capacitance, respectively. These nodes are laid out as a grid inside the spine as 
shown in the right side of Fig. 4.7. The 152 primary outputs of the clock spine are signals 
Clk<151:0> that controls the clocks with the individual enables. Additionally copy clocks 
Clkx<151:0> are also generated internal to the local clock distribution network and are 
used in the clock spine checkers to validate the original copy of the clock. Their enables 
are generated by redundant logic. The local clock nodes (Gclks, local Clks) have relatively 
smaller capacitances than the global nodes (E5Gclk, E4Gclk, E3Gclk, E2Gclk, E1Gclk and 
EGclk). The single stage Gclk node has about 34.17 fF while the final clock out from the 
spine has about 48.11 fF. These smaller capacitances necessitate the clock error checking 
circuitry discussed earlier. Systematic skew at the Gclks is less than 1 ps. 
All NMOS transistors in the clock spine use annular layout to mitigate total ionizing 
dose induced standby leakage power increases. P type guard rings are also used to isolate 
diffusions and the N wells for single event latchup mitigation. Since most of the devices 
 
Fig. 4.8. Die microphotograph with the clock spine layout overlaid. 
 108 
 
are large, and the spine has considerable white space to allow driver separation, there is no 
added area cost. Empty space is filled with decoupling capacitance, which reduces the 
power supply noise and its impact on jitter. 
4.3.4. Experimental Verification 
4.3.4.1. Test Chip Design 
The RHBD clock spine was designed and fabricated on both the IBM trusted 
foundry standard and low standby power 90 nm processes. It packaged using wire bonding 
in a ceramic pin grid array. All the logic circuits including the I/O are implemented as TMR 
circuits that self-correct by voting except the standard foundry provided non-rad-hard PLL 
used to generate the clock. The PLL was shielded from the radiation particles during 
testing. The spine was clocked to a maximum of 1 GHz with VDD = 1.2 V. The die 
microphotograph and the test die floor plan are shown in Fig. 4.8. 
 
Fig. 4.9. Clock spine driving different kinds of logic circuitry in the test chip. 
Copy A
Copy B
Copy C
Copy A
Copy B
Bank 1
Bank 2
Bank M
Clock 
Spine
DMR
Logic
TMR 
Logic
Cache
EnA1
EnA2
EnA3
EnY1
EnY2
EnC1
EnC2
EnCM
Gclk
 109 
 
The test chips contain three types of logic circuits (see Fig. 4.9). First, the test 
engine uses fine grained TMR logic [Hind09]. Each of the TMR logic copies is clocked by 
a separate clock, so if one clock incorrectly asserts, or fails to assert, the logic operation is 
unaffected. Secondly, DMR logic is protected by local error detection, which allows 
operations to restart. In this logic, if a clock is incorrectly asserted and affects the operation 
result, the resulting state upset is cleaned up and the operation restarted [Clark11].  
Third, most clocks drive cache memory banks. The aforementioned clock checking 
determines if clocks are inadvertently asserted or fail to assert, which would impact the 
cache operation. In the cache, there is also extensive error checking that protects from short 
clocks and glitched clocks, as well as other periphery SET errors [Yao10].  
4.3.4.2. Single Event Testing and Results 
The device under test (DUT) is mounted on a daughter card that is controlled by an 
FPGA board, which manages the testing. A divided clock is output from the test chip 
provides test visibility to the internal clock and allows us to validate adequate jitter 
performance. During each beam test, the programmable test engine runs different tests on 
the chip while all logic state, including the clock checking circuits, are monitored. The tests 
can use the foundry supplied (unhardened) PLL or bypass mode, which uses the reference 
clock input direct to the buffer supplying E5GClk. 
Broad beam testing used topside incident ions and protons using de-lidded packages 
as shown in Fig. 4.10(a). Test die (using the low standby power, lower transistor drive 
current process) were exposed to heavy ion beams at the cyclotron at Texas A&M 
University in air at room temperature using N, Ne, Cu, Ar, Kr and Au ions with 15 MeV/u. 
Beam fluences from 5×105 to 2×107 particles/cm2 were used. Beam angles from 0° to 79° 
 110 
 
(0° being the normal incidence) were used for low mass ions, limited to 65° for high mass 
ions, based on the range  calculated by the controlling Seuss software [Horv11]. The correct 
die topside overburden was used based on the foundry supplied metal stack and polyimide 
thicknesses. The beam was incident on the die front (metallization) side. The effective LET 
ranged from 1.4 to 219.8 MeV-cm2/mg. Table 4.1 shows the effective LET and range of 
the ions in silicon for all the angles tested. For high speed I/O, the daughter board and the 
FPGA were mounted one below the other as shown in Fig. 4.10(a). 
While over 50,000 errors were detected in the memories, only one clock spine error 
was reported by the DUT during testing. This occurred using Au ions, with an angle of 53°, 
i.e. at an LETeff = 152 MeV-cm2/mg. In the bypass mode, no clock spine errors were found 
TABLE 4.1 EFFECTIVE LET AND RANGE OF HEAVY IONS IN SILICON FOR TESTED ANGLES 
 Neon Nitrogen Krypton Gold Copper Argon 
Angl
e 
LET
EFF 
(Me
V-
cm2/
mg) 
Rang
e 
(µm) 
LET
EFF 
(Me
V-
cm2/
mg) 
Rang
e 
(µm) 
LETE
FF 
(Me
V-
cm2/
mg) 
Range 
(µm) 
LETE
FF  
(Me
V-
cm2/
mg) 
Rang
e 
(µm) 
LETE
FF  
(Me
V-
cm2/
mg) 
Range 
(µm) 
LET
EFF  
(Me
V-
cm2/
mg) 
Range 
(µm) 
0° 2.9 238.1 1.4 
350.
3 
31.3 92.8 90.4 77.6 22.5 91.4 9.2 151.5 
30° 3.3 212.8           
42° 3.9 177.2 1.9 
260.
6 
42.7 66.3 122.5 54.5 30.3 68.0 12.5 110.0 
53° 4.8 144.4 2.3 
211.
9 
53.3 51.8 152.0 42.3 38.1 52.0 15.6 87.4 
65° 7.1 93.4 3.3 
146.
7 
78.3 32.8 219.8 26.0 55.8 33.6 23.0 56.5 
72° 10.0 63.9 4.5 
108.
0 
111.6 20.7   79.9 21.3 32.0 39.0 
79°         147.8 8.5   
 
 111 
 
in the proton tests. However, errors due to clock SETs at the clock root (prior to E5Gclk) 
were detected when running with the (unhardened) PLL on and unshielded, leading us to 
conclude that the PLL was susceptible to upsets. 
The clock spine was also tested using 13.5 MeV/u and 49.3 MeV/u normal 
incidence proton beams at the 88-inch cyclotron at Lawrence Berkley National Laboratory. 
Most tests used 5×1011 protons/cm2 fluence, and the total fluence over all tests exceeded 
1.75×1013 protons/cm2. Flux ranged from 2.3×108 to 8.9×108 protons/cm2/s. These die were 
polyimide-free. The FPGA test board was separated from the beam using 3-foot cables (see 
Fig. 4.10(b)). The cables and inability to shield the PLL from the protons limited the clock 
frequency, primarily by requiring PLL bypass, to 100 MHz. Power supply voltages of 1, 
1.2 and 1.4 V were used. No clock spine errors were detected in these tests, although many 
PLL errors were, when tests used it.  
   
(a)                                                         (b) 
Fig. 4.10. (a) The heavy ion beam test board at the Texas A&M University cyclotron. (b) 
Proton test setup at the Lawrence Berkeley Labs cyclotron. 
 112 
 
4.4. ASIC Clock Trees for RHBD Applications 
As the clock network feeds a large number of flip-flops, a single clock transient can 
impact multiple sections of the chip. This rippling effect was observed in the study of clock 
and reset transients on a RHBD Tilera processor where a reset driver SET caused 100 flip-
flops to clear to zero [Holm09]. This work also found that the clock and reset trees 
(synthesized distributed H trees in this case) in the chip as the most crucial components to 
harden in RHBD designs. 
The clock distribution network and sequential circuits also generally consume 
about 40% of the overall chip’s power budget [Magen04], (32% in Alpha 21264 
[Gowan98]). Thus, optimal design of the clock distribution is critical and can have a 
significant impact on the power and performance of the chip. Clock balancing and tuning 
are usually done after the other logic has been synthesized and the complete timing 
requirements determined. Clock trees need to be designed to minimize clock skew and 
wirelength, while providing sufficient drive to all the clock sinks in the design. Hence many 
distribution strategies for performing automated clock tree synthesis have been developed.  
4.4.1. ASIC Clock Tree Synthesis 
Clock tree synthesis (CTS) generates buffered the clock signals to drive all the 
sequential elements and registers in the design with minimal localized skew, rise-fall slopes 
and insertion delay along the clock path. Clock tree synthesis determines the optimal tree 
depth, branches, fanout, the physical design and routing of the clock signal, subject to the 
register placement and other routing constraints. CTS is one of the most crucial steps in a 
design and determines the overall robustness of the design.  
 113 
 
Two major considerations impact the quality of the clock tree: the overall structure 
of the tree and the physical placement of the cells and the interconnecting routes. The tree 
structure determines the optimal length for placement of buffers and the fanout of the 
branches. The physical placement of the cells is used for RC delay balancing of the wires 
for minimal skew.  
As with the clock spine, the synthesized clock network is also divided into the 
global clock distribution network (also called predriver network) and the local clock 
distribution network (final stages). The global distribution network is usually un-gated and 
has an activity factor of 1. The local trees can be gated at different depths along the tree 
based on their functionality and physical proximity. 
In general, clock distribution networks are inspired by three commercial 
microprocessor clock network topologies: grids like the DEC 21264 [Gies97], trees like 
                 
(a)                                                             (b) 
Fig. 4.11. SEU hardened inverter (a) schematic and (b) layout using resistive hardening 
and well based isolation after [Baze00]. 
 114 
 
the IBM S/39O [Webb97, Rest98]. and length matched serpentines (spines) like the Intel 
P6 [Gean98].  
4.4.2. Radiation hardened ASIC clock trees  
Analysis of clock networks is critical as a clock SET creates a timing edge that 
clocks all the sequential elements that the particular clock node fans out to. This could 
cause the flip-flop to miss a correct data transition or incorrectly sample the wrong state 
resulting in a wrong machine state. Circuit design techniques such as building clock trees 
using hardened inverter designs with well based separation for resistive hardening as 
proposed in [Baze00] shown in Fig. 4.11 could be used. However, it has at least 3X area 
penalty and 5X increased power consumption.  
Designing robust clocks is dependent on the clock scheme chosen, the sequential 
element circuits and their sensitivity to clock upsets. Radiation hardened ASIC designs 
usually require multiple clock trees for redundancy, increasing area, power and design 
complexity for designing such trees with closely specified timing margins. Radiation 
hardened design techniques such as TMR or temporal redundancy based schemes, have 
very different timing and physical design requirements [Baze02]. 
In the following section, we discuss the design and characteristics of a radiation 
hardened clock tree for an integrated TMR and DMR based microprocessor (HERMES2) 
using commercial ASIC design tools and study the characteristics of the clock trees 
generated. 
 115 
 
4.4.3. HERMES2 Clock Trees 
The radiation hardened microprocessor HERMES2 has both TMR and DMR design 
blocks that provide SEE detection and correction. HERMES2 has three clock trees, called 
ClkA, ClkB and ClkC spanning from the root clock pins PLMasterGClkA, 
PLMasterGClkB and PLMasterGClkC. While, all three trees clock the TMR master-slave 
flops in the design, the ClkA and ClkB trees also fan out to clock the DMR logic. Thus the 
trees ClkA and ClkB are about 2x bigger than the ClkC tree. Fig. 4.12 shows the concept 
of the three trees designed. The different tree levels and the number of nodes denoted are 
from the final analysis of the trees generated by the CTS. 
 
Fig. 4.12. TMR clock trees in HERMES and their span. 
               2204 
             Nodes
               2250 
             Nodes
PLMasterGClkA PLMasterGClkB
PLMasterGClkC
1832 DFFs 
+ 1142 latches
             1823 DFFs 
         + 1134 latches
 38 latches
C
lk
A
 t
re
e
: 
2
6
 l
e
v
e
ls
C
lk
B
 t
re
e
: 
2
3
 l
e
v
e
ls
C
lk
C
 t
re
e
: 
2
4
 l
e
v
e
ls
774 Nodes
TMR FFs
(3984)
 116 
 
4.4.4. HERMES2 Physical Design Methodology for Spatial Redundancy 
As all redundancy schemes (TMR, DMR) for radiation hardening are dependent on 
the particle affecting only one of the copies at any given time, any SET that spans across 
multiple copies thwarts all such schemes. Consequently, the redundant copies of the same 
logic that are used for detecting and correcting such particle impacts should be and in this 
design are, spatially separated to reduce the probability of multi-bit or multi-node upsets. 
The key to designing a redundancy based (TMR or DMR) microprocessor such as 
HERMES2 is achieving this required spatial separation between the redundant copies of 
both the clock and logic during APR and physical design by guided placement.  
HERMES2 was synthesized using RTL compiler and auto-placed and routed in 
SOC Encounter. TMR self-correcting master slave flip-flops used in HERMES2 were 
 
(a)                                   (b)                                      (c) 
Fig. 4.13. Clock trees (a) ClkA (b) ClkB (c) ClkC in HERMES2. 
TM
R
 R
e
gi
o
n
D
M
R
 A
 R
e
gi
o
n
D
M
R
 B
 R
e
gi
o
n
ClkA ClkB ClkC
 117 
 
custom designed as discussed in chapter 3 and laid out in a vertical (row-multiple) multi-
bit implementation with 4 flip-flops sharing the same clock. This provides a spatial 
separation of at least 3 standard cell rows (~7.84 m) between any two redundant copies 
of the TMR flip-flop. Spatial separation between redundant combinational logic during cell 
placement is accomplished by a physical design flow that keeps the cells constituting the 
redundant copies in separate regions via fences that impose a hard placement constraint. 
The TMR fences (A/B and C) are serpentine shaped to maintain separation while achieving 
good placement density. The A and B regions of the TMR can also span out to 
accommodate each of the DMR logic portions in one contiguous shape. The multi-bit flip-
flop macros span across the three fences. Restricted placement of the flip-flops ensures that 
they are placed in the correct regions and do not have any offsets with respect to the fence 
boundaries. The physical span of the three HERMES2 clock trees at the chip top level is 
shown in Fig. 4.13. 
 118 
 
4.4.4.1. HERMES2 TMR clock tree synthesis 
The CTS for HERMES2 is a modified version of the CTS of the standard ASIC 
design flows. Redundant clock trees ClkA, ClkB and ClkC are synthesized with the 
placements of the TMR flip-flops spanning over the fences fixed. The clock trees were 
specified to minimize localized skew between any two subsequent registers. The three trees 
were given identical constraints to minimizing intra-tree and inter-tree skew. Standard CTS 
algorithms used by SoC Encounter were used to synthesize the trees in multiple iterations, 
each done to minimize localized skew based on static timing analysis. The fences guarantee 
spatial separation and hence the three trees are forced via specialized CAD flows to be 
placed inside of their respective fence. To accomplish this during each iteration, CAD 
flows relocate the clock buffer cells placed by the CTS to their respective fences. This 
could potentially degrade timing and hence further rounds of optimizations are performed. 
Extra buffers added during these optimizations are again relocated back to their fences. 
This process is carried out successively until timing is no longer improved on successive 
optimizations and no cells are outside their assigned fences.  
TABLE 4.2 OVERVIEW OF THE CLOCK TREES IN HERMES2 
 ClkA ClkB ClkC 
Levels 26 23 24 
Total # Nodes (also # of cells) 2250 2204 774 
Total # sequential elements 6958 6941 4022 
4 BIT TMR FFs 962 962 962 
Crossover Clocks (to TMR 4bit flops) 151 (ClkC) 151 (ClkA) 151 (ClkB) 
1 BIT TMR FFs 136 136 136 
Total # DFFs 1832 1823 0 
Total # Latches 1142 1134 38 
Depth of First FF 16 15 17 
Depth of Last FF 26 23 24 
 
 119 
 
4.4.5. HERMES2 Clock Tree Characteristics 
The Table 4.2 gives an overview of the three clock trees in HERMES2, as was 
shown graphically in Fig. 4.12. The different the salient characteristics of each clock tree 
is analyzed in detail in the following subsections.  
An additional feature of interest is that, some clocks from one clock tree (eg. ClkA) 
control TMR flip-flops that are associated with another region (B or C). This is a design 
requirement and was used for ensuring the cross-checking of different logic copies in 
HERMES2 and is present in the TMR-DMR cross checkers. This is represented by the 
column Crossover Clocks (to TMR 4bit flops) in the Table 4.2. 
4.4.5.1. Comparison of the Three TMR Tree structures 
As shown previously, ClkA and ClkB are the larger clock trees in HERMES2 as 
they fan out to all TMR flip-flops in the TMR regions as well as the non-redundant master-
slave flip-flops (called DFFs in the Table 4.2) and latches in the DMR regions as shown in 
Fig 4.12. Fig 4.14 compares the three trees by levels. The ClkA tree has the largest tree 
 
Fig.4.14. Distribution of clock nodes in each level of the HERMES2 clock trees. 
0
50
100
150
200
250
300
350
400
450
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
N
o
d
e
s
Levels
ClkA
ClkB
ClkC
 120 
 
depth at 26 levels. Most sequential elements are in at a depth of about 20 levels from the 
clock source (pins PLMasterGClkA /B/C). The total spread of the flip-flops is large with 
the first flip-flop from clock source found at the 16th level and the last flop is at the 26th 
level in ClkA. The three clock trees together contain a total of 5228 nodes (A - 2250, B - 
2204 and C- 774). The nodes are distributed roughly evenly with the most number of nodes 
in the 18th level of trees ClkA and ClkB tree and 19th level of ClkC. 
The first seven levels of all three trees have a fanout of 1 (first seven for ClkA and 
ClkB, 10 levels for ClkC). This is a consequence of the design decision to place source 
clock pins of the three trees at the west boundary of HERMES2, where they would be 
adjacent to the XOR clock source. This ensures that all the trees generated are deskewed 
to the west boundary of the chip, thus avoiding routes over HERMES2 at the top level 
when the clock generator is placed. HERMES2 has a total of 17921 sequential elements, 
of which there are 3984 TMR flip-flops, 2314 are latches and 3655 non redundant D-FFs 
reside in the DMR logic. Fig. 4.15 shows the distribution of the TMR flip-flops and all 
 
(a)                                                                              (b) 
Fig.4.15. Distribution of (a) all sequential elements and (b) TMR FFs in each level of the 
trees in HERMES2. 
0
500
1000
1500
2000
2500
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
To
ta
l s
eq
u
en
ti
al
 e
le
m
en
ts
Levels
# Sequentials ClkA
# Sequentials ClkB
# Sequentials ClkC
0
200
400
600
800
1000
1200
1400
1600
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
To
ta
l #
 o
f 
TM
R
 F
Fs
Levels
# TMR FFs ClkA
# TMR FFs ClkB
# TMR FFs ClkC
 121 
 
sequential elements at different levels. It is interesting to note that while trees ClkB and 
ClkC have very different number of clock sinks, they have almost the same number of 
levels. This could be due the three trees having been given identical timing constraints 
during CTS resulting in ClkC tree having been buffered with more levels than is necessary. 
All three trees have the most clock sinks (sequentials) at level 20. The first sequential 
element in the ClkA tree is at level 16, while in the ClkB tree it is at level 15 and in ClkC 
at level 17. ClkA drives a total of 962 4-bit TMR flip-flop groups of which 811 in the A-
domain and 151 are associated with the C-domain (present in the cross over checkers). 
Similary ClkB drives 811 B-domain and 151 A-domain and GClkC drives 811 C-domain 
and 151 B-domain TMR flip-flop groups as discussed.  
4.4.5.2. Tree Insertion Delays 
The fully routed and extracted clock trees were simulated in HSPICE. The insertion 
delay from the clock source (PLMasterGClkA, PLMasterGClkB or PLMasterGClkC) to 
all the registers in the design was analyzed in detail as shown in the PDFs in Fig. 4.16. The 
ClkA and ClkB trees have greater mean insertion delays (1.255ns and 1.242ns respectively) 
 
(a)                                        (b)                                            (c) 
Fig.4.16. PDF of the rising insertion delay from the clock source to all flip-flops in the 
design for (a) PLMasterGClkA , (b) PLMasterGClkB and (c) PLMasterGClkC 
tree. 
ClkA rising edge insertion delay (ns)
µ = 1.255ns
 =  25ps
Max = 1.339ns
Min = 1.205ns
µ = 1.242ns
 =  21ps
Max = 1.312ns
Min = 1.183ns
µ = 1.204ns
 =  16ps
Max = 1.246ns
Min = 1.158ns
ClkB rising edge insertion delay (ns) ClkC rising edge insertion delay (ns)
1.2 1.22 1.24 1.26 1.28 1.30 1.32 1.34 1.18 1.20 1.22 1.24 1.26 1.28 1.30 1.321.16 1.18 1.2 1.22 1.261.24
 122 
 
than ClkC (1.204ns). The variance () of the insertion delay, which corresponds to the intra 
tree skew between registers is the largest for the ClkA tree (25ps) and the smallest for the 
ClkC tree (16ps). This was expected given that it is easier to balance a small tree. This also 
shows that the block level analysis may be questionable given that it is quite challenging. 
The worst case intra clock skew is in the ClkA tree with the fastest clock arriving at 1.205ns 
and the slowest clock arriving at 1.339ns resulting in a worst case skew of about ~134ps.  
The tool however focuses more on minimizing the localized skew between the 
clocks of the register sending the data and the register receiving it. One of the major reasons 
for the differences seen between the trees is because each tree is designed independent of 
the other, but with the same design constraints. Thus the given constraints force Encounter 
to build similar trees, but is not forced to minimize the skew between the three trees. 
4.4.5.3. Inter-Tree Clock Skew: 
Inter tree clock skew is determined by measuring the skew between the clocks trees 
at a particular TMR flip-flop. Since the correction of the data due to the majority voter 
 
Fig.4.17. PDF of the inter-tree clock skew to all flip-flops in HERMES. 
Worst case skew (ps)
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
 123 
 
happens after the temporally second clock’s falling edge arrives at each TMR flop, the 
worst case inter-tree clock inter skew is determined as  
∆𝑖𝑛𝑡𝑒𝑟= |𝑀𝑎𝑥(𝑇𝑑𝐶𝑙𝑘𝐴 − 𝑇𝑑𝐶𝑙𝑘𝐵  , 𝑇𝑑𝐶𝑙𝑘𝐵 − 𝑇𝑑𝐶𝑙𝑘𝐶  , 𝑇𝑑𝐶𝑙𝑘𝐴 − 𝑇𝑑𝐶𝑙𝑘𝐶) |     (1) 
where TdClkA, TdClkB ,TdClkC are the tree insertion delays to a particular TMR flip-flop. The 
PDF of inter-tree clock skews is shown in Fig. 4.17.  
The largest skew is 140.4 ps and the smallest inter-tree skew is 3.9ps. This does not affect 
the functionality of the system or the timing of each individual pipeline. However, in the 
event of an SET striking one copy, the FF will take at most (140.4ps + TClk2Q + TMAJ) to 
arrive at the final output, where TClk2Q and TMAJ are Clk-to-Q delay and propagation delay 
of the FF and the majority gate respectively. 
4.4.5.4. Impact of SET vs Tree Level 
The impact of an SET on a particular node in the clock tree is studied by analyzing 
its fanout. Nodes at lower levels, i.e, nearer the root (by definition) of the tree fan out to 
larger number of nodes, while nodes at higher levels of the tree fanout to only a subset of 
 
(a)                                          (b)                                               (c) 
Fig.4.18. (a(b),(c) Fan out of an SET analyzed by stage for the three trees PLMasterGClkA 
PLMasterGClkB and PLMasterGClkC. 
 124 
 
the sequential circuits. As mentioned, the initial levels of the trees have a fanout of 1 and 
any of them can be considered the root of the tree. Hence, a strike on any of these nodes 
affects all clock nodes (Fig. 4.18). A study of just the sequential elements affected by a 
strike on a particular node (Figs. 4.19 (a,b,c) also follows a similar trend as shown in Fig. 
4.18. The use of multi-bit cells also contributes to increase the impact of an SET, as the 
SET now affects all flops in the multi-bit group that share the same clock. 
4.4.5.5. Node Capacitance vs Driver sizes 
A study of the buffer size used to drive every particular node shows how Encounter 
buffers nodes to achieve the required drive and timing performance. For this the total 
capacitance of every node in the trees was calculated from the extracted spice files as 
𝐶𝑁𝑂𝐷𝐸 = ∑ 𝐶𝑊𝐼𝑅𝐸 + (∑ 𝑊𝑃_𝐿𝑂𝐴𝐷 + ∑ 𝑊𝑁_𝐿𝑂𝐴𝐷) 𝐶𝐺𝑎𝑡𝑒/µ𝑚                         (2) 
where CNODE is total capacitance on the node, WP-LOAD is the total width of all load PMOS 
devices and WN-LOAD is the total width of load NMOS devices and ∑CWIRE is the total wire 
load across all metal layers connected to that node determined by reducing the parasitic RC 
networks in the extracted spice netlist. CGATE/µm for the foundry 90 nm technology is 1.5 
 
(a)                                           (b)                                               (c) 
Fig.4.19. (a(b),(c) Fan out of an SET analyzed by stage for the three trees PLMasterGClkA 
PLMasterGClkB and PLMasterGClkC. 
 125 
 
fF/µm. The output drive of the cells used to drive these nodes was normalized to the 
standard cell inverter for an even comparison. The resulting plot for tree ClkA is shown in 
Fig. 4.20 (a) and for tree ClkB in Fig. 4.20 (b). The driver sizes are quantized (due to the 
use of standard cells) and hence we see narrow bands on the plot. A custom clock spine 
 
(a) 
 
(b) 
Fig.4.20. Drive quality of the tree nodes in (a) ClkA and (b) ClkB in HERMES2. 
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
200.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
ap
ac
it
iv
e
 L
o
ad
 (
fF
)
Normalized Driver Width (µm)
underdriven nodes
overdriven nodes
FO4 Custom clock 
spine linear 
relationship (expected)
FO6 Custom clock 
spine linear 
relationship (expected)
FO8 Custom clock 
spine linear 
relationship (expected)
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
200.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
ap
ac
it
iv
e
 L
o
ad
 (
fF
) 
Normalized Driver Width (µm)
overdriven nodes
underdriven nodes
FO4 Custom clock 
spine linear 
relationship (expected)
FO6 Custom clock 
spine linear 
relationship (expected)
FO8 Custom clock 
spine linear 
relationship (expected)
 126 
 
designed with a fanout ratio of 4, 6, or 8 has a linear relationship with all spine nodes evenly 
buffered as indicated on Figs. 4.20(a) and (b). 
The more interesting case is the study of ClkC tree. The largest under driven nodes 
in this tree have an almost linear relationship with the driver sizes at fanout of 14 as can be 
seen in Fig. 4.21. We believe that since the ClkC tree drives lesser number of sequentials 
than the other trees, however within the same timing constraints, that the CTS tool tries to 
fix timing by underdriving nodes to make then slower to match the other trees. However, 
with a sample size of just one tree, we are unable to conclusively confirm this. It is also 
possible that the placement constraints cause underdriven nodes and that these are not fixed 
because they are RC limited, among other reasons. 
4.4.6. Clock Tree Risk Analysis 
In the previous sections, we have analyzed many salient features of the generated 
clock trees on HERMES2. However to adequately understand and estimate the quality of 
the clock tree in terms of its SET hardness, we need to develop a methodology to 
 
Fig.4.21. Drive quality of the tree nodes in ClkC. 
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
200.00
0 1 2 3 4 5 6 7 8 9 10
C
ap
ac
it
iv
e
 L
o
ad
 (
fF
)
Normalized Driver Width (µm)
FO4 Custom clock 
spine linear 
relationship (expected)
FO6 Custom clock 
spine linear 
relationship (expected)
FO8 Custom clock 
spine linear 
relationship (expected)
FO14 Custom clock 
spine linear 
relationship (expected)
underdriven nodes
overdriven nodes
 127 
 
empirically analyze the tree’s vulnerability. The vulnerability of clock trees to SETs can 
be determined by analyzing the risk posed to the IC by a particle strike on every node of 
the tree. This was studied for a particular implementation of an SRAM arbiter clock tree in 
[Chip12], however the analysis was not generalized nor were any design guidelines 
provided that are applicable to other clock networks or implementations. The goal of this 
section is to derive empirical equations to enable circuit designers to assess the 
vulnerabilities of clock trees, to compare competing design strategies and ideally design 
robust clock trees.  
4.4.6.1. Risk  
Mathematically, risk is defined as the product of failure and the expected loss in 
case of failure. In case of multiple events that may lead to failures, risk R is defined as 
𝑅 = ∑ (𝑃𝑖𝐶𝑜𝑠𝑡𝑖)
𝑁
𝑖=1                                                                 (3) 
where N is the total number of events, Pi is the probability of the i
th event occurring and 
Costi is the cost or impact in the case that the i
th event occurs. For a clock tree with N nodes 
we can define, risk Rtree as  
𝑅𝑡𝑟𝑒𝑒 = ∑ (𝑃𝑆𝐸𝑇𝑖𝐼𝑖)
𝑁
𝑖=1                                                                 (4) 
where, PSETi is the probability of a SET on the i
th node in the tree and Ii is the impact of the 
SET on that particular node. The probability of a SET on a given node i of the clock tree 
can be studied by studying the conditions under which particle strike causes an SET of 
considerable magnitude. Thus, PSETi depends on three conditions: a particle strikes the 
given node, the strike causes a SET on that node and the SET width is large enough to 
 128 
 
propagate through the tree to affect the state of the machine. Mathematically, this can be 
expressed as  
𝑃𝑆𝐸𝑇𝑖 = 𝑃𝑠𝑡𝑟𝑖𝑘𝑒_𝑖𝑃𝑠𝑡𝑎𝑡𝑒_𝑖𝑃𝑝𝑢𝑙𝑠𝑒_𝑖                                                      (5) 
where,  
 Pstrike_i is the probability of a particle strike on a given node.  
 Pstate_i is the probability that the particle strike induces an SET at that node 
depending on its state. 
 Ppulse_i is the probability that the induced SET is large enough to propagate through 
the gates. 
 Ii is the impact of the SET on a particular node in terms of FFs affected. 
In the case of this analysis being applied to combinational logic, a fourth important 
probability term PTWi is needed to model the probability that the particle strike occurs in a 
particular timing window that could result in the glitch being captured by a sequential 
element. However in the case of clock trees, the entire clock phase is vulnerable as a strike 
on any clock node can result in new clock edge as discussed in section 4.2. Hence PTWi is 
taken as 1 and is therefore not carried in the model. 
4.4.6.2. Empirical analysis of the impacting factors 
Pstrike_i is the probability that a particle would strike a given node and is dependent 
on the ionizing particle flux. This definition also includes particle strikes with sufficient 
energy (LET) near the node, which would still cause a transient upset at the node. 
Empirically, Pstrike_i depends on the cross section area of the node i. The error cross section 
of a node is given by the ratio of number of upsets on a particular node to its flux. Thus 
 129 
 
Pstrike_i  also depends on the energy of the particle and the flux (or total fluence) of the 
particles. For example, from [Ming04], we can empirically derive the probability of a 
cosmic ray neutron strike on a particular node as  
𝑃𝑠𝑡𝑟𝑖𝑘𝑒_𝑖  ∝ 𝐴𝑖  ∫ 𝐹𝑛(𝐸𝑛)𝑑𝐸𝑛
𝐸𝑛,𝑚𝑎𝑥
𝐸𝑛,𝑚𝑎𝑥
                                                                (6) 
where Ai is the area of node i , En,max , En,min are the maximum and minimum neutron 
energies and Fn(En) is the altitude dependent neutron flux of energy En. Similarly, we 
extend the analysis to all particles by using the following generalizing equation 
𝑃𝑠𝑡𝑟𝑖𝑘𝑒_𝑖  ∝ 𝐴𝑖  𝑃𝑠𝑡𝑟𝑖𝑘𝑒(𝐹, 𝐸𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒)                                              (7) 
where Pstrike(F,Eparticle) is the probability of a given particle to be oriented to strike a point 
in space with energy Eparticle under a given flux F. This parameter is technology and circuit 
independent and depends primarily of the flux of particles and their energies.  
Some of the charge induced at the node is collected by the drain of an off-state 
transistor and is dependent on the type of device and process used. Previous works by 
[Ming06], [Hazu00], [Shiv02] and [Kehl11] model this effective charge induced on the 
node as a negative exponential e(-q/Qs) function, where q is the total charge induced and Qs 
is the charge collection efficiency of the device, in fC. Thus the effective model of Pstrike_i 
is  
TABLE 4.3 LOGIC COMBINATIONS FOR SET VULNERABILITY 
Node state SET type 
SET induced 
at output 
node 
0 SETLOW No 
0 SETHIGH Yes 
1 SETLOW Yes 
1 SETHIGH No 
 
 130 
 
𝑃𝑠𝑡𝑟𝑖𝑘𝑒_𝑖 =  𝐾𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝐴𝑖 𝑃𝑠𝑡𝑟𝑖𝑘𝑒(𝐹, 𝐸𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒) (𝑒
(− 
𝑞
𝑄𝑠
)
).                                   (8) 
where Kparticle is the normalized fitting constant for each particle. Similar works on this by 
[Hazu00] and [Gill05] also define Pstrike_i as penv (for neutron analysis) and Unode,path 
respectively. However, it is important to note that, while such works compute the SER of 
latches and memory cells, this work presents similar analysis for analyzing clock networks 
in particular. 
Pstate_i is the probability that particle strike induces an SET depending on the logic 
state of the node. Since any node on the clock tree is present in two logic states 1and 0, a 
low going SET (SETLOW) on a node that is at logic 0 would not produce an SET as would 
a high SET on a node in the logic 1 as shown in Table 4.3. Thus Pstate_i for any node in the 
ungated clock tree is equal to ½. In case of combinational logic circuits other than the clock 
tree, this probability would not be a constant but would vary based on the gate function and 
the logic states of other inputs as well. Unlike clock networks, combinational logic usually 
has a logical masking property depending on the gates in the path [Karn04]. For example, 
In case of the AND gate that must be used in clock gaters for gated clocks, the Pstate_i  at its 
TABLE 4.4 LOGIC COMBINATIONS FOR SET VULNERABILITY OF GATED CLOCK NODES 
Node state 
(clock) 
Enable SET type 
SET induced 
at output 
node 
0 0 SETLOW No 
0 0 SETHIGH No 
0 1 SETLOW No 
0 1 SETHIGH Yes 
1 0 SETLOW No 
1 0 SETHIGH No 
1 1 SETLOW Yes 
1 1 SETHIGH No 
 
 131 
 
clock input is ¼ as given by the Table 4.4. Further cases of logical masking for 
combinational logic can be found in [Mass00], [Karn04] and [Berg13] and will not be 
discussed in further detail in this report. 
Ppulse_i is the probability that the SET induced is large enough to propagate through 
the tree to reach a receiving sequential element. This would require the SET pulse width 
tSET to be larger than the inertial delay (tID) of the cell [Gils08]. Typically a generic logic 
exhibits the following behavior when a input pulse of width Din is applied to its input 
[Wang08] : 
 Propagation with no attenuation, if Din ≥ 2P 
 Propagation with attenuation, if  2p  ≥ Din ≥ P 
  No propagation, if Din ≤ p 
where P is the propagation delay of the gate and is dependent primarily on its drive and 
capacitive load. Thus the logic gate is approximated by the model [Wang10] as  
𝐷𝑖𝑛 = {
0                  𝑖𝑓  𝐷𝑖𝑛 ≤  𝜏𝑃 
2(𝐷𝑖𝑛 − 2𝜏𝑃)       𝑖𝑓 2𝜏𝑃 ≥  𝐷𝑖𝑛 ≥  𝜏𝑃
𝐷𝑖𝑛               𝑖𝑓  𝐷𝑖𝑛 ≥  2𝜏𝑃 .
                                               (9) 
In many cases, the attenuated pulse width might be small enough for subsequent gates to 
completely terminate the propagating SET before they reach a sequential element. Thus 
we define tID_i as the minimum pulse width that could be mitigated by the gate in the 
network. tID_i is directly proportional to the propagation delay of each gate. Underdriven 
nodes in the tree have a large propagation delays and hence mitigate SETs of larger width. 
Since only SETs of duration larger than tID_i may propagate down the tree, we can 
approximate Ppulse_i as 
 132 
 
𝑃𝑝𝑢𝑙𝑠𝑒_𝑖 = 𝑃[𝑡𝑆𝐸𝑇 ≥  𝑡𝐼𝐷_𝑖].                                                             (10) 
P[tSET ≥ tID_i] is the probability that the duration of the SET pulse width generated is 
greater than tID_i . This by definition can be computed as { 1- P[tSET < tID_i] }, where P[tSET 
< tID_i] is the probability of the SET width being less than tID_i. This is the cumulative 
distribution function of P[tSET] evaluated up to tID_i as 
𝑃[𝑡𝑆𝐸𝑇 <  𝑡𝐼𝐷_𝑖]  = ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_𝑖
0
                                                            (11) 
P[tSET] is the probability that the particle induced SET (tSET) is a particular width t. 
Characterizing the P[tSET] is more challenging as there is little agreement on the 
range of SET pulse widths in literature. [Baze06] in the 130-nm technology node, 
concludes that the majority of SET widths do not exceed 500ps for minimum drive 
inverters, while [Bene06] at the same technology node shows SET widths exceeding 2.5 
ns at nominal VDD (exceeding 4.5ns at 1.1 V VDD). Using laser based experiments, [Nara06] 
estimates the largest SETs measured on the same technology to lie between 747.5 and 
812.5 ps. In this work, since we are primarily interested in the empirical trends to compare 
and contrast competing clock networks, we do not attempt to quantify the SET widths and 
instead focus more on their overall impact probability on the final analyses. Primarily, the 
induced pulse width tSET obviously has a linearly increasing relationship with the LET of 
the impinging particle in [Bene05] and [Eaton04]. [Norm96] shows exponential decrease 
in the flux of particles with increasing energies for both proton and neutron spectra. Thus 
the probability of occurrence of large energy particles and therefore the linearly related tSET 
is taken to be an exponentially decreasing function. Empirically this implies that the 
probability of occurrence of an SET of a particular width decreases exponentially as the 
width of the SET increases. 
 133 
 
Ii is the impact of a particle strike on a particular node. For a clock tree, this value 
corresponds to the number of sequential elements the particular node affects. This was 
studied in detail for HERMES2 in section 4.4.5.4. Thus SETs at the root of the tree would 
have a larger impact on the tree than SETs at the tree leaves. 
Thus, combining equations (5), (8) and (10) we obtain the risk Rtree for the entire tree as 
𝑅𝑡𝑟𝑒𝑒 = ∑ {((𝐾𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝐴𝑖 𝑃𝑠𝑡𝑟𝑖𝑘𝑒_𝑖(𝐹, 𝐸𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒) (𝑒
(− 
𝑞
𝑄𝑠
)
)) (𝑃𝑠𝑡𝑎𝑡𝑒_𝑖)(1 −  𝑃[𝑡𝑆𝐸𝑇 <  𝐼𝐷])) 𝐼𝑖}
𝑁
𝑖=1   
(12) 
which for a given technology, flux and LET can be rewritten as  
𝑅𝑡𝑟𝑒𝑒 = ∑ {((𝐾𝑠𝑡𝑟𝑖𝑘𝑒_𝑖𝐴𝑖  )(𝑃𝑠𝑡𝑎𝑡𝑒_𝑖) (1 − ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_𝑖
0
 )) 𝐼𝑖} 
𝑁
𝑖=1 ,              (13) 
where Kstrike_i is assumed as a fitted parameter to account for the technology, flux, and 
particle LETs. In the subsequent sub-sections, we analyze different types of clock trees 
using (13) and try estimate the effect of design decisions for different types of clock trees.  
4.4.6.3. Techniques to reduce Rtree 
Analyzing the various parameter of equation (13), in this section we determine 
various techniques for designing robust clock trees by reducing their risk. Empirically, we 
can rewrite (13) containing the important factors that designers can control as  
𝑅𝑡𝑟𝑒𝑒 = ∑ {(𝐾𝑟𝑖𝑠𝑘𝐴𝑖 (1 − ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_𝑖
0
 )) 𝐼𝑖}
𝑁
𝑖=1 .                                   (14) 
where Krisk is a lumped constant that correspond to the other factors. 
As designers, we can identify the following factors as being crucial to reducing 𝑅𝑡𝑟𝑒𝑒 
 134 
 
 Reducing N : Reducing the number of vulnerable nodes in the tree reduces its risk. 
This technique is commonly implemented by using redundant nodes, as only the 
non-redundant nodes contribute to the risk. 
 Reducing Ai: Reducing the size of the each node driver reduces its probability of 
being hit, however clock design constraints ensure the sizes are made adequate 
enough to drive the necessary loads. Moreover, variability is a function of 1/√𝑠𝑖𝑧𝑒. 
 Increasing tID_i : A possible technique to protect the tree would be to increase tID_i 
of every node. This would mean, we increase the propagation delay of each cell in 
the tree, thereby increasing both the insertion delay and reducing quality of the 
clock slopes. This is contrary to general clock design principles and variability 
minimization. However, as will be seen in section 4.4.6.6, other circuit techniques 
can be employed to increase the tID_i of a node without reducing the quality of the 
clock. 
 Reducing Ii : Reducing the number of sequential elements that a node fans out to 
can reduce the overall tree risk. However, as we see in the subsequent sections, in 
systems that are concerned with even a single flip-flop being hit, we have minimal 
flexibility to change Ii. 
  
 135 
 
4.4.6.4. Risk Analysis of individual Flip-flops 
To study the risk of each and every flip-flop in the design RFF, we study the risk of 
only the cells that are present in the clock path that drive the particular flip-flop. As shown 
in Fig. 4.22, we create a unique chain of cells (I0 to I5) from the clock source to the FF 
comprising of only those clock nodes that would cause an upset at that particular flip-flop. 
We ignore the effects of particle strikes on branches that would not contribute to an upset 
at the particular flip-flop under study. Equation (14) can be modified as  
𝑅𝐹𝐹 = ∑ {(𝐾𝑟𝑖𝑠𝑘𝐴𝑖 (1 − ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_𝑖
0
 )) 𝐼𝑖}
𝑁𝑝
𝑖=1                                     (15) 
where Np is the number of nodes in the particular clock path. However, the impact of an 
SET on any node in this case, Ii is 1, as we only focus on one flip-flop. Hence (15) becomes 
𝑅𝐹𝐹 = ∑ {(𝐾𝑟𝑖𝑠𝑘𝐴𝑖) (1 −  ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_𝑖
0
)} .
𝑁𝑝
𝑖=1                                     (16) 
 
Fig.4.22. Unique clock path from clock source to a flip-flop. 
Clock source
QD
CLK
I0
I1
I2 I3 I4
I5
 136 
 
4.4.6.5. Case Study: Under driven clock node 
In the case of an ASIC clock tree, with under and overdriven nodes, we see that each of the 
cell in the path has a different inertial delay, tID_i. Underdriven nodes are generated by the 
CTS tools as it tries to improve the timing. This was studied in detail in section 4.4.5.5 on 
the three clock trees in HERMES2. Thus, in this case the cell with the worst or the largest 
tID_i would form the limiting case pulse width for an SET that propagates from anywhere 
above it in the chain. As shown in Fig. 4.23, we consider for example that the cell I3  has 
the largest inertial delay tID3, hence all cells prior to I3 in the tree, have tID3 as their inertial 
delay limit. Consequently the risk of this particular flip flop is 
𝑅𝐹𝐹 = ∑ {(𝐾𝑟𝑖𝑠𝑘𝐴𝑖)(1 − ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷3
0
)}
𝐼3
𝑖=0 + ∑ {(𝐾𝑟𝑖𝑠𝑘𝐴𝑖)(1 −
𝐼5
𝑖=4
 ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_𝑖
0
)}                                                                                                                                 (17) 
It is important to note that any under driven node with a large tID_i can mitigate an 
SET of up to tID_i only at its input and on nodes prior to its input in the chain. It cannot 
mitigate an SET at its output or on devices down in the chain. Under driven nodes may 
provide some SET mitigation but they do so at a significant cost, they have poor signal 
slope and generally lead to both degraded clocks, larger variability and larger power 
 
Fig.4.23. Under driven node in the clock path from clock source to a flip-flop. 
Clock source
QD
CLK
tID3
tID3
tID3 tID3 tID4
tID5
 137 
 
consumption. Moreover, they produce the longest SETs. For example, in HERMES2, ClkA 
tree mitigates an SET of up to 160 ps at its clock source due to the presence of underdriven 
nodes. 
4.4.6.6. Case Study: Delay Filter in the clock path 
Another well proven technique to improve the tID_i  of the final (leaf) clock node is 
to use delay filters (DF) [Nase06]. A delay filter produces a large delay that filters any SET 
of duration less than the filter delay. Delay filters are indentical to the guard gate technique 
of SET protection discussed in [Bala05]. Thus, similar to the underdriven nodes case, the 
large delay of the delay filter results in a large tID_i, (denoted in Fig 4.24 as tID_DF). The DF 
thus mitigates any SET of upto tID_DF  on all nodes prior to the filter in the clock path, 
reducing the overall risk of the flip-flop to  
𝑅𝐹𝐹 = ∑ {(𝐾𝑟𝑖𝑠𝑘𝐴𝑖)(1 −  ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_𝐷𝐹
0
)}
𝐼5
𝑖=0 .                                       (18) 
 
Fig.4.24. Delay filter used at the clock input of the Flip-flop. 
Clock source
QD
CLK
tID_DF
DF
tID_DFtID_DF
tID_DF tID_DF tID_DF
 138 
 
4.4.6.7. Case Study: Redundant Clock Trees 
Redundant nodes with majority voting at the sequential elements provide excellent 
SET protection as the redundancy reduces the risk of any one node state being upset. 
Redundant flip-flops considered for this analysis have two or more copies of clocks and 
majority voting circuits at the sequential stages for correction [Teif08] [Hind11][Carm01]. 
However redundant flip-flops which share the same clock for both copies are not protected 
to clock SETs which is obvious and reiterated in [Furu10]. These are not considered clock 
redundant for this analysis and hence are analyzed as non-redundant flip-flops from a clock 
SET perspective. The redundant branches do not pose any risk and hence the clock risk to 
 
(a) 
 
(b)  
Fig.4.25. Redundancy based SET protections schemes showing (a)semi redundant and (b) 
fully redundant clock networks. 
QADA
CLKA
QBDB
CLKA
I0 I1 I2
IA3 IA4 IA5
IB3 IB4 IB5
Common path
Clock 
source
Redundant path
QADA
CLKA
QBDB
CLKA
IA1 IA2 IA3 IA4 IA5
IB1 IB2 IB3 IB4 IB5
Clock 
source
I0
Common 
node
Redundant path
 139 
 
the flip-flop arises only from the common path as shown in Fig. 4.25 (a). The risk of the 
FF in the semi redundant RFF_semi case would be 
𝑅𝐹𝐹_𝑠𝑒𝑚𝑖 = ∑ {(𝐾𝑟𝑖𝑠𝑘_𝑖𝐴𝑖)(1 −  ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_𝑖
0
)}𝐼2𝑖=𝐼0 ,                                       (19) 
while the risk for the fully redundant clock tree RFF_fully as shown in Fig 4.25 (b) reduces 
to the risk of just one node (clock source) getting hit as  
𝑅𝐹𝐹_𝑓𝑢𝑙𝑙𝑦 = {(𝐾𝑟𝑖𝑠𝑘_0𝐴0)(1 −  ∫ 𝑃[𝑡𝑆𝐸𝑇]𝑑𝑡
𝑡𝐼𝐷_0
0
)}.                                       (20) 
4.4.6.8. HERMES2 Risk analysis 
In this section, the tree risk is systematically analyzed for each of the three trees. In 
this analysis however, to simplify it we have merely tried to analyze the impact of two 
parameters of the trees: area of the node Ai and the number of nodes it fans out to Ii.  
We have already seen that the CTS uses buffers of different sizes based on their 
placement location and wire load to buffer the tree and adjust timing. As the initial levels 
of the tree have a fanout of 1, any of these nodes can be considered the root. Since a particle 
strike on any of these nodes would propagate to each and every sequential element in the 
  
(a)                                                                       (b) 
Fig.4.26. (a) Cumulative Risk per level of the ClkA tree (b)Risk of the individual nodes in 
ClkA tree vs insertion delay. 
0
2000
4000
6000
8000
10000
12000
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
C
u
m
u
la
ti
ve
 L
ev
e
l R
is
k 
(∑
A
iI
i)
Level
0.1
1
10
100
1000
10000
0 200 400 600 800 1000 1200 1400
A
i *
 Ii
 (
lo
g)
Insertion Delay from Clock Source
 140 
 
design. Given these same impact numbers, the risk of each of these nodes is proportional 
to its node area. Hence, in Fig 4.26 (a) we see dramatic swings attributable to the different 
(and sometimes arbitrarily sized) drivers employed, at the intial levels (up to level 8 in the 
tree). However as the tree starts faning out, we see a clearly decreasing risk trend from the 
root of the tree to the sequential elements.  
From 4.4.5.1 we observe that the flip-flops in the tree are spread across multiple 
levels (from 16-26 in ClkA tree). From 4.4.5.2, it is evident that the flip-flops are more 
closely grouped and so when analyzed the  
Obviously, the root of the tree presents higher risk (due to its large Ii) than the 
higher levels in the tree. This is in spite of the large number of nodes in the higher levels, 
and consequently higher cumulative output drain area, since this drain area is split across 
many nodes. Similarly, Fig. 4.27 shows the same trends on the ClkB and ClkC trees. 
  
  
 
(a)                                                                        (b) 
Fig.4.27. Cumulative Risk per level of the (a) ClkB tree (b)ClkC tree. 
0
2000
4000
6000
8000
10000
12000
8 10 12 14 16 18 20 22 24
C
u
m
u
la
ti
ve
 L
ev
e
l R
is
k 
Level
0
1000
2000
3000
4000
5000
6000
7000
13 14 15 16 17 18 19 20 21 22 23 24
C
u
m
u
la
ti
ve
 L
ev
e
l R
is
k
Level
 141 
 
4.4.7. Temporally Redundant Clock Trees 
Radiation hardened designs using temporal sampling usually depend on having 
multiple clocks for sampling the data at multiple sample windows. These clocks are 
produced from a single clock source and could be generated locally in the flip-flop 
[Mavis02] or using local clock managers or delay elements [Avir12]. Such designs have 
the disadvantage that a particle strike at a common clock node causes the flip-flop to sample 
the wrong value of data. Thus as was shown in 4.4.6.7, the common path (to the three 
clocks) in the tree is still vulnerable to upset. In case of locally generated temporal clocks 
[Mavis02], the whole clock tree up to the flip-flop’s clock pin is vulnerable, while in case 
of local clock generators the path from the clock source to the clock input of the local clock 
generator is vulnerable [Avir12]. 
In the next section, we propose an integrated temporally redundant clock tree and 
pulse flip-flop methodology to mitigate SETs and SEUs. This scheme is an improved 
version fo the temporal pulse clocked flip-flop (TPFF) discussed in Chapter 3 along with 
a temporally redundant clocking scheme, that reduces the clock risk. The proposed 
integrated TMR pulse flip-flop with temporally redundant clock tree methodologies tried 
with combine the advantages of [Sushil15] while significantly reducing the power 
overhead due to the large number of local delay elements required. 
4.4.8. Proposed Integrated Temporal Clocking and TMR Pulse FF methodology 
The proposed design shown in Fig. 4.28, generates the three delayed clocks D1Clk, 
D2Clk and D3Clk globally at the root of the clock tree or the clock source. The three 
temporally redundant clocks are then distributed to the distributed pulse generators using 
 142 
 
three independent clock trees. The three clock trees have driving nodes that are spatially 
separated to avoid a single SET from affecting to different clock nodes at any given time.  
From 4.4.6.6 we have already seen that using delay filters [Nase06] or guard gates 
[Bala05] in the tree reduces the clock risk on all nodes prior to the filter in the tree. In this 
case, as in [Sushil15] protects against SETs on the global clock node propagating to all 
three copies of the clock. However unlike in that design, the delayed clocks are distributed 
throughout the chip rather a single global clock that is split at the end. This eliminates the 
need for individual delay elements in each of the pulse generators to generate the redundant 
clocks resulting in large power savings. Additionally one delay element in the clock source 
is also reduced by sharing the GClk with all C-elements to produce D1Clk, D2Clk and 
D3Clk. 
The three redundant clock trees TreeA, TreeB and TreeC, as shown in section 4.4.6.7, 
do not contribute to the clock risk as the TMR pulse flip-flops are completely hard to an 
SET on a single clock copy. Thus the clock risk is greatly reduced by such an integrated 
clocking and TMR pulse flip-flop scheme along with very large power savings, which is 
quantified in subsequent sections. 
 
Fig.4.28. Proposed integrated TMR pulse FF and temporally redundant clock tree design. 


D3CLK
GClk
Data D <15:0> D
Q
Maj
D
Q
D
Q
Clk
Clk
Clk
QA
QB
QC
Q <15:0>
x16
C
C
C
PGA
PGB
PGC
PCLKA
PCLKB
PCLKC
TreeC
D2CLK
D1CLK
TreeB
TreeA
Clock Source
 143 
 
4.4.8.1. Experimental implementation 
To study the additional power cost of having three redundant clock trees, over a 
single tree in the chip, we synthesized, placed and routed different implementations of the 
AES design that was studied in Chapter 3. The TMR AES design was re-designed with 
minor modifications to implement both a single redundant clock tree to clock all three 
copies of the sequential elements and with three redundant clock trees where each clocks 
 
(a) 
 
(b) 
Fig.4.29. TMR AES design in TC-23 implemented with (a) non-redundant or single clock 
tree and (b) triple redundant clock trees. 
GClkA source
GClkA source GClkB source GClkC source
 144 
 
one redundant copy. Fig. 4.29 shows the CTS generated trees for both the single tree and 
triple redundant tree versions.  
4.4.8.2. Analysis of Power Consumed  
Power consumed in the clock tree has been shown to be linearly related to the 
number of clock sinks in the design [Vitt97]. Thus in ideal cases, clocking the three copies 
with either a single non-redundant clock tree or three smaller redundant clock trees should 
dissipate about the same amount of power. However, as we have showed in section 4.4.5.5 
the CTS tool tries to adjust timing by over or under-driving nodes resulting in sub-optimal 
clock trees. Consequently, we find that the total power consumed by the three trees in the 
implementation is 20.2% more than that of the non-redundant tree. Table 4.5 provides the 
detailed analysis of the power consumed by the trees in the different design schemes. The 
clock activity factor for this comparison was taken as 1 or 100% to account for power 
savings in the un-gated global clock trees.  
Though the power consumed by three trees is larger than that of a single tree, it 
dramatically reduces the overall power in the design, as the need for delay elements in the 
multitude of pulse generators has now been eliminated. The overall power consumed by 
all the TMR pulse latches (~6816 or 426 16-b macros) in the AES design as implemented 
TABLE 4.5 COMPARISON OF A NON REDUNDANT CLOCK TREE VS  THREE REDUNDANT 
TREES 
 
Number 
of cells 
in Tree 
Vdd 
(V) 
Frequency 
(MHz) 
Time 
period 
(ns) 
Iavg 
(mA) 
Power 
(mW) 
Energy 
(pJ) 
Total 
Energy 
(pJ) 
Non Redundant 
Clock Tree 
544 1.2 250 4 12.47 14.964 3.741 3.741 
TR 
Clock 
trees 
TREEA 231 1.2 250 4 4.96 5.952 1.488 
4.494 TREEB 266 1.2 250 4 5.14 6.168 1.542 
TREEC 223 1.2 250 4 4.88 5.856 1.464 
 
 145 
 
with both (a) the non-redundant clock tree and TMR pulse FF as designed in [Sushil15] 
and (b) the proposed integrated scheme is enumerated in Table 4.6. 
Fig. 4.30 shows the overall power consumed by the AES design when implemented 
with the TPFF as proposed in [Sushil15] and the proposed integrated methodology. For 
this analysis the TPFF proposed in [Sushil15] was redesigned with only two delay 
elements. We see that the overall power is at least 22% compared to this prior work and 
we obtain even higher power savings at low data activity factors. When this methodology 
is compared with the BISER flip-flop [Zhang06], we see almost identical power 
consumption (<1%) at higher activity factors and the largest penalty (<20%) at low activity 
factors. The proposed approach however, is hard to clock and data SETs unlike the BISER, 
which can only protect against SEU upsets.  
TABLE 4.6 STUDY OF OVERALL POWER SAVINGS BY THE PROPOSED INTEGRATED 
METHODOLOGY 
Data 
Activity 
Factor 
() 
BISER 
Power 
( fJ ) 
16 b 
TPFF 
w/o PG 
( fJ ) 
16 b TPFF 
with PG 
(with 2) 
( fJ ) 
SR_tree + 
426 * 
[Sushil15] 
(pJ) 
TR_tree + 
426 * 
[proposed] 
(pJ) 
BISER 
+ 
SR_Tree 
(pJ) 
savings  
[Sushil15] – 
[proposed] 
% 
savings 
BISER – 
[Proposed] 
% 
0 24.60 490 800.00 347.10 212.38 171.41 38.81 -19.29 
0.25 36.20 633 1048.00 408.44 273.73 250.48 32.98 -8.49 
0.5 47.10 784 1305.00 472.77 338.05 324.77 28.50 -3.93 
0.75 58.10 932 1547.00 534.11 399.40 399.75 25.22 0.09 
1 69.10 1088 1797.00 604.83 470.11 474.73 22.27 0.98 
* 426 is the number of 16-b TMR macros used in the implemented AES design 
 146 
 
4.5. Chapter Summary 
This chapter describes design techniques for designing both custom and CAD tool 
based ASIC radiation hardened clock distribution networks. In particular, the chapter 
proposes the design and implementation of a RHBD clock spine with circuits for detecting 
radiation induced errors fabricated and tested using two 90 nm process variants. The design 
was exposed to heavy ions and protons and was experimentally found to be hard to over 
100 MeV-cm2/mg with only one recorded clock error. The test results demonstrate the 
clock spine radiation hardening techniques are effective. The chapter also proposes design 
techniques for designing redundant ASIC clock trees using standard CAD flow 
methodologies which in conjunction with the design of RHBD flip-flops can be used for 
radiation hardened systems. The design characteristics of such trees are also studied in 
detail. The chapter also proposes a framework for analyzing the risk of clock trees and uses 
the framework to design an integrated temporally redundant clocking scheme with TMR 
pulse clocked flip-flops.  
 
Fig.4.30. Comparison of overall power consumed due to sequential elements in AES when 
implemented with TPFF methodology [Sushil15], proposed and BISER FFs. 
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
0 0.25 0.5 0.75 1
En
e
rg
y 
(p
J)
data activity factor ()
SRtree w TPFF (pj)
Proposed integrated
methodology
SRTree w BISER FFs
 147 
 
CHAPTER 5. RADIATION HARDENED DDRX CLOCK GENERATION 
In this chapter, we discuss the design of a proposed RHBD AD-DLL for DDRx 
applications. The architecture, modes of operation and description of custom blocks such 
as coarse delay line, fine delay line and time-to-digital converter are presented. Finally the 
block RTL is discussed with a one mode of operation implemented.  
5.1. DDR SDRAM Memory Interfaces 
Double data rate synchronous dynamic random access memories (DDR SDRAM) 
perform data transfers at both the rising and falling edges of the I/O clock. With technology 
progression, the data transfer rates of memories have increased, leading to DDR2, DDR3 
and DDR4 standards. The current industry standard in SDRAM memories is the DDR3, 
while the DDR4 standard is currently being designed. Table 5.1 compares the different 
SDRAM DDR memory standards.  
DDR2 memories have higher bus speeds leading to four data transfers per clock 
cycle as compared to the SDRAM. DDR3 memories can transfer data up to twice the data 
  
Fig 5.1. Clock frequencies for SDR, DDR, DDR2 and DDR3 memories. 
Memory 
Clock 
SDR I/O 
Data clock
DDR I/O 
Data clock
DDR2 I/O 
Data clock
DDR3 I/O 
Data clock
 148 
 
rate of DDR2 memories or about 8 data transfers per clock cycle. Thus at a clock frequency 
of 100 MHz, the DDR SDRAM with has a maximum data transfer rate of 1600 MB/s while 
the DDR2 memory has 3200MB/s and the DDR3 has 6400MB/s. Fig. 5.1 shows the 
internal clocks for the different SDRAM standards. Such high data transfer rates require 
accurate synchronization with the external clock with precise internal timing control and 
very low clock jitter. Implementation of different types of SDRAM standards require the 
use of self-calibration and phase locked loops or data locked loops to produce multiple 
frequency clocks or quadrature clocks. This work describes the design of an all-digital data 
locked loop (AD-DLL) architecture and the associated circuitry for use in radiation 
hardened environments. Note that the jitter requirements have decreased commensurately 
with the data rates. However, the numbers are suntil achievable using digital circuits. Since 
 
Table 5.1. Comparison of DDR, DDR2, DDR3 memories 
DDR DDR2 DDR3
JEDEC
Standard
name
DDR -
200
DDR -
266
DDR -
333
DDR 
–
400 
B/C
DDR2 
– 533 
B/C
DDR2 
– 667 
C/D 
DDR3 
– 800 
D/E
DDR3 –
1066 
E/F/G
DDR3 –
1333 
F/G/H/J
Memory 
Clock (MHz)
100 133.33 166.66 100 133.33 166.66 100 133.33 166.66
Cycle Time 
(ns)
10 7.5 6 10 7.5 6 10 7.5 6
I/O Bus 
Clock (MHz)
100 133.33 166.66 200 266.66 333.33 400 533.33 666.6
Data rate 
(MT/s)
200 266.66 333.33 400 533.33 666.66 800 1066.6 1333.3
Peak 
transfer rate
(MB/s)
1600 2133.3 2666.6 3200 4266.6 5333.3 6400 8533.3 10666.6
Clock period 
jitter (tJIT) 
(ps)
+/- 250 +/-125 +/-100 +/- 90 +/- 80
 149 
 
it is much more straightforward to SEE harden digital circuits, and digital circuits are much 
more portable across process technologies, the studies here focus on all-digital clock data 
recovery approaches. 
5.2. An RHBD DLL for DDR2 and DDR3 
The design proposed in this chapter is a radiation hardened by design (RHBD) all 
digital delay locked loop (AD-DLL) for use in DDR2 and DDR3 applications. An all-
digital (AD) implementation is chosen for the DLL architecture as it is easier to harden 
against single event effects. The AD PLL (ADPLL) is exclusively built from digital blocks; 
hence, it does not contain any passive components, e.g., resistors and capacitors. The 
traditional voltage controlled delay line (VCDL) is replaced by a digitally controlled delay 
line (digital delay line or DDL) or more specifically called a numerically controlled delay 
 
Fig. 5.2. Quadrature clocks produced by the DLL for DDR2. 
Delay introduced
Input 
clock
Quadrature 
clock 0
o
Quadrature 
Clock 90
o
Quadrature 
clock 180
o
Quadrature 
Clock 270
o
Edge offset 
detected
offset correction 
computed and 
employed by 
control logic
 150 
 
line as was discussed in Chapter 2. It also easier to design with standard digital design 
flows and is scalable across technologies. It is designed to reach speeds needed for memory 
standards from SDR memories and DDR to DDR2 and DDR3 memories. The DDR4 
memory spec is still a work in progress and is further improvements to the proposed design 
can enable the same design to service DDR4 applications as well. Simulation results below 
show that we appear to have sufficient jitter margin to scale beyond DDR3, but the full 
DDR specification is still evolving. Fig. 5.2 shows the required quadrature clock 
generation. Basically, it is necessary to generate higher speed clocks, which are kept in 
 
Fig. 5.3. Proposed DLL architecture top level block diagram. 
Digital Delay line
Mode 
mux
Control Logic
Input clock
(reference 
clock) Output 
clock A
DLL copy B
DLL copy C
Output 
clock B
Output 
clock C
Mode 
Selects
Delay control signals 
(coarse, fine)
Output control
TMR voting signals
TDC
TDC 
mux
Fine
 edges
 151 
 
phase with the input clock. The basic task at hand is to generate the quadrature data 
recovery clocks with low jitter. 
5.3. RHBD AD-DLL Architectural Overview 
The top level block diagram of the proposed RHBD AD-DLL is shown in Fig. 5.3. 
The architecture consists of three single redundant copies of the core DLL, combined in a 
self-correcting, triple modular redundant (TMR) structure to mitigate soft-errors in the 
logic or signaling. Each DLL consists of a DDL, a time-to-digital converter, and associated 
control circuits. Digital signals from parts of the DLL are locally voted using majority gates 
and radiation hardened TMR flip-flops to SETs from upsetting internal states. Majority 
voting is also used to synchronize the different copies, so that they stay in lock with each 
other, even if one experiences a transient induced loss of lock. Each single redundant copy 
of the DLL produces an output clock resulting in 3 system clocks ClkA, ClkB, and ClkC. 
Since the DLLs are feed-forward circuits, errors due to particle strikes do not accumulate 
and hence no voting is necessary in the output clocks. In the proposed DLL, the quadrature 
clocks required are produced by digitally controlled delay lines (DDLs). Each DDL in turn 
consists of a coarse and a fine delay line. A time-to-digital converter (TDC) measures the 
edge offset between the system clock (DLL clock) and the reference clock at every cycle. 
The 0° quadrature clock is produced by multiplexing the input reference to the 
output with “zero” delay (i.e., clock and data delay matched). This results in no phase error 
accumulation as in a pure DLL. Quadrature Clocks 90°, 180° and 270° are produced by the 
DDL configured in an oscillator mode by feeding the DDL output back to its inputs for 
three cycles. This allows the use of a single coarse and fine delay line for all four quadrature 
clocks instead of three copies of the circuits.  
 152 
 
Dithering is employed to get faster locking and accurate clock edges. Dithering 
allows a total jitter of about +/- two fine delay steps (approximately 10ps in this 
implementation) of total single input clock phase error, rather than four times that amount. 
The edge offset is measured at the end of the first cycle and the required delay setting for 
the correct edges is computed based on the fine delay line error measured by the TDC with 
5ps accuracy. The subsequent clock is corrected to produce the required correct edge based 
on precise phase error measurement using the TDC as shown in Fig. 3.3. Thus the peak to 
peak jitter (Jitterabs) generated is equal to  
Jitterabs= 2 Tfinestep 
where Tfinestep is the fine step delay or the smallest resolution delay of the DDL.  
While an overview of the architecture is presented in this section, the detailed 
description of the architecture along with working and modes of operation is discussed in 
 
Fig 5.4. DDL block diagram. 
 153 
 
section 5.8. The constituent circuits designed and their features are discussed in detail in 
the subsequent sections. The complete top level architecture is then presented again in 
greater detail as hierarchy of its top level circuits. 
5.4. Digital Delay Line (DDL) 
The DDL has a coarse delay line for delay increments of ~60ps and a fine delay 
line for delay increments of ~5ps when implemented on the prototype low standby power 
(LSP) 90-nm foundry process. The DDL is designed to use TMR for protection against 
SETs from three single redundant delay lines (coarse and fine), decoders for decoding the 
coarse and fine selects and output a majority voted clock as shown in Fig. 5.4. Note the 
voted outputs in the figure. The TMR control logic determines the coarse and fine settings, 
and is fully synthesizable.  
The circuit operates by rotating through the coarse and fine delay lines for each 
quadrature clock, multiplexing the feedback for the last 3 cycles (oscillator mode) and 
feeding the input clock forward during the first. During the oscillator mode to produce 
 
Fig. 5.5. The coarse DDL using a multiplexing chain. AO gates on a 14 track pitch are 
used. 
 154 
 
quadrature clock edges, the delay produced by the delay line becomes the width of each 
clock phase as shown in Fig. 5.6.  
5.4.1. Coarse Delay Line 
The coarse DDL generates large delays controllable by the coarse select word. The 
delay is generated by a multiplexer based delay line as shown in Fig. 5.5 in up to 64 
approximately 60 ps steps. Each of the multiplexer stages can propagate the input clock or 
the multiplexer output of the previous stage depending on the coarse select signal. Variable 
delay is generated by selecting the number of multiplexer stages in the critical input clock 
path (as shown in blue), i.e., the entry point into the DDL. In an ideal design, the overall 
delay is an integer multiple of the multiplexer propagation delay. In a practical design, each 
multiplexer also adds variability to the delay, thus the delay variability depends on the total 
delay selected and from device to device. We characterize this delay variability using a 
digitally controlled oscillator (DCO) test structure in TC-23 as discussed below.  
 
Fig. 5.6. Quadrature Clock pulse width. 
 155 
 
 The 64 bit coarse select word is coded as a thermometer code. The bit position of 
last ‘1’ in the code determines the number of stages in the path. The control logic that 
determines the coarse delay needed provides a 6 bit encoded Coarse Select Address to the 
coarse delay line. A 6:64 decoder within the delay line decodes this 6 bit word to produce 
the individual Coarse Select signals for each of the 64 stages. Fig. 5.7 shows the simulated 
ideal variation of delay as the coarse select word is changed. 
5.4.2. Fine Delay Line 
 The fine delay line is used to tune the DDL delay in small delay increments to 
achieve high resolution, i.e., low jitter. The fine delay line produces multiple fine edges 
and depending on the phase error, the control circuits select the appropriate edge to increase 
or decrease the delay. In this work, the fine delay line is constructed from a chain of 32 
interpolator circuits. A 32:1 multiplexer selects the fine tune output controlled by external 
TMR logic, which will be synthesized. 
 
Fig. 5.7. Coarse chain delay vs coarse select from post layout extracted simulations. Note 
each stage adds a delay of 60ps to the total delay of the chain. 
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
# of stages in the Coarse Delay line
D
el
ay
 (
n
s)
 156 
 
5.4.2.1. Interpolator 
The interpolator outputs edges interpolated between each of its two input edges. In 
this implementation, the edge-to-edge delays (rising and falling) are evenly spaced 
(excepting inaccuracies due to device variations) at 1/4th the propagation delay of an 
inverter. The interpolator circuit is shown in Fig. 5.8. Eight such stages are connected in a 
chain forming the fine delay line producing 32 fine edges. The input and output edges of 
an inverter (herein called the stage inverter) are converted to the same polarity using an 
edge equalization circuit.  
This circuit has three inverter stages in one branch and four inverter stages in the 
other, to match the propagation delays. This converts the alternate rising and falling edges 
of the stage inverter into uniform polarity rising or falling edges leaving the interpolator. 
 
 
Fig. 5.8. Interpolator based fine delay stage circuit. 
 157 
 
Using three and four inversion stages in the inverting and non-inverting paths, respectively, 
as opposed to the minimum two and three stages, was chosen to keep the loading of the 
first stage of both the equalization branches identical, thus presenting identical loads.  
Consequently, the difference between the two input edges to the interpolator is 
equal to the propagation delay of the stage inverter. The interpolator then produces four 
edges that are evenly spaced at 1/4th the propagation delay of the stage inverter. The post 
layout extracted delay was simulated at approximately 20ps. Thus, each interpolated fine 
delay edge is evenly spaced at 5ps. SPICE optimization to tune the paths resulted in less 
than 400fs difference between the edge-to-edge delays as simulated with post-layout 
extracted parasitics. The fine edges (rising and falling) produced are shown in Fig. 5.9. 14 
track custom cells were designed for the interpolator as this would result in lower 
variability than the 7 track standard cells. Large inverters minimize delay variations. The 
edge slew rate in both rising and falling cases was designed to be ~30ps irrespective of the 
 
Fig. 5.9. Simulated fine delay edges (rising and falling) produced by the interpolator based 
fine delay line. 
Time (ps)
V
 (
V
)
0
1.2
 158 
 
input slope. Two additional dummy stages, one at the beginning of the chain for input slope 
balancing and one at the end of the chain for load balancing are added. 
5.4.2.2. Fine Delay Line Response with Systematic Process Corners 
The fine delay circuit is designed to produce reliable edge delays at all PVT corners. 
The circuit was iteratively optimized at the TT process corner but the circuit performance 
was analyzed at other corners for satisfactory performance. The goal was to avoid sizing 
the transistors in a fashion that would cause the fine edge of any one stage to overlap or 
cross the neighboring fine edge at any corner. The circuit is also designed that at the fast 
corners, the fine delays produced are large enough to be able to independently distinguish 
the different edges from one another. This requires the mean of each fine step to be greater 
than 4 of its variation ( of each step is approximately 300 fs). The process corner 
simulation results comprise Table 5.2. Fig. 5.10 shows each of the fine step sizes across 
corners.  
 
Fig 5.10. Simulated fine edge delay at process corners. Note the impact of edge rates for 
P/N mismatched corners impacts but is quite small. 
S
te
p
 s
iz
e
 (
p
s
)
Fine Edge
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
TT
FF
FS
SF
SS
Linear (TT)
 159 
 
5.5. RHBD Time-to-Digital Converter  
The time-to-digital converter (TDC) is used to measure the phase difference 
between the generated system clock and the reference (input) clock. The quantified and 
filtered phase differences are then used to compute the fine settings needed for subsequent 
clock edges to ensure DLL lock with minimum jitter. 
The TDC is constructed from the interpolator based fine delay line. The multiple 
temporally separated fine clock edges clock flip-flops, with the reference clock at the data 
input of each flip-flop. The flip-flops for which the reference clock meets the setup time of 
the flip-flop, store a logic ‘1’ or ‘high’ state, while the others store capture logic ‘0’ or 
‘low’ state. Thus the time difference between the reference clock and the system clock is 
converted into a digital thermometer code as the circuit name implies. The RHBD TDC 
designs use three copies of the fine delay line along with TMR flip-flops for SET 
protection. The reference clock is buffered to drive the D inputs. The reference clock is 
delayed such that a 0° phase difference in the two input clocks to the TDC produces a 
thermometer code of (MSB-LSB)/2 or 16 for a 32 bit TDC under PVT conditions. This 
centers the thermometer code at the center of the fine delay chain and enables us to 
determine if the system clock is leading or lagging the reference clock. A thermometer 
 
Table 5.2. Designed delays of the fine delay line across corners. 
TT (ps) FF  
(ps)
FS (ps) SF (ps) SS (ps)
Stage Inverter Delay 12.037 10.194 13.399 10.867 14.660
Fine Edge Rise –Rise 3.060 2.561 3.060 3.103 3.749
Fine Edge Fall - Fall 3.055 2.552 3.067 3.084 3.741
 160 
 
code greater than 15 denotes that the generated system clock is leading the reference clock 
and conversely a code less than 15 indicates phase lag. Majority gates as shown in Fig. 
  
(a) 
 
(b) 
Fig. 5.11. (a) RHBD TDC using TMR flip flops and triplicated delay lines. (b) Outputs are 
voted as shown. 
DDL Fine Delay Line Copy A
Reference
 Clock 
D
Q
Fine Clock [0:31]
DLL 
Clock A
Buffer for drive
 and delay matching
D
Q
Buffer for drive
 and delay matching
D
Q
TDC out 
[0]
Buffer for drive
 and delay matching
DLL 
Clock B
DLL 
Clock C
Fine Clock [0:31]
Fine Clock [0:31]
R
H
B
D
 T
M
R
 F
F
D
Q
D
Q
D
Q
TDC out 
[1]
D
Q
D
Q
D
Q
TDC out 
[0]
D
Q
D
Q
D
Q
TDC out 
[1]
DDL Fine Delay Line Copy B
DDL Fine Delay Line Copy C
R
H
B
D
 T
M
R
 F
F
R
H
B
D
 T
M
R
 F
F
R
H
B
D
 T
M
R
 F
F
R
H
B
D
 T
M
R
 F
F
Reference
 Clock 
Reference
 Clock 
Maj
TDC 
out [0]
TDC 
out [1]
TDC 
out [2]
TDC 
Thermomter 
code [0]
Maj
TDC 
out [1]
TDC 
out [2]
TDC 
out [3]
TDC 
Thermomter 
code [1]
 161 
 
5.11 are used to vote between the Nth, (N+1)th and (N +2)th stages to correct any bubbles 
that may be accidentally introduced in the TDC thermometer code by noise. 
5.6. TC 23 Test Structures 
The test chip (denoted TC-23) has multiple test structures that implement the 
proposed DLL components. Delays, delay variations and the effect of process variations 
will be measured from the post-fabrication circuit testing. Experimental validation of the 
TC-23 test structures can be used to analyze the effect of process variability on the DDL 
(coarse and fine), TDC and frequency control logic. TC-23 has the following test structures 
implemented: 
i) 64 stage coarse delay chain implemented as a digitally controlled oscillator 
ii) 32 stage fine delay chain for clocking a TDC 
 
Fig. 5.12. TC-23 test structure of the course delay line to generate a hardened coarse clock. 
 162 
 
iii) Single redundant TDC using the 32 stage fine delay chain 
iv) RHBD Frequency step logic for synchronous frequency stepping of coarse delay 
line (to be used as part of the TMR control logic) 
v) RHBD Frequency divider (used for both testing the clocks and in a possible RHBD 
PLL implementation)  
5.6.1. Digitally controlled oscillator  
TC-23 has a RHBD coarse delay based DDL configured as a digitally controlled 
oscillator or DCO test structure to study the effect of process variability on the delay step 
size as shown in Fig. 5.12. Three 64 stage multiplexer chains can be configured as a 3 
unhardened ring oscillators or a TMR RHBD ring oscillator. Any change in delay due to a 
step change is measured as a change in oscillation frequency. An external reset 
(PADLLResetCN) synchronizes all three oscillator copies. The three oscillator outputs are 
 
Fig. 5.13. Layout of the TMR DCO test structure in TC-23. 
Coarse Delay DDL A
Coarse Delay DDL B
Coarse Delay DDL C
Decoder A + HTREE buffer A
Decoder A + HTREE buffer A
Decoder B + HTREE buffer B
Decoder B + HTREE buffer B
Decoder C + HTREE buffer C
Decoder C + HTREE buffer C
 163 
 
independently testable along with the majority voted output. This allows us to study the 
effect of process variations on each individual copy of the delay line along with effect of 
TMR on the oscillation frequency. The layout comprises Fig. 5.13. 
The multiplexers used in the ring oscillator were designed with double cell height 
TSMC 90 standard cell AO cells for reduced variability and lower delay. The individual 
oscillator rings are spatially separated by 23.5 um for hardness. A 6:64 decoder decodes 
the set coarse select address to produce the coarse select signals for each copy of the ring 
oscillator. The selected output is fed back to the input of the oscillator by a balanced H-
tree buffer. The H-tree buffer is sized for low skew and the smallest propagation delay and 
is laid out within the layout of the address decoder. Two rows of decaps separate the 
different copies of decoders providing atleast 4 m of spatial separation between the two 
decoder instances.  
5.6.2. Frequency Step Control  
In the digitally controlled ring oscillator, changing the coarse step asynchronously 
can cause the output of the delay determinant stage multiplexer (the multiplexer that 
propagates the input into the delay line and determines the delay and number of stages in 
the clock path) to change its output state. When changed asynchronously, it can cause a 
stable glitch or bubble to enter and begin traversing the ring. This bubble in turn, causes 
multiple edges to stably propagate through the ring, creating abrupt frequency changes. 
Consequently, changes in the coarse select word can occur only during a permissible timing 
window, during which the delay selection multiplexer has the same inputs at both of any 
of its inputs.  
 164 
 
5.6.2.1. Circuit Operation and Timing 
To change the frequency without introducing bubbles, the frequency step control 
logic is designed to detect a step change command and apply it only during the permissible 
timing window. In the DLL, a change in frequency involves either increasing or decreasing 
 
Fig. 5.14. Block diagram of the frequency step control logic. 
 165 
 
the frequency one step at a time, i.e., using an increment/decrement circuit. Fig. 5.14 shows 
the frequency step control logic.  
In TC-23 there is a shortage of available pins that were not previous allocated to 
other test blocks. Thus, we re-use signals. The backup high speed XOR clocks are unused 
when the DCO is running and are used as the input pins to asynchronously raise or lower 
the frequency (synchronized internally). The rising edge at the output of the delay line is 
used as a clock to sample the command input and detects a rising edge for frequency change 
(controlled by PAClkIn0 in TC-23 as shown). This is then used to enable the next clock to 
synchronously sample the next frequency for a change. Pin PAClkIN1 on TC-23 controls 
the type of frequency change desired (increment/decrement). PABypassCH is used to 
determine if we would load a new value to the Frequency Control Register (that stores the 
Coarse Frequency Step) or load an incremented/decremented address for decoding.  
The circuit timing is critical to the functioning of the circuit. A new control setting 
for the delay line (Coarse Select) must be applied only when the multiplexer has a logic 
zero at both inputs and at its outputs. The critical timing condition to be met for this to be 
accomplished is  
(TDMIN  + THBUF) < TCG +TMULTIPLEXER+TFCR+TDEC < 2(TDMIN  + THBUF) 
Where TCG, TMULTIPLEXER ,TFCR , and TDEC are the propagation delays of the clock 
gater, multiplexer at the output of the clock gater, frequency control register, and the 
decoder, respectively. TDMIN is the minimum delay of the delay line when the delay line 
has the least number of stages in the delay path (i.e., the fastest frequency). THBUF is delay 
through the H-tree buffer that feeds the input to all the coarse stages. This circuit is an 
integral part of the TMR control logic for changing the coarse step in the DLL. 
 166 
 
5.6.3. Frequency Divider 
The high frequency unhardend and hardened clocks generated by the DCO are 
divided down to provide a frequency that is easily measured by the test equipment. Since 
the interesting part is the jitter and variability of the stage delays, this can be extracted from 
a divided clock. The TC-23 test structure has a synchronous frequency divider producing 
a divide-by-2, 4 or 8 clock output.  The test structure has a RHBD divider (using TMR 
FFs) for dividing the majority voted clock and an unhardened divider (using standard cell 
FFs) for dividing the unhardened clocks. Two independent dividers allow independent 
testing the unhardened and hardened clocks, as the majority voting in the FFs used in the 
RHBD divider would cancel out any variation in between the three copies of the clocks 
that we wish to measure. Thus, the mismatch between the TMR blocks can be determined, 
as well as the impact of triplication on the jitter. The divider provides four modes of 
operation using two internal signals .F1 and F0 (Fig. 5.15). 
The circuit diagram of the frequency divider is shown in Fig. 5.16. The divider is 
implemented with toggle flip-flops to produce equal clock high and clock low phases, for 
equal duty cycles. The undivided clock has a delay buffer that matches the TC-Q of the 
flip-flop used in the divider for equal delays on clocks. Signals F1 and F0 select between a 
  
Fig. 5.15. Divider modes. 
Produce A Divided 
clock 
F1 F0 MODE 
0 0 % 2 
0 1 % 4 
1 0 % 8 
1 1 No div 
 
When F1 AND F0 => No division
 167 
 
divided or undivided clock. The hardened frequency divider has the same circuit topology 
as the unhardened divider, with the only change being that all the standard cell flip-flops 
used in the unhardened divider are replaced with self-correcting TMR RHBD flip-flops. 
The divider is designed as a state machine which produces the appropriate toggle 
input to a toggle flip flop. The FFX0 and FFX1 flip-flops shown in the Fig. 5.16 keep track 
of the state variables and produce toggle high/low depending upon if we are counting or 
changing states.  
 
 
 Fig. 5.16. TC-23 frequency divider circuit diagram. 
D Q
0
1
Clock
D Q
0
1 X1
X0
FFX0
FFX1
X1 (XOR) X0 X1
+
X0
+
R
Toggle
D Q
0
1
toggle
`
0
1
F1 F0
To match the C-Q of 
the FF
Clock
DivClk
Undivided
clock
divided 
clock
F1
F0
X0
F1
PADLLBypassCH
PADLLBypassCH
 168 
 
5.7. TC-23 test structure overview  
The TC-23 DCO frequency step control and the frequency divider are integrated 
into one test structure (Figs. 5.17 and 5.18 show the structure and layout, respectively). 
External pins can be used to select an external XOR clock or the oscillator clock. The on-
board register file (RF) loads the starting frequency address into the frequency control 
registers (which are part of the top entry). The oscillator can be reset and the individual 
clocks and/or the TMR clock can be available for testing in radiation environments. The 
frequency can be raised or lowered as desired and the desired divider outputs selected using 
external pins. All inputs are majority voted for protection.  
 
Fig. 5.17. TC-23 coarse frequency generator test structure top level block diagram. 
 169 
 
5.7.1. Time to Digital Converter Test Structure  
TC-23 also has a single redundant 32 bit TDC test structure for analyzing the TDC 
and fine delay line characteristics. A 32 stage fine delay line with 2 dummy stages for load 
balancing (identical to that described in 2.2.2) is used as a TDC with standard cell 90nm 
flip-flops to characterize the phase difference between two analog inputs from external pad 
pins. These external pad pins are routed to the TDC with nominally identical routing (this 
can be tuned experimentally by externally varying their initial phase difference) and are 
used to provide test inputs representing the DLL clock and the reference clock. 
 
Fig. 5.18. Coarse frequency generator test structure physical implementation. 
TMR Coarse Delay Line
TMR
Frequency 
step control
Freq Div
 170 
 
The TDC provides the phase difference between the two analog input signals with 
a 32 bit thermometer code. One logic state is used to represent the case of underflow (logic 
‘0’ in all flip-flops). The TDC is centered at the middle of the fine delay line with a phase 
difference of 0° producing a thermometer code of 16. A thermometer to binary converter 
is used to convert the 32 bit binary code to 5 bit code that can be measured through external 
pins. The thermometer to binary converter is a 32:5 priority encoder that is synthesized. 
Fig.5.19 (a) shows the block diagram and Fig. 5.19 (b) shows the physical implementation 
of the TDC.   
 
(a)  
 
(b) 
Fig. 5.19. (a) Single redundant TDC test structure circuit and (b) layout. 
Fine Delay Line
5:32 thermometer encoder
 171 
 
5.8. Proposed AD-DLL Top Level Architecture 
The basic top-level architecture of the AD-DLL has been updated. The top level 
architecture of the proposed AD-DLL is as shown in Fig. 5.20. The key components of the 
AD-DLL are the digital delay line (DDL) which comprises of the fine delay line (FDL) 
and the coarse delay line (CDL), the time-to-digital converter (TDC), the DDL and TDC 
input select multiplexers and the control unit. The CDL, FDL and TDC are custom blocks 
that were previously presented and implemented on TC-23. One key change is that the 
configuration of the DDL has been changed to have the FDL at the beginning to drive the 
CDL, rather than that which was proposed originally (where the CDL outputs to the FDL). 
As shown in Fig. 5.20, the DDLInClock now feeds the FDL. This provides the advantage 
that the interpolators used in the FDL and TDC are now merged. Thus the TDC receives 
 
Fig. 5.20. Top level architecture of the proposed AD-DLL. Note all clock signals are 
highlighted in blue, while control signals are shown in black. 
FDL CDL
RefClock
TDC
FdbkClock
32
32
Fine edges
DDLInClock
TDCInClock
D
D
L
M
u
x
S
e
l
T
D
C
M
u
x
S
e
l
TDCVal
FdbkClock
FDSel CDSel
R
e
fC
lo
c
k
Control Unit
Digital Delay line (DDL)
5 6
DLLReset
DLLLock
FdbkClock
DLLReset
DLLLock
 172 
 
the fine edges generated by the FDL for delay comparison, rather than generating its own. 
This eliminates mismatch between the TDC and FDL, which required that two separate 
fine delay edges be synchronized across PVT corners. Now the TDC offset measured and 
the fine step adjustment required are computed with respect to the same timing edges. 
Throughout this chapter, we use the following definitions: 
 RefClock : The reference clock signal to which the DLL locks on. 
 FdbkClock : The DDL generated clock that is fed back into the DDL, when 
configured as an oscillator (DCO). Once locked, the FdbkClock is twice the 
frequency of the RefClock. 
 SysClock: SysClock is the FdbkClock divided by two using a frequency divider. 
The SysClock is phase and frequency locked with the RefClock. 
 DDLInClock: The clock that is fed at the input of the DDL. Since the FDL is the 
first stage of the DDL, the DDLInClock produces the fine edges for use in the TDC. 
 TDCInClock: This is the clock that triggers the TDC flip-flops for capturing the 
TDC range. Since the DDLInClock produces the fine edges and the TDCInClock 
sample the time difference, both these clocks in all modes of the DLL must be from 
two different clock sources at any given time. As we discuss later on, in any given 
mode, the RefClock and FdbkClock are fed to the inputs of the DDL and TDC 
alternatively. 
5.9. Modes of operation 
The AD-DLL has two modes of operation: Coarse Lock Acquisition (CLA mode) 
and the Lock mode. The CLA mode is entered when the DLL loses lock or when the DLL 
 173 
 
is reset. In the CLA mode, the prime goal of the DLL is to first achieve the closest coarse 
delay setting which brings the system into the fine tuning range. Once the DLL achieves 
CDL lock, the FDL is tuned to reach a fine setting close to the actual frequency. Once this 
is achieved, the DLL is said to be locked and can then exit the CLA mode and move to the 
Lock mode. In the Lock mode, the prime goal of the DLL is to track the RefClock. In the 
Lock mode, the control logic implements a low pass digital loop filter for maintaining lock. 
This DLL mode also implements dithering to achieve better jitter characteristics. As the 
design is of a DLL rather than a traditional PLL, the jitter does not accumulate beyond one 
RefClock cycle. The mode of the DLL is determined by the DLLLock signal (0 = CLA 
mode; 1 = Lock mode). Each of these two modes an the AD-DLL working and 
characteristics in each of these modes is explained in detail below.  
5.9.1. Coarse Lock Acquisition (CLA) Mode 
In the CLA mode, the DLL is not locked. This mode initially calibrates the coarse 
delay line to the required delay setting and then calibrates the fine delay line to get to the 
closest fine delay step required. Initially in this mode, the coarse delay line is set to different 
coarse delay values and the resulting output clock edges are compared with the next 
RefClock edge in the TDC to determine if the delay is satisfactory. Up on entering this 
mode, the CDL is set to its mid-range value (in this case 32 for a possible 64 stage CDL). 
Then the delays are progressively increased or decreased one stage delay at a time until the 
desired delay is achieved. This coarse delay adjustment is carried out as needed until the 
difference in the FdbkClock and the RefClock edges fall within the TDC range. Once the 
measured dealy is within the TDC range, the CDL is fixed and only the FDL is 
subsequently adjusted to achieve the necessary delay lock. While we initially determine 
 174 
 
coarse delays that get within the TDC range of operation, this may not be the optimal coarse 
setting. Hence, once we reach the TDC range, the coarse and fine delays are recalibrated 
to bring the FDL values near the mid-range of the TDC (settings 8-24 out of a possible 0-
32 overall TDC range). This gives the FDL maximum margin during lock mode, to track 
the input RefClock while simultaneously avoiding running off the max-min values. There 
is sufficient overlap in the delays to allow this centering. 
Each coarse lock acquisition step requires two RefClock cycles. Since every 
FdbkClock has to be phase locked to the RefClock, every FdbkClock during CLA is 
launched off a given RefClock edge. The first reference clock rising edge is input to the 
delay line (DDL) as shown conceptually in Fig. 5.21(a). We ensure that all stages in the 
DDL are at logic ‘0’ and the desired delay is set in the CDL prior to this. Thus, the rising 
 
(a)                                                                                   (b) 
 
        (c)                                                                                      (d) 
Fig. 5.21. Loop operation in the CLA mode. The multiplexer selects and other control 
signals have not been shown for clarity. 
FDL CDL
RefClock
FdbkClock
First rising edge
TDC
Fine edges
FDL CDL
TDC
Fine edges
Oscillator mode for one RefClock cycle
RefClock
FdbkClock
FDL CDL
TDC
Second Rising Edge
RefClock
FDL CDL
TDC
Fine edges
Second 
Falling 
edge 
RefClock
FdbkClock FdbkClock
Fine edges => FdbkClock 
 175 
 
edge of the RefClock ripples through the DDL. The output of the DDL is a delayed falling 
edge. This is fed back to the input of the DDL, effectively configuring it as a DCO as shown 
in Fig. 5.21(b). The DCO produces oscillations whose frequency is based on the delay set 
in the CDL (The FDL is constantly held at mid setting). Thus the fine edges are produced 
by the FdbkClock. At the next RefClock rising edge as shown in Fig. 5.21(c), the TDC is 
triggered (by RefClock edge) to capture the difference between the FdbkClock and 
RefClock. Based on the TDC output, the control logic decides to increment or decrement 
the delay in the coarse lock acquisition cycle. At the next falling edge (Fig. 5.21(d)) of the 
RefClock, the multiplexer select is enabled to propagate the RefClock to the input of the 
DDL. Thus the RefClock (at logic ‘0’) resets all the stages of the delay line to logic low to 
set up for the next coarse lock acquisition cycle. During this low phase of the RefClock 
and the DDL, the number of stages in the delay line can be changed without introducing a 
 
Fig. 5.22. Timing waveforms for the CLA mode operation. 
RefClock
DDLMuxSel
DDLInClock
In CLA Mode:
First RefClock Rising edge 
is launched into the DDL
At this point DDLInClock is made 0 to setup 
for sending the rising RefClock edge 
through the DDL at the next CLA cycle
TDC measures 
this difference
One CLA Cycle = 2 RefClock cycles
One half phase available to 
change Coarse Step without 
causing propagating bubbles
Next CLA cycle 
with updated CDL setting
 176 
 
propagating glitch through the delay line. This eliminates the complex timing windows 
within which the CDL could be changed as previously required, by providing an entire 
RefClock phase to set the CDL.  
The settings are incremented or decremented by one stage at a time as it is easier to 
implement a shift register based coarse delay select rather than using a dedicated coarse 
select decoder. Thus increasing or decreasing the number of coarse stages is a right or left 
shift that consumes less area, time, and power. More importantly, changing the CDL setting 
can now be accomplished without having to worry about timing races or hold failures in 
the design from introducing glitches due to the combinational logic in a decoder, that can 
catastrophically affect operaton. 
One key operational constraint to ensure in the CLA mode is to prevent the DLL 
from frequency aliasing. This occurs when the DLL locks on to a frequency that is a 
multiple of the required frequency. This is accomplished by ensuring that in the DCO 
 
Fig. 5.23. Timing waveforms for the Lock mode operation. 
RefClock
DDLMuxSel
DDLInClock
In Lock Mode:
Every RefClock rising edge 
is launched into the DDL
One lock Cycle = 1 RefClock cycle
TDC measures 
this difference
The fine delay setting can be 
changed in this phase
 177 
 
mode, the FdbkClock has only two clock periods during one RefClock period. Thus if we 
see the FdbkClock edges to be within the TDC range, the FdbkClock has either less than 
two or more than two falling edges in one RefClock cycle, that delay setting of the CDL is 
ignored and the DLL proceeds to the next frequency. 
The input to the DDL and TDC are controlled by select multiplexeres 
(DDLMultiplexer and TDCMultiplexer). These multiplexeres are precisely timed to be 
select the necessary input to the DDL and TDC as was seen in Figs. 5.21 (a-d). The design 
of the control signal for timing these edges (DDLMultiplexerSel and TDCMultiplexerSel 
in Fig. 5.20) will be discussed in detail in the later sections. 
5.9.2. Lock Mode 
When the DLL enters the lock mode, the coarse and fine delay lines have been set 
to their lock setting. The DLL is now phase and frequency locked to the RefClock. The 
main goal of the lock mode is to track long-term reference frequency variations. As the 
 
Fig. 5.24. Multiplexer configurations for different modes. 
32
DDLMuxSelTDCMuxSel
TDCValFdbkClock
FDSel CDSel
RefClock
Control Unit
DLLReset
DLLLock
Generate 
TDCMuxSel
Generate_CD_FD 
select
TDC 
priority 
encoder
Generate 
DDLMux Sel TDCout
5
DLLLock
 178 
 
DLL does not accumulate jitter from its previous phase to the next, and is phase locked to 
the RefClock, every rising edge of the RefClock is input to the DDL. The DLL is 
configured as a DCO from then until the next rising edge of the RefClock. When delay 
locked, the DCO ring must produce two oscillations and the third rising edge of the 
FdbkClock is compared with the RefClock in the TDC. This time offset is output to the 
control logic, which determines the next FDL setting for the DLL. This is computed by the 
loop filter and the feedback algorithm that also implements dithering. In this mode is that 
only the FDL is tuned to track the RefClock. However, if the FDL select gets too close to 
either of the FDL edges (0-7 or 24-31) the CDL is adjusted to the bring this value back 
close to the FDL center range. This coarse adjustment only happens one step at a time and 
if a time delay greater than one coarse step is needed, the DLL exits the Lock mode and 
goes back to unlocked.  
Unlike the CLA mode, in locked mode, since every rising edge of the RefClock is 
input to the DDL, all fine edges are produced by the FDL as delayed versions of the 
RefClock. Thus the TDCInClock is triggered by the FdbkClock that is produced by the 
DDL. FdbkClock is twice the frequency of the RefClock and so in the Lock mode, we 
Mode DDL Multiplexer configuration 
TDC Multiplexer 
configuration 
Fine Edges for 
TDC produced 
by 
TDC clocked 
by 
CLA 
  
FdbkClock RefClock 
Lock 
  
RefClock FdbkClock 
Fig. 5.25. Multiplexer configurations for different modes. 
DDLInClock
FdbkClock
RefClock
DDLMuxSel
TDCInClock
0
RefClock
TDCMuxSel
DDLInClock
FdbkClock
RefClock
DDLMuxSel
TDCInClock
0
FdbkClock
TDCMuxSel
 179 
 
should see two Fdbkclock falling edges in every RefClock cycle. The controller 
continuously checks for this condition, and if this not the case at any given point in time, 
frequency aliasing is detected. DLLLock is then deasserted and the DLL moves back to 
CLA mode to recapture. Implementation details such as the TDC and DDL multiplexer 
selects and the digital loop filter algorithm used will be discussed in detail later. 
5.10. Control Unit Design 
As the other custom blocks of the DLL, such as TDC, CDL and FDL have been 
discussed in detail in the previous report. This report focuses on the design of the Control 
unit designed in Verilog HDL to be synthesized and routed in commercial EDA tools 
(Cadence Encounter). Fig. 2.5 shows the constituent sub-blocks of the Control unit as 
designed. The TDC sampled data (TDCVal) is a 32 bit thermometer code that is converted 
into a 5 bit binary number by the TDC priority encoder sub-block. The measured value of 
the TDC is then used by the Generate_CD_FD_select sub-block to compute the necessary 
value of coarse and fine delays. In the CLA mode this sub-block varies the CDL and FDL 
selects to achieve the necessary delay for lock and in the Lock mode this sub-block 
implements a digital low pass filter and primarily alters the FDL select to track the locked 
RefClock frequency. The TDC priority encoder is the same design as the one in the 
previous proposal implemented in TC-23. 
One of the key control unit functions is to implement the multiplexer selects that 
control the inputs of the DDL and the TDC at the appropriate times. The sub-blocks 
GenerateTDCMultiplexerSel and Generate DDLMultiplexerSel determine the mode the 
DLL is currently in and generate the appropriate select signals. The different multiplexer 
 180 
 
configurations are shown in Fig. 5.25 with different inputs in different modes. The 
following sections discuss in detail about the design of the multiplexer selects. 
5.10.1. DDLMultiplexer Select design 
As shown in Fig. 5.25, the mode the DLL is in determines the multiplexer select 
behavior. In the CLA mode, before the first rising edge of the RefClock, the DDL 
Multiplexer select is always high. This allows propagation of the RefClock rising edge to 
the DDL input. This RefClock rising edge triggers the DDL multiplexer select de-assertion. 
Now FdbkClock propagates to the DDL input, making the DDL a DCO. All stages in the 
DDL were ensured to be low before the first rising edge of the RefClock. Thus after the 
 
Fig. 5.26. Design of the DDL Multiplexer select sub-block. The flow chart and 
corresponding waveforms in each of the DLL modes is shown. 
DDLInClock 
= RefClock 
DLLLock?
at first rising edge of RefClock 
Reset DDLMuxSel = 0
Count Refclock Rising & 
Falling Edges
at second falling edge of RefClock
Set DDLMuxSel =1
Reset all counts =0 
at first rising edge of RefClock 
Reset DDLMuxSel = 0
Count RefClock Rising and 
FdbkClock Falling Edges
at second falling edge of FdbkClock
Set DDLMuxSel =1
Reset all counts =0 
In CLA Mode: First RefClock edge is launched into the DDL. 
Then the DDL is set as a DCO for almost 3 clock phases
Lock Mode CLA Mode
NY
RefClock
DDLMuxSel
DDLInClock
DDLInClock = 
FdbkClock 
RefClock
DDLMuxSel
DDLInClock
In CLA Mode:
First RefClock Rising edge is 
launched into the DDL
 DDLInClock is made 0 to setup for sending 
the RefClock rising edge through the DDL at 
the next clock cycle
TDC measures 
this difference
One Coase Calibration Cycle = 2 RefClock cycles
One half phase available to 
change Coarse Step without 
causing propagating bubbles
DDLInClock = 
FdbkClock
DDLInClock 
= RefClock
In Lock Mode: First RefClock edge is launched into 
the DDL. This edge ripples through the DDL
The DDL is now made a DCO and 
FdbkClock is fed to DDLInClock
FdbkClock
 181 
 
FdbkClock (which has one extra inversion) is asserted high (and does not cause any 
bubbles to propagate through the DDL when the DDL multiplexer select is deasserted). 
The second falling edge of the RefClock asserts the DDL multiplexer select, which again 
connects RefClock to the DDLInClock. The falling edge of the RefClock propagates 
through the DDL forcing all stages low and thus sets up for the next CLA cycle that begins 
at the next RefClock rising edge. 
In the Lock mode, the DDL multiplexer select is high before the rising edge of the 
RefClock. This allows the rising RefClock edge to propagate through to the DDLInClock. 
This also de-asserts the DDL multiplexer select, which now place the DLL in a DCO mode. 
At the second falling edge of the FdbkClock, the DDL multiplexer is again asserted, 
causing the RefClock’s logic ‘Low” to propagate all through the DDL , setting up the DDL 
for next RefClock cycle. All the RefClock and FdbkClock rising and falling edges are 
counted by counters that are reset appropriately. 
5.10.2. TDC Multiplexer Select 
The TDC multiplexer select is complementary to the DDL multiplexer select. The 
TDC multiplexer select logic is designed to ensure that at any given instant, the two inputs 
to the TDC, that are the fine edges, and the TDCInClock do not originate from the same 
clock. As with the DDL multiplexer select, in the CLA mode, the TDCInClock receives 
the second rising edge of the RefClock and logic ‘low’ at all other times. This is ensured 
by asserting the TDC multiplexer select at the second falling edge of the RefClock and de-
asserting it at the (subsequent) first falling edge of the next CLA cycle. In Lock mode, the 
TDC multiplexer select is asserted at the second falling edge of the FdbkClock and de-
asserted by the (subsequent clock cycle’s) first falling edge as shown in Fig. 5.27. This 
 182 
 
ensures that the third rising edge of the FdbkClock propagates to clock the TDC. The fine 
edges are produced by the first RefClock rising edge propagating through the DDL. 
As evident in Fig. 5.25, the inputs to the TDC multiplexer are different in different 
modes. This is implemented by series multiplexers as shown in Fig. 5.28, with the mode 
the DLL is currently in (denoted by the DLLLock) choosing the TDC multiplexer inputs. 
To maintain equal delays (via equal loading) between the TDC and DDL multiplexers, the 
DDL multiplexer has a dummy series multiplexer that matches the TDC input multiplexer. 
 
Fig. 5.27. Design of the TDC Multiplexer select sub-block. The flow chard and the 
waveforms for each mode are presented. 
TDCInClock 
= FdbkClock 
DLLLock?
at first falling edge of RefClock 
Set TDCMuxSel = 1
Count Refclock 
Falling Edges
at second falling edge of RefClock
Reset TDCMuxSel =0
Reset count =0 
at second falling edge of FdbkClock 
Set TDCMuxSel = 1
Count FdbkClock 
Falling Edges
at first falling edge of RefClock
Reset TDCMuxSel =0
Reset count =0 
In CLA Mode:
First RefClock edge must not be sent to the 
TDC
Second RefClock edge clocks TDC to 
measure the FdbkClock -RefClock offset
First falling edge
 used to set mux select
second falling 
edge used to 
reset mux select
RefClock
TDCMuxSel
TDCInClock
Lock Mode CLA Mode
NY
Lock Mode:
RefClock
TDCMuxSel
FdbkClock
TDCInClock
TDC measures 
this difference
TDCInClock = 0 
TDCInClock = 0 
TDCInClock 
= RefClock 
 183 
 
5.10.3. Generate CD and FD Selects 
The Generate_CD_FD_Select sub-block in the Control unit generates the 
appropriate coarse and fine delay line selects. In the CLA mode, this sub-block initially 
incrementally varies the coarse delay, and subsequently the fine delay, to achieve a 
frequency lock. Once the DLL is in the lock mode, this sub-block tracks the RefClock 
signal, by adjusting the fine delay selects. This is implemented as a low-pass filter to avoid 
tracking high frequency noise on the RefClock.  
The algorithm used for generating the CD/FD selects in the CLA mode is as shown 
in Fig. 5.29. Initially, as the DLL is reset, signals DLLLock , CDLock and FDLock are de-
asserted, placing the DLL in CLA mode. Here the initial value of the FD select is set to 15 
(mid value of the possible 32 selects) and the coarse delay is varied in steps from the mid 
value (32) up or down until coarse lock is achieved. However, we need to ensure that the 
control unit counts only two falling edges in each RefClock cycle. A greater or lesser 
number of edges indicates frequency aliasing. Once the difference in the RefClock and 
FdbkClock is significant enough to be recorded by the TDC, DLL has achieved CDLock 
and proceeds to achieve FD Lock. For FDLock we need the measured TDC offset to be at 
 
Fig. 5.28. Design of the TDC Multiplexer to incorporate both modes. 
TDCInClock
0
FdbkClock
TDCMuxSel
DLLLock
RefClock
 184 
 
the mid value 15. (TDC < 15 indicates lag, TDC > 15 indicates leads and TDC =1 indicates 
the signals have no phase difference).  
Hence incrementing or decrementing the FDSel continues until the TDC measures 
the required offset of 15. At this point CLA mode could be exited. However, it is important 
to ensure that FDselect is not close to an edge of the FD select range to provide greater 
margin to track the RefClock signal as it moves. 
 
Fig. 5.29. Algorithm for generating the coarse and fine delay selects in CLA mode. 
DLLLock?
CDLock?
FdbkClock 
Count ?
CD select = CDselect +1
Case: >2
Case: <2
Case: =2
CD select = CDselect -1
CD select = CDselect -1
Case: <1
Case: >30
Case: 1<TDCout<30
CD select = CDselect +1
CDLock = TRUE
TDC Out?
FD select = FDselect -1
Case: > 15
Case: <15
Case: =15
FD select = FDselect +1
TRUE FALSE
FALSE
CLA 
Mode
TDC Out?FDLock = TRUE
FD select ?
FD select = FDselect + 8
CD select = CDselect - 1
Case: < 8
Case: >23 FD select = FDselect - 8
CD select = CDselect +1
TBD
TRUE
Lock 
Mode
Coarse adjustment
DLLLock = TRUE
Case: 8<FD select <23
F
re
q
u
e
n
c
y
 a
lia
s
in
g
 185 
 
The coarse and fine delay lines have nominal delays of 40 ps and 5 ps, respectively 
(i.e., one CD step equals 8 FD steps). Thus, the controller ensures that the FD select is at 
least 8 settings away from either edge by readjusting the CD select up or down as need to 
bring it within this range. Once the CD and FD are locked with the fine delay in the 
acceptable range, DLLLock is asserted to indicate that the DLL has now locked with 
RefClock and CLA mode exits to the Lock mode.  
The algorithm for the lock mode is not discussed in this work and is open for future 
implementation and is hence marked as TBD. 
 
Fig. 5.30. ModelSim waveforms of the DLL working in the CLA mode. Note the RefClock 
and SysClock are locked and in sync after DLLLock is achieved. The FdbkClock 
is twice the frequency of the RefClock and is also phase locked with the 
RefClock. 
 186 
 
5.11. RTL Implementation and Future Work 
The proposed modules for both the custom and logic macros of the DLL are modeled 
in Verilog HDL and implemented. Currently, the verilog model of the DLL locks on to the 
RefClock frequency in the CLA mode and generates the DLLLock signal. However, the 
DLL design does not yet track a time varying RefClock signal. Fig. 5.30 shows the 
FdbkClock begins at the initial frequency and progressively locks on to the frequency of 
the RefClock.  
The Verilog HDL is has to be expanded to implement the Lock Mode DLL 
functionality. At that point, the design can be synthesized and the non-idealities of the 
design measured and compensated for to have a working gate-level implementation of the 
DLL. This will be triplicated for hardness. Once synthesized design, satisfactorily passes 
functional verification, the design can be placed and routed in standard CAD tools 
(Cadence SOC Encounter). 
5.12. Conclusion 
This chapter presents a very high resolution all digital approach for clock and data 
recovery in DDR memory channels that is amenable to a RHBD implementation using 
TMR to mitigate soft-errors. The design is fully redundant, and can run through a SET 
applied in any copy. The timing is voted in the feedback path, keeping the DDLs in 
synchronization. The architecture and circuits presented were amenable to a RHBD 
implementation using TMR to mitigate soft-errors. Basic building blocks have been 
fabricated in the TC-23 90-nm TSMC test chip. The top level architecture proposed in this 
chapter is being implemented and awaits further exploration.  
 187 
 
CHAPTER 6. SUMMARY 
This dissertation provides techniques for designing hardened-by-design circuits for 
the entire clock network. In particular, the design of pulse-clocked latches for hardened 
ICs, hardened custom clock spines and ASIC clock distribution networks, methodology to 
evaluate the vulnerabilities of clock trees, an integrated redundant clock tree and temporal 
pulse clocked latch design for low power radiation hardened applications and an approach 
to designing an all-digital DLL for RHBD DDR applications are some of the unique 
contributions of this dissertation. A brief summary of all the chapters in this dissertation 
are provided below.  
Chapter 1 provided a brief overview of the radiation environment, the different 
effects of radiation on circuits and common mitigation techniques. Chapter 2 provides 
background details and discussed the basics of a clock network. In particular, the concept 
and designs for the different components of the clock network: sequential elements (latches 
and flip-flops), clock distribution (using tree, mesh or spine based topologies) and clock 
generation (using different PLL or DLL topologies) were presented in detail. The 
subsequent chapters provided detailed techniques and circuits to harden these components 
of the clock network. 
Chapter 3 focused on the different RHBD approaches for designing sequential 
elements: flip-flops and pulse latches. This chapter proposed CAD techniques for using the 
stage-of-art self-correcting TMR master-slave flip-flop in structured ASIC designs with 
little design effort and faster design cycles. The design of hardened pulse-clocked latches 
was proposed for the first time. This was then implemented using an AES engine as a test 
vehicle and its hardness verified by broad beam based testing of the test chip. A novel 
 188 
 
temporal pulse latch design using temporally separated clock pulses that is 40% smaller 
and consumes 70% lesser power than previously designed temporal flip-flops was also 
proposed and implemented for experimental verification. 
Chapter 4 discussed design techniques for hardened clock distribution networks 
using both custom and CAD tool based ASIC clock trees for microprocessors. 
Characteristics of both a RHBD clock spine and structured ASIC trees in the HERMES2 
microprocessor were analyzed. The spine was fabricated and tested using two 90 nm 
process variants and was found to be experimentally hard to over 100 MeV-cm2/mg with 
only one recorded clock error. An empirical approach to evaluating the vulnerability of 
clock trees was also proposed. This framework for analyzing the tree vulnerability is then 
used to design an integrated redundant clock tree and temporal pulse-clocked latch based 
clocking architecture that is hard to both SETs and SEUs. This is technique was found to 
be ~22% lower power than the design proposed in Chapter 3.  
Finally in Chapter 5, a high resolution all digital approach for clock and data 
recovery for DDR memories was presented. The architecture and circuits presented were 
amenable to a RHBD implementation using TMR to mitigate soft-errors. Basic building 
blocks have been fabricated on the TC-23 90-nm TSMC test chip. The top level 
architecture and control timings has been proposed and implementing them awaits further 
exploration.   
 189 
 
REFERENCES 
[Alex96] D. R. Alexander, "Design issues for radiation tolerant microcircuits for space", 
Short Course Nuclear and Space Radiation Effects Conf.,  1996. 
[Avir12] Avirneni, Naga Durga Prasad, and Arun K. Somani. "Low overhead soft error 
mitigation techniques for high-performance and aggressive designs." Computers, IEEE 
Transactions on 61.4 (2012): 488-501. 
[Axne86] Axness, C. L.; Weaver, H. T.; Fu, J. S.; Koga, R.; Kolasinski, W. A.; , 
"Mechanisms Leading to Single Event Upset," Nuclear Science, IEEE Transactions on , 
vol.33, no.6, pp.1577-1580, Dec. 1986. 
[Bala05] Balasubramanian, A., et al. "RHBD techniques for mitigating effects of single-
event hits using guard-gates." Nuclear Science, IEEE Transactions on 52.6 (2005): 2531-
2535. 
[Barn06] Barnaby, H. J.; , "Total-Ionizing-Dose Effects in Modern CMOS Technologies," 
Nuclear Science, IEEE Transactions on , vol.53, no.6, pp.3103-3121, Dec. 2006 
[Barn09] H. J. Barnaby, M. L. McLain, I. S. Esqueda, and X. J. Chen, “Modeling ionizing 
radiation effects in solid state materials and CMOS devices,” IEEE Trans. Circuits Syst. I, 
vol. 56, pp. 1870–1883, Aug. 2009. 
[Barth03] Barth, J.L.; Dyer, C.S.; Stassinopoulos, E.G.; , "Space, atmospheric, and 
terrestrial radiation environments," Nuclear Science, IEEE Transactions on , vol.50, no.3, 
pp. 466- 482, June 2003. 
[Baze00] Baze, Mark P., Steven P. Buchner, and Dale McMorrow. "A digital CMOS 
design technique for SEU hardening." Nuclear Science, IEEE Transactions on47.6 (2000): 
2603-2608. 
[Baze02] Baze, M. P., et al. "SEU hardening techniques for retargetable, scalable, sub-
micron digital circuits and libraries." Single Event Effects Symp., Manhattan Beach, CA, 
April, klabs. org. Vol. 5. 2002. 
[Baze-06] M. P. Baze, J. Wert, J. W. Clement, M. G. Hubert, A. Witulski, O. A. Amusan, 
L. Massengill, D. McMorrow, “Propagating SET Characterization Technique for Digital 
CMOS Libraries,” IEEE Trans. Nucl. Sci, vol. 53, no. 6, pp. 3472-3478, Dec 2006. 
[Beau93] W. Beauvais, P. McNulty, W. A. Kader, and R. Reed, “SEU parameters and 
proton-induced upsets,” in Proc. Sec. European Conf. on Radiation and its Effects on 
Components and Systems, pp. 540-545, Sept. 1993.  
 
 190 
 
[Bene04] Benedetto, J.; Eaton, P.; Avery, K.; Mavis, D.; Gadlage, M.; Turflinger, T.; 
Dodd, P.E.; Vizkelethyd, G.; , "Heavy ion-induced digital single-event transients in deep 
submicron Processes," Nuclear Science, IEEE Transactions on , vol.51, no.6, pp. 3480- 
3485, Dec. 2004. 
[Bene05] Benedetto, J. M., et al. "Variation of digital SET pulse widths and the 
implications for single event hardening of advanced CMOS processes." Nuclear Science, 
IEEE Transactions on 52.6 (2005): 2114-2119. 
[Bene-06] J. M. Benedetto, P. H. Eaton, D. G. Mavis, M. Gadlage, T. Turflinger, “Digital 
Single Event Transient Trends With Technology Node Scaling,” IEEE Trans. Nucl. Sci, 
vol. 53, no. 6, pp. 3462-3465, Dec 2006. 
[Best07] R. E. Best,  Phase-Locked Loops, Design and Applications,  2007 :McGraw-Hill 
[Black08] J. Black, et al., “Characterizing SRAM Single Event Upset in Terms of Single 
and Multiple Node Charge Collection,” IEEE Trans. Nuc. Sci., 55, 6, pp. 2943 – 2947, 
Dec. 2008. 
[Burn07] J. R. Burnham, Design and Analysis of jitter-tolerant digital delay-locked loops 
and fixed delay lines, Ph.D. Dissertation, Stanford University, June 2007. 
[Calin96] T. Calin, M. Nicolaidis and R.Velazco, “Upset hardened memory design for 
submicron CMOS technology,” IEEE Trans. Nucl. Sci., vol. 43, no. 6, pp. 2874-2878, Dec. 
1996. 
[Calin96] T. Calin, M. Nicolaidis, and R. Velazco, “Upset Hardened Memory Design for 
Submicron CMOS Technology,” IEEE Trans. Nuc. Sci., pp. 2874-2878, 43, 6, Dec. 1996 
[Carm01] Carmichael, Carl. "Triple module redundancy design techniques for Virtex 
FPGAs." Xilinx Application Note XAPP197 1 (2001). 
[Cha93]H. Cha and J. H. Patel, “A logic-level model for particle hits in CMOS circuits,” 
Intl Conf. on Computer Design, pp. 538-542, Oct. 1993.  
[Chandra01] A. Chandrakasan, W. J. Bowhill, and F. Fox,  Design of High-Performance 
Microprocessor Circuits,  2001 :IEEE Press. 
[Chen84] Chen, C. L.; Hsiao, M. Y.; , "Error-Correcting Codes for Semiconductor Memory 
Applications: A State-of-the-Art Review," IBM Journal of Research and Development , 
vol.28, no.2, pp.124-134, March 1984 
[Chip12] Chipana, Raul; Kastensmidt, F.L.; Tonfat, Jorge; Reis, R., "SET susceptibility 
estimation of clock tree networks from layout extraction," Test Workshop (LATW), 2012 
13th Latin American , vol., no., pp.1,6, 10-13 April 2012. 
 191 
 
[Chitu05] C. Chitua, and M. Glesner, “An FPGA implementation of the AES-Rijndael in 
OCB/ECB modes of operation” Microelectronics Journal, 2005, pp. 139–146. 
[Clark01] L. Clark, et al., “An embedded microprocessor core for high performance and 
low power applications,” IEEE JSSC, 36, 11, pp. 498-506, Nov. 2001. 
[Clark11] L. Clark, D. Patterson, N. Hindman, K. Holbert, and S. Guertin, “A dual mode 
redundant approach for microprocessor soft error hardness,” IEEE Trans. Nucl. Sci., vol. 
58, no. 6, Dec. 2011, pp. 3018-3025. 
[Dash09] R. Dash, R. Garg, S. P. Khatri and G. Choi, “SEU hardened clock regeneration 
circuits,” Intl Symp on Quality of Elec. Design, 2009, pp. 806-813. 
[Dehng00] G.K. Dehng, et al. "Clock-deskew buffer using a SAR-controlled delay-locked 
loop.", IEEE Journal of Solid-State Circuits, 2000, pp 1128-1136. 
[Dodd03] Dodd, P.E.; Massengill, L.W.; , "Basic mechanisms and modeling of single-
event upset in digital microelectronics," Nuclear Science, IEEE Transactions on , vol.50, 
no.3, pp. 583- 602, June 2003 
[Dodd95] P. Dodd and F. Sexton, “Critical charge concepts for CMOS SRAMs,” IEEE 
Trans. Nucl. Sci., vol. 42, no. 6, pp. 1764–1771, Dec. 1995. 
[Eaton04] Eaton, P., et al. "Single event transient pulsewidth measurements using a 
variable temporal latch technique." IEEE transactions on nuclear science 51.6 (2004): 
3365-3368. 
[Ecof94] R. Ecoffet, S. Duzellier, P. Tastet, C. Aicardi, and M. Labrunee, “Observation of 
heavy ion induced transients in linear circuits,” Proc. IEEE NSREC Radiation Effects Data 
Workshop Record, pp. 72–77, 1994. 
[Efth02] A. Efthymiou, and J. D. Garside, "Adaptive pipeline depth control for processor 
power-management," Proc. IEEE ICCD, 2002, pp. 454-457. 
[Egan08] W. F. Egan,  Phase-Lock Basics,  2008 :Wiley. 
[Fred96] A. R. Frederickson, “Upsets related to spacecraft charging,” IEEE Trans. Nucl. 
Sci., vol. 43, no. 2, pp. 426-441, 1996. 
[Fried01] Friedman, Eby G. "Clock distribution networks in synchronous digital integrated 
circuits." Proceedings of the IEEE 89.5, 2001, pp 665-692. 
[Fuji90] Fujiwara, E.; Pradhan, D.K.; , "Error-control coding in computers," Computer , 
vol.23, no.7, pp.63-72, July 1990. 
 192 
 
[Fulk07] D. Fulkerson, D. Nelson, R. Carlson, and E. Vogt, “Modeling ion-induced pulses 
in radiation-hard SOI integrated circuits,” IEEE Trans. Nucl. Sci., vol. 54, no. 4, pp. 1406-
1415, Aug. 2007. 
[Furu10] Furuta, J.; Hamanaka, C.; Kobayashi, K.; Onodera, Hidetoshi, "A 65nm Bistable 
Cross-coupled Dual Modular Redundancy Flip-Flop capable of protecting soft errors on 
the C-element," VLSI Circuits (VLSIC), 2010 IEEE Symposium on , vol., no., pp.123,124, 
16-18 June 2010.  
[Gadl04] Gadlage, M.J.; Schrimpf, R.D.; Benedetto, J.M.; Eaton, P.H.; Mavis, D.G.; 
Sibley, M.; Avery, K.; Turflinger, T.L.; , "Single event transient pulse widths in digital 
microcircuits," Nuclear Science, IEEE Transactions on , vol.51, no.6, pp. 3285- 3290, Dec. 
2004 
[Gard79] F. M. Gardner,  Phaselock Techniques,  1979 :Wiley 
[Gard80] Gardner, F.; , "Charge-Pump Phase-Lock Loops," Communications, IEEE 
Transactions on , vol.28, no.11, pp. 1849- 1858, Nov 1980[Moon00] Y. Moon, J. Choi, K. 
Lee, D.-K. Jeong, and M.-K. Kim, "An all-analog multiphase delay-locked loop using a 
replica delay line for wide-range operation and low-jitter performance," IEEE J. Solid-
State Circuits, vol. 35, pp. 377-384, Mar. 2000.  
[Gasp13] N. Gaspard, et al., “Technology Scaling Comparison of Flip-Flop Heavy-Ion 
Single-Event Upset Cross Sections,” IEEE Trans. Nuc. Sci., 60, 6, pp. 4368-4373, Dec. 
2013. 
[Gean98] Geannopoulos, George, and Ximing Dai. "An adaptive digital deskewing circuit 
for clock distribution networks." Solid-State Circuits Conference, 1998. Digest of 
Technical Papers. 1998 IEEE International. IEEE, 1998. 
[Gies97] Gieseke, Bruce A., et al. "A 600 MHz superscalar RISC microprocessor with out-
of-order execution." Solid-State Circuits Conference, 1997. Digest of Technical Papers. 
43rd ISSCC., 1997 IEEE International. IEEE, 1997. 
[Gill05] Gill, Balkaran S., et al. "Node sensitivity analysis for soft errors in CMOS logic." 
Test Conference, 2005. Proceedings. ITC 2005. IEEE International. IEEE, 2005. 
[Gils08] Wirth, Gilson, Fernanda L. Kastensmidt, and Ivandro Ribeiro. "Single event 
transients in logic circuits—load and propagation induced pulse broadening." Nuclear 
Science, IEEE Transactions on 55.6 (2008): 2928-2935. 
[Gowan98] Gowan, M.K.; Biro, L.L.; Jackson, D.B., "Power considerations in the design 
of the Alpha 21264 microprocessor," Design Automation Conference, 1998. Proceedings , 
vol., no., pp.726,731, 19-19 June 1998.  
[Guan96] Guan-Chyun Hsieh; Hung, J.C.; , "Phase-locked loop techniques. A survey," 
Industrial Electronics, IEEE Transactions on , vol.43, no.6, pp.609-615, Dec. 1996. 
 193 
 
[Guss96] Gussenhoven, M.S.; Mullen, E.G.; Brautigam, D.H.; , "Improved understanding 
of the Earth's radiation belts from the CRRES satellite," Nuclear Science, IEEE 
Transactions on , vol.43, no.2, pp.353-368, Apr 1996. 
[Hans09] Hansen, D.L.; et.al. , "Clock, Flip-Flop, and Combinatorial Logic Contributions 
to the SEU Cross Section in 90 nm ASIC Technology," Nuclear Science, IEEE 
Transactions on , vol.56, no.6, pp.3542,3550, Dec. 2009.  
[Hazu00] Hazucha, Peter, Christer Svensson, and Stephen A. Wender. "Cosmic-ray soft 
error rate characterization of a standard 0.6-/spl mu/m cmos process." Solid-State Circuits, 
IEEE Journal of 35.10 (2000): 1422-1429. 
[Heald00] Heald, R., et al. "Implementation of a 3rd-generation SPARC V9 64 b 
microprocessor.", International Solid-State Circuits Conference, 2000, pp. 412-413. 
[Heil89] S. J. Heileman, W. R. Eisenstadt, R. M. Fox, R. S. Wagner, N. Bordes, and J. M. 
Bradley, “CMOS VLSI single event transient characterization,” IEEE Trans. Nucl. Sci., 
vol. 36, no. 6, pp. 2287-2291, 1989. 
[Hind09] Hindman, N.D.; Pettit, D.E.; Patterson, D.W.; Nielsen, K.E.; Xiaoyin Yao; 
Holbert, K.E.; Clark, L.T.; , "High speed redundant self-correcting circuits for radiation 
hardened by design logic," Radiation and Its Effects on Components and Systems 
(RADECS), 2009 European Conference on , vol., no., pp.465-472, 14-18 Sept. 2009 
[Hind11] Hindman, N.D.; Clark, L.T.; Patterson, D.W.; Holbert, K.E., "Fully Automated, 
Testable Design of Fine-Grained Triple Mode Redundant Logic," Nuclear Science, IEEE 
Transactions on , vol.58, no.6, pp.3046,3052, Dec. 2011 
[Hint01] G. Hinton, et al. "A 0.18-μm CMOS IA-32 processor with a 4-GHz integer 
execution unit," IEEE JSSC, Nov 2001, vol.36, no.11, pp.1617-1627. 
[Holm09] Cabanas-Holmen, M.; Cannon, E.H.; Kleinosowski, A.; Ballast, J.; Killens, J.; 
Socha, J., "Clock and Reset Transients in a 90 nm RHBD Single-Core Tilera Processor," 
Nuclear Science, IEEE Transactions on , vol.56, no.6, pp.3505,3510, Dec. 2009 
[Horv11] V. Horvat, Seuss, Radiation Effects Testing Facility, Cyclotron Institute, Texas 
A&M University, Online: cyclotron.tamu.edu/ref/beams.php, updated Dec. 21, 2011.  
[Hsieh81] Hsieh, C. M.; Murley, P. C.; O'Brien, R. R.; , "Dynamics of Charge Collection 
from Alpha-Particle Tracks in Integrated Circuits," Reliability Physics Symposium, 1981. 
19th Annual , vol., no., pp.38-42, April 1981. 
[Hsieh81a] Hsieh, C.M.; Murley, P.C.; O'Brien, R.R.; , "A field-funneling effect on the 
collection of alpha-particle-generated carriers in silicon devices," Electron Device Letters, 
IEEE , vol.2, no.4, pp.103-105, April 1981. 
 194 
 
[Hugh03] Hughes, H.L, Benedetto, J.M, "Radiation effects and hardening of MOS 
technology: devices and circuits," Nuclear Science, IEEE Transactions on , vol.50, no.3, 
pp. 500- 521, June 2003.  
[Hugh64] H. L. Hughes and R. R. Giroux, “Space radiation affects MOSFET’s,” 
Electronics, vol. 37, p. 58, 1964. 
[Jacob04] H. Jacobson, “Improved clock-gating through transparent pipelining,” Proc. 
ISLPED, 2004, pp. 26-31. 
[Karn04] Karnik, T.; Hazucha, P., "Characterization of soft errors caused by single event 
upsets in CMOS processes," Dependable and Secure Computing, IEEE Transactions on , 
vol.1, no.2, pp.128,143, April-June 2004. 
[Karn04] Karnik, T.; Hazucha, P.; , "Characterization of soft errors caused by single event 
upsets in CMOS processes," Dependable and Secure Computing, IEEE Transactions on , 
vol.1, no.2, pp. 128- 143, April-June 2004 
[Karn04] T. Karnik and P. Hazucha, “Characterization of Soft Errors Caused by Single 
Event Upsets in CMOS Processes,” IEEE Trans. Dep. and Secure Comp., 1, 2, pp. 128–
143, April, 2004. 
[Kehl11] Kehl, N.; Rosenstiel, W., "An Efficient SER Estimation Method for 
Combinational Circuits," Reliability, IEEE Transactions on , vol.60, no.4, pp.742,747, 
Dec. 2011 
[Kerns88] Kerns, S.E.; Shafer, B.D.; Rockett, L.R., Jr.; Pridmore, J.S.; Berndt, D.F.; van 
Vonno, N.; Barber, F.E.; , "The design of radiation-hardened ICs for space: a compendium 
of approaches," Proceedings of the IEEE , vol.76, no.11, pp.1470-1509, Nov 1988. 
[Knud06] J. Knudsen and L. Clark, “An area and power efficient radiation hardened by 
design flip-flop,” IEEE Trans. Nucl. Sci., vol. 53, no. 6, pp. 3392-3399, Dec. 2006.  
[Knud06] Knudsen, J. E.; Clark, L. T.; , "An Area and Power Efficient Radiation Hardened 
by Design Flip-Flop," Nuclear Science, IEEE Transactions on , vol.53, no.6, pp.3392-3399, 
Dec. 2006. 
[Koba09] D. Kobayashi, T. Makino, and K. Hirose, “Analytical expression for temporal 
width characterization of radiation-induced pulse noises in SOI CMOS logic gates,” Proc. 
IRPS, pp. 165-169, 2009. 
[Koga93] R. Koga, S. D. Pinkerton, S. C. Moss, D. C. Mayer, S. Lalumondiere, S. J. 
Hansel, K. B. Crawford, and W. R. Crain, “Observation of single event upsets in analog 
microcircuits,” IEEE Trans. Nucl. Sci., vol. 40, no. 6, pp. 1838–1844, Dec. 1993. 
 195 
 
[Koga97] Koga, R.; Penzin, S.H.; Crawford, K.B.; Crain, W.R.; , "Single event functional 
interrupt (SEFI) sensitivity in microcircuits,", RADECS 97. Fourth European Conference 
on , vol., no., pp.311-318, 15-19 Sep 1997. 
[Koppa02] J. Koppanalil, et al. "A case for dynamic pipeline scaling," Proc. Intl. Conf. on 
Compilers, Architecture, Synthesis for Embedded Systems, ACM, 2002, pp. 1-8. 
[Kurd01] Kurd et al., “A multigigahertz clocking scheme for the Pentium 4 
microprocessor,” IEEE Journal of Solid-State Circuits, vol. 36, no. 11, pp. 1647–1653, 
Nov. 2001.  
[LaBel96] LaBel, K.A.; Gates, M.M.; , "Single-event-effect mitigation from a system 
perspective," Nuclear Science, IEEE Transactions on , vol.43, no.2, pp.654-660, Apr 1996. 
[Lacoe00] Lacoe, R.C.; Osborn, J.V.; Koga, R.; Brown, S.; Mayer, D.C.; , "Application of 
hardness-by-design methodology to radiation-tolerant ASIC technologies," Nuclear 
Science, IEEE Transactions on , vol.47, no.6, pp.2334-2341, Dec 2000 
[Lacoe03] R. Lacoe, “CMOS scaling, design principles and hardening-by-design 
methodologies,” presented at the IEEE Nuclear and Space Radiation Effects Conf., 
Monterey, CA, Jul. 2003, short course. 
[Lee08] Lee, H., Paik, S., Shin, Y, “Pulse width allocation with clock skew scheduling for 
optimizing pulsed latch-based sequential circuits”, In Proceedings of the 2008 IEEE/ACM 
International Conference on Computer-Aided Design, pp. 224-229. 
[Magen04] Magen, Nir, et al. "Interconnect-power dissipation in a microprocessor." 
Proceedings of the 2004 international workshop on System level interconnect prediction. 
ACM, 2004.  
[Mass00] Massengill, L.W.; Baranski, A.E.; Van Nort, D.O.; Meng, J.; Bhuva, B.L., 
"Analysis of single-event effects in combinational logic-simulation of the AM2901 bitslice 
processor," Nuclear Science, IEEE Transactions on , vol.47, no.6, pp.2609,2615, Dec 2000. 
[Mass93] W. Massengill, M. Alles, and S. Kerns, “SEU error rates in advanced digital 
CMOS,” in Proc. Sec. European Conf. on Radiation and its Effects on Components and 
Systems, pp. 546-553, Sept. 1993.  
[Matush10] B. Matush, T. Mozdzen, L. Clark and J. Knudsen, “Area efficient temporally 
hardened by design flip flop circuits,” IEEE Trans. Nucl. Sci., vol. 57, no. 6, pp. 3588-
3595, Dec. 2010. 
[Mavis02] D. Mavis and P. Eaton, “Soft error mitigation techniques for modern 
microcircuits,” Proc. IEEE IRPS, Aug. 2002 pp. 216-225, 2002. 
 196 
 
[Mavis02] Mavis, D.G.; Eaton, P.H.; , "Soft error rate mitigation techniques for modern 
microcircuits," Reliability Physics Symposium Proceedings, 2002. 40th Annual , vol., no., 
pp. 216- 225, 2002 
[Mavis02] Mavis, David G., and Paul H. Eaton. "Soft error rate mitigation techniques for 
modern microcircuits." IEEE international reliability physics symposium. 2002 
[Ming04] Ming Zhang; Shanbhag, N.R., "A soft error rate analysis (SERA) methodology," 
Computer Aided Design, 2004. ICCAD-2004. IEEE/ACM International Conference on , 
vol., no., pp.111,118, 7-11 Nov. 2004. 
[Ming06] Zhang, Ming, and Naresh R. Shanbhag. "Soft-error-rate-analysis (SERA) 
methodology." Computer-Aided Design of Integrated Circuits and Systems, IEEE 
Transactions on 25.10 (2006): 2140-2155. 
[Mohr07] K. Mohr and L. Clark, “Experimental characterization and application of circuit 
architecture level single event transient mitigation,” Proc. IRPS, pp. 312-317, 2007.  
[Muss96] Musseau, O.; Gardic, F.; Roche, P.; Corbiere, T.; Reed, R.A.; Buchner, S.; 
McDonald, P.; Melinger, J.; Tran, L.; Campbell, A.B.; , "Analysis of multiple bit upsets 
(MBU) in CMOS SRAM," Nuclear Science, IEEE Transactions on , vol.43, no.6, pp.2879-
2888, Dec 1996. 
[Naff02]  S. Naffziger et al., “The implementation of the Itanium 2 microprocessor,” IEEE 
Journal of Solid-State Circuits, vol. 37, no. 11, pp. 1448–1460, Nov. 2002. 
[Nara06] Narasimham, B.; Ramachandran, V.; Bhuva, B.L.; Schrimpf, R.D.; Witulski, 
A.F.; Holman, W.T.; Massengill, L.W.; Black, J.D.; Robinson, W.H.; McMorrow, D., "On-
Chip Characterization of Single-Event Transient Pulsewidths," Device and Materials 
Reliability, IEEE Transactions on , vol.6, no.4, pp.542,549, Dec. 2006. 
[Nara06] Narayanan. V, Xie. Y, "Reliability concerns in embedded system designs," 
Computer , vol.39, no.1, pp. 118- 120, Jan. 2006.  
[Nase06] Naseer, Riaz, and Jeff Draper. "DF-DICE: a scalable solution for soft error 
tolerant circuit design." Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE 
International Symposium on. IEEE, 2006. 
[NIST01] National Institute of Standards and Technology, Advanced Encryption Standard 
AES, Federal Information Processing Standards Publication FIPS 197, 2001, 
http://csrc.nist.gov/publications/fips. 
[Norm96] E. Normand, “Single-event effects in avionics,” IEEE Trans. Nuc. Sci., vol. 43, 
no. 2, pp. 461-474, 1996. 
[Norm96] Normand, Eugene. "Single-event effects in avionics." Nuclear Science, IEEE 
Transactions on 43.2 (1996): 461-474. 
 197 
 
[Prit02] Pritchard .B .E, G. M. Swift, and A. H. Johnston, “Radiation effects predicted, 
observed, and compared for spacecraft systems,” Proc. IEEE NSREC Radiation Effects 
Data Workshop Record, pp. 7–17, 2002. 
[Rest01] Restle, P. J., et al. "A clock distribution network for microprocessors." Solid-State 
Circuits, IEEE Journal of 36.5, 2001, pp 792-799. 
[Rest98] Restle, P.J.; Jenkins, K.A.; Deutsch, A.; Cook, P.W.; "Measurement and modeling 
of on-chip transmission line effects in a 400 MHz microprocessor," Solid-State Circuits, 
IEEE Journal of , vol.33, no.4, pp.662-665, Apr 1998. 
[Rest98] Restle, Phillip J., et al. "Measurement and modeling of on-chip transmission line 
effects in a 400 MHz microprocessor." Solid-State Circuits, IEEE Journal of 33.4 (1998): 
662-665. 
[Rodb13] K. Rodbell, “Where Radiation Effects in Emerging Technologies Really 
Matter,” NSREC Short Course, 2013. 
[Sagg05] Saggese, G.P.; Wang, N.J.; Kalbarczyk, Z.T.; Patel, S.J.; Iyer, R.K.; , "An 
experimental study of soft errors in microprocessors," Micro, IEEE , vol.25, no.6, pp. 30- 
39, Nov.-Dec. 2005 
[Seif05] N. Seifert, et al., “Radiation induced clock jitter and race,” Proc. Int. Phys. Rel. 
Symp., April 2005, pp. 215-222. 
[Seif06] N. Seifert, et al., “Radiation-induced soft error rates of advanced CMOS bulk 
devices,” Proc. Int. Phys. Reliab. Symp, March 2006, pp. 217-225. 
[Shiba 06] S. Shibatani and A. Li, “Pulse-latch approach reduces dynamic power,”July 
2006, EE Times. 
[Shim06] H. Shimada, H. Ando, and T. Shimada, “A hybrid power reduction scheme using 
pipeline stage unification and dynamic voltage scaling.” Proc. IEEE COOL Chips, 2006, 
pp. 201-214. 
[Shiv02] Shivakumar, Premkishore, et al. "Modeling the effect of technology trends on the 
soft error rate of combinational logic." Dependable Systems and Networks, 2002. DSN 
2002. Proceedings. International Conference on. IEEE, 2002. 
[Song10] H. J. Song, "VLSI High-Speed I/O Circuits," Xlibris, ISBN#978-1-4415-5987-
6, 2010. 
[Sore00] R. Harboe-Sorensen, F. X. Guerre, H. Constans, J. Van Dooren, G. Berger, and 
W. Hajdas, “Single event transient characterization of analog ICs for ESA’s satellites,” in 
Proc. IEEE 5th Eur. Conf. Radiation and Its Effects on Components and Systems, 2000, 
pp. 573–581.  
 198 
 
[Stas06] R.B. Staszewski, and P. T. Balsara. "All-digital frequency synthesizer in deep-
submicron CMOS". John Wiley & Sons, 2006. 
[Stas88] Stassinopoulos, E.G., Raymond, J.P., "The space radiation environment for 
electronics," Proceedings of the IEEE , vol.76, no.11, pp.1423-1442, Nov 1988. 
[Sushil15] S. Kumar, S.Chellappa and L.T. Clark, “Temporal Pulse-clocked Multi-bit Flip-
flop Mitigating SET and SEU” (Accepted ISCAS 2015). 
[Teif08] Teifel, J., "Self-Voting Dual-Modular-Redundancy Circuits for Single-Event-
Transient Mitigation," Nuclear Science, IEEE Transactions on , vol.55, no.6, 
pp.3435,3439, Dec. 2008 
[Tshanz01] J. Tshanz, et al., “Comparative delay and energy of single edge-triggered & 
dual edge-triggered pulsed flip-flops for high performance microprocessors,” Proc. 
ISLPED, 2001, pp. 147-152. 
[Varg07] Varghese George; et.al. , "Penryn: 45-nm next generation Intel core 2 processor," 
Solid-State Circuits Conference, 2007. ASSCC '07. IEEE Asian , vol., no., pp.14-17, 12-
14 Nov. 2007.  
[Vitt97] Vittal, A.; Marek-Sadowska, M., "Low-power buffered clock tree design," 
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , 
vol.16, no.9, pp.965,975, Sept. 1997. 
[Vittoz88] E. A. Vittoz , M. B. R. Degrauwe and S. Bitz  "High performance crystal 
oscillator circuits: theory and application",  IEEE J. Solid-State Circuits,  vol. 23,  pp.774 
1988. 
[Wang08] Fan Wang; Agrawal, V.D., "Soft Error Rate Determination for Nanometer 
CMOS VLSI Logic," System Theory, 2008. SSST 2008. 40th Southeastern Symposium on 
, vol., no., pp.324,328, 16-18 March 2008. 
[Wang10] Wang, Fan, and Vishwani D. Agrawal. "Soft error rate determination for 
nanoscale sequential logic." Quality Electronic Design (ISQED), 2010 11th International 
Symposium on. IEEE, 2010. 
[Warren09] K. Warren, et al., “Heavy Ion Testing and Single Event Upset Rate Prediction 
Considerations for a DICE Flip-Flop,” IEEE Trans. Nuc. Sci., 56, 6, pp. 3130-3137, Dec. 
2009. 
[Warren09] K. Warren, et al., “Heavy ion testing and single event upset rate prediction 
considerations for a DICE flip-flop,” IEEE Trans. Nucl. Sci., vol. 56, no. 6, pp. 3130-3137, 
Dec. 2009.  
[Weav04] C. Weaver, J. Emer, S. Mukherjee, and S. Reinhardt, “Techniques to  reduce the 
soft error rate of a high-performance microprocessor,” Proc. ISCA, 2004, pp. 264-27. 
 199 
 
[Webb97] Webb, Charles F., et al. "A 400-MHz s/390 microprocessor." Solid-State 
Circuits, IEEE Journal of 32.11 (1997): 1665-1675. 
[Yao10] X. Yao, L. Clark, D. Patterson, K. Holbert, “A 90 nm bulk CMOS radiation 
hardened by design cache memory,” IEEE Trans. Nucl. Sci, vol. 57, no. 4, pp. 2089-2097, 
Aug. 2010.  
[Yao10] Yao. X; Clark, L.T.; Chellappa, S.; Holbert, K.E.; Hindman, N.D., "Design and 
Experimental Validation of Radiation Hardened by Design SRAM Cells," Nuclear Science, 
IEEE Transactions on , vol.57, no.1, pp.258-265, Feb. 2010. 
[Zhang06] M. Zhang et al, “Sequential Element Design with Built-In Soft Error 
Resilience,” IEEE Trans. VLSI Sys., 14, 12, pp. 1368-1378, Dec. 2006. 
[Zhang06] Ming Zhang; Mitra, S.; Mak, T.M.; Seifert, N.; Wang, N.J.; Quan Shi; Kee Sup 
Kim; Shanbhag, N.R.; Patel, S.J., "Sequential Element Design With Built-In Soft Error 
Resilience," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.14, 
no.12, pp.1368,1378, Dec. 2006. 
