Susceptibility of Redundant Versus Singular Clock Domains Implemented in SRAM-Based FPGA TMR Designs by Pellish, Jonathan et al.
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Melanie Berg, AS&D in support of NASA/GSFC
Melanie.D.Berg@NASA.gov
Keh LaBel, NASA/GSFC
Jonathan Pellish, NASA/GSFC
1
Susceptibility of Redundant Versus 
Singular Clock Domains Implemented 
in SRAM-Based FPGA TMR Designs
https://ntrs.nasa.gov/search.jsp?R=20160004768 2019-08-31T03:50:41+00:00Z
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Acronyms
• Combinatorial logic (CL)
• Design under analysis (DUA)
• Device under test (DUT)
• Distributed triple modular redundancy 
(DTMR)
• Edge-triggered flip-flops (DFFs)
• Field programmable gate array (FPGA)
• Global triple modular redundancy (GTMR)
• Hardware description language (HDL)
• Input – output (I/O)
• Linear energy transfer (LET)
• Mean time to failure (MTTF)
• Operational frequency (fs)
• Radiation Effects and Analysis Group 
(REAG)
• Single Error Correct Double Error Detect 
Single event functional interrupt (SEFI)
• Single event effects (SEEs)
• Single event transient (SET)
• Single event upset (SEU)
• Single event upset cross-section (σSEU)
• Static random access memory (SRAM)
• Static timing analysis (STA)
• Triple modular redundancy (TMR)
2
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Problem Statement
• Triple modular redundancy (TMR) can be implemented in a 
variety of topologies.
• This presentation focuses on the trade-offs between 
implementing TMR with:
– Multiple clock domains (one clock per TMR domain):  
i.e., global TMR (GTMR) and
– A single clock shared across TMR domains: 
i.e., distributed TMR (DTMR).
• For many organizations, GTMR is the mitigation strategy of 
choice because of its redundant clock topology.
• However, as FPGA devices become larger and more complex, 
clock skew between separate domains is increasing and 
becoming impossible to control.  
• Unfortunately, mismanaged clock skew can cause timing 
violations or circuit race conditions in synchronous designs.
3
Race conditions from clock skew weaken mitigation and can 
cause system malfunction!
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Abstract
We present the challenges that arise when using
redundant clock domains due to their clock-skew.
Radiation data show that a singular clock domain
(DTMR) provides an improved TMR methodology
for SRAM-based FPGAs over redundant clocks.
4
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Clock Skew
5
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Clock Skew within One Clock Domain
• A clock domain is defined 
as a group of circuitry that 
is connected to the same 
clock tree.  
• The clock tree only feeds 
DFF clock pin inputs. 
• It is mandatory to balance 
the clock tree domain so 
that all connected DFFs 
receive the controlling 
clock edge at virtually the 
same moment in time. 
6
CL: combinatorial logic
Tcomb: CL circuit delay
Tskew: clock skew
DFF: flip-flop
The difference in time of a clock edge’s arrival at one DFF 
with respect to its arrival at another DFF is defined as clock 
skew (Tskew). 
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Synchronous Data Capture
7
Tcomb: combinational logic delay.
Tclkq: delay of data output from DFF.
TPHOLD: Path hold time.
Tsetup: Data stable time prior to clock edge.
Tsetup: Data stable time post clock edge.
Launch DFF Capture DFF
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Positive Skew and Data Capture
In a system with positive skew, there is a possibility that: 
– DFFx will capture the wrong data (cycle ahead); or 
– DFFx will capture while data is changing 
(metastability).
8
Tskew: clock skew
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Positive Skew and TPHOLD
9
• TPHOLD is elongated to accommodate positive clock skew.
• TPHOLD elongation takes time away from Tcomb… Tcomb is shortened.
• Might violate timing constraints.
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Negative Skew and Data Capture
In a system with negative skew, there is the possibility 
that data can be captured during it’s computation time.
– Tsetup is violated.
– This can cause metastability.
– Data is invalid.
10
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Triple Modular Redundancy (TMR)
11
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
DTMR and GTMR Topologies
• With DTMR and GTMR 
all circuits are 
triplicated.
• Voters are placed after 
the internal flip-flops 
(DFFs).
• DTMR: only one clock 
per TMR domain.
• GTMR: Three separate 
clocks per TMR domain.
12
GTMR violates synchronous design protocol because of 
clock skew and sharing data across clock domains without 
synchronization.
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Challenges of GTMR System 
Implementation
13
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
GTMR: Multiple Clock Domains 
Exacerbate Skew
• Three separate clock domains merge into one clock domain at each 
voter.
• For GTMR, this is replicated for each clock domain.
14
TMR0 clock 
domain merge 
example
Each clock domain 
will have a unique 
skew with respect 
to DFFx.
Complete violation to synchronous design rules.
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
System Implementation: 
Sources of Clock Skew
• Board Level:
– One board clock source (oscillator): routes 
from board clock source must be the same 
length to FPGA clock inputs.
– Three board clock sources: Don’t!
• Internal to the FPGA:
– Clock pin to clock tree routing differences,
– Skew within a single clock tree, and
– GTMR has additional skew from use of 
different clock trees.
15
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
GTMR Skew Management in Various 
FPGA Devices
16
• Board level and routing skew can be managed.  
However, clock skew within a single tree and between 
different trees is based on the manufacturer’s product 
and can be difficult or impossible to control.
• Clock skew was less of a problem in smaller Xilinx 
FPGA devices (e.g., Virtex 5 and smaller).
• Clock skew is now a challenge in the larger family of 
Xilinx devices (e.g.,7 series and above).
– More skew within one clock tree (especially as distance 
between DFFs increase).
– More skew between separate clock trees.
• GTMR was never able to be implemented in the 
Microsemi/Actel FPGA product lines because there is 
too much skew between clock trees.
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Detection of Tskew with GTMR
• GTMR Tskew is difficult to detect due to the following:
– Many static timing analysis (STA) tools do not accurately report 
hold time violations across clock domains – hence the user can 
be unaware Tskew exists.
– Tskew can be temperature and voltage related.  Hence, a design 
can work during ground testing yet have failures during 
operation in its target environment.
– In the presence of clock skew, usually two out of three of the 
domains are in sync.  Hence the design will appear  to operate 
normally.
– Not all nodes will contain the same skew:
• Some nodes may have positive skew,
• Some nodes may have negative skew, and
• Some nodes may have negligible skew.
• Due to state space explosion, fault injection and 
simulation will not provide sufficient coverage.
17
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Tskew System Effects
• Significantly large Tskew: can cause one domain to 
always be out of sync with the other two 
domains.
• Marginal Tskew: can cause metastable circuits.
• Variable Tskew: can cause pockets of Tskew such 
that some portions of the circuit contain:
– Positive Tskew
– Negative Tskew, and
– Negligible Tskew.
• As the designer decreases overall Tskew, (e.g., via 
board level and routing management) pockets of 
Tskew start to exist.
• This is more prominent in large FPGA devices 
such as the Xilinx 7-series.
18
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Accelerated Heavy Ion Testing
19
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Accelerated Radiation Testing: DTMR 
versus GTMR
• Device under test (DUT): Xilinx Kintex-7 FPGA 
(XC7K325T).
• The base design (DUA) was the counter-array 
created by NASA Electronics Parts and 
Packaging (NEPP) Program.  
• The counter-array DUA had three versions based 
on its inserted TMR scheme: 
– No TMR, 
– GTMR, and 
– DTMR.  
• The TMR DUAs were physically partitioned 
across TMR domains in order to reduce shared 
resources (single points of failure).  
20
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Considerations Taken Prior to and 
during Testing
• TMR topologies for the DTMR and GTMR designs 
were analyzed to verify that the redundancy was 
implemented correctly. 
• Both DTMR and GTMR were partitioned equally 
the same.
• Major difference is that DTMR has one shared 
clock and GTMR has three separate clocks.
• Multiple bit upsets (MBUs) should not make a 
significant difference in this testing because each 
should be statistically equally likely to fail due to 
MBUs.
21
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
DUA Failures
• Failure is referenced at the system level not at the 
component level. 
• Hence, SEUs can occur and disrupt components; 
however, if the next state is the expected next 
state, then no system failure exists. 
• In a TMR’d design, system failures can occur 
from: 
– Single configuration bit-SEUs that control shared 
resources that cross TMR domains; 
– Multiple configuration bit-SEUs that can span across 
TMR domains; 
– GTMR architectures with race conditions. 
– Single event functional interrupts (SEFIs)
22
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Heavy-Ion Results
23
LET: linear energy transfer
MFTF: Mean failure to fluence
M
FT
F 
(p
ar
tic
le
s/
cm
2 )
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Heavy-Ion Results: Low LETs
24
LET: linear energy transfer
MFTF: Mean failure to fluence
Low LET: GTMR ≅ DTMR
And is a decade better than 
No TMR
M
FT
F 
(p
ar
tic
le
s/
cm
2 )
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Heavy-Ion Results: Higher LETs
25
LET: linear energy transfer
MFTF: Mean failure to fluence
LET >5MeVcm2/mg: GTMR ≅ No TMR
And is a decade worse than DTMRM
FT
F 
(p
ar
tic
le
s/
cm
2 )
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
It Is A Clock Skew Problem
• If results were due to significant clock skew, one 
GTMR clock domain would always be out of sync 
and GTMR would always have results similar to 
No TMR.
• Because GTMR has results near DTMR at low LET 
and approach No TMR as LET increases, suggest 
that the failures are mostly due to clock skew.
• MBUs are not considered a significant source of 
failure because both systems are partitioned in 
the same manner.  
26
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Conclusion (1)
• Theoretically, GTMR should be the strongest TMR 
mitigation scheme.  
• For this reason, it has been suggested as the 
TMR strategy of choice for SRAM-based FPGAs.  
• However, the uncontrollable clock skew between 
GTMR clock domains can cause race conditions 
that inevitably weaken GTMR mitigation.   
• For small (less complex) designs implemented in 
FPGAs that contain clock trees with minimal 
Tskew, GTMR can be realizable.  
27
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Conclusion (2)
• As device and design area increase, as with 
modern devices such as the Xilinx Kintex-7, GTMR 
clock skew also increases.  
• The increase in skew increases the potential for 
race conditions. 
• Some race conditions can be uncontrollable and 
unrecognizable by manufacturer-supplied design 
tools.  
• Consequently, Kintex-7 GTMR versus DTMR heavy-
ion data show that GTMR is an ineffective and 
unreliable mitigation solution.  
• In conclusion, we suggest that DTMR is a more 
applicable TMR strategy for larger commercial 
SRAM-based FPGA devices.
28
To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology Conference, April 4-8, 2016, Monterey, CA.
Acknowledgements
• Some of this work has been sponsored by the 
NASA Electronic Parts and Packaging (NEPP) 
Program and the Defense Threat Reduction 
Agency (DTRA).
• Thanks is given to the NASA Goddard Radiation 
Effects and Analysis Group (REAG) for their 
technical assistance and support. REAG is led by 
Kenneth LaBel and Jonathan Pellish.
29
Contact Information:
Melanie Berg: NASA Goddard REAG FPGA 
Principal Investigator:
Melanie.D.Berg@NASA.GOV
