Improved dependability for dynamically reconfigurable hardware: restoration of the reliability index via replication and error correction by José Martins Ferreira & Manuel G. Gericota
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway





Restoration of the reliability index 
via replication and error correction
J. M. Martins Ferreira [ jmf@fe.up.pt ]
FEUP / DEEC HIBU
Rua Roberto Frias Frogsvei 41 
P-4200-465 Porto N-3603 Kongsberg
Manuel G. Gericota [ mgg@dee.isep.ipp.pt ]
ISEP / DEE
Rua Dr. António Bernardino de Almeida
P-4200-072 Porto
[ this presentation is available online at 
http://www.fe.up.pt/~jmf/dak-2004.ppt ]
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (2 / 26)
DAK-forum 2004
Outline of the presentation
• Introduction and motivation
• Causes of failure
• Concurrent fault detection
• Fault detection latency and fault tolerance
• Fault masking and fault correction
• Research directions
• Conclusion
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (3 / 26)
DAK-forum 2004
Introduction and motivation
• Dynamically reconfigurable 
FPGAs:
– Production tests cannot 
guarantee fault-free operation
– Application areas include 
mission-critical systems
– The cost / benefit of spatial 
redundancy is different from 
static implementations
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (4 / 26)
DAK-forum 2004
Introduction and motivation
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (5 / 26)
DAK-forum 2004
Causes of failure
• Post-production failure modes may be 
permanent or temporary ― examples:
– Electromigration phenomena may lead to 
permanent physical damage
– Single-event upsets (SEUs) may cause 
permanent malfunction if not mitigated 
(modification of SRAM contents changes 
design and data information) 
– See
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (6 / 26)
DAK-forum 2004
Fault detection
• Dynamic reconfiguration enables 
concurrent fault detection
– Modifications in the configuration memory 
may be tested by scrubbing
– Structural faults that emerge on the field may 
be detected by release-to-test strategies
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (7 / 26)
DAK-forum 2004
Fault detection: Scrubbing
• Errors in the on-chip configuration memory 
may be detected by partial readback (and 
corrected by partial reconfiguration)
• Scrubbing prevents “design” errors that 
might lead to functional failure
• Data stored in flip-flop registers is not 
writable via the configuration memory, so 
scrubbing does not correct “data” errors
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (8 / 26)
DAK-forum 2004
Fault detection: Release-to-test
• The basic idea underlying release-to-test 
strategies consists of non-intrusively 
replicating a given 
functional block
in another area,















Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (9 / 26)
DAK-forum 2004
Replication of active resources
• Concurrent fault detection based on 
release-to-test approaches must provide 
functional and state replication
• Replication at CLB-level 
– Facilitates state transfer and requires 
a minimal amount of spare resources
– The relative position of the replicated CLB 




Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (10 / 26)
DAK-forum 2004
CLB replication 
• Replicating the functional 
configuration of a CLB is
done with minimal overhead
• In free-running clock circuits, placing the 
inputs of the two CLBs in parallel ensures 
common state acquisition
• Gated-clock circuits need an auxiliary 
block to provide state transfer
(paralleling CLB outputs is not a problem)
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (11 / 26)
DAK-forum 2004
Example: Replicate and release-to-























Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (12 / 26)
DAK-forum 2004
Example: Replicate and release-to-













































































Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway




























Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (14 / 26)
DAK-forum 2004
ITC’99 benchmarks: ∆f and size
16,86 070 4855 195 444-47,8-13,5B14
28,6332 954258 827-42,8-4,3B13













Ratio between the total 
size of the reconf. files 
by CLB (%) 
(horizontal>vertical)
Total size of the 
reconfiguration files (bytes)
Maximum frequency 
deviation   (%)Circuit 
reference
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (15 / 26)
DAK-forum 2004
















Ratio between the mean 
size value of the reconf. 
files by CLB (%) 
(horizontal>vertical)
Mean size of the reconfiguration 





Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (16 / 26)
DAK-forum 2004
Structural fault detection in CLBs
• Test vector application / 
response capturing is 





























Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway





Disconnect of the original CLB inputs 
and test configuration
1,1461 333Disconnect of the original CLB outputs
3,5504 129Place of the CLB outputs in parallel
1,9062 217
Disconnect of all the auxiliary 
relocation circuit signals
1,8442 145






Copy of the internal logic functionality 
and place of the input signals in parallel




Partial reconfiguration file size and reconfiguration time for each step in 
the replication of synchronous circuits with clock enable
Replication using the auxiliary relocation block
30,62535 621Total
15,81318 392
Disconnect of the original CLB inputs 
and test configuration
0,9231 073
Disconnect of the original CLB 
outputs
3,4333 993Place of the CLB outputs in parallel
10,45712 163
Copy of the internal logic 
functionality and place of the input 
signals in parallel




Partial reconfiguration file size and reconfiguration time for each step 
in the replication of synchronous circuits with free-running clock and 
of combinational circuits
Replication without auxiliary relocation block
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway















Partial reconfiguration file size and reconfiguration time 
of the test configurations
0,0665201340




Length (bits)Number of vectors
Shifting time for test vector application
4,088401 022




Number of cells of the BS 
register in a XCV200
Shifting time for the test vector responses from a CLB under test
26 472,235 ms                      TCK = 33 MHz
43 679,188 ms TCK = 20 MHz
Occupation type: 25%  synchronous + 50% combinational + 25% empty
Mean time for the test of a 1 176 CLBs matrixThe mean time to test the full 
CLB matrix is also the worst 
case fault detection latency
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (19 / 26)
DAK-forum 2004
Fault detection latency x 
fault masking
• A fault detection latency higher than 40 s 
may be acceptable in some applications, 
but may be a problem in many others
• Fault masking by spatial redundancy may 
solve the problem until the defective CLB 






Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (20 / 26)
DAK-forum 2004
Spatial redundancy
• N-NMR implementations 
enhance reliability by 
allowing voter failure
• Earlier NMR implementations were a form 
of static redundancy, but dynamic 
reconfiguration brings an added value
– Just-in-time implementation saves space






Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (21 / 26)
DAK-forum 2004
Fault detection and correction in 
N-NMR via replication of CLBs
• The CLB testing approach previously 
described enables the identification of a 
defective CLB (structural fault)
• Replication will be used to remove the 
defective CLB from operation (and to 
reestablish the reliability index)
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (22 / 26)
DAK-forum 2004







scan chain Majority 
voters
• An internal scan chain capturing the 







Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (23 / 26)
DAK-forum 2004
Error correction
• If an incoherency is detected:
– A scrubbing procedure is launched to read-
and-compare the configuration bitstream for 
the affected area
– If no error is found, each CLB in the affected 
module / voter is tested (a defective CLB will 
be replicated and removed from service)
• Error correction via CLB replication or 







scan chain Majority 
voters
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (24 / 26)
DAK-forum 2004
Research directions
• One-chip “self-healing” architectures may 
be achieved via self-reconfiguration (a 
microprocessor core controls the self-
reconfiguration port and scan chains)
• Dual-chip or multi-chip architectures may 
monitor the error detection circuitry of 
each other - See
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (25 / 26)
DAK-forum 2004
Conclusion
• The CLB replication and test procedure 
proposed enables concurrent non-intrusive 
fault detection, but fault detection latency 
prevents true fault tolerance
• Combining the proposed fault detection 
techniques with spatial redundancy enables 
low overhead fault tolerance for DR-FPGAs
(and self-healing for SR-FPGAs)
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway
October 19-21 2004 (26 / 26)
DAK-forum 2004
Conclusion
• Dependability will also be improved by 
runtime defragmentation of the FPGA logic 
space (using the proposed CLB replication 
and test procedure)
• See
Improved dependability for dynamically reconfigurable hardware: Restoration of the reliability index via replication and error correction
Trondheim, Norway





Restoration of the reliability index 
via replication and error correction
J. M. Martins Ferreira [ jmf@fe.up.pt ]
FEUP / DEEC HIBU
Rua Roberto Frias Frogsvei 41 
P-4200-465 Porto N-3603 Kongsberg
Manuel G. Gericota [ mgg@dee.isep.ipp.pt ]
ISEP / DEE
Rua Dr. António Bernardino de Almeida
P-4200-072 Porto
[ this presentation is available online at 
http://www.fe.up.pt/~jmf/dak-2004.ppt ]
