Reconfiguration and Fine-Grained Redundancy for Fault Tolerance in FPGAs by Campregher, N et al.
RECONFIGURATION AND FINE-GRAINED REDUNDANCY FOR FAULT TOLERANCE IN
FPGAS
Nicola Campregher§, Peter Y. K. Cheung§, George A. Constantinides§ and Milan Vasilko‡
§Department of EEE, Imperial College, London, UK
‡School of Design, Engineering and Computing, Bournemouth University, UK
ABSTRACT
As manufacturing technology enters the ultra-deep submi-
cron era, wafer yields are destined to drop due to higher
occurrence of physical defects on the die. This paper pro-
poses a yield enhancement scheme based on the use of spare
interconnect resources in each routing channel to tolerate
functional faults. By using a node-covering technique and
integer-linear programming (ILP) methods, the scheme is
shown to provide minimal area and timing overheads. Sig-
nificant yield improvements can thus be achieved.
1. INTRODUCTION
The area occupied by wiring channels and interconnect con-
figuration circuits in an FPGA is significant, occupying 50
to 90 percent of the chip area [1]. With current trends aim-
ing to reduce the area occupied by wiring segments in the
routing channels, wire width and wire spacing have been re-
duced. This has in turn led to higher occurrences of wiring
defects, such as breaks and shorts, and decrease in manu-
facturing yield [2] and fewer functioning devices at fixed
manufacturing costs.
The nature of FPGA devices provides two separate ways
of dealing with defects. The first, more obvious, method is
based on hardware redundancy, and it capitalizes on the high
regularity of the FPGA array to swap a faulty resource with
a spare functioning one. The second method is based on ex-
ploiting the reconfiguration properties of the device, tweak-
ing the design to fit around the defective resource. Both
methods come at great area and timing overheads; only a
limited number of the proposed schemes have proved suc-
cessful and have been implemented by manufacturers [3, 4,
5].
In this work we propose a new fault tolerant scheme
based on both reconfiguration and hardware redundancy. We
propose a new approach to fault tolerance based on modify-
ing some of the underlying characteristics of a given FPGA
architecture, and we demonstrate the principle using a sim-
ple routing architecture.
Our fault tolerance scheme has the following notable ad-
vantages
This work is partially supported by EPSRC grant no. EP/C549481/1.
• Estimated device yields increase from 40% to 85% for
large devices built at 90nm as predicted by our yield
analysis tool [2].
• Estimated worst case timing degradation of 8.5%.
• Semi-permanent defect correction through configura-
tion readback.
• Support for multiple non-localized defects.
• Can be extended to support dynamic fault tolerance.
The major disadvantage of this fault tolerance scheme
comes in the need for a small configuration controller, which
requires silicon area and will slow down the power-up se-
quence. At the time of publication an in-depth analysis of
the overheads of the configuration controller has not been
performed: these however are likely to be outweighed by
the considerable yield advantages achieved by this scheme.
This paper is structured as follows. Section 2 provides
a brief account of some of the most successful work car-
ried out in the field of fault tolerance in FPGA. Section 3
introduces the fault tolerance scheme and gives details on
the implementation. The results of our pilot architecture are
presented in Section 4, and finally Section 5 concludes the
paper and suggests area for future research in the area.
2. PREVIOUS WORK
Research carried out in the field of FPGA fault tolerance
can be divided into two main categories. The first exploits
the highly regular structure of the FPGA array. The second
is based on the reconfiguration properties of FPGA devices.
In the first category, one of the earliest work was pro-
posed in [6]. The research first analyzed the possibility of
modifying the FPGA row selector and extending it to sup-
port the swapping of an entire row for a spare one in the case
of a fault. Other methods, based on a finer redundancy, were
proposed by two different authors in [7, 8]. Both these works
propose widening the routing channels to include a spare in-
terconnect to be swapped for a defective one. The main dif-
ference between these works is in the swapping procedure:
the earlier work utilizes fuse blowing, while the second work
uses a more sophisticated technique to shift all routing tracks
in a channel using a more elaborate switch matrix.
1-4244-0 312-X/06/$20.00 c©2006 IEEE.
Many approaches have been presented that exploit the
reconfiguration properties of FPGA devices. The most com-
prehensive, based on roving areas to take part of the chip off-
line, test and repair it has been presented in [9]. A method
based on pre-compiled partial configurations has been pro-
posed in [10], whereby segments of the design are placed
and routed multiple times and then the configuration that
avoids the defect is chosen for programming the device.
3. FAULT TOLERANCE SCHEME
This section describes the proposed fault tolerance scheme,
its motivation and line of thought that brought to its devel-
opment.
3.1. Motivation
The occurrence of defects during manufacturing is a random
process. While data exist regarding the density and cluster-
ing of defects, it is impossible to formulate prediction mod-
els regarding the location of the defects within a die. As
such, the probability of obtaining even two defective devices
which exhibit the same functional fault in exactly the same
location is almost non-existent.
As FPGAs continue their expansion into the semicon-
ductor market, they are more and more often utilized in medium
volume products. One of the biggest challenges offered by
fault tolerance is therefore ensuring that the same bitstream
produced can be successfully matched to tens, hundreds, and
potentially even thousands of non-identical devices. There-
fore the development of a fault tolerant scheme has the fol-
lowing primary goals:
• increase number of usable devices from wafers,
• ensure no extra design burden for the customer,
• maintain timing enclosure.
Area, despite being an important issue for most research,
is not mentioned as a primary goal. This is because in order
to increase the number of usable devices obtained from the
manufacturing process the yield advantage has to overcome
the area overhead. It is therefore more important to increase
yields and reduce area overheads.
Under these constraints the most reasonable approach
is to automatically manipulate the design before program-
ming the FPGA. By generating ad-hoc bitstreams for each
device the “uniqueness” factor of fault tolerance is elimi-
nated. This can be achieved either through a configuration
controller or through on-board placer and router, to be run
with prior knowledge of the fault location [11]. However,
other factors affect these type of approaches, most notably
the overhead required to manipulate the bitstreams in a rea-
sonable amount of time before programming. With designs
getting more and more complex a full re-generation of bit-
stream is infeasible.
Therefore hardware redundancy, coming with greater area
and timing penalties, has been the preferred method for im-
plementing fault tolerance. As device performance is im-
proved as a result of manufacturing technology and architec-
tural improvements, the timing degradation resulting from
the extra switching required to avoid the fault can be within
acceptable limits for non-performance-dependent applica-
tions. Any form of hardware redundancy does, however,
restrict the performance of devices, and in a fast moving
semiconductor industry even the smallest degradations are
crucial.
3.2. Preliminary analysis
The first step of our analysis was conducted to understand
exactly how the highly redundant nature of FPGA affected
the probability of a design being successfully placed and
routed. VPR, an open source place and route tool [12],
was modified in order to provide fault injection. This is
achieved by making an interconnect resource unavailable to
the router, to simulate the presence of a catastrophic inter-
connect fault. Selected benchmarks from the MCNC suite [13]
have been placed and routed in a minimal FPGA (minimum
array and channel width), then placed and routed in a faulty
FPGA (FPGA without an interconnect resource). The tool
was run once for every track used by the original design, so
as to simulate operation under the presence of an intercon-
nect fault. The results were split in 3 categories:
• Successful - The design was successfully placed and
routed with the same timing characteristics of the orig-
inal design.
• Timing failure - The design was successfully placed
and routed but the timing was affected when com-
pared to the original design.
• Failed - The design could not be successfully placed
and routed.
The benchmarks were placed and routed using two rout-
ing architectures. The first, labelled single is a full connec-
tivity, low performing architecture: this is the ideal fault-
tolerant architecture because of the high routing flexibility.
The second architecture taken into consideration is a seg-
mented one, labelled seg which resembles commercial FPGA
architectures, and is a compromise between routability and
performance. Details of these two routing architectures will
follow later.
Figure 1 shows the outcome of the place and route anal-
ysis. In the worst-case scenario of maximum array usage
the percentage of faults causing the design to fail routing
can be as high as 60% for the segmented architecture. The
spread of timing variation is shown in Figure 2, and in both
architectures some designs successfully place and route but
exhibit very high timing degradation.
On average, the single architecture only fails the place
and route process in under 10% of cases. However, if the
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
si
n
g
le
se
g
si
n
g
le
se
g
si
n
g
le
se
g
si
n
g
le
se
g
si
n
g
le
se
g
si
n
g
le
se
g
si
n
g
le
se
g
si
n
g
le
se
g
si
n
g
le
se
g
si
n
g
le
se
g
alu4 ex5p tseng bigkey dsip des apex2 frisc m isex3 average
P
e
rc
e
n
ta
g
e
 o
f 
P
&
R
 p
a
s
s
e
s
Successful Failed Tim ing Failed routing
Fig. 1. Design classification in the presence of a fault
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
0-5% 5-10% 10-20% 20-30% 30-40% 40-50% >50%
Percentage tim ing variation
P
e
rc
e
n
ta
g
e
 o
f 
P
&
R
 p
a
s
s
e
s
Single
Seg
Fig. 2. Spread of timing variation for designs successfully
placed and routed but with timing failure
experiment is repeated using one extra track in each rout-
ing channel, all designs route successfully and within timing
constraints.
Figure 3 shows a portion of the single architecture. The
striking feature of this type of architecture is the high con-
nectivity from all pins of connection block (a) (A,B,C,D,N)
to all pins of connection block (b) (E,F,G,H,O). More impor-
tantly, due to the nature of the architecture it is possible to
replicate all pin-to-pin connections by simply widening the
routing channel and introducing spare resources, as shown
in Figure 3. It is this feature which enables designs to be
placed and routed while maintaining similar timing charac-
A B C D N E F G H O
Connection Block (a) Switch m atrix Connection Block (b)
Spare resources
Fig. 3. Single architecture connectivity
A B C D N E F G H O
Switch m atrixConnection Block (a) Connection Block (b)
Fig. 4. An example architecture with limited connectivity
teristics to the original design. Duplicating pin-to-pin con-
nections has thus been identified as being the key to achiev-
ing fault tolerance without disrupting the timing.
This brings the question of how much redundancy is re-
ally needed in FPGAs? Our analysis has shown that there is
no need to introduce many more spare resources, rather it is
convenient to maximize and improve the use of the available
ones left over by the placer and router, and perhaps introduce
spares at very fine grain level. This has brought the develop-
ment of our fault tolerant architecture, discussed in the next
subsection.
3.3. The proposed fault tolerant FPGA
The highly regular structure and high connectivity of the
single architecture enables all pin-to-pin connections to be
replicated with relative ease. Full connectivity, however,
is subject to larger area and considerable timing penalties.
High performance architectures thus have limited connec-
tivity, as shown in the sample architecture in Figure 4; the
architecture is designed taking into account the routability
and performance only. We propose here to re-evaluate the
connectivity of FPGA devices taking into account fault tol-
erance as a parameter.
Consider the architecture shown in Figure 4. The con-
nectivity in each connection block and the switch matrix can
be expressed mathematically using adjacency matrices, as
shown in Figure 5. A matrix entry of “1” signifies a con-
nection between a pin and an interconnect or an intercon-
nect to another interconnect exists in a connection block or
switch matrix respectively. Conversely, a matrix entry of “0”
means no connection exists. The product of all three matri-
ces yields the overall connectivity matrix, where each entry
indicates the number of possible routes to connect each pin
to all others.
In order to improve fault tolerance we aim to re-evaluate
the overall connectivity matrix by increasing each non-zero
matrix entry by one or more, and derive the individual ad-
jacency matrices accordingly. Integer Linear Programming
was used to solve this problem, and the formulation is shown
in the next subsection.
3.3.1. ILP formulation
This section introduces the Integer Linear Programming (ILP)
to split an overall connectivity matrix into the product of
l1 l2 l3 l4 l5 l1 l2 l3 l4 l5 E F G H O
A 0 1 0 0 1 l1 1 0 0 0 0 l1 1 1 0 0 0
B 1 0 0 1 0 l2 0 1 0 0 0 l2 1 0 1 0 0
C 1 0 1 0 0 l3 0 0 1 0 0 l3 0 0 1 1 0
D 0 1 0 0 1 l4 0 0 0 1 0 l4 0 1 0 0 1
N 0 0 1 1 0 l5 0 0 0 0 1 l5 0 0 0 1 1
E F G H O
A 1 0 1 1 1
B 1 2 0 0 1
C 1 1 1 1 1
D 1 0 1 1 1
N 0 1 1 1 1
O verall Connectivity
connectivity m atrices
A,B,C,D,N = LUT (a) pins
Connection Block A Switch m atrix Connection Block B
E,F,G,H,O = LUT (b) pins
Fig. 5. Modelling connectivity using adjacency matrices
a three adjacency matrices, each representing a connection
block or a switch matrix. In order to linearize the problem
the formulation has been split into two stages, where in the
first stage the overall connectivity matrix is split into the
product of a binary permutation matrix and an intermediate
matrix; in the second stage the intermediate matrix is further
divided into two binary permutation matrices. The formula-
tion of the first stage is shown in this section, it depicts the
most general case of the problem.
Consider an architecture where each routing channel is
t tracks wide and each LUT has p pins which connect to
the routing channel. The overall connectivity matrix for this
kind of architecture is modelled using a matrix C of p rows
by p columns, which is the product of a binary permutation
matrix A and an arbitrary matrix B of p× t and t × p rows
and columns respectively. The main constraint is shown in
(1).
∀i, ∀j
t∑
k=1
aikbkj ≥ cij + 1. (1)
As both a and b are variables in our system, we introduce
a dummy variable d constrained by (2) to regulate (1) in
linear form. Substituting for d yields (3).
dikj ≤ aikbkj . (2)
t∑
k=1
dikj ≥ cij + 1. (3)
Considering aik ∈ {0, 1}, (1 can be replaced by (4),
which is then linearly expressed by (5) and (6), where U is
an upper bound on b.
aik = 0⇒ dikj ≤ 0
aik = 1⇒ dikj ≤ bkj . (4)
dikj ≤ aikU. (5)
A B C D N E F G H O
Connection Block (a) Switch m atrix Connection Block (b)
Extra switches
Spare resources
Fig. 6. Example architecture with improved connectivity for
fault tolerance
dikj ≤ bkj . (6)
Finally, in order to preserve the regular structure of the
FPGA, the algorithm needs to ensure that all pins connect
to a maximum number of interconnects and vice versa all
interconnects only connect to a fixed number of pins. These
conditions are expressed by (7) and (8), where N1 and N2
are the intended number of connections present, if a is a
connection block adjacency matrix. Similar constraints are
used for switch block permutation matrices.
∑
i
aik ≤ N1 (7)
∑
k
aik ≤ N2 (8)
The aim of formulation is to obtain an optimized solu-
tion to minimize area. The ILP objective is thus minimize
the number of non-zero entries in the adjacency matrices,
as they represent the number of switches present in each
connection block and switch block. Considering the fact
that transistors in connection blocks and switch matrices are
likely to have different properties, the size and performance
of each was based on the work presented in [14]. The final
objective equation is shown in (9), where T1 and T2 are the
relative sizes of the transistors in the resource being mod-
elled by matrices A and B.
min
∑
i,k
(T1aik) +
∑
k,j
T2bkj (9)
The final result of the ILP problem solving for the sam-
ple architecture shown in Figure 4 is shown in Figure 6. Due
to limitations in space, only connections in the horizontal
channels are shown. For a complete architecture the vertical
connections are also considered.
3.3.2. Fault avoidance
The fault avoidance is based on a node covering scheme.
Each point to point connection is “covered” by another op-
tion, so that if a track becomes unavailable as a result of
A B C D N E F G H O
Connection Block (a) Switch m atrix Connection Block (b)
Extra switches
A B C D N E F G H O
Connection Block (a) Switch m atrix Connection Block (b)
Extra switches
Faulty line
Spare resources OriginalFault tolerant
Fig. 7. Example of fault avoidance through re-routing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
al
u4
ex
5p
ts
en
g
bi
gk
ey
ds
ip
de
s
ap
ex
2
fri
sc
m
ise
x3
av
er
ag
e
P
e
rc
e
n
ta
g
e
 o
f 
P
&
R
 p
a
s
s
e
s
Successful Failed Tim ing
Fig. 8. Percentage of faults causing design degradation
fault, the “covering” track is used instead. The transforma-
tion required is very simple, and can easily be performed
during bitstream loading. An in depth analysis of the bit-
stream controller is planned for future work.
An example transformation is shown in Figure 7. The
original design, depicted on the top diagram, utilizes a faulty
track (second from the bottom of connection block (b)). The
transformation algorithm automatically swaps the signal to
utilize a different track. This in turn “knocks” another sig-
nal, which originally utilized the covering track, onto its own
cover. The final result is depicted on the bottom diagram of
Figure 7.
4. ANALYSIS
Our fault tolerant architecture, developed using the tech-
nique presented in Section 3.3.1 applied to the seg architec-
ture allows all of the designs taken into account to be placed
and routed successfully, as shown in Figure 8. On average,
over 70% of faults did not affect the designs in any way.
The remaining faults only caused minor timing variations,
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
0-2.5% 2.5-5% 5-7.5% 7.5-10%
Percentage tim ing variation
P
e
rc
e
n
ta
g
e
 o
f 
P
&
R
 p
a
s
s
e
s
Fig. 9. Percentage of faults causing timing variations
0
1000
2000
3000
4000
5000
6000
8 10 12 14 16 18 20 22
Routing Channel W idth
M
in
im
u
m
 s
iz
e
 t
ra
n
s
is
to
r 
p
e
r 
ti
le
Single line area Fc = 0.5 Single line Area Fc = 0.75 
Single line Area Fc = 1 Single line area with fault tolerance Fc = 0.5
Single line area with fault tolerance Fc = 0.75 Single line area with fault tolerance Fc = 1
Fig. 10. Routing area analysis of fault tolerant architectures
as shown in Figure 9. All of the timing variations are within
8.5% of the original design, while the majority is contained
within 5%.
Figure 10 depicts the area overhead incurred by the fault
tolerant architecture. The curves shown depict the total rout-
ing area required to implement the architecture with and
without fault tolerance. The graph also shows how varying
the parameter FC , the fraction of pins each track connects
to, affects the overall area requirements (FC = 1 means the
the interconnect track connects to all pins, while FC = 0
means the track does not connect to any pins). The total area
is shown in minimum transistor size count, an area measur-
ing technique introduced in [14]. This method takes into
account the different sizes of connection block, buffers, and
switch blocks transistors, and it offers a much more realistic
measure of the total area required to implement an architec-
ture. For the analysis, no buffer sharing was assumed, i.e.
every connection needs an individual buffer.
The results shown in Figure 10 only depict the routing
area requirements. In order to calculate the total tile area
the transistor counts of the logic block would need to be
included. Our sample routing architecture is, for a routing
channel width of 16, 31% larger than the seg architecture
which it generates from. When logic is included, this value
100 110 120 130 140 150 160 170 180 190 200
10
20
30
40
50
60
70
80
90
Array size (arbitrary units)
N
um
be
r 
of
 g
oo
d 
di
es
 p
er
 w
af
er
without fault tolerance
with fault tolerance
maximum array
size
Fig. 11. Using the proposed fault tolerance scheme can al-
most double the productivity for very large devices at 90nm
is reduced to 19%.
The rather large area overhead is however overshadowed
by the great yield increase, and reflected in the total number
of working devices out of a wafer. Figure 11 shows the vari-
ation in number of working dies per wafer as a function of
array size, shown here in arbitrary units. The largest de-
vices built at 90nm, the technology being analyzed here, is
shown by the vertical dotted line in Figure 11. If no hard-
ware redundancy is used, 14 working dies can be expected
using this fault tolerance scheme. Using our fault tolerance
scheme the total number of working dies can be increased to
26, thereby almost doubling the productivity. Details on the
yield analysis framework used here can be found in [2].
5. CONCLUSIONS
A new approach to fault tolerance in FPGAs has been pre-
sented. The method proposes to re-evaluate the routing ar-
chitecture of FPGA devices to include fault tolerance as a
measuring parameter as well as performance and routabil-
ity.
The scheme proposed is based on node-covering tech-
niques to replace faulty tracks by spare ones. Area overhead
is limited by minimizing the number of extra switches re-
quired to implement the node-covering.
It has been shown that even in the worst-case scenario
timing variation is within 8.5% of original design. Using
yield analysis techniques presented in past research it has
also been possible to prove that yields can be increased sig-
nificantly, almost doubling the total number of working dies
per wafer despite the area overhead.
The scheme requires a configuration controller to imple-
ment the fault avoidance. At time of publication an exten-
sive study on the performance of the controller has not be
carried out. Future work will also include exploring details
of efficient Built-in-Self-Test techniques to identify cheaply
and quickly the location of faults in the FPGA.
6. REFERENCES
[1] S. Brown, R. Francis, J. Rose, and Z. Vranesic, Field Pro-
grammable Gate Arrays, MA: Kluwer, 1992.
[2] N. Campregher, P.Y.K. Cheung, G.A. Constantinides and
M. Vasilko, “Analysis of yield loss due to random pho-
tolithographic defects in the interconnect structure of fp-
gas,” in Thirteenth ACM International Symposium on Field-
Programmable Gate Arrays, Monterey, CA, 2005.
[3] “Altera Says Redundancy Technology Increases
Yields,” 2000. [Online]. Available: http://www.reed-
electronics.com/electronicnews/article/CA94271.html
[4] C. McClintock, A. L. Lee, and R. G. Cliff, “Redundancy cir-
cuitry for logic circuits,” 2000.
[5] S. T. Reddy, M. Mejia, A. L. Lee, and B. B. Pedersen, “Pro-
grammable logic device with redundant circuitry,” 2002.
[6] F. Hatori, T. Sakurai, K. Nogami, K. Sawada, M. Taka-
hashi, M. Ichida, M. Uchida, I. Yoshii, Y. Kawahara, T. Hibi,
Y. Saeki, H. Muroga, A. Tanaka, and K. Kanzaki, “Introduc-
ing redundancy in field programmable gate arrays,” in Cus-
tom Integrated Circuits Conference, 1993., Proceedings of
the IEEE 1993, 1993, pp. 7.1.1–7.1.4.
[7] F. Hanchek and S. Dutt, “Methodologies for tolerating cell
and interconnect faults in FPGAs,” IEEE Transactions on
Computers C, vol. 47, no. 1, pp. 15–33, 1998.
[8] A.J. Yu and G.G.F. Lemieux, “Defect tolerant fpga switch
block and connection block with fine-grain redundancy for
yield enhancement,” in Proceedings of FPL 2005. 15th Inter-
national Conference on Field Programmable Logic and Ap-
plications, Tampere, Finland, 2005, pp. 255–262.
[9] M. Abramovici, J. M. Emmert, and C. E. Stroud, “Roving
STARs: An integrated approach to on-line testing, diagno-
sis, and fault tolerance for FPGAs in adaptive computing
systems,” in NASA/DoD workshop on evolvable hardware,
D. Keymeulen, Ed. Long Beach, CA: IEEE Computer So-
ciety, 2001, pp. 73–92.
[10] V. Lakamraju and R. Tessier, “Tolerating operational faults in
cluster-based FPGAs,” in Field programmable gate arrays;
FPGA ’00ACM/SIGDA. Monterey, CA: Acm, 2000, pp.
187–194.
[11] S. Dutt, V. Shanmugavel, and S. Trimberger, “Efficient in-
cremental rerouting for fault reconfiguration in field pro-
grammable gate arrays,” in 1999 IEEE/ACM International
Conference on Computer-Aided Design. Digest of Technical
Papers, 7-11 Nov. 1999, ser. 1999 IEEE/ACM International
Conference on Computer-Aided Design. Digest of Technical
Papers (Cat. No.99CH37051). San Jose, CA, USA: IEEE,
1999, pp. 173–6.
[12] V. Betz and J. Rose, “Vpr: a new packing, placement and
routing tool for fpga research,” Field-programmable Logic
and Applications. 7th International Workshop, FPL ’97. Pro-
ceedings, pp. 213 – 22, 1997.
[13] S. Yang, “Logic synthesis and optimization benchmarks, ver-
sion 3.0,” Microelectronics Centre of North Carolina, 1991.
[14] V. Betz, “Architecture and cad for speed and area optimiza-
tion of fpgas,” PhD Thesis, University of Toronto, 1998.
