As device size shrinks to the nanometer range, FPGAs are increasingly prone to manufacturing defects. We anticipate that the ability to tolerate multiple defects will be very important at 45nm and beyond. One common defect point is in the lookup table (LUT) configuration bits, which are crucial to the correct operation of FPGAs. In this work we will present an error analysis technique that is able to efficiently calculate the number of critical bits needed to implement each LUT. We will perform this analysis using a scalable overlapping window-based method called DCOW (Don't-care Computation with Overlapping Windows), which allows for accurate and efficient don't-care lower bound calculations. This new windowing technique can approximate the complete don't cares within 2.34%, and can be used for many logic synthesis operations. In particular, we apply DCOW to our FPGA mapping algorithm to reduce the number of possible faults. This will allow the design to have a much higher success of functioning correctly when implemented on a faulty FPGA. By using our algorithm, we are able to reduce the number of possible faults by more than 12% with no area increase.
INTRODUCTION
As device size shrinks to the nanometer range, device failures are almost guaranteed. Circuits based on nanotechnology are expected to outperform current technology in terms of density (area), power consumption, and computation speed. These circuits will not only more susceptible to errors, but will also have high defect rates. Systems employing nano components will presumably have to deal with non-negligible error rates [1] , [2] , and [7] . The error effects can be corrected using techniques such as hardware/time redundancy [16] , error-correcting information encoding [8] and [19] , software-based fault tolerance [20] , or combinations thereof [15] and [18] .
Even at the current device size, the defect rate is significant, and it would be extremely useful to have a secondary use for the defective chips. One approach to using partly defective FPGAs is to choose designs that avoid the defects. This approach is called the Application-Specific FPGA (ASFPGA) and is provided by Xilinx in their EasyPath program [23] . The idea is that the designer creates a bitstream as usual, and then Xilinx selects defective FPGAs that are compatible with the given bitstream. The advantage here is that there is no need for any change in the FPGA architecture, the tool chain, the final bitstream, or methods used when developing the design.
Currently, algorithms for improving reliability [6] , [9] , [17] are only evaluated using Monte Carlo simulation where a single new fault is injected for each set of inputs. This simulation attempts to evaluate whether a fault will get masked, and in turn these reliability improvement algorithms try to reduce the detectability of an error. However, reducing the detectability of a fault does not guarantee the prevention of the fault; i.e., just because a fault is highly unlikely to be detected, it does not mean that it will not occur. For most applications, a circuit which has a fault that only appears every 10,000 clock cycles is just as broken as a circuit that has a fault every clock cycle in the sense that both circuits are not functional. This is especially true when it comes to optimizing FPGA configuration bit errors, as in [6] and [9] , since it is very unlikely that a configuration bit error will correct itself.
In this work we will show that we can efficiently determine which bits of the bitstream are critical, and we can efficiently map the designs to significantly reduce the number of critical configuration bits (defined in the next section). First, in Sections 2 we will present the background and problem formulation. Then, in Section 3 we will present a study of how configuration bit errors propagate in LUT-based FPGAs. This will show that most errors can be captured using a windowing technique. In Section 4 we present a novel windowing technique, DCOW (Don't-care Computation with Overlapping Windows), that can efficiently estimate the complete don't care (DC) set for each node in the circuit. In Section 5 we will use DCOW to create a technology mapper that can improve the reliability of a circuit. The results from the technology mapper will be presented in Section 6.
BACKGROUND AND PROBLEM FORUMULATION
There are several important factors involved in determining how errors in the configuration bits can be observed on the outputs. For an error to be detectable, it has to be both controllable and observable. The controllability set [11] (whose complement is the controllability don't care set (CDC)) represents all the input patterns that are produced by the environment. The observability set [11] (whose complement is the observability don't care set (ODC)) consists of all the input patterns that represent situations where an output is observed by the environment. Combining the CDC and the ODC sets together determines the complete set of don't cares (DC). This combined set can be used to determine which configuration bits are not critical. A LUT configuration bit is a critical bit if the corresponding LUT input pattern is controllable, and the inversion of this bit can lead to an observable error on at least one of the circuit outputs. Figure 1 to be controllable, there has to exist an input to the circuit which results in i 1 =0 and i 2 =0. For an error on configuration bit C 1 to be observable at the circuit outputs, f should first be replaced by ْ į to test both the correct (į=0) and faulty functioning (į=1) of this LUT. Therefore, an input to the circuit has to exist which results in i 1 =0 and i 2 =0, while at the same time toggling į leads to a toggling of at least one primary output.
In our work we present an effective mapping algorithm to improve the probability that a design will function correctly when implemented on a faulty FPGA. More concretely, we can formulate the FPGA reliability improvement problem as follows:
Problem: Given a circuit, use technology mapping to improve reliability by reducing the number of critical bits in the LUT configuration bits used to implement the circuit. The number of critical bits provides a way to estimate the likelihood that a design will function correctly when implemented on a faulty FPGA, in which some configuration bits are faulty due to transient and permanent errors. As in [6] and [9] , in this paper, we address only FPGA LUT configuration bit errors, but we plan to address interconnect errors in our future work.
MOTIVATION FOR WINDOWING
To be able to quickly estimate controllability and observability of configuration bit errors, it is important to analyze how they relate to different properties in the circuit. To evaluate these properties, we simulated a fault on every configuration bit of MCNC circuits with less than 25 combination logic inputs (latches were treated as primary inputs). The restriction on the number of inputs allowed every possible circuit input to be simulated. The set consisted of the following 8 circuits: alu4, apex4, misex3, pdc, s298, spla, ex1010, and ex5p. These benchmarks were each mapped depth optimally to the 6-LUT architecture using the ABC FPGA mapper, "fpga" [13] . ABC performs mapping by first converting the design into an AND-INV (AIG) graph, and then performs several iterations of cut selection (depth-optimal followed by area recovery).
We first analyzed how many nodes can be affected by a configuration bit error. The data in Figure 2 shows that the errors do not propagate very far. Over 82% of the time, the error propagates to ten or fewer nodes. This means it is possible to correctly calculate (82% of the time) the observability by only examining ten of the transitive fanout nodes. The transitive fanout (TFO) node n can be defined recursively (PO to PI) using ሺ݊ሻ ൌ ሼ݊ሽ ‫‬ ൫‫ڂ‬ ሺ݂ሻ ‫א‬௨௧௦ሺሻ ൯, where ‫‪ሺ݊ሻ‬ݏݐݑ݂݊ܽ‬ are the fanouts of ݊. Similarly, the transitive fanin (TFI) of node n can be defined recursively (PI to PO) using ሺሻ ൌ ሼሽ ‫‬ ൫‫ڂ‬ ሺ݂ሻ ‫א‬௦ሺሻ ൯, where ݂ܽ݊݅݊‫ݏ‬ሺ݊ሻ are the fanins of ݊.
Figure 2. Error spreading count
When the error spreads to a lot of nodes, it is useful to see if it also spreads to the outputs and becomes observable. Figure 3 shows how far the error spreads compared to the amount of masking. In this figure, each dot represents a LUT. The x-coordinate represents the percent of critical configuration bits of this LUT that are not masked, and the y-coordinate represents the total number of other LUTs reached by all possible configuration errors in this LUT. We can see that if the error spreads to more than ten LUTs, then it is likely that all the corresponding configuration bits are critical.
Figure 3. Error spreading statistics
In order to evaluate where all the possible faults can occur, the complete DC for each node has to be calculated. If the circuit is small enough, a complete simulation can be performed to compute the complete DC computation. But for most circuits, the number of inputs is too large to perform a full simulation, and a full DC computation is very costly [12] . In these cases we will use a windowing technique [14] to perform the evaluation. Windowing works by creating a small environment (defined by a TFI and a TFO set) around the node under test and performing a complete simulation in this environment. By restricting the size of the circuit under test using the windowing technique, we can simplify the fault coverage problem size. Since the window is significantly smaller than the whole circuit, the window-based fault coverage will result in an overestimate of the real faults by underestimating the complete DC of the LUT node in the window. Unlike Monte Carlo simulation-based methods [6] , [9] , [17] , if a configuration bit is determined not to be critical in a window, then it is guaranteed to be not critical in the circuit. So, we never have false negatives using the windowing approach.
The quality of the windowing solution is determined by how accurately it can be used to generate the complete DC set for each node in the circuit. Since windowing is guaranteed to produce a DC lower bound and simulation produces a DC upper bound, the difference (scaled to total number of configuration bits in the circuit) can be used to evaluate the quality of the solution.
Specifically, the quality of the solution can be evaluated using the following equation: 
For a LUT node with ‫‪ȁ‬ݏݐݑ݊݅‪ȁ‬‬ inputs, ‫ܥܦ‪ȁ‬‬ ௪ ȁ ሺȁ‫ܥܦ‬ ௦ ȁሻ is the size of the DC set computed by the windowing formulation (Monte Carlo simulation), 6 is a summation over all the nodes in the circuit, and ʹ ȁ௨௧௦ȁ is the number of configuration bits of the node and also the maximum possible size of the DC. Thus, the gap is the difference between the two DC evaluations scaled to the maximum possible size of the DC. For the simulation-based DC calculation, we used a complete simulation for circuits with less than 25 inputs and 100M random simulation for the rest of the circuits. To generate a baseline for previous methods, we evaluated a quick 100K Monte Carlo simulation and the windowing from [14] . The windowing in [14] creates windows by combining a fanout cone and a fanin cone with the intermediate nodes. The results of using these methods to calculate the DCs for the MCNC benchmarks are shown in Table 1 . Assuming the large 100M simulation provides a good estimation of the sum of the DCs, a quick simulation is overestimating by almost 3.31%, while the windowing from [14] is underestimating by 6.65%. This is an encouraging starting point which shows that the baseline windowing technique can be greatly improved. 
DON'T-CARE COMPUTATION WITH OVERLAPPING WINDOWS (DCOW)
To improve the DC computation of the previous methods, we introduce a new windowing method, called DCOW (Don't-care Computation with Overlapping Windows), which has three major innovations: basic window creation (Section 4.1), information passing between windows (Section 4.2), and growing windows in an intelligent manner (Section 4.3).
Input-Bounded Windowing
Our main improvement relates to how we generate the window. Instead of adding a set number of fanin and fanout levels, as in [12] and [14] , we construct an input-bounded window which tries to maximize the volume (as defined by the number of nodes covered).
Since the window has to grow in both directions, the maximum volume K-feasible cut method from FlowMap [4] cannot be used directly, and we grow the window using a heuristic. For each node that can be added to the window, we select the one that adds the least amount of new inputs to the window. This allows the window to grow as large as possible while maintaining the limit on the number of inputs. This volume maximizing idea is effective for two reasons. First, the more nodes the window covers the more likely it is to capture the complete DC information. Second, this heuristic attempts to limit the fanins of the window by selecting reconvergent nodes. This in turn helps restrict the controllability set size by reducing the possible inputs to the node. These two properties allow DCOW to estimate the DC within 3.19%, which correlates to reducing the estimated gap by over 2X when compared to the baseline windowing in [14] .
The windowing heuristic generates a lot of ties by only examining nodes that are immediately connected to the window. This is especially a problem when selecting the first node to add to the window. But by performing a simple lookahead of one node, we can further reduce the average DC estimate gap to 2.66%, a further reducing the estimated gap by 19.6% when compared to the baseline windowing in [14] .
Overlapping Windows
The first step in evaluating whether a configuration bit is in the DC set is to determine if it is not controllable. The controllability set can be minimized by overlapping the windows to further restrict the controllability set of each window. Let us define ሺ݊ǡ ܹሻ to be the set of possible inputs to LUT ݊ when performing a complete simulation of window ܹ. If two windows disagree on the controllability of node n, i.e., ‫‬ ‫א‬ ሺ݊ǡ ܹʹሻ but ‫‬ ‫ב‬ ሺ݊ǡ ܹͳሻ, then the window input which corresponded to ‫‬ ‫א‬ ሺ݊ǡ ܹʹሻ was not a valid input to that ܹʹ. Thus, the final controllability of node ݊ is defined as:
Figure 4. Overlapping window example
For example, consider Figure 4 . It has two 4-input windows defined by their inputs, ܹͳ ൌ ሼܽǡ ݅ ଷ ǡ ݅ ସ ǡ ݅ ହ ሽ and ܹʹ ൌ ሼ݅ ଵ ǡ ݅ ଶ ǡ ܾǡ ܿሽ (where ܽǡ ܾǡ and ܿ are outputs from LUT nodes labeled ܽǡ ܾǡ and ܿ) with a common node ݁. Both windows must agree on the controllability set for ݁. Consider input ‫‬ ൌ ሼܽǡ ܾǡ ܿሽ ൌ ሼͲǡͲǡͳሽ to node ݁. We can derive that ‫‬ ‫א‬ ሺ݁ǡ ܹʹሻ but ‫‬ ‫ב‬ ሺ݁ǡ ܹͳሻ; then the following inputs to the W2 are not valid: ሼͲǡͲǡͲǡͳሽ, ሼͲǡͳǡͲǡͳሽ, ሼͳǡͲǡͲǡͳሽ, and ሼͳǡͳǡͲǡͳሽ. The more overlapping windows we use, the more accurate our solution will become. This can further reduce the estimated gap of the DCOW-based DC calculations method by 2.7% when compared to the baseline windowing in [14] .
Simulation Guided Growing
After computing the controllability for each node, the step to evaluating whether a configuration bit is critical is to determine if it is observable. The observability set is much more difficult to represent accurately. The main difficulty of a window-based observability calculation is that it is important to cover all the fanouts of a node. This is the only way to determine if the error will get masked. If not enough nodes are encompassed by the window, then the window will not be very effective. From the results in Figure 2 and Figure 3 , we see that we do not need to cover very many TFO (transitive fanout) nodes to determine if the error is observable. In fact, if the fanout window captures at least five TFO nodes, it then has an 85.4% chance that any error escaping the window will reach the outputs. If the window can at least cover ten TFO nodes, then this probability is increased to 99.4%.
Unlike previous methods which selected several levels of TFO nodes [12] , [14] , and [21] , we use simulation to help select which TFO nodes should be included in the window. We first perform a quick simulation to see where the errors can spread to without reaching the outputs (false positive errors), and then we guide windowing to cover the TFO nodes which have the most false positive errors. Overall, this technique can further improve the DCOW-based DC calculation gap by 20.3% when compared to the baseline windowing in [14] .
Summary of Windowing
The summary in Figure 5 shows how our DCOW method compares to the currently available techniques. Initially, obtaining a solution that is close to simulation might seem like a lot of work. But, the real DC calculation lies between the "DCOW" line and the "Sim 100M" line, and windowing is a scalable way to get an accurate lower bound of the complete DC set. And a DC lower bound is much more valuable than an upper bound, as shown in the following theorem:
Theorem 1: Given a LUT, let ݂ be its original function and ݂ǯ be the resulting function after a configuration bit error. If ݂ ْ ݂ǯ is covered by the DC lower bound computed by DCOW, then the configuration bit error is not critical.
The proof of this theorem is omitted due to the page limit. Note that the DC lower bound computed by DCOW can be used in other logic optimization procedures, such as resynthesis and rewiring.
Figure 5. Summary of methods
The results in Table 2 show that the improved windowing technique can measure the DC of a node within 2.34% on average. When comparing the estimated DC gap, DCOW amounts to a 2.88X improvement over the baseline windowing from [14] and a 41% improvement over the 100K simulation-based DC calculation.
In terms of runtime, it took less than 10 minutes to perform the windowing estimation for all circuits, compared to 48 minutes for the 100K Monte Carlo simulation. In the next section we will present a technology mapping extension to ABC that uses windowing to improve the probability that a design will function correctly when implemented on a faulty FPGA.
RELIABILITY IMPROVEMENT IN MAPPING
Using our newly created DC computation method, DCOW, we developed, DCOWMap, a method for improving reliability through technology mapping. This method extends ABC [13] using a windowing technique to simplify reliability evaluation (shown in Figure 6 ). Since technology mapping is just a covering problem, it does not affect the functionality of internal nodes. This property ensures that the mapping of one node does not affect the DC set of any other nodes. This independence integrates very well in the dynamic programming based technology mapping used in ABC.
Figure 6. Windowing technique for FPGAs
By performing this technology mapping-based reduction we can evaluate the amount of reliability gains that can be achieved through a localized search. We implemented DCOWMap by adding an extra two steps to the ABC FPGA mapper (command "fpga"). Each step works by creating a DCOW-based window around each node in topological order (PI to PO), performing a complete simulation on the window and then selecting the best cut implementation for the node. The two steps of mapping differ only in how they perform cut selection. In the first step we use a critical bit flow formulation, and in the second step we use an exact critical bit formulation.
Critical Bit Flow Cut Selection
The first step selects the best cut implementation by using a flow heuristic commonly used for area flow calculations [5] , [10] , [13] . We recursively define critical bit flow (CBF) as follows:
where ‫‪ሺ݊ሻ‬ݏݐ݅ܤ݈ܽܿ݅ݐ݅ݎܥ‬ is the number of critical bits of the LUT implementation of n (also equal to the complement of ݊'s DC set), ݂ܽ݊݅݊‫ݏ‬ሺ݊ሻ are the fanins of n, and ‫‪ሺ݂ሻ‬ݐݑ݊ܽܨ݉ݑܰ‬ is the number of fanouts of node f in the current selected mapping. The notion of a flow captures the fact that each incoming flow is shared by the fanouts of the fanins, which gives a global view of the cost function during technology mapping similar to that used in [13] and [5] . This global view is then complemented with the use of the exact critical bit cut selection in the second step.
Exact Critical Bit Cut Selection
The second step selects the best cut implementation by recursively calculating the exact number of critical configuration bits it would take to implement each cut. This is based on the local view heuristic in [10] , known as the exact area. The exact number of critical bits (ECB) of a cut is defined as the sum of the number of critical bits of the LUTs in the maximum fanout free cone (MFFC) of the cut, and can be recursively defined as follows:
where ‫‪ሺ݊ሻ‬ݏݐ݅ܤ݈ܽܿ݅ݐ݅ݎܥ‬ is the number of critical bits of the LUT implementation of ݊, ݂ܽ݊݅݊‫ݏ‬ሺ݊ሻ are the fanins of ݊, and ‫‪ሺ݊ሻ‬ܥܨܨܯ‬ is the set of nodes in the MFFC rooted at ݊. This can be calculated in a DFS order very efficiently since the selection of a single cut does not usually make a drastic change to the mapping solution of the circuit.
RESULTS
We evaluated DCOWMap using the 20 largest MCNC benchmark sets. The results in Table 3 were evaluated using either a full simulation of all possible inputs (latches are treated as primary inputs) for circuits with less than 25 combinational logic inputs (alu4, apex4, misex3, pdc, s298, spla, ex1010, and ex5p) or using a Monte Carlo simulation with 100M random inputs for the rest of the circuits. These benchmarks were each mapped depth optimally to the 6-LUT architecture. Since the number of inputs to each window is relatively small, DCOWMap was able to map the largest circuit, clma, in less than 200 seconds. This corresponds to evaluating 50 nodes each second. These results show that with no area increase, the critical bits can be reduced by over 12% (spla, des, and ex5p were reduced by over 20%). Since the area was unaffected, both versions (ABC baseline and DCOWMap) would be implemented on the same FPGA with the same number of possible failures. Thus, on average, the optimized version can tolerate 12% more permanent faults in the LUT configuration bits.
To verify that our mapping solution did not degrade the placement solution, we placed and routed the mapped circuits using VPR [3] . We observed no increase in routing failure and no critical path delay increase (in fact, critical path delay decreased by 0.4%).
each configuration bit for that specific input vector instead of all possible input vectors. Second, by selecting random configuration bits to test instead of testing all of them, the fault rate stays artificially low. For example, this would allow a minor change, like the masking of 200 configuration bits, to appear like a 2X improvement (20K * 2% = 400 versus 20K * 1% = 200). In the best case, the ROSE fault evaluation calculates the amount of masking that exists for transient configuration bit faults, but configuration bit faults are not transient. In this case we need to ensure that the circuit functions correctly under all inputs, not just one. In our evaluation we use a new 100M input vector to evaluate each node and report the total number of critical bits.
CONCLUSION
The contributions of this work can be broken down into three parts. First, we presented a quantitative error propagation study where we identified several critical properties that enable efficient DC calculations. Second, we presented a scalable DC computation method using an improved windowing method (DCOW), with a 2.88X improvement over the windowing in [14] . Third, we used the windowing technique to create a reliability-driven technology mapper, DCOWMap, which is able to reduce the total number of critical configuration bits by 12.3% with no area overhead. Our work shows that it is possible to quickly increase the probability that a design will function correctly when implemented on a faulty FPGA.
The source code for the technology mapping algorithm and all the reliability evaluators can be downloaded at: http://cadlab.cs.ucla.edu/software_release/reliability/
ACKNOWLEDGEMENTS
Financial support from Altera, Magma, and NSF grant CNS-0725354 are greatly acknowledged.
