Abstract-Negative bias temperature instability (NBTI) significantly affects nanoscale integrated circuit performance and reliability. The degradation in threshold voltage (V th ) due to NBTI is further affected by the initial value of V th from fabricationinduced process variation (PV). Addressing these challenges in embedded FPGA designs is possible, as FPGA reconfigurablility can be exploited to measure the exact timing degradation of an FPGA due to the joint effect of NBTI and PV at run time with low overhead. The gathered information can then be used to improve the run-time performance and reliability of FPGA designs without targeting the pessimistic worst case.
I. INTRODUCTION
Negative bias temperature instability (NBTI) is a leading reliability challenge at the nano-scale level that causes degradation in the threshold voltage (V th ) of a PMOS transistor, which gradually increases delay. NBTI occurs when a PMOS transistor is stressed under high temperatures with V gs = −V dd , causing high oxide electric field (E ox ); this stress causes some Si-H bonds on the Si-SiO 2 interface to break, leaving unpaired valence electron in Si atoms. These broken bonds are called interface traps. The existence of such traps increases the absolute value of V th in PMOS transistors.
NBTI degradation is further affected by the initial value of V th that may deviate from the nominal value due to process variation (PV). The initial value of V th determines the amount of E ox ; the smaller its initial value, the higher E ox , which increases NBTI degradation. However, the joint NBTI/PV effect, i.e., combining initial V th with the expected degradation due to NBTI, shows that variation in V th is always the dominating factor, which means that transistors with smaller initial V th will have smaller V th even after NBTI degradation.Variations in V th are expected to further increase according to the ITRS [1] . Addressing the joint NBTI/PV effect becomes increasingly critical.
The regular structure of FPGAs and their reconfigurability can be exploited to measure the joint effect of PV and NBTI at run time. The results can be used to optimize circuit placement and routing with awareness of NBTI and PV effects to improve performance. Chip-wise PV measurement and enhanced placement can be used for performance and reliability optimization of standalone and embedded FPGA systems [2] .
More specifically, the operating system or system middleware can conduct periodic tests at run time on the configurable fabric in order to measure the effect of NBTI and PV on the performance by implementing ring oscillators and observing variations in oscillator frequency. When the system needs to implement some circuit in a programmable core to accelerate the execution of a specific critical region in a software application, the embedded CAD tools can be invoked to perform the required placement and routing with consideration of NBTI/PV effects, making the placed and routed circuit faster than considering the worst case.
Although past work has addressed NBTI and PV as independent problems, no work, to the best of our knowledge, studies the joint effect on the speed of circuits placed and routed in FPGAs. The closest to our work is the study conducted by Cheng et al. [3] . The authors proposed a chip-wise placement for FPGAs considering the effect of PV. A PV map is assumed for each chip. The critical path delay of a circuit is evaluated considering PV information during placement. However, their work does not consider the joint NBTI/PV effect. Moreover, we use actual PV data that we measure from 15 FPGA chips, rather than assuming some arbitrary PV data.
In this paper, we propose joint NBTI/PV optimization techniques for FPGAs, targeting the placement phase of the FPGA design flow. Our techniques include NBTI/PV-aware timing analysis, region-based delay estimation, and a new move-acceptance optimization procedure. We integrate the proposed techniques into, T-VPlace [4] , a timing-driven placement engine based on simulated annealing, and then assess the FPGA circuit performance improvement when PV and NBTI effects are taken into consideration. The PV data used in the experiment is based on the measurement of 15 Xilinx Virtex-II Pro FPGA chips. The proposed techniques are able to reduce the joint NBTI/PV effect significantly on most studied circuits for the different measured PV maps. The proposed techniques can also be easily integrated with other FPGA placement algorithms as they target the common optimization steps used in FPGA placement.
The rest of the paper is organized as follows. In Section II, we describe the long-term NBTI model and the joint NBTI/PV effect. In Section III, we discuss limitations of existing FPGA placement techniques that make them unsuitable for addressing NBTI/PV effects, taking T-VPlace as an example, and we present our joint NBTI/PV-aware optimization techniques. Results and discussion are presented in Section IV. We conclude the paper in Section V.
II. JOINT NBTI/PV EFFECT II.A. NBTI Model
In this work, we adopt the long-term NBTI model proposed by Bhardwaj et al. [5] . The model is shown in Table I . It predicts the upper bound of the dynamic (stress/recovery) NBTI effect, and its accuracy has been validated experimentally [5] . This model has been used extensively in the past work to investigate the NBTI effect [6] , [7] , [8] . In this model, the long-term NBTI effect is a function of the initial value of V th , which can then be extended to study the joint effect of the fabrication-induced PV and run-time NBTI degradation. 
II.B. Joint NBTI and PV
We have conducted detailed analysis of the joint PV/NBTI effect. Figure 1 summarizes the NBTI effect as a function of the initial V th . It shows that the NBTI effect has more significant impact on devices with smaller initial V th . This phenomenon can be explained using the long-term NBTI model (see Table I [5]). Therefore, due to the NBTI effect, the V th variation gradually decreases. This study also indicates that to minimize the NBTI effect, circuits more susceptible to the NBTI effects should be placed in the FPGA components with lower initial V th values. 
III. NBTI/PV-AWARE OPTIMIZATION FOR FPGAS
In this section, we describe the proposed joint NBTI/PVaware optimization techniques for FPGA placement. The proposed solutions have been integrated into T-VPlace [4] for performance evaluation. T-VPlace is a simulated annealing based timing-driven placement engine in the VPR placement and routing tool [9] .
In the following sections, using T-VPlace as an example, we first summarize the limitations of existing FPGA placement techniques, and then present the proposed joint NBTI/PVaware optimization techniques.
III.A. Assessing T-VPlace by Considering NBTI and PV
Most existing FPGA placement algorithms, such as TVPlace in the VPR tool, perform timing analysis assuming that all of the components in an FPGA chip have homogeneous timing properties. In other words, the timing variation due to the joint NBTI/PV effect is ignored. As a result, if a circuit block is placed in a fast location, the circuit latency will be less than the delay in the nominal case, reducing the timing criticality for connections in the path to which the block belongs. For this reason, it is important to perform NBTI/PVaware timing analysis in order to provide more accurate timing information.
The timing analysis method in VPR is based on building a delay lookup matrix that assumes nominal values for all routing resources. With PV, however, variations in performance of routing resources exist according to their locations in a chip. Although delay estimation is only used to update the criticalities of connections, it is important to account for the effect of PV to attract critical portions of a circuit (nets and blocks) to regions in the chip that are faster than others, and thereby reduce the joint NBTI/PV effect.
Another important issue is related to moves/swaps in the placement phase. The existing cost function that is used in T-VPlace to assess placement quality does not account for the NBTI/PV effect. An appropriate cost function would not only favor movements that reduce the timing and wiring costs for a circuit's placement, but also those movements that result in moving NBTI-critical blocks to locations in a chip that are faster than others. This requirement stems from the discussion in Section II-B regarding the joint NBTI/PV effect. An enhanced move-acceptance policy should seek to overcome the performance degradation from the joint effect as well as reducing timing and wiring costs.
III.B. NBTI/PV-Aware Delay Estimation and Timing Analysis
We next describe the proposed NBTI/PV-aware timing analysis. In the proposed approach, the delays in the circuit's timing graph edges that represent internal paths within CLBs are updated to reflect the joint NBTI/PV effect based on the location of a CLB and the simulated NBTI effect. Timing analysis is improved by developing a PV-aware delay estimation technique. We divide the chip's PV map into rectangular regions that can be considered homogeneous in terms of the effect of PV. Delay estimates are then calculated based on the delay lookup matrix that is built by VPR, and PV regions in a chip. PV regions are identified using a split-and-merge procedure [10] (Algorithm 1).
For the implementation of the isHomogeneous() function, we ensure that a specific percentage of blocks in a PV region are at the same variation level. We assume that there are 3 levels of V th variations in a chip: below the average, near the average, and greater than the average. Our study shows that using more levels does not provide additional benefits. A region satisfies the homogeneity condition if the blocks within the same level are φ% of all blocks in the region. This condition allows identifying relatively slow and fast PV regions in a chip. We found experimentally that values of φ in the range 60-80 provide good placement results.
We categorize connection delays into two types: intra-region and inter-region delays. Intra-region delays are for connections between blocks that reside in the same PV region, and they are calculated according to the following equation: T del global is the delay derived from the lookup matrix, V thregion is the average threshold voltage in the PV region, and V thaverage is the average for the chip.
Inter-region delays, on the other hand, are for connections with blocks in different PV regions. There are two cases to consider according to whether the blocks are on the same horizontal/vertical axis or not. Delay along a horizontal/vertical axis can be predicted more accurately than the diagonal case; we can better estimate the number of switch boxes along a horizontal or vertical axis. These switch boxes are the main source of delay. The inter-region delay for a horizontal/vertical connection is:
n is the number of regions the connection passes through, numSBs [i] is the number of switch boxes in the i th region, T otalN umSBs is the total number of switch boxes in all regions of the connection, V thregion [i] is the threshold voltage in the i th region, and V thaverage is the average threshold voltage in the chip.
For the diagonal case, it is more difficult to estimate the number of switch boxes in each region; there are different possibilities that the router might use to route a net. We consider two common cases: left and right dogleg-shaped connections, as shown in Figure 2 . In each case, we estimate the length of a connection in each region, and we scale the delay by the expected contributions of those regions. We then choose the maximum. The delay in each case is:
Length[i] is the length of the dogleg-shaped connection in the i th region, and T otalLength is the total. The length of a connection in a PV region is the number of blocks in that region based on the direction of the connection.
III.C. New Move-Acceptance Procedure
The existing T-VPlace algorithm accepts placement movements either probabilistically or according to the change in the cost as shown in lines 10-13 in Algorithm 2. If ΔC < 0, i.e., the cost is reduced, then a move is accepted, otherwise, the move might be accepted or not, based on a random distribution. The probability of accepting moves decreases as the annealing temperature, T , decreases.
Our new move-acceptance procedure favors moves that can reduce the NBTI/PV delay degradation effect. We define NBTI-criticality for a logic block (CLB) as the delay in the CLB when NBTI is considered. For this purpose, we perform a one-time analysis of the timing graph that represents a CLB at the beginning of the placement phase; NBTI-criticality is the critical path delay in that timing sub-graph.
We seek to place a more NBTI-critical block in a fast location (relatively small V th ) in order to reduce the delay through the block and its contribution to the performance degradation of the circuit. To measure the potential for reducing the NBTI/PV delay effect, we define α and β as follows:
V th old−loc and V th new−loc are for the original location of the block, and its new potential location, respectively. F rom NBT I−crit and T o NBT I−crit are the NBTI-criticality values for the selected block to be moved, and for the block at the new potential location. If the values of α and β are larger than 1, then the new location can reduce the joint NBTI-PV effect on the block's delay. However, we do not accept all movements, as this might lead to placement results where blocks on the critical path are far from each other. Moreover, accepting all moves that achieve this condition might affect routability. Instead, we choose to accept a move randomly and give it higher probability to be accepted if the potential improvement in the block's delay is large enough. The procedure in Algorithm 2 is used for this purpose. In Algorithm 2, ξ 1 depends on the amount of variability in V th that exists in a chip. V th in a chip is confined to the range
. The higher the value of ξ 1 , the more conservative we are in accepting moves to a faster new location. However, if a swap is to be performed, and α > ξ 1 , this means that the block that currently resides in the new location will be moved to a slower location. For this reason, we use the second condition, β > ξ 2 , to ensure that swapping is performed between blocks that have the correct levels of NBTI-criticality. To obtain a potential improvement, the value of ξ 2 should be larger than 1. The range for ξ 2 if α > ξ 1 AND β > ξ 2 AND R limit > ξ 3 then 5:
end if 9: else 10: r = random(0, 1) 11:
if r < e −ΔC/T then 12: S = Snew 13:
end if 14: end if 15: end if use R limit > ξ 3 to ensure that we do not apply the procedure when the blocks have a small range within which to move; little improvement can be expected in this case due to similar characteristics of variation. Our experiments show that values of ξ 3 in the range 3-5 give good results. If these conditions are met, the move might be accepted randomly, with probability that increases as the potential benefit increases, as shown in line 5 in Algorithm 2. The value of ξ 4 indicates how much weight is given to the probability that the move will not be accepted. We found that a value between 1.5 and 2.0 gives good results. If the new procedure is not completed because the conditions are not met, then the original move-acceptance procedure is executed. The overhead of the new procedure is small as it only contains a few additional mathematical operations.
IV. RESULTS
We conduct physical measurement on a set of 130 nm Xilinx FPGAs to characterize the PV effect on FPGAs. For NBTI modeling, all experiments using VPR are based on a 45 nm technology, 1.0V supply voltage, 80
• C, and 5-year lifetime. Ideally, we would like to use PV data measured from FPGAs also based on 45 nm technology and integrate it with the 45 nm NBTI model. However, due to limited resources, we were able only to use PV data from FPGAs based on 130 nm technology. We scale the PV data as we will see in Section IV-E. The experiments are done using the largest 20 MCNC benchmark circuits, i.e., ex5p to clma, that come with the VPR download. We assume an FPGA chip size of 50 × 44 CLBs, which is similar to the chips used in the measurements. Because the original T-VPlace algorithm is basically a simulated annealing placement engine, it is inherently random, which makes it difficult to achieve consistent improvements in every benchmark used in the experiments. For this reason, we choose to show the percentage of benchmarks that show improvements using our proposed techniques, and the average improvement and NBTI/PV-effect reduction among these benchmarks for each FPGA chip, as we will see in Sections IV-C and IV-D.
IV.A. Measuring PV
We measure the effect of PV on 15 Xilinx Virtex-II Pro FPGAs based on 130 nm technology. In our experiments, we use two DN6000K10 prototyping boards from The DiniGroup, Inc. [11] , as shown in Figure 3 . One board has 9 Xilinx FPGA chips, while the other board has 6 chips.
1196 ring oscillators (using 15 inverters) are placed in each FPGA. We ensure that all ROs have the same relative placement and routing by using hard macros and LOC constraint supported by Xilinx tools. We assume that V th is the main parameter affected by PV, and we extract V th variation maps from RO frequency readings using the alpha-power law model [12] . Table II shows statistics for the different tested chips. 
IV.B. Modeling NBTI/PV Effect in FPGA
The change in V th that comes from NBTI and/or PV is modeled by reflecting the change on the equivalent ON-resistance of a specific circuit component such as a pass transistor, a transmission gate, or a buffer. We use appropriate fitting functions that have been validated using HSPICE simulations. An Elmore delay model is then used to calculate the delay through specific FPGA circuit components to extract timing information required to perform timing analysis instead of using static values. Using an Elmore delay model allows us to efficiently account for the NBTI/PV effect for routing resources that have pass transistor switches, in addition to the buffered resources. Switching probabilities for the different nets and LUT inputs in a circuit are obtained using the FPGA PowerModel tool [13] , and used in the long-term NBTI model.
IV.C. Combination of Proposed Techniques
To assess the effectiveness of the proposed joint NBTI/PVaware optimization techniques, we calculate the percentage of MCNC benchmarks with improvement (R bench. ), the average performance improvement (A P I ), and the average reduction in the effect of NBTI/PV (A RE ) for each FPGA chip by considering only the benchmarks that show improvement:
For a specific FPGA chip c, Benchs is the number of benchmark circuits that are placed and routed successfully, BWI is the set of benchmarks with improvement, and N is the total number of benchmarks of BWI . Improvements are calculated by comparing the critical path delay T crit of a placed and routed circuit using our proposed techniques with that of a circuit placed using the original T-VPlace algorithm, considering NBTI/PV effect in both cases.
For cases where Reduction exceeds 100% (this happens when T crit for a circuit using the proposed techniques under NBTI/PV effect is less than T crit when no effect of NBTI/PV is considered using the original T-VPlace algorithm), we clip its value to 100%. Table III shows the values of the different parameters for the proposed techniques. ξ 1 and ξ 2 are calculated using θ 1 and θ 2 in the table according to the following equations:
The higher the values of θ 1 and θ 2 , the more conservative we are in accepting moves because they result in higher values of ξ 1 and ξ 2 . The expressions in Equation (9) and Equation (10) are very useful because they utilize chip-and circuit-specific information to extract the required parameters. Table IV gives a summary of the results for all chips. A considerable reduction in the joint NBTI/PV effect can be observed; 60% of the FPGA chips show more than 60% reduction in the joint NBTI/PV effect, and this reduction can be up to 100% in some chips, e.g., chip 4. For all chips, except chip 2, more than 50% of the benchmark circuits show improvement that ranges from 4% to 9%, with 6.5% on average. Although the average improvements can reach up to 9%, some benchmarks show individual improvements of more than 25%. We study the execution time overhead of the proposed joint NBTI/PV-aware optimization techniques using the PV map of chip 2. This chip consistently shows the largest number of PV regions. We run the experiments using the largest benchmark circuit, clma, and we study the placement execution time for the original T-VPlace and the modified T-VPlace with our proposed new techniques integrated within it. Table V shows the results obtained on a workstation using an Intel Core 2 CPU running at 2.66GHz with 4MB of L2 Cache and 4GB of RAM. The execution time overhead in the worst case might reach up to 3X. However, when using a moderate value for M inRegionSize, i.e., 5, the number of regions generated for about 67% of the tested chips is within the range 4-14, which limits the worst-case overhead to 1.8X. Nevertheless, because routing contributes significantly to the execution time overhead in the FPGA compilation process (routing can be about 2-38X slower than placement [14] ), the overhead in the placement phase does not significantly affect the total compilation time. 
IV.D. Standalone Move-Acceptance Procedure
The proposed new move-acceptance procedure, described in Section III-C, can be used as a standalone technique if there are constraints on the FPGA compilation time. We study the cases shown in Table VI for the standalone new moveacceptance procedure and we compare the results with that of all of the proposed techniques when they are combined. Equations (9) and (10) are used to calculated ξ 1 and ξ 2 based on the values of θ 1 and θ 2 in the table. Figure 4 shows the comparison between the new moveacceptance procedure and all proposed techniques combined. We use the different cases shown in Table VI . In all cases, we use the same parameter values from Table III, except for the values of θ 1 and θ 2 . Each bar in the graph represents the normalized aggregate benefit for all chips (normalized to the best case, which is C5). We define the aggregate benefit among all chips as the average sum of products of R bench. and A P I . This measure is a good summary of the large amount of results because we use R bench. to weight A P I for each chip. The combined techniques give better results than the new moveacceptance procedure, except for the case C1. However, as a standalone technique, the new move-acceptance procedure in C4 and C5 gives results that offer benefits of more than 90% of the best case. 
IV.E. Scaling the PV Effect
To assess the effectiveness of the proposed joint NBTI/PV optimization techniques for different levels of PV, we use linear scaling for the original PV maps that have been obtained from RO frequency measurements on the tested chips. Although this does not accurately represent how PV scales for future nodes, it is useful in studying the performance of the proposed techniques for different scales of PV.
We identify the mean V th in the chip. We then scale the variations from the mean by a scaling factor SF :
ΔV thnew = ΔV th old × (1 + SF )
We vary SF from 10% to 30%; at the same time, we vary ξ 1 by varying θ 1 according to the scaling level. The higher scaling increases the range of values for α. This means that we should increase ξ 1 as the range of α increases in order to guarantee that we accept moves that have high potential benefit. Table  VII shows the results. The parameters used for the placement techniques are from Table III , except for θ 1 and θ 2 . We found that θ 2 = 0.25 gives good results, and θ 1 values of 0.6, 0.7, and 0.8 give good results. The proposed techniques give good results for the different scales of PV. About 90% of the chips for the different scaling factors have 50-90% of the circuits showing improvement. For SF = 10%, 67% of the chips show 50-100% average reduction in the joint NBTI/PV effect. For SF = 20%, 60% of the chips show 57-99% average reduction in the joint NBTI/PV effect. For SF = 30%, 67% of the chips show 52-96% average reduction in the joint NBTI/PV effect.
V. CONCLUSIONS Fabrication-induced PV and the NBTI aging effect have significant impact on circuit performance and reliability. In this paper, we presented joint NBTI/PV-aware optimization techniques for FPGAs, which take the effect of NBTI/PV into consideration. A set of analysis and optimization techniques have been proposed and implemented into FPGA placement, including NBTI/PV-aware timing analysis, region-based delay estimation, and a new move-acceptance procedure. Experimental results show that the proposed solution can achieve considerable reduction in the joint NBTI/PV effect. The proposed approach is suitable for run-time reconfigurable systems that use embedded FPGAs as accelerators for timing-critical applications.
