Abstract-We propose a new circuit level reliability evaluation methodology. The proposed methodology is based on a divide and conquer approach, which enjoys the benefits of device level accuracy and of block level efficiency. At the core of the reliability estimation engine lies a Monte Carlo algorithm which works with failure times modeled as Weibull and lognormal distributions for major wearout mechanisms: time dependent dielectric break down, negative bias temperature instability, electromigration, thermal cycling, and stress migration. As a case study, we demonstrate how the proposed reliability evaluation technique can be applied to a Network-on-Chip router to identify the most vulnerable subblocks, which represent the reliability bottlenecks of the router.
I. INTRODUCTION
Reliability has become a fundamental challenge in the design of integrated circuits [1] . One of the main difficulties in reliability aware design is the estimation of reliability. Evaluation of reliability is a challenging task because reliability is affected by numerous factors including aging mechanisms (e.g., time-dependent dielectric breakdown (TDDB) [2] , negative bias temperature instability (NBTI) [3] , electromigration (EM) [4] , thermal cycling (TC), and stress migration (SM) [5] ), process variations, dynamic power and thermal management, workload, and system architecture and configuration. While there has been significant work carried out to estimate reliability [6] - [18] , we discuss next two approaches that are closest to our work. An extensive review of previous reliability simulation tools can be found in [6] .
The RAMP approach [8] , [9] models the mean time to failure (MTTF) of a processor microarchitecture as a function of temperature related failure rates of individual structures on chip. It divides the processor into several discrete structures (e.g., ALU, register files, etc) and applies analytical models to each structure. Then, it combines the structure level MTTFs to compute the overall MTTF of the entire processor assumed as a series failure system. Because the lifetime distributions of failure mechanisms are assumed to be exponential, the reliability is calculated by applying the sum-of-failure-rates (SOFR) model. This approach is not realistic because failure rates of units increase with time due to aging. To address this limitation of the SOFR model, RAMP 2.0 [10] , [11] uses lognormal distributions, which are harder to deal with analytically. One of the main limitations of the RAMP approach as an architecture level approach is its accuracy. In addition, it may estimate equal MTTFs for blocks of different sizes but with activity factors that cancel out the area proportionality factor.
Another, more recent, class of reliability evaluation approaches are based on Spice simulations. Failure rate based Spice (FaRBS) [12] and Maryland circuit reliability oriented (MaCRO) [13] are circuit level reliability simulation methods. The main advantage of this class of simulation methods is the device level granularity that enables reliability analysis at transistor level to identify the most vulnerable transistors. There are some issues related to the Spice based reliability simulation. These approaches do not consider the layout of the system and simulations are done under worst case temperature scenarios, which is not realistic. Besides, Spice circuit simulations tend to take long time especially when done for large circuits. In addition, both methods (FaRBS and MaCRO) are developed under the assumption that failure rate is constant. As discussed above this assumption is inaccurate.
II. LIFETIME FAILURE MODELS
Many proposed lifetime reliability models assume a uniform device density on the chip and an identical vulnerability of devices to failure mechanisms. The lifetime distributions of failure mechanisms are usually assumed to be exponential [8] - [10] , [15] , [16] . As discussed in the previous section, this allows system-level reliability to be calculated by applying the sum-of-failure-rates (SOFR) model. However, this approach is not realistic because failure rates of units increase with time due to aging. To address this issue and to develop an accurate reliability model, more general lifetime distributions (e.g., Weibull and lognormal) must be utilized. On the other hand, when Weibull or lognormal distributions are utilized the prediction of reliability becomes more difficult and therefore Monte Carlo simulations must be employed. In this paper, we adopt Weibull distribution modeling for TDDB, NBTI, TC, and SM and lognormal distribution modeling for EM because these distributions have been found to best fit the corresponding wearout mechanisms [5] .
Due to space limitations, below we only list the expressions of the mean time to failure that characterize the models of the aging mechanisms considered in this paper. For elaborate discussions of these failure mechanisms and their models the reader is referred to [5] - [7] , [13] , [14] .
978-1-4673-2658-2/12/$31.00 ©2012 IEEE
The definitions of the variables and parameters involved in these expressions are well known and described in previous literature. In this paper, we utilize typical values of these parameters from [5] , [14] .
III. NEW RELIABILITY EVA L UAT I O N METHODOLOGY
The proposed reliability evaluation methodology is an attempt to bridge the benefits of both RAMP like and Spice simulation based reliability evaluation approaches. The block diagram that illustrates the main steps of the proposed reliability evaluation methodology is shown in Fig.1 . The salient features of our methodology are as follows. First, in order to deal with complexity due to circuit size we adopt a divide and conquer approach. The hierarchy of the structure of a design is exploited by zooming-in to lower levels where the analysis is tractable within reasonable computational time. Second, similar to MaCRO method [13] , we employ subblock level Spice simulations to derive transistor operating parameters. However, we conduct Spice simulations at realistic temperatures (different subblocks have different temperatures) rather than worst-case temperatures as it is done pessimistically in [13] . Third, we model failure times using Weibull and lognormal distributions that have been found to better fit experimental data. Fourth, the block level reliability (as MTTF) is estimated by aggregating the failure times of the individual subblocks. This process is implemented such that the design hierarchy is zoomed-out to upper levels. Finally, the proposed method has the ability to identify the most vulnerable subblocks from a reliability point of view.
Following the flow chart from Fig.1 , the main steps of the proposed reliability evaluation methodology are as follows:
Step 1: We start from a given hierarchical description of the design under consideration. This description can be in any hardware description language such as VHDL or Verilog. In addition, transistor and technology parameters are assumed to be given based on the technology node in which the design is to be fabricated.
Step 2: The design is synthesized, placed, and routed using Cadence tools [19] , but any other CAD tool can be utilized. The resulting layout represents the block level floorplan, which is divided into individual structures or subblocks based on the initial structural description of the design. In this way, we basically obtain for each subblock its layout, location, and aspect ratio. In addition, power consumption estimates are also generated using Cadence tools.
Step 3: The floorplan and power estimates are then fed into HotSpot [20] . HotSpot is an accurate and fast thermal model based on an equivalent circuit of thermal resistances and capacitances that correspond to microarchitecture blocks. The output of the HotSpot simulation is a list with temperatures of each subblock. Our approach addresses one of the limitations of MaCRO like methods [13] . As mentioned earlier, instead of doing worst-case temperature simulations we work with actual operating temperature for each subblock. Therefore, reliability of each subblock can be evaluated more accurately.
Step 4: These temperatures are utilized together with circuit netlists generated from within Cadence tools to perform subblock level Spice simulations. These simulations provide us with the transistor operating parameters necessary to be plugged into the equations modeling the wearout mechanisms described in Section II.
Step 5: At this stage we have everything that is needed by the lifetime failure models described by equations (1)-(5). At the core of the proposed methodology we employ a Monte Carlo (MC) simulation algorithm (see Fig.1 ) implemented and run in Matlab [21] . Our technique is inspired from the RAMP method but executed at the subblock level where the elementary unit is the transistor.
The MC algorithm proceeds with the following main steps (1) for each failure mechanism run N =1 0 7 simulations: (a) for each transistor, generate failure time samples from the corresponding distribution and (b) use MIN analysis of these times by assuming the subblock as a series system to calculate the time to failure tf ( 3 ) calculate the value of the overall subblock's time to failure as the minimum among the failure times due to each failure mechanism.
In our experiments, we found that in order to better differentiate between subblocks one only needs to focus on the most vulnerable transistors in a given subblock. Hence, for l ← 1 to F do // F: number of failure types 13: Calculate MTTFl using equations 1-5 from Section II 14: for j ← 1 to N do // N =10 we introduce a threshold that helps to identify transistors whose lifetime samples are smaller than this threshold. As an indicator of how vulnerable a subblock is, we calculate the percentage of transistors whose lifetime sample is smaller than the selected threshold. This is illustrated in the pseudocode description of the algorithm presented in Fig.2 . The threshold value is selected during the reliability qualification process as a function of the desired expected lifetime.
It is important to note that the pseudocode on lines 14-29 in Fig.2 is applicable to the TDDB and NBTI cases only. In this case, the proposed reliability methodology performs MC analysis at the transistor level. For the EM, TC, and SM wearout mechanisms the MC analysis is done at the subblock level similar to RAMP 2.0 [9] . The pseudocode description in these cases is simpler and we omit it here due to space limitations.
The output of the proposed reliability evaluation methodology consists of the actual estimate of the time to failure or MTTF of the design (line number 33 in Fig.2 ) and vulnerabilities of each individual subblock as percentage of transistors with average failure time shorter than the threshold (line number 29 in Fig.2) . The overall MTTF of the design is estimated using a MIN MAX analysis similar to [9] in order to be able to handle redundant subblocks that might be introduced for improving reliability. This information can be useful to circuit and system designers to develop fault tolerant or robust circuits and systems. Armed with information about what are the reliability critical subblocks and transistors, designers can concentrate their design efforts [22] with wearout mechanism specific techniques only on those, thereby saving area and power.
IV. CASE STUDY AND SIMULATION RESULTS
While Network-on-Chip (NoC) has become the dominant communication architecture to cope with the ever-increasing complexity of multicore systems, its reliability has been studied significantly less compared to that of cores. Therefore, in this section, we demonstrate the proposed reliability evaluation methodology on an NoC router. Our goal is to analyze the microarchitecture of a typical router to identify its most vulnerable components. We focus our attention on one of the most popular pipelined router architectures described in [23] . The main components of this architecture are: routing computation (RC), virtual channel allocation (VA), switch allocation (SA), input ports, output ports, and crossbar switch.
We first code the router's specification in Verilog. Specifics of this description include: 5 input and 5 output ports, 2 virtual channels per port, 4 sets of registers for each virtual channel of each port, and 16 bites wide links. Then, we take the router design through the reliability evaluation methodology described in Fig.1 . We utilize Nangate 45nm Open Cell Library [24] within Cadence tools to synthesize and generate the floorplan of the router. Fig.3 and Fig.4 show the results for TDDB and NBTI cases as subblock vulnerabilities. We observe that RC and VA components contain the highest percentages of transistors with lifetime shorter than the selected threshold despite the fact that their area is smaller compared to for example the area of input registers. This can be explained by the fact that RC and VA components experience higher switching activities compared to the other router components, which in turn leads to higher temperatures. Note that this information could not be obtained with RAMP like reliability evaluation approaches.
Because for EM, TC, and SM the Monte Carlo algorithm is applied at subblock level (see discussion in Section III) we report the individual subblock MTTF in Fig.5, Fig.6 , and Fig.7 . We observe that input and output ports are the most vulnerable components to these three failure mechanisms.
V. C ONCLUSION We proposed a new circuit level divide and conquer reliability evaluation methodology based on an MC algorithm. The proposed methodology can be very useful to circuit and system designers to predict circuit and system MTTF and to focus on cost effective design techniques targeted at specific parts of the design to improve its lifetime. ACKNOWLEDGMENT This work was supported by the National Science Foundation (NSF) under grant CCF-1116022. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the NSF.
