Abstract-Negative Bias Temperature Instability (NBTI) is a prominent physical failure mechanism which severely degrades the performance of PMOS transistors whenever the voltage at the gate is negatively biased. It leads to catastrophic timing violations in critical circuits and a severe shortening of the overall operational lifetime of the entire system. To alleviate such damaging effects due to NBTI, we present PRITEXT, a novel technique which generates a minimal set of deterministic exercise vectors based on test generation techniques which inherently nearoptimizes the bit patterns across each of the generated vectors; the end target being to exercise the critical paths of a device when dormant so as to achieve near-ideal NBTI stress reduction. We explore the designspace of our generated vectors and apply them to our test processor platform under differing sequences, where our evaluation under realistic benchmarks shows that PRITEXT leads to an average 4.99× and a maximum of 13.91× lifetime improvement using 9 generated vectors. In an attempt to reduce hardware overheads even further, we next propose a heuristic to further reduce the number of exercise vectors with minimum loss in lifetime improvement.
I. INTRODUCTION
Wear in CMOS circuits has recently become a subject of heightened concern in modern VLSI as a consequence of transistor miniaturization with Moore's Law [2] . As transistors become smaller, current and temperature densities are elevated leading to accelerated device degradation and eventual timing violations and run-time faults. With chip sizes remaining constant, the number of such vulnerable devices skyrockets with scaling and thus failures become more frequent. Current designs, in which complex circuits must attain consistent high operational performance in a given lifetime envelope set at design time, are particularly susceptible to this phenomenon.
While many forms of CMOS wear exist, due to higher temperature and electric field densities in current and future CMOS technology nodes, Negative Bias Temperature Instability (NBTI) [13] has become a prominent age-inducing transistor wear mechanism. Though NBTI does not necessarily cause hard faults (opens or shorts) in CMOS circuits, the mere reduction in operating frequency may render the entire system as unfit for operation or "dead."
In modern CMOS designs, the conventional solution in managing NBTI effects, albeit statically set at design time, is to utilize guardbands in both the cycle time [2] and threshold voltage V th [9] so as to guarantee seamless device operation within an expected time frame. Under such a scenario guard-bands of 10%-20% are typical to ensure a credible safety net in establishing a reasonable lifetime, but once the guard-bands are violated by a single critical path, the entire device becomes non-operational. However NBTI does not merely depend on circuit parameters such as transistor geometry, but critically also on bit patterns which can cause a given transistor to degrade at varying rates due to their usage; where heavily biased, i.e., mostly dormant, critical paths are prone to elevated NBTI effects. Several researchers have proposed various circuit-level and microarchitectural techniques to manage device degradation due to NBTI-induced transistor aging. Several use a well known technique called Input Vector Control (IVC) where the idea is to precisely control the state of the internal nodes in circuit using input vectors [5] . Here we propose and evaluate a novel microarchitectural technique, called PRITEXT, which significantly prolongs the lifetime of the device under consideration by exercising the internal nodes along the timing critical paths, using a minimal set of deterministic input vectors generated to achieve balanced utilization (duty cycle). We then consider a test-case superscalar processor [4] to test and evaluate our proposed methodology. PRITEXT achieves an average lifetime improvement of 4.99× compared to a processor system with no NBTI mitigation in place, while restricting both power and area overheads to under 0.5% . This paper makes the following contributions:
• We propose a novel vector generation technique based on delay testing models that generates a small number of exercise vectors for NBTI rejuvenation.
• We propose a lifetime-extending microarchitectural technique, PRITEXT, to improve the reliability of semiconductor devices with minimum hardware overhead. The technique is then verified by performing an extensive analysis using a complex superscalar processor running a variety of benchmarks.
• We present a set of heuristics and optimization techniques to reduce hardware overheads using vector compaction applied to both width and length. Our experimental results show a significant decrease in hardware overheads while incurring minimal loss of lifetime acceleration.
II. BACKGROUND ON NBTI AND IVC
A. NBTI degradation and recovery NBTI affects the gate of a PMOS transistor (NMOS transistors exhibit a similar, but lesser, effect, i.e., PBTI) when reverse biased, i.e., pulled to logic "0" (Vgs = -V dd ), by a corresponding internal net, and under continuous stress [3] . This activity leads to generation of interface traps due to disassociation of Si-H bonds in the Si/SiO2 interface, leading to an increase in the transistor's threshold voltage (V th ) and a simultaneous reduction in the drive current due to charge carrier mobility degradation (stress phase). A continuous increase in V th leads to accelerated transistor aging leading to a steady decrease in its switching speed. Hence NBTI does not cause actual PMOS transistor failure, but decelerates its switching.
NBTI has an interesting recovery phenomenon when the PMOS transistor's gate is not reverse biased (Vgs = 0); most of the Hydrogen atoms diffuse back and bond with Silicon leading to V th [3] readjustment (recovery phase). However V th is only partially recovered, depending on the ratio of the stress mode versus the recovery mode, where the net increase in operating threshold voltage due to dynamic NBTI stress is sensitive to the fraction of the time the transistor is under negative bias; this is defined as the duty cycle β. Hence, to further reduce NBTI effects, as self-recovery is obviously not sufficient, the transistor duty cycle should be ideally balanced at 50%. Note that a duty cycle of 0% is more favorable for NBTI rejenuvation but can lead to accelerated PBTI stress.
B. Lifetime Improvement Mathematical Model
NBTI stress shifts both the threshold voltage (V th ) and the switching delay over time, causing wearout. We utilize the mathematical model proposed in our previous work [10] which provides an approximation for switching delay degradation in multi-gate logic under NBTI stress. This models NBTI degradation under AC stress and also proposes a metric for quantifying the lifetime improvement achieved by alleviating NBTI stress. The lifetime of a device with combinational and sequential logic is limited by the multi-gate critical circuit path(s) delay exceeding a timing guard-band set at design-time. The cumulative increase in delay Δd(t) along such a path with N gates at time t can be computed as:
where βi is the duty cycle of the i th gate (i.e., the fraction of time the gate's voltage is under logic "0"), n is a time exponent, k is Boltzmann's constant, A is a fitting constant, T stands for temperature, and EaNBT I is the activation energy.
Lifetime, T lif etime , is defined such that Δd(T lif etime ) is less than the guard band, as a device is said to be functionally reliable as long as the increase in the switching delay of the critical timing path does not exceed this timing guard. The lifetime improvement is quantified by the Acceleration Factor (AF), defined as follows [10] :
where T lif etime (x) is the operational lifetime of a device enhanced with NBTI mitigation using PRITEXT technique, and T lif etime (ref ) is the operational lifetime of the reference device with no mitigation. Last, βi and βj respectively represent the duty cycle of the i th gate on the critical path of the enhanced device, and of the j th gate on the critical path of the reference device, assuming equivalent critical path gate counts of N and M .
C. Input Vector Control
Any internal net of a combinational circuit, which drives the gate of a PMOS transistor, can be deliberately switched to logic "1" by applying a relevant input vector. This logic state-controlling technique, termed Input Vector Control (IVC) [5] , is used to activate (i.e., pull-up) the internal nets so as to eliminate the negative bias of the PMOS transistors during standby mode or during device inactivity. Hence, by balancing the ratio of stress time to the total operational time (equivalent to β = 50%), NBTI degradation can be greatly alleviated. Balancing the duty cycle (β) of an internal net using IVC forms the basis of our technique in building our PRITEXT microarchitecture. We note that a single input vector cannot guarantee the activation of all the internal nets on a single timing path. Hence, our aim with PRITEXT is to derive a set of deterministic input vectors so as to achieve balanced duty cycles across all the internal critical nets. Pulling up critical internal nets (i.e., recovery phase) during standby time using IVC forms the so called "exercise technique" [10] , while the set of derived input vectors are dubbed "exercise vectors".
III. PRITEXT MICROARCHITECTURE AND MOTIVE IN REDUCING VECTOR SET OVERHEAD

A. PRITEXT: Proactive Reliability Improvement though EXercise Technique
In an attempt to exercise all the critical nets such that their β can be balanced at an ideal 50%, it is vital to capture the activation cone of all the critical nets which begins at the fan-in of the critical path as shown conceptually in Fig. 1 . With the relevant circuit netlist in hand, deterministic vectors that exercise all these critical nets are fist generated and then applied. This two-step process defines our PRITEXT technique, with the exercise vector generation algorithm described in Section IV. As seen in Fig. 1 , PRITEXT is backed by additonal hardware which augments the base critical path logic. A Read-Only Memory (ROM) block and multiple lightweight multiplexers (MUXes) store and apply the generated exercise vectors to the critical circuit path respectively. Note that each row in the ROM corresponds to a unique vector. The exercise vectors are applied noninvasively to the extracted critical path logic cone when the system's exercise mode signal is enabled in standby phase, hence not affecting the architectural state of the system; this is ensured by disabling the output flip-flops during exercise mode. The exercise vectors are applied rotatively with a pre-defined periodicity in terms of 100s of cycles (typically 1024) so as to account for the entire critical path tree.
B. Overhead Reduction of Exercise Vector Set
Exercise vectors generated for NBTI rejuvenation, where their aggregation comprises a vector set, may overload an underlying architecture with considerable CMOS area and power consumption overheads if not optimized. The hardware overhead of a vector set is proportional to the product of the length (i.e., number of vectors) times the width (i.e., size of individual vector) of the vector set.
Reducing the number of exercise vectors (length) offers a number of benefits. First, ROM real-estate occupancy on-chip is reduced ( Fig. 1) , while non-zero dynamic power dissipated as a consequence of critical path nets switching is also kept to minimum when being applied during exercise. Last, the periodicity of exercise application can be enhanced, directly leading to improved circuit rejuvenation during system standby. The latter is especially critical as no single exercise vector can activate all critical nets within a logic cone at once; instead a series of vectors are applied periodically to cover all exercise cases so that the critical circuit cone is rejuvenated holistically. Intuitively, keeping the number of vectors to a minimum value helps in providing a higher fraction of exercise (standby) period for each internal net, thereby achieving a balance of duty cycles across all the critical nets.
On top of length optimization, reducing the size (bit-width) of each individual vector (termed vector compaction) is crucial as it directly leads to a reduced width of vector set; this has a positive net effect requiring a smaller ROM to house it, and less MUXes to propagate it to the critical logic cone. In lieu of these benefits, various heuristics are developed for maximizing the NBTI rejuvenation impact.
C. 2-Dimensional Vector Set Optimization Heuristics
Length optimization: Each gate lying on a critical path does not influence the total switching delay degradation equally due to presence of the
factor, as Equation 1 shows. This factor can be considered as the contribution of the internal gate (net) towards the total switching delay degradation along the timing path. This inspired us to develop a heuristic such that the vectors which exercise these low criticality nets can be removed from the vector set, to significantly reduce hardware overhead while incurring minor loss in lifetime improvement.
Width optimization: In the general case, a new MUX is needed for each input bit of the exercise vector to enable its injection into the combinational logic cone. However, some inputs always have "don't care" values on all generated vectors. This can occur when powerful vector generation methods are employed to derive vectors with minimally specified bits. As a result, the corresponding ROM columns can be removed and there is no need for a MUX since the values of those inputs do not affect the state of the circuit. Moreover, other inputs may always assert either the "0" or "1" logic value in all the exercise vectors. In such cases, the corresponding ROM columns can also be removed, but a MUX per such input is still required and is set to the corresponding constant value (either "0" or "1").
IV. DETERMINISTIC VECTOR GENERATION
A. Deterministic vector generation derived from testing techniques
As discussed in the previous section, vectors are injected during the exercise mode of the circuit in order to balance the duty cycle of the critical nets, which are nets on critical paths with high duty cycle (=
). The injected vectors are generated offline using a deterministic vector generation algorithm, optimized for the specific problem at hand. In the manufacturing test domain, vector generation is an NP-complete process and its goal is to generate test vectors capable of detecting defects in manufactured circuits. Fault models are used to model the possible defects.
1) Vector generation based on stuck-at fault modeling
Under the well-known stuck-at fault model, the ATPG (Automatic Test Pattern Generation) process comprises the fault activation and the fault propagation phases. Fault activation requires the injection of the opposite to the fault value at the fault location by justifying the inputs of the circuit. Fault propagation extends the input justification process in order to propagate the fault effect to an observable output.
Critical net activation can be seen as a restricted version of the ATPG problem (since only fault activation is required but not propagation), where critical nets comprise the faults that need to be activated at the stuck-at-0 value. Stuck-at-0 activation includes the justification of the logic value "1" at the net by setting the necessary values at the inputs of the circuit which comprise the generated vector. The goal is to generate as few vectors as possible, collectively capable to exercise all critical nets. In the best case, a single vector may exist capable of activating all stuck-at-0 faults on the targeted critical nets. Often, as this is a very strict condition to satisfy, several vectors are needed where each one activates multiple critical nets.
2) Vector generation based on path delay fault modeling
Consider a set of critical nets, all laying on some circuit path, for which no single vector can activate all nets based on the stuck-at fault model discussed above. Then, the problem of finding the minimum number of vectors to exercise all these critical nets can be reduced to the problem of finding a robust test for a path delay fault on the particular path. In general, a delay fault is assumed to cause a defect in the manufactured circuit when the cumulative delay of a combinational logic path exceeds the clock period. The path delay fault model is the most accurate among delay fault models [11] , as it can model both lumped and accumulated delays along paths. Under the path delay fault model every fault is represented as a sequence of falling (1→0) or rising (0→1) transitions along a path. The transition initiated at an input is propagated through the path to an output. A path delay test consists of a pair of vectors (v1, v2), where v1 initializes the path and v2 launches and propagates the transition through the path to an observable output.
Path delay tests can be categorized as robust or non-robust. A robust test guarantees the detection of the path delay fault in consideration irrespective of the delays on other off-path nets, as it does not allow any hazards to propagate along the path. In this manner, any possible masking due to different timing arrival of transitions is avoided. Hence, a robust test (v1, v2) guarantees to exercise all critical nets on a path at the logic "1" value for 50% of the time needed to apply vectors v1 and v2. It is often the case that a robust test does not exist. This is mainly attributed to the complex structure of circuits with multiple re-convergent physical paths. In such cases, a relaxed test can be generated (called non-robust test) which allows the propagation of static or dynamic hazards on some of the nets. In the domain of manufacturing test this translates to the possibility of additional circuit delays masking the transition and, hence, invalidating the test. In the context of the work in this paper, any critical nets asserting transitions with hazards cannot always be considered as exercised. The solution in this case is to isolate such Table I summarizes the sensitization conditions on a path's offinputs for robust and non-robust path delay fault tests. Off-inputs are nets which are inputs to gates on the path but are not nets of the path themselves. For the relaxed non-robust tests, it suffices for v2 to settle the off-inputs to a steady non-controlling value ("0" for OR/NOR gates and "1" for AND/NAND gates) and allow any value for v1 (x). For a robust test, if the on-path net settles to a controlling value ("1" for OR/NOR gates and "0" for AND/NAND gates) then the off-input nets must assert a stable non-controlling value during both v1 and v2; otherwise, the same conditions as with a non-robust test hold. The reader is referred to [11] for a complete discussion on path delay fault test generation. Fig. 2 illustrates a short example of how path delay fault tests can be utilized to exercise critical nets. The first objective is to find a robust test for a path which covers all the critical nets (for simplicity, at this point let us assume that one such physical path exists; this condition will be relaxed later in subsection IV-B2). The targeted path is highlighted in red color and starts from input b, passes through gates 1, 2, 4, and ends at output f. We assume, without any loss of generality, that all nets in the path are critical. As a robust test (v1, v2) guarantees the propagation of a transition from an input net to an output net through the targeted path, the activation of all critical on-path nets during either the first vector v1 or the second vector v2 is also guaranteed.
Test generation begins by asserting a transition at the origin of the path (let this be a rising transition here) and propagates it to the path output based on the types of the gates on the path (see Fig. 2 .a). This phase is followed by an attempt to justify all off-path nets for all gates on the path to the necessary values, as shown in Fig. 2 .b. For a robust test, input c is set to a stable 0 value (0, 0), and input a is set to (x, 1), according to the sensitization rules for robust tests given in Table I . However, this does not allow for the off-path input of gate 4 to be set to (0, 0), thus, a robust test is not possible for the particular path and transition type at the origin. Even though a robust test might not be possible, a non-robust test might be feasible for the particular path. The solution in this case, shown in Fig. 2 .c, is a non-robust test where the necessary condition for input c and a is (x, 0) and (x, 1), respectively. Moreover, the propagated falling transition at the output of gate 3 can be justified by (x, 0) which is a valid off-input sensitization condition for gate 4. The generated pair of vectors, as shown in Fig. 2.d, is (v1, v2) = (x0x, 110) . In this case, as net f will either assume a static-1 hazard value or a stable 1 value, it will be exercised. A reverse situation (static-0 hazard or stable 0 value) would require that net f is re-targeted by a different vector to ensure that it will be exercised as desired.
B. Proposed Exercise Vector Generation Technique 1) Activation of critical nets on a single physical path
Assume a circuit netlist C and a set of critical nets Nc. The critical nets are derived based on the NBTI stress model discussed in section II (additional details are also provided in section V where the complete evaluation framework is discussed). Algorithm 1 outlines the major steps in the proposed exercise vector generation methodology. The primary goal is to derive a small number of vectors (V ) capable of activating all nets in Nc. A secondary goal is the optimization of each generated vector in terms of unspecified bits as this allows for further compaction of the vector set to be exercised in both dimensions of the vector set (number and size of vectors), by applying the heuristics discussed in section III. This latter goal is achieved by the powerful test generation routines utilized for this work and their particular details are beyond the scope of this paper. The reader is referred to [17] , [18] , [19] for discussion on test generation techniques. Algorithm 1 begins by targeting the activation of all critical nets with a single vector based on the relaxed stuck-at model (step 01). If one such vector exists then V will contain a single vector; otherwise, more than 1 vectors are required to exercise all nets in Nc and the approach considers the path delay fault model by first finding a physical path p which contains all critical nets in Nc (step 06). It is assumed at this point that such a path p always exists, however, this condition can be removed as discussed in the next subsection. Consequently, a robust test is targeted for p (step 07). If such a test exists, then the approach guarantees to return a set of 2 vectors in V = (v1, v2) able to activate all nets in Nc. As discussed in the previous subsection, for cases where a robust pair of vectors (test) does not exists the conditions of test generation are relaxed to derive a non-robust pair of vectors (step 12). It is assumed that a non-robust test can always be found for path p. Otherwise, p cannot be singly sensitized which means that it cannot affect the timing of the circuit unless delays on other nets, not on p, exist in the circuit. However, as all critical nets are on p this situation cannot occur.
Under the non-robust criterion, some nets of p may assert values with hazards (either static or dynamic) and, therefore, may not be able to guarantee the activation of a critical net. In steps 14-16, the list of critical nets is updated to only contain such problematic nets and the entire approach is repeated but only for these nets. The algorithm terminates when no more critical nets exist in Nc.
2) Extending to the general case
The proposed exercise vector generation technique can be extended to the general case where critical nets are dispersed over multiple paths in the circuit and do not lay on a single physical path. A heuristic procedure can be used to find the minimum number of paths that can cover all critical nets and then utilize the proposed vector generation procedure to generate vectors for each path. Such a heuristic can be implemented using known graph-theoretic approaches, where the circuit logic is represented as a Directed Acyclic Graph (DAG), such as a modified version of the problem of determining a minimum number of edge-disjoint paths to cover edges in a DAG. Vertices of the DAG will represent the nodes of the circuit while edges will be representing the nets of the circuit. The heuristic will solve the problem of finding the minimum number of paths in DAG that can cover specific subset of edges representing the critical nets. Each path should cover as much of the critical nets as possible in order to result with a small number of paths covering all critical nets. In depth-details are abstracted at this point due to space limitations. 
Algorithm 1 Proposed exercise vectors generation technique
Inputs: Circuit netlist C, list of critical nets
A. NBTI Critical Timing Paths
NBTI degradation is not uniform across all the paths in a device because of different timing delays and duty cycles along the gates. In short, workload stresses each net and path differently leading to highly skewed duty cycles across some of the timing paths. Hence only the paths which do not have enough slack to overcome the degraded switching delay are highly prone to timing failures due to NBTI stress. All the paths having less than 10% slack are considered to be NBTI critical. We synthesized the processor core for a clock frequency of 500 MHz using Synopsys Design Compiler with a 45 nm TSMC standard cell library. Using the static timing analysis tool from Synopsys, we obtained 83 critical paths which can be broadly classified into three groups based on the corresponding pipeline stages as listed below:
• Group A consists of 70 paths. All these belong to the Load/Store Unit and are responsible for memory disambiguation logic.
• Group B consists of 8 paths. All these belong to the Simple ALU unit which acts performs basic arithmetic instructions.
• Group C consists of 5 paths. All these belong to the Instruction Decode unit.
B. Workload Characterization of Critical Paths
Since NBTI is highly sensitive to duty cycle of transistors that lie along a timing path, we performed gate level simulations of the synthesized netlist using six workloads (bzip2, gap, gzip, mcf, vortex and parser) to obtain their duty cycles. Due to lack of commercial tool chain support for PISA architecture used by FabScalar, we are constrained to SPEC CPU2000 suite composing of these 6 benchmarks which are representative of the real time workloads. Using the waveform dumps, we calculated the duty cycle of all the internal nets along each of the three groups of critical timing B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 paths. Since the total delay degradation of each path is dependent upon the duty cycle of each of the internal nets along the critical path, we computed the cumulative delay degradation (see Equation  1 ) for each of the 83 paths. We found that the paths of Group A (Load/Store Unit) degrade 100× faster compared to the paths of Groups B and C because they have relatively highly skewed duty cycles, thereby making Group A paths to be the most NBTI-critical and lifetime-shortening. Hence we next focus only upon Group A paths which implement the logic to check if a speculatively executed load has a true data dependency with any of the outstanding stores. A detailed analysis of the microarchitecure of the corresponding memory disambiguation logic is abstracted due to space limitations. Fig. 3 illustrates the distribution of duty cycles of internal nets across all critical paths belonging to Group A under each tested workload. The horizontal axis in Fig. 3 shows duty cycle bins ranging from B0 to B9, while the vertical axis shows the fraction of internal nets that fall into each corresponding bin. The bin width is 0.1; hence B0 represents duty cycles in the range [0.0,0.1), B5 represents represents duty cycles in the range [0.5,0.6) and so on. As illustrated in Fig. 3 , the duty cycles of these critical paths are highly skewed towards the boundaries. On average 38% of critical nets have duty cycles in the range of [0.9, 1.0), highlighting the under utilization of several logic paths in the processor which are highly prone to harmful NBTI wearout effects.
Modern processors are often quiescent for a major fraction during their operation while waiting for events such as input/output access, or off-chip DRAM access to complete. This provides opportunities to enable our exercise mode to combat NBTI effects whenever the processor is idle. Based on the real-time statistics collected for various processors in several server clusters, we have considered a conservative ratio of 1:1 for active to standby processor state durations in our experimental evaluation. A higher standby time triggers improved opportunities to balance duty cycles along critical logic path since each exercise vector has higher fraction of operational time.
VI. EXPERIMENTAL EVALUATION AND RESULTS
A. Deterministic Algorithm Vector Generation
The extracted combinational logic cone of the critical nets in the Load/Store unit of our experimental platform [4] consists of 1,426 primary inputs, 15,194 internal nodes, and 63 unique critical nets which all lay on a single path. Using Algorithm 1 we have obtained 9 exercise vectors (designated by V 1-V 9 with each vector having 1,426 bits) which together guarantee to cumulatively exercise all of the 63 critical nets at least once during application of the exercise phase. Fig. 4 additional nets, and 34 critical nets in total. Vectors V 1, V 2, V 3 and V 4 collectively exercises 50 (35+9+2+4) out of the total 63 critical nets, covering approximately 80% of them. This observance, together with the heuristic described in section III-C motivated us to compute lifetime improvement under two distinct vector sets: a complete set comprising all 9 vectors (dubbed "Set9"), and a second set consisting only of the first 4 vectors (dubbed "Set4"). Both of these vector sets are considerably smaller and efficient ( in terms of overhead) compared to the set derived from the algorithm in [10] which gives a set of 14 (10) exercise vectors for 100% (80%) coverage of the critical nets for this design.
In order to minimize the hardware overheads in our PRITEXT technique, we performed vector compaction to minimize the newly introduced MUXes and on-chip ROM that stores our vectors. The storage capacity of this ROM equals to the product of the vector count and the width of each exercise vector, which in turn corresponds to the 1,426 primary inputs of the extracted logic. Out of 1,426 possible MUX locations corresponding to the inputs under vector set Set4, 1,224 of those contain "don't care" logical values spreading across the 4 generated vectors. This high number of unspecified bit values is attributed to the ability of the proposed vector generation technique to produce vectors that exercise a large number of critical nets per vector, while bit-setting only necessary inputs at a time. Moreover, 81 of the inputs always remain at logic value '0,' while 83 inputs always remain at logic value '1' in each vector. Hence, the final ROM width is only 38 (1426-1224-81-83) bits, leading to a total ROM size of 4× 38 bits (i.e., 4 vectors affect 38 inputs). Only 202 (38+81+83) MUXes are necessary, with 164 (81+83) of them presenting a constant value as input during exercise mode. A similar analysis for Set9 leads to a ROM size of 9 × 53 bits, totaling 357 MUXes (53+171+133 bits). Hence, the width optimization heuristic achieves ROM size and MUX overhead reduction of 97.3% (96.3%) and 86.8% (74.9%) for Set4 (Set9), respectively, compared to the initial unoptimized vector set with a ROM size of 9 × 1426 bits and 1426 MUXes. The overall hardware overhead due to PRITEXT technique for Set9 was well within 0.5% of the total area of the reference processor. Note that the impact of the aging of the PRITEXT hardware itself on the lifetime of the device is not significant. The overall lifetime acceleration achieved in the long run will outweigh the impact since the hardware overhead is quite small.
B. Balanced Duty Cycles
Fig . 5 illustrates the duty cycle distribution of critical nets when the PRITEXT technique is applied to the superscalar processor under consideration. The fraction of critical nets having skewed duty cycles (i.e., within bin 0.9) has dropped from 38% to 13% on average, as compared to the reference system with no lifetime extending support, highlighting PRITEXT's efficacy. Fig. 6 gives lifetime improvements using PRITEXT using our two vector sets. Set4 achieves nearly the same improvement as in Set9, yet it demands smaller hardware overheads. This tradeoff between the size of the vector set (leading to repercussions such as hardware overhead, dynamic power consumption) and lifetime improvement can be leveraged by designers to tradeoff design constraints. improvement over a reference system using deterministic vectors, while a maximum improvement of 13.91× is observed for the highly memory intensive mcf workload. VII. RELATED WORK Most research to date into NBTI-induced aging focuses on modeling and methodologies to combat it. Aging effects are studied under stress conditions to derive relevant micro-architectural models, with the latter being more realistic for high-frequency long-term CMOS operation [9] , [13] . We next concentrate on representative NBTI alleviation techniques from the microprocessor domain.
C. Lifetime Improvement
Among the various such mechanisms, Abella et al. [1] introduced the Penelope NBTI-aware processor architecture where they discuss a number of techniques to combat NBTI in various components, including a mechanism which writes special values in memory cells in order to keep the duty cycle at an ideal 50%. Gunadi et al. [6] suggested the Colt duty cycle equalizer which balances the duty cycle by alternating true and one's complement data representations, while Gupta et al. [7] proposed to generate idle periods for BTI recovery by power gating most of the components in a single core processor system. Next, Oboril and Tahoori [12] reduced aging in micro-processor pipelines, by replacing the traditional designtime time-balancing pipeline scheme with MTTF-balanced pipelines, also at design-time. The same authors proposed a technique to alleviate NBTI [14] where instructions are classified based upon their execution criticality, directing each into a specialized functional unit (FU), so as to balance the duty cycle in each FU by leveraging one at the expense of the other, while on the middleware level, the same method leverages specialized NOP instructions to achieve maximum NBTI relaxation in processors [5] .
Further works focus upon architectural-level reliability models. Research in [16] proposed such a model of a processor core, which considers a set of failure mechanisms, assuming uniform failure rates across specific components, however restricting the accuracy of the model when extended to the entire chip. [15] further develop this concept and introduce effective defect density and effective stress condition coefficients that weigh the failure impact across the chip area and run-time respectively. Last, Jenihhin et al. [8] propose an NBTI mitigation method by rejuvenating the logic along NBTIcritical paths that are first identified hierarchically. Using SPICE simulations a detailed model for computing NBTI-induced delays is first captured at the gate level, and fed as input to an evolutionary algorithm to extract critical circuit usage patterns and thereafter create periodic rejuvenating stimuli. However, the effectiveness of the bit patterns in the generated vectors in mitigating NBTI was not evaluated while no discussion of the periodicity of their application was presented. Contrary to their efforts in optimizing the convergence of the evolutionary algorithm, we have proposed deterministic vector generation algorithm through priciples of path delay test model and presented optimization techniques to reduce the hardware overhead. Kim et al. [10] proposed and evaluated the vector exercising technique, considered in this work, for NoC router. However, its vector generation algorithm is limited to stuck-at fault activation concepts only, leading to larger vector sets and considerably higher hardware overheads for general designs.
VIII. CONCLUSIONS
We have presented a novel, non-invasive micro-architectural technique, dubbed PRITEXT, to improve the lifetime of a CMOS design under NBTI stress using exercise vectors. We leveraged path delay test principles to derive near-ideal vectors while simultaneously providing a deterministic algorithm to generate exercise vectors under circumstances where such tests do not exist. Next, we demonstrated the efficacy of our technique on a reference superscalar processor. Our evaluation under realistic benchmarks showed that PRITEXT leads to an average 4.99× and a maximum of 13.91× lifetime improvement using 9 generated vectors, requiring negligible hardware overheads to store them and apply them to the underlying test-bench architecture. Finally, a proposed heuristic was applied to further reduce the number of exercise vectors and hence associated hardware even further with minimum loss in lifetime improvement.
