Abstract-The continuous scaling of transistor sizes and the increased parametric variations render nanometer circuits more prone to timing failures. To protect circuits from such failures, typically designers adopt pessimistic timing margins, which are estimated under rare worst-case conditions. In this paper, we present a technique that mitigates such pessimistic margins by minimizing the number of timing failures. In particular, we propose a method that minimizes the number of long latency paths within each processor pipeline stage and constrains them in as few stages as possible. Such a method allows us not only to reduce the timing failures but also to limit the potential errorprone locations to only a few pipeline stages. To further reduce these failures, we exploit the path excitation dependence on data patterns and truncate the bitwidth of the operands in the few remaining long latency paths by setting a number of less significant bits (LSBs) to a constant value of zero. Such a truncation may incur quality loss, but this is limited since it is applied only to the LSBs of the few operands that may activate the confined error-prone long latency paths. To evaluate the efficiency of our method, we perform post-place and route dynamic timing analysis based on real operands extracted from a variety of applications. This helps to estimate the bit-error rate, while considering the data-dependent path excitation. When applied to an IEEE-754 compatible double precision floating point unit (FPU), the proposed approach reduces the timing failures by 216.25× on average compared to the reference FPU design under an assumed 8.1% variation-induced worst-case path delay increase in a 45-nm process. Our results show that the path shaping alone introduces a negligible 0.25% area and 5.7% power overheads with no performance cost. Finally, we demonstrate that by combining the path shaping with aggressive operand bitwidth truncation, we enable power savings up to 44.7% due to the substantially reduced switching activity at minimal quality loss.
I. INTRODUCTION
T HE AGGRESSIVE shrinking of transistor sizes has worsened process variations, which led to a 25% delay Manuscript increase [1] and 20× higher leakage variation [2] in advanced nanometer technologies. These trends make circuits more prone to timing failures, thus threatening the circuit functionality, and hinder designs from meeting the targeted specifications [3] . Such delay variations further worsen under scaled voltages, which are considered necessary for saving power [4] . Manufacturers tend to adopt timing guardbands that force the circuit to operate at a lower frequency or a higher voltage, providing sufficient timing margins to mitigate any failures triggered by delay variations [2] , [5] . However, such timing margins are considered to be overly pessimistic, since they are estimated based on few worst-case critical paths and on assumed rare operating conditions (e.g., temperature), and thus incurring large overheads [1] , [3] , [5] , especially for the inherently fast paths.
In an attempt to trim down the introduced overheads, statistical static timing analysis (SSTA) tools have been introduced [6] , but such tools still focus on improving the analysis and margin estimations rather than the design itself. Designcentric techniques focus on integrating extra circuits to detect any errors and either try to correct them in-situ using special flip-flops [7] or stall the pipeline and replay the failed instructions [4] , [8] . Other design-centric schemes try to predict the instructions and operands that may activate the failure prone long latency paths (LLPs) [9] and provide extra clock cycle(s) for the completion of these paths. However, in all the cases the enforced timing constraints for timing-error detection and the overheads incurred by the applied recovery methods, especially when the activation probability of critical paths is high, may neglect the gains achieved by removing the static or dynamic safety margins. An approach proposed in [10] helps to limit the overheads of the above methods by reducing the overall number of failure prone critical paths. Although effective, such a method has never been applied to a fully pipelined design, nor has it been considered jointly with the above design-centric schemes or other methods to reduce the dynamic excitation of critical paths.
Recently, approximate computing has emerged as an alternative approach for addressing potential timing failures with less overheads than the ones incurred by the conventional guardband-based techniques [5] , [11] . Existing studies have showcased the inherent resiliency of various signal/image processing [12] , [13] , machine learning [12] , [14] and scientific computation algorithms [13] to faults or inaccurate operations. Most of the existing studies have indicated that any approximation should be applied only to error-resilient code or data regions in applications, since uniform approximation of all data may result in significant quality degradation [15] , [16] . For instance, approximating few less significant bits (LSBs) in the mantissa part of floating point (FP) operands leads to insignificant quality loss as opposed to approximation of any bit of the exponent part of (FP) operands [17] , [18] . The majority of approximation based schemes have been applied to standalone digital signal processing (DSP) accelerators or integer arithmetic units rather than to complex pipelined designs. Furthermore, such schemes have been evaluated on simulators using error-injection models that neglect the impact of any approximation, e.g., reduced precision, on the dynamic data dependent path excitation.
Few research studies tried to exploit the data dependent path excitation for limiting the static or dynamic margins by dynamically adjusting the clock period [19] , [20] . Such schemes have shown the overly pessimistic estimation of timing paths by conventional static timing analysis (STA) compared to the dynamic timing analysis (DTA) approach that we propose here. However, the proposed adjustment of the clock per instruction may be very challenging to apply in practice, and has never been used in conjunction with approximation schemes.
Contributions: The primary objective of this paper is to minimize the potential timing failures in pipelined designs. In particular, we propose a method to redesign the target circuit for limiting the failure-prone LLPs and reduce the path excitation probability by truncating the operand bitwidth [21] . Our approach can effectively reduce the overheads of traditional variation-aware schemes, while complementing existing design-centric and approximation-based schemes. The contributions of this work can be summarized as follows.
• We present a new framework to redesign a pipelined circuit by carefully shaping the path distribution such that the LLPs are significantly reduced and isolated to as few pipeline stages as possible. The advantages of such an approach are two-fold. First, it reduces the critical LLPs and thus significantly reduces the failure probability. Second, it limits the number of registers or stages, where any variation-aware or approximation scheme would need to be used, thus limiting any resulting overheads.
• We exploit the data dependent excitation of timing paths by truncating the bitwidth of the operands that may activate the few remaining LLPs. This is realized by setting a number of LSBs of these operands to a constant value of zero. By doing so, we reduce the computational delay of the LLPs along with the excitation probability of these paths. In addition, we also reduce the power consumption due to a lower switching activity. The operand bitwidth truncation may lead to a deterministic quality loss. However, we limit such loss by carefully selecting the number of truncated bits of the operands that may excite the isolated LLPs. Moreover, any loss incurred by truncation is expected to be less than the loss incurred by random timing failures, which may affect significant parts of the computation.
• We apply the proposed approach to the implementation of a variation-aware IEEE-754 compatible [22] Floating Point Unit (FPU) in a 45 nm process technology.
Note that floating point variation-aware designs have not received much attention apart from few works [23] , [24] , despite the importance of FPUs in today's high-end processors. In addition, FPUs are excellent representatives of complex pipelined designs and any approximation scheme has a measurable effect on accuracy. By combining a set of constraints at the synthesis and place and route phases with micro-architectural changes, we show that the LLPs can be significantly reduced compared to an original FPU designed with conventional performance-centric optimizations.
• We present a post-layout DTA tool to estimate the dynamic excitation of timing paths and potential timing failures, considering the operands of executed program traces and the clock period.
• We develop a profiling tool to extract FP instruction traces and operands from various applications executed on a RISC (Reduced Instruction Set Computer) processor. We use the profiling results as input to the developed DTA tool for realistic evaluation of the efficacy of our approach.
• We estimate the timing failures under various assumed clock reduction (CR) levels that represent potential variations for the proposed and original designs under different operand bitwidth truncation ranges. Based on these estimates, we evaluate the quality loss quantified in terms of the Relative Error (RE) for four popular applications Heartwall Raytrace, CFD and K-means. These applications originate from the Medical Imaging, Computer Graphics, Fluid Dynamics and Data Mining domains, respectively. Finally, we estimate the bit error rate (BER) and show how it varies across the bit positions and across applications in the conventional and proposed FPU designs. The rest of the paper is organized as follows. Section II presents the proposed approach. Section III describes the implemented flow using state-of-the-art tools. In Section IV, we apply the proposed framework to the design of an IEEE-754 compatible FPU of a RISC processor. Section V presents our experimental results. Section VI discusses related work. Finally, we draw our conclusions in Section VII.
II. PROPOSED APPROACH
Let us consider a pipelined design which consists of a set of N unique combinational paths P = {p 1 , p 2 , . . . , p N }. In this design, each path completes with a delay D(p i ) for i = 1, 2, . . . , N. As in any synchronous design, the longest path across all S pipeline stages determines the clock period, such as:
where P s is the set of paths that belongs to pipeline stage s (s = 1, 2, . . . , S). Figure 1a depicts a typical distribution of all the path delays D(P), which is obtained by applying conventional design flows and STA. As it can be seen, such a distribution is characterized by a so-called "timing wall", with many LLPs close to CP STA . Such a wall is a consequence of how modern designs are optimized for power and area, subject to a global frequency constraint. In particular, current design flows minimize the delay of LLPs by (area/power hungry) gate up-sizing, while the inherently short latency paths (SLPs) are allowed to become near-critical for recovering any area or power costs [10] . This "timing wall" does not have any negative impact on the adopted CP STA of the design, however it critically affects the probability of timing failures since under any (even small) delay variation many paths may fail [25] .
Consider a set of K SLPs as P SLP = {p SLP 1 , . . . , p SLP k } ⊂ P and a set of operands Op SLP that excite such paths; and assume that all these paths can complete their computations within a time.
T SLP = max{D(P SLP )}, as shown in Figure 1a . Figure 1a  also ) ) can be observed. Such a slack can be used as a safety margin against potential delay variations, minimizing the probability of a timing failure at that stage. Intuitively, by ensuring that only SLPs are being triggered across all stages, then the overall probability of a timing failure will be low.
A. Path Shaping and Critical Stage Constraining
The first step of our approach for reducing the timing failures is to move away from a path distribution with many LLPs close to the CP STA , which is typical in conventional designs (see Figure 1a) . The goal of our work is to minimize |P LLP | subject to the target clock period CP STA and power/area constraints.
To achieve such a goal, we define appropriate timing constraints for different path groups and impose them on the design during synthesis. By introducing such path constraints, we ensure that the inherently fast paths (i.e., SLPs) are not made slower, as opposed to the conventional approach. The end goal is to obtain a path distribution similar to the one depicted in Figure 1b , where |P SLP | >> |P LLP |. Note that to facilitate the isolation of discerned path groups, we also make modifications at the micro-architectural or register-transfer level (RTL) by trying to constrain the LLPs in as few stages as possible. This does not only help to control better the path groups, but also enables the isolation of P LLP to as many stages as accessed by only few specific instructions. This allows us to apply a failure mitigation or correction technique to few stages rather than using it for the whole design, which is far more complicated and costly [26] . In addition, it facilitates the development of failure mitigation mechanisms tailored for the specific instruction(s) that activate the few remaining LLPs. It is also important to note that our approach does not change the CP STA , since any path shaping is made subject to maintaining the conventional speed.
B. Significance-Driven Operand Truncation
The activation of each combinational path within each stage in a pipeline design depends on the executed instruction and on the input operands of each operation that takes place in each stage. Hence, it is possible to further reduce the activation probability of the LLPs by adjusting the operands of the specific instructions that excite these paths. A straightforward solution that we propose is to set to "0" some LSBs of Op LLP and truncate the bitwidth, thus reducing the computational delay. To elucidate the impact of bitwidth truncation on computational delay, let us consider a simple 4-bit ripple carry adder (RCA), as shown in Figure 2 . The depicted adder consists of four full adders (FA). FA is a logic circuit that adds two input operand bits (Ai, Bi) plus a Carry in bit (Ci,i) and generates a Carry out bit (Co, i) and a sum bit (S, i). In such a design, the most timing critical path LLP1 (emphasized in red dotted line) will be activated when the carry propagates all the way from Ci,0 to Co, 3. If we define the gate delay as T, then LLP1 requires a delay equal to 8T to be completed, since LLP1 will have to travel from the AND gate (emphasized in red) in FA0 down to XOR gate (emphasized in red) in FA3. According to Eq. (1), the minimum CP STA of this datapath is 8T. Note that such a delay will be activated only in case of a suitable combination of operands belonging to Op LLP . For instance, when A = 1111 and B = 0001, the carry generated in the first bit position (i.e., LSB) is propagated all the way to the final bit position, exciting the error-prone LLP1. Under an assumed variation induced delay increase this path will fail as D(LLP1) > 8T and thus D(LLP1) > CP STA . However, by modifying the inputs and inserting 0s in the last 2 bits, such as: A = 1100 and B = 0000, there is no carry propagation and thus only SLPs (e.g., SLP1 and SLP2 highlighted in green dashed lines) will be excited. By truncating the last 2 bits of the input operands the delay of the critical paths that are excited is reduced to 2T, thus providing enough timing slack to address any potential delay increase. The truncation of a number of LSBs from Op LLP may provide a slack and reduce the LLPs excitation as discussed above, but this will come at a quality loss. However, such a loss can be controlled by appropriately selecting the number of truncated LSBs, ensuring that it is not as catastrophic as the loss incurred by random timing failures when these affect the most significant bits (MSBs). For instance, let us consider the addition of two floating point operands A and B which results to an output C. These operands follow the IEEE-754 [22] double precision format in which the first bit from the left represents the sign, the next 11 bits represent the exponent, while the rest 52 bits represent the mantissa. As illustrated in Table I , a random error in the exponent part, e.g., in the 10 th bit of the output C (highlighted in red), induced by a timing failure will lead to a completely different number than the one expected resulting in high Relative Error of ∼ 0.9375 (Relative Error is defined in Eq. (2) in Section V-C2). Conversely, in case of 32 LSBs truncation in the mantissa part of each operand, the resulting output value is very close to the reference value with very low RE equal to ∼ 3.6556 · 10 −7 . Such a low RE is attributed to the fact that we truncate LSBs in the mantissa part that are not critical for determining the output value in FP operations. On the other hand, the exponent plays a significant role in determining the range of the output and any error either due to random bit flip or truncation in that part may result in catastrophic results [17] .
C. Exploiting the Dynamic Path Excitation
The fundamental difference between our approach and approaches that rely on STA, is that we exploit the dynamic excitation of paths by operands. By making the excitation of the LLPs by Op LLP rare by design, we minimize the need for adopting any conservative timing margin. In the case of Op SLP there is enough positive timing slack to avoid failures under any potential worst-case path delay increase up to a magnitude of:
To estimate the efficacy of our approach and evaluate the dynamic data dependent excitation of paths, we develop a tool to perform DTA. Such a tool allows us to explore the unused timing margins of the processor that are available at runtime. These cannot be accurately characterized by STA due to the missing notion of path activation probabilities. Additionally, by applying this analysis phase, we estimate how often the LLPs are excited and the quality degradation incurred by operand truncation. Finally, such a DTA tool helps us extract instruction aware timing failures, which also depend on the dynamic excitation of critical paths by operands. The total number of timing failures and BERs can be extremely useful for more accurate error injection during the evaluation of application resiliency [27] .
III. DESIGN FLOW
The steps of the proposed approach are implemented using state-of-the-art electronic design automation (EDA) tools. The workflow of this approach is shown in Figure 3 , where our modifications are highlighted in orange. The flow consists of a design phase and an analysis phase.
A. Design Phase
To reduce |P LLP |, we impose constraints on different pathgroups (in the Synopsys Design Constraints or SDC file) based on the minimum delay required for the completion of each path-group. By introducing multiple path-group constraints, we force the synthesis tool to avoid optimizations that make naturally fast paths slower for saving area and power. Even though these constraints may affect area and power consumption, the total overheads, which depend on the targeted path distribution, can be kept small. Initially, we apply strict constraints to shift the "timing wall" away from the target CP STA . If there are timing violations after synthesis, we relax the design constraints and re-run our iterative method until the timing target is met. The applied design constraints, which are implemented in a fully automated way, not only reduce the number of the timing critical paths, but also isolate them to as few pipeline stages as possible. However, the design can be further improved by introducing custom changes in the micro-architecture/RTL description, which helps the automated process of the path redistribution technique. After performing these changes, we have managed to restrict P LLP to one stage, whereas ensuring that the target clock frequency of the design is still met. The amount and type of these modifications depend on the desired path distribution and the original design. Nevertheless, these modifications may not be needed at all if the design facilitates a desired path distribution after imposing the group path constraints. In this work, we apply strict timing constraints, under iso-frequency, and the preferred path shaping can be achieved only after performing micro-architectural/RTL changes. Note that the same path distribution can be obtained without applying such micro-architectural changes, but this incurs considerable timing penalties. The synthesis step is followed by the place and route using the Innovus tool from Cadence. Sign-off STA follows with Synopsys PrimeTime to verify that design has achieved the timing closures.
B. Analysis Phase 1) DTA:
To enable characterization of the data dependent path activation, we use the post-place gate-level simulation supported by ModelSim, which monitors the inputs and the outputs of all flip-flops in the design and generates a corresponding event log. To obtain this information and perform full back annotated simulation, ModelSim apart from RTL netlist and testbench requires a standard delay format (SDF) file which describes the cell and interconnect delay. RTL netlist and SDF file are obtained at the place and route step.
2) Profiling Tool: To feed the ModelSim with real FP operands, we extended a sampling-based profiling tool [28] to collect statistics about operands of FP instructions running different applications on a real hardware. The tool consists of an online module, that runs in parallel with a profiled application, and an offline module used to process the profiling results after the application has finished. The online module interrupts the execution of a running program with a defined period, retrieves the current instruction pointer and collects the values held in registers. The offline module disassembles the profiled program to identify types of instructions executed by the application. It uses this information to assign the sampled registers to specific instructions using the collected instruction pointers. At the final stage of profiling, we filter all sampled instructions to find all FP-instructions and values held in registers used by these instructions. To implement our tool, we use ptrace interfaces which are supported by the latest Linux kernel (since Linux 2.6.34).
Provided that every set of the extracted operands under nominal conditions produces an error-free output, we define this error-free gate-level simulation output as D gold . To measure the number of manifested timing failures under any potential delay increase, we execute the simulation and compare D gold to the event log obtained from ModelSim. Finally, this tool extracts a value change dump (VCD) file that contains information about the switching activity and value changes that occurred during the simulation for nets and registers of the design. This file is essential for performing dynamic power analysis.
3) Power Analysis: We estimate the power consumed by all benchmarks and designs using the Voltus tool from Cadence. To perform dynamic power analysis, we use the following input data in Voltus: the post placed and routed netlist, the VCD file, a design exchange format (DEF) file that represents the physical layout and a standard parasitic exchange format (SPEF) file which corresponds to the parasitic data of wires in a chip. Sign-off quality power analysis obtains VCD files Fig. 3 .
Workflow of the proposed approach. Our modifications are emphasized in orange.
from ModelSim, while the other inputs are produced by the already placed and routed design.
IV. CASE STUDY: APPLICATION TO FPU
We apply the proposed approach to a multi-cycle, IEEE-754 compatible double precision FPU. According to the IEEE-754 Standard, a FP number follows the representation −1 S × M × 2 E , where S : sign, E : exponent and M : mantissa. In a double precision FP number the MSB indicates the sign, the next 11 bits represent the exponent and the mantissa consists of the 52 LSBs. This FPU is a part of the latest Out-of-Order mor1kx MAROCCHINO pipeline, a 5-stage pipeline microprocessor based on the OpenRISC 1000 Instruction Set Architecture [29] . In this paper, we implement the following FP instructions: addition/subtraction, integerto-FP and FP-to-integer conversions and comparison of FP operands. Figure 4a illustrates the micro-architecture of the targeted FPU, highlighting the FP addition/subtraction. At Stage 1, an Order Control Buffer and a Pre-Normalize block are implemented, which permits data dependencies detection and adjustment of the exponent and mantissa, respectively. Stage 2 is responsible for the pre-addition/subtraction alignment, while Stage 3 performs the necessary multiplexing and shifting of the operands. Mantissa addition and exponent update are performed at Stage 4; rounding occurs in the last two stages.
A. Redesigned FPU
We start by applying the typical EDA flow (see Figure 3) to the conventional unmodified design. After following the synthesis and place and route steps, as well as performing STA using PrimeTime, we built the path distribution that is shown in Figure 4b . The obtained distribution implies that the conventional performance-centric flow results in a large |P LLP |, in which many paths are close to the worst case delay, or, in other words, to the clock delay. Such a path distribution creates the "timing wall". Figure 5a depicts the path distribution within each pipeline stage, revealing that the "timing wall" exists in 4 out of the 6 stages. These findings indicate that there is an increased probability of timing failures in the stages where LLPs exist. To circumvent this, we apply the steps of our design flow with the following modifications.
1) Micro-Architectural Changes:
The way RTL is written has a direct impact on the physical layout and thus on the path distribution. To change the distribution of paths in the way discussed in Section II, we apply an extra optimization step that focuses on modifying the micro-architecture/RTL description of some pipeline stages. In particular, after performing STA, we noticed that Stage 4 consists of the most timing critical paths. To reduce the number of the LLPs in this stage, we moved parts of the combinational logic to the previous stage, exploiting the slack margins observed at Stage 3. Additionally, rounding, which occurs at Stages 5 and 6, poses a bottleneck because it is applied to the result of all the FP operations, and thus many paths in these stages have a long latency. To this end, we changed the RTL code implementing the logic in Stages 5 and 6, making the synthesis part more efficient. Specifically, we optimized some unnecessary large conversions to negative numbers.
2) Group Path Constraints: As discussed in Section II-A1, during the synthesis step, we apply various constraints by grouping paths into two different sets, the P SLP and the P LLP , based on their computational delays. We define the delay target of the SLPs as T SLP . If the path delay is less than T SLP , then this path is assigned to the P SLP , otherwise it is assigned to the P LLP . Initially, we synthesize the design for a small T SLP in order to move paths away from CP STA . If after synthesis the timing target is not met, then we increase the value of T SLP until the design achieves the targeted timing closures. After many iterations we set the T SLP in the particular FPU to 1.68ns. These iterations are implemented using tool command language (tcl) scripts. As a result, this automated procedure reduces |P LLP | (see Figure 4b) , while ensuring all other paths are fast enough (at least 9.1% faster than the CP STA ) to tolerate variation-induced timing failures. Note that the timing constraints imposed by our design does not exceed the CP STA , which is determined by the original unmodified design. We also constrain the LLPs to as few stages and instructions as possible. It is also worth mentioning that the remaining LLPs are isolated in such a way that can be triggered only by FP addition/subtraction instructions at Stage 4, which is the main goal of our design strategy.
B. Application of Operand Truncation
After redesigning the FPU, we apply bitwidth truncation to the input operands of the specific instructions that activate the few remaining LLPs in order to reduce their excitation probability and thus timing failures. Since all the LLPs are restricted to Stage 4 of the FP addition/subtraction instructions, we deploy the truncation only to the LSBs of the input operands of these instructions. Given that Stage 4 implements the 52-bit mantissa addition, we set constant "0" values to the LSBs to avoid delay failures in the MSBs.
As discussed in Section II, truncation may lead to quality loss and needs to be carefully selected. In our experiments, we evaluated the truncation of the 32, 44 and 48 LSBs of the mantissa part of the FP addition/subtraction operands. This means that the sign and the exponent part of the IEEE-compliant operands were unaffected (12 bits totally) along with the 20, 8 and 4 MSBs of the mantissa part in the considered scenarios.
V. EVALUATION RESULTS
In this section, we evaluate the efficacy of our approach in limiting the timing failures under various CR levels, which are shown in Table II . We reduce the clock period from 1.85ns to 1.65ns in steps of 0.05ns, representing potential degrees of worst-case path delay increase that may be caused by variations [30] . In this section, we compare our redesigned FPU (see Section III-A) with the original (reference) design. For a fair comparison, we applied the truncation of 32, 44 and 48 LSBs of specific FP operands to the original (Orig), unmodified FPU, and also to the proposed (Prop) one. The considered FPUs are implemented using the design flow described in Section II-A in NanGate 45 nm Composite Current Source Cell Library [31] .
A. Application Profiling
To estimate the efficacy of our approach using real FP operands (Section III-B2), we developed a profiling tool to extract program traces from various compute intensive applications. In our analysis, we use the Kmeans, CFD, Heartwall from the Rodinia benchmark suite and the Raytrace benchmark from the Parsec suite. This set of benchmarks represents a variety of algorithms that have many FP operations and covers a wide range of domains, i.e., Data Mining, Fluid Dynamics, Medical Imaging and Computer Graphics. To obtain the program traces, we profile all benchmarks on an ARM A7 based system, Odroid-Xu3. Figure 6 depicts the percentage of time spent on execution of different types of instructions averaged over all profiled benchmarks. We observe that benchmarks spent 31.2% of the total time on execution of FP instructions on average, which indicates their importance. Using such a tool, we extract 10000 operands from the most frequently executed FP instructions for each application, which we feed to the DTA tool to estimate the BER and the consumed power. 
B. Characterization of Timing Failures
Timing failures are a function of input operands, clock period and the number of truncated bits. Figure 7 demonstrates how the number of timing failures changes under different bitwidth truncation levels when we scale down the clock period in the original and the proposed designs across the 4 benchmarks. Note that the failures are reported only for the cases when the simulated output for specific operands differs from the recorded error-free output D gold . As shown in Figure 7 , under the nominal clock period (CR0) no failures are manifested. Moreover, we observe that the timing failures are substantially lower for any degree of bitwidth truncation across all applications. In particular, by setting to zero the 32, 44 and 48 LSBs of the mantissa in the original FPU, the total number of failures across the 4 benchmarks is reduced by 1.69×, 2.83× and 12.13× on average.
In the same figure, we observe that the number of timing failures incurred in the original design is significantly higher than the number of failures incurred in the proposed FPU under CR1, CR2 and CR3. Beyond the CR3 level, which corresponds to the T maxvar (Section II-C), we notice an increase in timing failures, especially in the proposed design. This can be attributed to the fact that the applied path shaping has shifted the paths, as shown in Figure 4b . This results in a high number of paths being activated and failing under CR4. In particular, Figure 4b shows that there are many paths with delay more than 1.65 ns in the original and proposed designs; this outcome implies that the probability of timing violations in case of activation of these paths under CR4 is expected to be very high. Note that this is a choice for the specific implementation and can be altered at design time.
Nonetheless, the proposed design results in a significantly reduced number of timing failures for most of the CR levels. Specifically, our approach under CR1 allows us to eliminate all the incurred errors, while Table III shows the original FPU exhibits from 86 up-to 107505 failures for the same CR. In addition, under CR3, which corresponds to an 8.1% worst-case path delay increase, the proposed design with the preferred path shaping reduces the number of timing failures by ∼ 4× on average compared to the original design. The combination of path shaping and operand bitwidth truncation in the 32, 44 and 48 LSBs reduces these failures by ∼ 13×, ∼ 79× and ∼ 769× on average, respectively, when compared to the original design. Figure 8 depicts the average BER across the 4 benchmarks at CR3, where several interesting observations can be made. Note that in this figure letters S, E and M on the X axis correspond to the sign bit, the exponent bits and the mantissa bits, respectively. To begin with, we observe that distinct bit positions incur different BERs. This happens because different input operands activate different paths contributing to the calculation of different output bits. We also observe that the original design with the full range precision (0 bits truncation) exhibits high BERs, ranging from 1.15% up-to 66.9% in the mantissa and up-to 37.6% in the exponent part. To reduce BERs, we set different LSBs of the mantissa to "0", which is expected to reduce the number of the LLPs excitation and consequently the overall BERs. However, upon further investigation we observe that in the original FPU, the BER of the MSBs of the mantissa still remains considerably high under different truncation levels (∼ 30%, ∼ 20% and ∼ 15% on average after truncating 32bits, 44bits and 48bits, respectively), while the BER of the exponent bits ranges from 0% up-to 20%. In contrast to the original FPU, the proposed FPU accompanied with path shaping and operand truncation results in significantly lower BERs across the vast majority of the bits and especially in the exponent (0% -0.3%) and the MSBs of the mantissa (0% -9%). The fact that the failing bits are neither in the exponent bits nor in the MSBs of the mantissa helps limit the incurred quality loss.
C. Evaluation of BER and Quality

1) BER:
2) Quality: To evaluate the quality loss incurred by the random timing failures and the operand truncation, we estimate the average Relative Error (RE) achieved by our design and compare it with the RE of the original design for all considered applications. The average Relative Error, a common metric for estimating the output quality [17] , [32] , is defined as:
where D gold (i) denotes the exact error-free output value obtained from the reference FPU design and O sim (i) represents the output obtained by the simulation using our DTA tool for a specific (i) FP instruction and the associated operands. The O sim (i) value is extracted by the output register of the considered FPU after simulating both designs (original and proposed) under a specific CR level and bitwidth truncation range. For these experiments we extracted 10000 FP instructions for each benchmark and thus i varies from 1 up-to K = 10000. The effect of different CR and operand bitwidth truncation levels on the RE before and after applying our approach is depicted in Figure 9 . The original design without any truncation under the nominal clock period (CR0) introduces no quality degradation since no failures have been manifested (see Figure 7 ). In the case of CFD, we can observe that the RE of the original FPU leads to unacceptable (> 1) RE levels even under a small worst-case delay increase (i.e., CR1). Truncation helps to reduce the RE under CR1, CR2 and CR3, but not for every CR and truncation level. On the other hand, the proposed FPU, under the same CR levels renders the quality loss controllable and deterministic, since it depends only on the number of LSBs that are truncated. In the case of Raytrace, we obtain similar results as the proposed design under CR3 achieves upto 0.003 RE, while the RE incurred by the original design may reach unacceptable values. In the case of Kmeans and Heartwall, we observe that the original FPU under CR3 and all the considered truncation levels incurs a significant quality degradation. Conversely, the proposed FPU under CR3 exhibits RE ranging from ∼ 2 · 10 −9 up-to ∼ 0.04. Overall, the proposed FPU minimizes the catastrophic quality loss incurred by the original double precision FPU. When compared to the original FPU with various truncation levels enabled, the combination of path shaping and operand truncation provides equal or much lower REs for the first four CR levels across different benchmarks. Beyond the CR3 level, we notice a significant quality degradation in the original and the proposed designs. This can be explained by a high likelihood of massive timing failures under CR4, as discussed in Section V-B. Finally, we observe that setting the last 32 bits of mantissa to "0" provides a judicious choice for operand truncation, considering all the evaluated applications.
D. Estimation of Power and Area
Table IV depicts power consumed by the post-placed and routed original and proposed designs measured with the EDA tools, as explained in Section II-A2. Our approach introduces a 5.7% power overhead when path shaping is applied. However, the proposed FPU leads to up-to 44.7% power savings when path shaping and 48 LSBs truncation are combined due to a significantly reduced switching activity. Overall, the area overhead incurred by path shaping is 0.25%, while our approach 
E. Discussion on Potential Power Savings
We propose a variation-aware approach that facilitates the positive timing slacks of the SLPs and exploits the rare activation of the LLPs through significance-driven bitwidth truncation. Conversely, such properties could also be utilized to allow operation at a reduced voltage, when the manufactured chip is unaffected by variations. In this case, our design can facilitate operation at 0.95V, leading to extra ∼ 28% power gain on average compared to the operation at the nominal supply voltage of 1.1V. This means that the proposed FPU operating at 0.95V enables us to save 72.7% of power on average in the case of 48 LSBs truncation compared to the original FPU operating at 1.1V. In other words, our approach can be used not only for mitigating timing failures at a low cost, but also for enabling operation at a reduced voltage.
In this paragraph, we compare our design with the guardband based approach. In particular, according to the conventional paradigm, the original FPU adopts enough timing guardbands by scaling up the voltage to avoid failures and obtain an error-free output. Using the available fast corner cell library (@1.25V) and the extracted VCD files, which allow us to consider the switching activity and dynamic path activation, we estimate the power consumption across all applications. Operation at such a voltage provides the necessary margins to avoid all the dynamic timing failures. However, as shown in Table V , it comes at a cost of up-to 43.1% power overhead when compared to the proposed design (@1.1V) without enabling the operand truncation. If we combine the path shaping technique with the 48 LSBs operand truncation, which also leads to an error free output (see Figure 8) , our approach can lead to 84.21% power savings on average.
VI. RELATED WORK
Our work aims at preventing timing failures by truncating the bitwidth of operands, while avoiding the use of conservative guardband based schemes. The majority of the works in the area of approximate computing exploited the inherent error resilience of applications to tolerate failures [33] and/or limit the overheads incurred by guardbands in at least some parts of the application [34] . However, such approaches have been evaluated on simulators due to the lack of suitable approximate hardware. Some of the existing approaches exploited the precision scaling to reduce the power consumption as in [35] . Reference [36] accommodates voltage scaling in arithmetic units for saving energy by disabling some of the input bits. Another scheme proposed in [37] prune the bitwidth of all the input operands in simple data-paths, enabling power savings due to the reduced switching activity. Although very interesting, such works overlook the impact of precision scaling on the reduction of the path-delay; and did not consider 'when and where' they need to apply it in pipelined datapaths as we do.
Some works attempted to exploit an unequal contribution of algorithmic computations to output quality. In particular, several works proposed to redesign the circuit by giving priority to the execution of the most critical parts of each datapath [16] for avoiding failures. These works may have exploited the varying importance of computations, but they did so only for few application specific DSP architectures rather than for general purpose FPUs that may run any application.
Some recent works used approximate computing to make circuits resilient to variation-induced delay failures. An approach in [38] uses precision scaling to limit timing failures in the context of transistor aging and thus mitigating the estimated delay increase over few years. A post-Silicon technique in [18] truncates the bitwidth of all the inputs operands to prevent timing failures in typical DSP hardware modules.
Overall, the majority of the proposed solutions are applied to simple data-paths or arithmetic units rather than complex pipelined designs. Moreover, existing schemes reduce the computation precision neglecting the fact that there are few operands and instructions that may activate the error-prone LLPs. The novelty of the proposed significance-driven operand truncation technique lies in applying bitwidth truncation only to the error-prone operands of specific instructions that have been isolated by design.
VII. CONCLUSION
This paper presents a framework for minimizing the timing failures in pipelined designs by i) redesigning the target circuit in a way that eliminates the excitation of the LLPs, and ii) opportunistically exploiting the dynamic activation of such paths by few operands Op LLP and making them rare by setting a constant value "0" to a fixed number of LSBs in the relevant operands. The evaluation of the proposed redesigned placed and routed FPU with the developed DTA tool using extracted program traces shows a significant reduction of timing failures under potential delay variations with a negligible 0.25% area overhead and no performance loss. An essential attribute that led to low overheads in our design is the applied path shaping technique. Without this scheme the truncation should be applied statically to every operation and operand. Our results also show that the proposed approach effectively reduces the BER in all the exponent and mantissa bits of the redesigned FPU, as opposed to the reference design. Truncation of 32 or 44 LSBs of Op LLP helps to maintain low RE levels in all the evaluated applications and up-to an assumed 8.1% variationinduced worst-case delay increase. Finally, we observe that path shaping may introduce 5.7% power overhead, but when combined with operand bitwidth truncation can lead to up-to 44.7% power savings. Even though we demonstrated the efficacy of the proposed approach by applying it to the specific FPU, the presented steps can be applied to redesign the stages of any other pipelined core.
