Abstract-With ever-increasing on-chip current density, technology scaling is pushing the electromigration (EM)-induced robustness of silicon chips' controlled collapse chip connection (C4) bump array to its limit. Since the density of C4 bumps is projected to be constant in the future, it is increasingly becoming challenging to guarantee EM-failure free for all power-supply bumps without increasing chip packaging cost or encroaching on bumps sites needed for I/O. In this paper, we develop a statistical simulation framework to analyze the mechanism and consequences of multiple power-bump wearout. Our analysis shows that the penalty of a moderate number of EM-induced power-bump failures is fairly small. A mild increase in on-chip supply voltage noise guardband can tolerate these bump failures and significantly increase a system mean-time-to-failure (MTTF). As a result, the targeted system MTTF can be achieved with significantly reduced power-bump count (e.g., 43% less) and a small extra noise margin (e.g., 0.5% V dd IR drop).
I. INTRODUCTION
T HE ROAD toward building reliable silicon chips with higher performance and lower energy consumption is full of challenges from various physical constraints. Electromigration (EM) has long been recognized as an important lifetime reliability issue. Among the various circuit components of a silicon chip, controlled collapse chip connection (C4) bumps are particularly vulnerable to EM damage because the materials they are made of SnPb or SnAgCu have the orders of magnitude lower current density tolerance compared with the material used in on-chip wires (e.g., copper) [1] . Moreover, C4 bumps suffer from the effect of Joule heating and current crowding [2] , which both accelerate EM damage.
Due to the sheer volume of power-supply bumps that today's high-performance microprocessors usually have (e.g., a few thousand [3] ), the whole system's expected EM-failure-free time is much shorter than a single-bump's mean-time-tofailure (MTTF), because any bump could fail even after a relatively short period of time [4] . As a result, guaranteeing an EM-failure-free operation during the targeted system lifetime requires allocating a large amount of C4 bumps to the power delivery network (PDN) to amortize per-bump current and improve a single-bump MTTF. As nonideal technology scaling brings exponentially increasing transistor density without being able to reduce the supply voltage accordingly, power and current consumption density grows rapidly. Consequently, near-future silicon chips will require more power-supply C4 bumps to maintain EM lifetime [5] .
Unfortunately, higher power-bump count raises the cost of the chip package, with increased design and fabrication complexity. For example, the difficulty of silicon chip assembly, especially getting a sufficient underfill material to smoothly flowthrough the C4 bump array, is directly related to the density of populated C4 bumps [6] . Furthermore, more power bumps create more footprints (i.e., contacting areas for electrical connection) in the chip package, which increases the complexity of I/O routing. To make matters worse, there exists a contention for C4 bumps between the needs of power delivery and the chip I/O. This is because both I/O links and power-supply channels have to share the bump sites exclusively, and the density of total available bump sites is projected to be constant in the near future [3] . Consequently, higher power-supply bump requirements will encroach on the availability of bumps for I/O and reduce the available bandwidth for off-chip communication, which eventually will degrade processor performance. It is, therefore, critically becoming important to improve whole-chip robustness against EM-induced bump wearout in order to reduce the requirement for power-supply bumps.
Once a power-supply bump fails, the PDN's effective impedance as observed by nearby circuit will increase, because now the supply current has to come from neighboring power bumps through the lateral on-chip PDN wires. Fortunately, the on-chip wires are designed to have very low impedance; therefore, the voltage noise increase caused by a moderate number of bump failures is usually very small [5] . This observation suggests that allocating a small extra on-chip voltage noise margin (i.e., overvolting and/or underclocking) could help the system to tolerate bump failures, improve the whole-chip's EM robustness, and reduce the power-bump requirement. However, as power bumps start to fail, the current they used to carry will be redistributed to their neighboring bumps and accelerate EM wearout. This may create an avalanche effect where a cluster of bumps shortly break after the first bump failure happens. The effectiveness of tolerating multiple bump failures with extra voltage noise margin heavily depends on the severity of this avalanche effect. Our simulation results show that this failure-acceleration effect does exist and designers should carefully consider its impact.
In this paper, we design and implement a Monte Carlo simulation (MCS) framework to analyze the mechanism and consequences of multiple, EM-induced, random power-C4 bump failures. By integrating this framework with a preregister transfer level (RTL) PDN model VoltSpot [5] , we are able to capture the impact of current redistribution and study the whole system's MTTF under a given power-bump configuration and voltage noise margin. Our framework helps to navigate designers through the complex tradeoff involving C4 bump allocations, EM-induced MTTF, voltage noise margin, and multiple on-chip power domains. This paper's major contributions are as follows.
1) We develop an MCS framework for multiple EM-induced random bump failure simulation. It helps designers provision bumps under different design constraints. 2) We observe that the EM-induced bump-wearout power delivery quality degradation is a relatively slow process. Therefore, a small extra noise margin can tolerate the consequences of multiple bump failures and extend system lifetime. 3) By comparing results against a simplified model that ignores the impact of current redistribution, we prove the existence of a bump failures' avalanche effect and show that ignoring its impact will overestimate the system MTTF by up to 80% and underestimate the required noise margin for EM wearout by over 50%. 4) We evaluate the impact of having multiple on-chip voltage domains and observe that splitting the power grid into multiple islands will degrade the bump array's robustness against EM wearout. Under a fixed MTTF target and voltage noise limit, finer-grained power domains require up to 15% more power bumps. 5) We design and evaluate two run-time techniques that effectively extend system's lifetime beyond the designed MTTF with graceful performance degradation.
II. BACKGROUND AND RELATED WORK

A. Single-Bump EM Failure
EM refers to the phenomenon of gradual mass transport in metal conductors induced by momentum transfer from electrons to atoms. It creates an open or short circuit in the PDN and permanently degrades the quality of power delivery. The EM-induced failure mechanism for a single C4 bump has been extensively studied in the past. The cumulative distribution function (CDF) of a single-bump's failure probability over time can be well described by a lognormal distribution [7] 
where σ is the lognormal distribution's standard deviation (experimentally determined to be 0.5 [7] ). The MTTF is determined by Black's equation. For C4 bumps, the MTTF equation must be adjusted to consider the impact of current crowding and Joule heating [2] 
where J is the current density, n and Q are the materialspecific constants (for C4 solder bump, n = 1.8 and
, A is an empirical constant, and T is the temperature in Kelvin. With (1) and (2), we can calculate the probability of failure for a single bump after any time t. We note that although EM wearout can also cause a short circuit in the on-chip metal stack, C4 bumps are less likely to suffer from such a phenomenon, since the spacing between C4 bumps is usually much larger than on-chip metal. Thus in this paper, we assume that all EM-induced bumps failures cause an open circuit. In addition, because I/O bumps carry bidirectional current, they are characterized by longer times to the EM-induced failure [9] . Furthermore, I/O bump failures are independent of each other and do not affect power delivery quality. For these reasons, we only study the failure of power bumps in this paper.
B. Supply Voltage Noise and Design Margin
Power-supply C4 bumps belong to the PDN that delivers current from power sources to switching transistors. The design goal of the PDN is to be able to provide sufficient current at a stable voltage level and over a desired lifetime. Unfortunately, because of the PDN's impedance, the actual supply voltage at transistors' power rails is noisy (i.e., it drops or fluctuates over time). Since transistor delay is proportional to the amplitude of supply voltage noise (defined as the actual supply voltage's deviation from the nominal value), large voltage noise will result in timing errors [10] .
In order to avoid voltage-noise-induced errors, designers usually allocate a timing margin (as a guardband) in the circuit's critical paths by reducing clock frequency. For example, a processor running at a supply voltage of 1 V and a clock frequency of 1 GHz will need to reduce its frequency to 950 MHz (a 5% decrease) to avoid timing errors when the actual supply voltage drops by 50 mV to 0.95 V (5% V dd noise). In other words, this margin protects against large voltage noise at the cost of reduced clock speed and/or increased energy consumption. power bumps through the lateral on-chip PDN wires. This phenomenon has two consequences. First, the increased current density in those neighboring bumps [the two remaining ones in Fig. 1(b) ] will exacerbate EM wearout, causing an avalanche effect where clustered bumps fail more quickly after the first failure within that cluster. Second, the effective PDN impedance seen by the transistors near the failed bump [e.g., the current source in the middle of Fig. 1(b) ], will increase, which intensifies the supply voltage noise in those regions. The increased noise will result in frequent timing errors and consequently, system failure. We note that the severity of the avalanche effect directly affects the rate of power delivery quality degradation (i.e., the increase in on-chip voltage noise over time) and the expected chip lifetime. To be more specific, if bump failures are scattered across the entire chip, the perfailure-induced increase in voltage noise will be smaller and, therefore, the chip will survive longer before on-chip voltage noise exceeds the design threshold. If all failed bumps are clustered together, on-chip voltage noise can rapidly exceed the designated margin after a few bumps fail.
C. Consequences of Power-Supply Bump Failures
D. Prior Work on Statistical EM Analysis
In a recent publication, Zhang et al. [5] found that tolerating multiple EM-induced bump failures with extra noise guardband could significantly extend system MTTF. Although the results in [5] indicate that the targeted system MTTF could be achieved with a reduced number of power-supply bumps, the analysis made several simplifying assumptions and, therefore, left many important questions unanswered. For example, because Zhang et al. [5] ignore the impact of bumpfailure-induced current redistribution, the potential damage of the avalanche effect could not be captured at all. We show that without analyzing this wearout acceleration phenomenon, the results in [5] are too optimistic. In addition, Zhang et al. [5] assume that the bumps with the largest current always fail first, which neglects the stochastic nature of bump failures. Without considering the possibility of all bumps' failure and their consequences, Zhang et al.'s [5] approach is incapable of estimating the system's expected lifetime. For these reasons, a statistical analysis approach with more accurate assumptions is required.
Li et al. [11] proposed a statistical model for multivia EM lifetime estimation. Fawaz [12] designed an MCS framework to estimate the on-chip power grid's expected lifetime under a given IR-drop target. Although the failure times of a single via or copper line follow the same probability distribution (i.e., lognormal) as a C4 bump, the prior work is not adequate for the study of multiple bump failures, because it either ignores failure-induced current redistribution [12] , or is only capable of modeling a small number of vulnerable elements (e.g., four vias in [11] ). To the best of our knowledge, a whole-system analysis of multiple C4 bump failures that considers the impact of the avalanche effect is missing from the literature.
III. STATISTICAL SIMULATION OF BUMP FAILURES
A. Monte Carlo Simulation and Results Confidence Level
In this paper, our primary interest is the system's robustness against EM wearout. This is usually measured with system MTTF, where the time-of-failure (TOF) is defined as when the maximum on-chip voltage noise exceeds a predetermined threshold. Because the failure of bumps is a sequence of probabilistic events, multiple bump failures are a stochastic process. Considering the total number of power-supply bumps, the permutation space of possible bump failure sequences is enormous even if we only allow a small portion of bumps to fail (we use permutation instead of combination because the order of bump failures matters due to the current redistribution effect). For example, for a silicon chip with 1500 total C4 bumps, there are more than 10 126 ways to fail 40 bumps. For this reason, an analytical approach would be infeasible for the study of multiple bump failures.
Fortunately, since we are interested in the mean value of a distribution (system lifetime), MCS can be utilized to get reliable results within reasonable time. The idea of MCS is to take random samples from the population-of-interest and use the sample arithmetic mean to estimate the true mean. The quality of this estimation heavily depends on the number of samples taken. Johnson et al. [13] suggest that when sampling from an unknown distribution, the following equation can be used as the stopping criteria (for N ≥ 30):
where N is the minimum number of samples required, x N is the sample mean, is the estimation's relative deviation from the population's true mean μ, z α/2 is the (1 − α/2)-percentile of a random variable Z , where Z has the standard normal distribution, and s N is the sample standard deviation. In this paper, we set = 0.005 and z α/2 = 2.32(α = 0.02) and, therefore, our MTTF results have a confidence interval of ±0.5% around the true mean at a confidence level of 98%. In this paper, this confidence level can be achieved with N = [800, 1500].
B. Capturing the Impact of Current Redistribution
As we discussed in Section II-C, the severity of the avalanche effect caused by failure-induced current redistribution directly affects the system's robustness against EM. To model the EM wearout acceleration phenomenon caused by failure-induced current redistribution, we need to adjust the bumps' failure probability distribution functions whenever their current changes. This CDF adjustment has been discussed in [11] . It involves two major steps: 1) stressing current Algorithm 1 Our MCS Framework for Failure Study translation and 2) conditional probability calculation. The rule of current translation is given by
where n is the same as in (2), I prev and I new are the previous and new current, respectively, t prev is the stressed time under previous current, and t prev is the translated stress time. Under this equation, the probability of failure under current I prev over time t prev would be the same as if the bump had been stressed under I new over time t prev . With the translated stressed time, we can calculate the new CDF with conditional probability
For any time period t after the moment of current change, this equation calculates the probability of bump failure under current I new , given the condition that the bump had not failed after the first time period of t prev . F new is the unmodified CDF under stressing current I new .
C. MCS Framework for Bump Failure Study
Algorithm 1 shows the details of the failure model used in our MCS framework. Within a bump failure simulation trial, we first calculate the current density of each bump and generate the per-bump failure time CDF (line 4) using (1) and (2) . Then, we randomly generate the TOF of each bump using the inverse transform sampling method [14] . To be more specific, we first generate a random number p for each bump from a uniform distribution between 0 and 1. The bump's TOF then equals to F −1 ( p), where F −1 is the inverse function of the bump's CDF. By selecting the bump with earliest TOF and removing it from the PDN (lines 8-10), we can simulate one random bump failure event. Every time a bump fails, all remaining bumps' CDF will be adjusted (lines 13 and 14) and inverse-transform-sampled (line 15) to capture the impact of current redistribution.
Algorithm 2 Simplified Bump Failure Model
To monitor the degradation of power delivery quality over time, we reevaluate on-chip noise whenever a failed bump is removed (line 11). As soon as the noise level exceeds the design threshold, we terminate the trial and record the last failed bump's TOF as the TOF of the whole system (line 18). Each iteration of the outer while loop (lines 2-19) simulates a trial of bump failures and generates one system TOF sample. The MCS will stop as soon as the stopping criteria (3) are satisfied; the system's MTTF is the arithmetic average of all samples (line 20).
To evaluate the severity of the avalanche effect, we also implement a simplified baseline model that only calculates bump current, CDF, and TOF once in every trial and does not adjust bump current or CDF after bump failures (Algorithm 2). This way, the simplified model ignores the impact of current redistribution and the differences between the results of these two models directly measure the severity of the avalanche effect. We note that the model used in [5] would not be a proper baseline for this paper because it assumes the bump with the highest current will always fail first, which completely neglects the stochastic nature of bump failures and, therefore, cannot be used to estimate the system MTTF.
D. Supporting Other Random Failures
Although Algorithms 1 and 2 are designed to simulate multiple EM-induced C4 bump failures, they can be easily extended to study other vulnerable components under EM stress or other reliability threats. For example, by replacing C4 bumps with on-chip metal wires, we can study the relationship between the consequences of multiple EM-induced on-chip wire failures and the whole-chip's lifetime under a given noise margin. On the other hand, we can also adapt our MCS to use the failure probability CDFs of other failure mechanisms (e.g., thermal migration or thermal stress) to evaluate the consequence of those reliability threats and their impact on system MTTF. If the modeled failure events are not independent, the system reevaluation and the CDF recalculation mechanism in Algorithm 1 (lines 11-16) can be used to capture the interaction between random failures and their system-level effects. Moreover, our framework has built-in interfaces with architecture-level power, thermal and voltage noise models, and, therefore, it directly supports the exploration of different 
IV. SIMULATION SETUP
A. PDN Modeling
The MCS framework described in Section III uses per-bump current to calculate/adjust failure probability CDF. It also depends on the evaluation of on-chip voltage noise to determine the whole-system's TOF. To get the PDN's current and voltage profile, we integrate our MCS framework with a pre-RTL PDN model, VoltSpot [5] .
VoltSpot is an open-source tool. It takes a processor floorplan, bump locations, and a per-unit power trace as inputs, and calculates the current and voltage at all PDN branches and nodes. VoltSpot uses separate regular 2-D circuit meshes to model the on-chip V dd and ground nets. C4 bump is modeled as individual resistor-inductor branches attached to on-chip grid nodes, and on-chip decoupling capacitors as distributed capacitors connecting the V dd and ground grids. Ideal current sources model the load (i.e., the power of the switching transistors and associated leakage), and the current values are calculated as I = (Power/Supply Voltage). Off-chip components, such as the package, are modeled with lumped RLC elements. We assume the printed circuit board provides an ideal power supply. The parameters used in this paper are listed in Table I . We note that although existing circuit level PDN models (e.g., the SPICE model in [15] ) are also capable of simulating PDN current and noise, they usually involve millions of nodes, which significantly increase simulation time. As a result, these models cannot provide enough MCS trials to derive highquality results within a reasonable time.
In order to avoid pessimistic results caused by suboptimal power-bump placement, we optimized the location of all C4 bumps in all of our test cases. The optimization algorithm is adopted from [16] .
B. Multicore Processor Power, Area, and Floorplan Modeling
To study the severity of EM in the near-future highperformance processors, we build a multicore processor based on a 45-nm Intel Penryn processor [17] and scaled it down to 16 nm. It has 16 32-bit four-way out-of-order cores. Each core contains a 32-kB L1 instruction cache and a 32-kB L1 data cache. Unified L2 caches private to each core are each 3 MB. The chip area was calculated with McPAT [18] , an architecture-level power model. Application-specific power Normalized whole-system MTTF under different noise margin settings. The differences between the two sets of results are caused by the avalanche effect.
consumption was derived by integrating McPAT with a performance simulator Gem5 [19] . We use ArchFP [20] to generate our floorplans. Fig. 2 shows the floorplan of our 16-nm, 16 core processor. It has an area of 159.4 mm 2 and a total of 1914 C4 bump sites. With a supply voltage of 0.7 V, the 3.7-GHz processor's peak power consumption is 151.7 W.
In the past, Tao et al. [21] discovered that the lifetime of metal under high-frequency ac current stress is determined by the dc component of the stressing current alone. For this reason, we analyze bump EM stress in steady-state only. To be more specific, we first simulate the Parsec 2.0 benchmark suite [22] with our power and performance simulators and then extract the average power consumption of the entire suite. With this power map as an input to steady-state VoltSpot simulation, we can capture the whole system's EM wearout under the processor's average behavior.
V. RESULTS
A. Avalanche Effect: How Bad Is It?
To analyze the severity of the avalanche effect, we performed MCS on our 16-core processor with both detailed and simplified failure models described in Section III. Fig. 3 shows the whole-chip's MTTF with different extra noise margins reserved for EM wearout. The baseline case, which is a conservative design with no extra margin and 40% of total bumps assigned to power supply, has a maximum on-chip IR drop of 1.24% V dd . All results are normalized to the MTTF of this baseline design. As we tolerate larger IR drop (x-axis represents the delta increase of noise margin), we significantly extend system MTTF by allowing more bumps to fail. Since the only difference between the two models is whether they consider the impact of current redistribution, the differences between their results are directly caused by the avalanche effect. Therefore, Fig. 3 shows that the avalanche effect does exist and in extreme cases (e.g., with 5% V dd extra noise margin), ignoring it will overestimate system MTTF by 80%. We also evaluated the computational complexity of the two models. In general, the detailed model requires more simulation time, because failure time CDFs need to be updated for every bump after each bump failure. However, the simplified model requires more trials for each MC experiment, because bump failures are not correlated in space and more bump failures are needed to reach chip failure, while the detailed model requires fewer trials as extra noise margin increases, because there are fewer paths to failure. As a result, the simulation overhead of the detailed model ranges from 80% (at 0.5% extra noise margin) to break even (at 4.5% extra margin).
An interesting observation is that the severity of the avalanche effect will increase as we further relax noise margin to allow more bumps to fail. This is because as EM stress gradually fails a cluster of power bumps, the amount of current redistributed to the remaining bumps will accumulate and exacerbate the acceleration of EM wearout. If we only allow a small portion of bumps to fail (i.e., the left-most data points of Fig. 3) , the impact of current redistribution is fairly small and a mild increase in noise margin can tolerate those bump failures and significantly increase the system's MTTF.
B. Achieving Target MTTF With Reduced Bump Count
The number of C4 bumps allocated for power supply is an important design parameter because it directly affects both the complexity of chip package design and the available physical width of off-chip I/O communication channels. Although both design considerations favor fewer power-supply bumps, reducing the power-bump count will increase bump current and, therefore, degrade the system's EM robustness. Table II shows our baseline chip's expected lifetime without any EM-induced bump failures. We evaluated different bump allocation schemes and observe that reducing power-bump count significantly shortens EM-failure-free time.
Fortunately, the observations in Section V-A indicate that by assigning a small extra noise margin to tolerate bump failures, the MTTF of a chip with fewer power bumps could be extended to match the MTTF of a chip with more bumps. In other words, at the cost of increased noise margin, the power-bump count can be reduced without degrading the system's MTTF.
With our MCS framework, we analyzed the tradeoff between the chip power-bump count and the required on-chip voltage noise margin under a fixed system MTTF target. Using a target MTTF that equals the EM-failure-free time of a baseline chip with 70% of total bumps allocated as power/ground, Fig. 4(a) shows the required noise margin for different bump allocation schemes (in this section, we only discuss the single-domain case, which is the blue line with diamond-shaped marker at the bottom). We note that the noise margin values here represent absolute V dd % IR drop and they include the guardband to tolerate both bump failures and the power delivery quality degradation due to reduced bump count (e.g., if we reduce power-bump ratio from 70% to 30%, the worst on-chip IR drop will increase from 0.91% to 1.59% even without any bump failures). We observe that a small increase in noise margin is sufficient even if we significantly reduce the power-bump count. For example, with an extra noise margin of 0.53% V dd (from 0.91% to 1.44%), designers can reduce the power-bump count by 43% (from 70% to 40%) without shortening the whole-system's MTTF. This increases the physical bandwidth for off-chip I/O communication by up to 2× (I/O bump count from 30% to 60%), while only incurring less than 1% slow down (given the proportional relationship between the noise margin and the clock frequency, as discussed in Section II-B).
C. Impact of Multiple On-Chip Power Domains
Contemporary processors usually have multiple on-chip power islands instead of supporting the entire silicon chip with a single power domain. Splitting the on-chip power delivery grid into multiple isolated domains provides finer-grained spatial support for dynamic voltage and frequency scaling, which improves the energy efficiency of multicore processors. In addition, isolating high-frequency digital blocks (e.g., CPU cores) from mixed-signal blocks (e.g., PHY unit in memory controllers) helps to improve the integrity of analog signals. However, separating the power grid into mutually insulated domains inevitably cuts off a portion of lateral current flow in the on-chip PDN. This not only degrades power delivery quality due to increased PDN impedance, but also undermines C4 bumps' robustness against EM wearout. This is because for the bumps near domain boundaries, the amount of redistributed current due to bump failures within one domain will be higher, since the bumps in neighboring domains can no longer help to share the current load.
To study the impact of power domains on system EM robustness, we extended VoltSpot by splitting the virtual on-chip V dd grid into different islands according to the chip floorplan and the power domain specification for each block. The lateral circuit branches in the virtual grid that cross domain boundaries are removed so that different domains are mutually insulated from each other. With this extension, we performed statistical analysis with our MCS framework and Fig. 4(a) shows the required noise margin to achieve the target system MTTF under different power domain settings. Using the same MTTF target and bump allocations as Section V-B, we tested three cases with 4, 2, and 1 core(s) per domain, which gave us 4, 8, and 16 different domains, respectively. Simulation results indicate that having multiple power domains does exacerbate EM wearout, and the required noise margin to guarantee targeted MTTF will significantly increase as we split the power grid into more domains or reduce power-bump count. If the chip design has strict limits for both MTTF and on-chip noise margin, supporting more power domains will require more power bumps. For example, with a fixed noise margin of 2% V dd , a single-domain chip needs only ∼35% of total bumps to supply power, while a one-coreper-domain chip has to increase this ratio to ∼55%. These results indicate that although more on-chip power islands improve system energy efficiency with finer-grained support for voltage/frequency tunning, it also comes with the price of a higher demand for power bumps. This also suggests that the insulation boundaries (i.e., boundaries between adjacent voltage domains) in the on-chip PDN directly affect the EM robustness of the whole system. The relationship between the amount of insulation boundaries and the number of total bumps is an interesting topic for future work.
To further explore the impact of the avalanche effect, we performed the same set of analysis with the simplified model [ Fig. 4(b) ]. By ignoring the current redistribution phenomenon, the simplified failure model produces optimistic estimations of EM wearout and, therefore, ends up with a much smaller noise margin requirement. In the worst case scenario with one core per domain and 30% bumps as power supply, the simplified model underestimates the required noise margin by over 50%, which leads to a design that will fail long before the target MTTF. We, therefore, conclude that it is critically important to evaluate the impact of the avalanche effect in the design tradeoff study that involves power-bump count, noise margin, EM-induced system MTTF, and multiple on-chip power domains. Failure to do so can lead to incorrect pre-and post-RTL design decisions.
D. Using Graceful Performance-Degradation Schemes to Extend Chip Lifetime
The EM wearout of a silicon chip's C4 bump array is a gradual process that slowly increases on-chip voltage noise.
Due to the nonuniform current distribution between different bumps and the presence of the avalanche effect, different on-chip regions will suffer from different levels of EM-induced power delivery quality degradation. These observations suggest the possibilities of using graceful performance-degradation techniques to extend silicon chips' lifetime beyond the original MTTF and thus get more work done with slightly reduced performance. For example, with the ability to further relax timing margin (i.e., slow down clock frequency) after the worst on-chip IR drop reaches the design target, the silicon chip can tolerate more bump failures and operate longer. Another possible scheme is to abandon a core as soon as its IR drop exceeds the design target. This way, the processor loses throughput but the remaining cores could operate longer with unchanged voltage and frequency.
We implemented both the margin-adaptation and the core-desertion techniques in our MCS framework. For the margin-adaptation scheme, we assume a fixed voltage level and only adjust global clock frequency to accommodate extra voltage noise. Since there is an approximate linear dependence between the delay of transistors and the supply voltage noise level at typical supply voltages, we assume that an x% extra IR drop reduces the maximum clock frequency by x%. After the on-chip noise exceeds the original design target, the margin-adaptation controller will gradually slow down the entire processor according to the actual voltage noise level. It is worth mentioning that in practical designs, designers could set a hard limit on noise margin beyond which the processor or core will be considered as failed. This helps to maintain a lower performance bound and guarantee silicon chips' functionality under worst case voltage noise. We include this hard limit as a design parameter in our framework. For the core-desertion technique, the controller will power gate a core as soon as its noise level exceeds the design target and the abandoned core will never be used again. We note that both schemes assume ideal voltage sensing that captures the worst per-core IR drop. The discussion of noise detection's role exceeds the scope of this paper but is an interesting direction for future work. In general, any margin-adaptation technique will need to add margin to account for the imprecision of the voltage sensors.
To study the effectiveness of these two schemes, we pick a design point from Fig. 4(a) and apply the graceful degradation techniques with different settings. The baseline design has 40% bumps allocated to power supply and uses a single on-chip power domain. It requires an IR-drop margin of 1.44% V dd to meet the MTTF target. Fig. 5 shows the evaluation results with the metric of aggregated work done, which is calculated as clock_frequency * number_of_functional_cores * time_duration. The unit of this metric is instruction count, and it shows the amount of instructions a processor can execute during its entire lifetime. All results are normalized to the work done by our baseline design without any graceful performancedegradation scheme. By testing the core-desertion technique that abandons up to 14 cores (out of 16) and the marginadaptation technique with different hard noise limits, we found that the aggregated work done could be significantly increased (up to 2.5×) with graceful degradation schemes. An interesting observation is that if we only apply one technique, the benefit of slightly increasing noise margin would be equivalent to abandoning many cores. For example, by gradually relaxing noise margin from 1.44% to 2% V dd , the chip could get 51% more work done. To complete the same amount of extra work, the core-desertion technique must abandon at least seven cores. The reason behind the marginadaptation technique's high-efficiency is similar to what we have observed in Section V-A, where a slight noise margin increase can tolerate more bump failures and significantly improve chip lifetime. In contrast, the core-desertion technique is not as effective because the silicon chip usually suffers from multiple EM-damaged regions; therefore, shutting down only a few cores could not extend the lifetime of the entire chip.
In conclusion, graceful performance-degradation techniques could keep a processor running after the designed MTTF and thus significantly increase the amount of work done by the processor.
We note that, because these techniques only shutdown cores or reduce frequency after bumps start to fail, they will not degrade the processors' performance during the targeted lifetime. In fact, the above-discussed techniques act like a warning light that not only gives users the ability to continue to do work, but also alerts them that a replacement is due soon.
E. Impact of On-Chip Temperature Locality
In Sections V-A-V-D, we assumed that the on-chip temperature is uniformly distributed. Although this assumption is helpful for worst case analysis when we set the temperature value to its upper bound (e.g., 100°C), it clearly leads to pessimistic results in MTTF estimation. This is because the EM-induced MTTF is considerably sensitive to temperature, and in reality, on-chip temperature exhibits a nonuniform distribution, where blocks with lower power consumptions will experience lower temperature. In order to explore the impact of on-chip temperature's locality, we integrate our MCS framework with a pre-RTL thermal model HotSpot [23] . With HotSpot's fine-grained on-chip thermal modeling capability, the improved MCS framework first derives a detailed on-chip Fig. 6 presents the whole-system's MTTF with different extra noise margins reserved for EM wearout. All results are normalized to the MTTF of the baseline design, which has 30% of the total bumps assigned to power supply. By comparing the results generated using both distribution assumptions, we observe that adopting uniformly distributed temperature underestimates system MTTF by up to 12%. We note that according to our thermal simulation, the maximum on-chip temperature (which is also used as the temperature value in the uniform scenario) is roughly 10.9°C higher than the lowest on-chip temperature.
Besides evaluating the effect of temperature distribution on MTTF, we also analyzed how it affects noise margin provisioning. Using a target MTTF that equals the EM-failurefree time of the chip with 70% bumps as power/ground, Table III compares the estimated noise margin requirement under both temperature-distribution assumptions. We observe that ignoring the on-chip temperature distribution only overestimates noise margin by a trivial amount (e.g., 0.05% V dd ), unless the ratio of power/ground bumps is reduced to a low level (e.g., to 30%).
F. Transient Noise Evaluation
The majority of this paper uses steady-state IR drop as the metric to evaluate power delivery quality. Although this methodology has been widely used by researchers and engineers [4] , it is also important to understand the impact of bump failures on transient noise. Unfortunately, even though the PDN model we use in this paper supports transient noise simulation, evaluating PDN transient behavior will be many orders of magnitudes slower than steady-state simulation. For example, simulating 2M cycles (which is the minimum requirement to determine average noise behavior of one application [5] ) would take over 200 h, while one steady-state simulation takes less than 0.1 s. Considering the fact that each trial in our MCS has to evaluate on-chip noise once after every bump failure and it requires up to 1000 trials to get high-quality MTTF results, it become impossible to rely on whole-application transient noise evaluation in our analysis.
In order to validate whether our IR-drop-based observations are consistent with transient noise evaluation, we significantly reduce simulation time by only evaluating the worst-case transient noise. Using a 1k-cycle stressmark that triggers the largest transient voltage drop among the entire Parsec 2.0 benchmark suite, we are able to perform our MCS within an acceptable time period (e.g., two weeks). To be more specific, we replace the steady-state noise evaluation in Algorithm 1 (line 11) with transient simulation of the stressmark and use the maximum on-chip transient voltage drop during the entire stressmark to determine the failure of the chip (line 7). Fig. 7 shows the MTTF results under transient noise evaluation. Similar to the steady-state results, a small extra margin reserved for EM wearout (e.g., 0.5% V dd ) allows the system to tolerate bump failures and significantly extend MTTF (e.g., 2×). In addition, the fact that larger noise margin produce diminishing returns in MTTF indicates that the avalanche effect also applies to transient behavior. These results give us confidence that our observations derived from IR-drop evaluation applies to the cases where transient noise is a major concern. A more detailed transient study is left for future work.
VI. CONCLUSION
In this paper, we develop an MCS framework for the study of multiple EM-induced power-supply C4 bump failures. Our results indicate that the degradation of power delivery quality, in terms of maximum on-chip voltage noise, is a relatively slow process. As a result, a small extra noise guardband can tolerate the consequences of multiple EM-induced bump failures, which significantly extends the system's expected lifetime. As a result, target lifetime can be achieved with significantly reduced power-supply bump count (e.g., by 43%) and a slightly increased on-chip noise margin (e.g., 0.5% V dd IR drop). In addition, we show that an avalanche-like wearout acceleration effect exists and ignoring it will overestimate system MTTF by up to 80% and underestimate the required noise margin for EM wearout by over 50%. Moreover, we explored the impact of splitting on-chip power grid into multiple domains and observe that the lack of lateral PDN current flow across domain boundaries degrades bump array's EM robustness. Consequently, chips with finer-grained domains will require more power-supply bumps to maintain the target MTTF.
The design of C4 bumps under an EM-induced reliability constraint is a multidimensional design space that consists of bump allocations, system MTTF, voltage noise margin, multiple on-chip power domains, and so on. The MCS framework we develop here can guide a designer through this complex tradeoff space and help designers to: 1) better provision bump allocation in different design scenarios to reduce packaging cost and support more off-chip I/O channels in nearfuture technology nodes; 2) better reserve timing margin for EM wearout to achieve targeted MTTF; and 3) better design on-chip power islands to balance system energy efficiency, I/O bandwidth, and cost. Our framework also supports the design and evaluation of run-time wear leveling and/or graceful performance-degradation techniques.
