Advancements in technology enable integration of a large number of cores on a single silicon die. At the same time, aggressive technology scaling has an ever-increasing adverse impact on the lifetime reliability of such large integrated circuits. In this work, we model the lifetime reliability of homogeneous manycore systems using a load-sharing nonrepairable k-out-of-n:G system with general failure distributions for embedded cores. In manycore systems, an embedded core can be in operational, cold standby, or warm standby state depending on system redundancy schemes and their workloads. We then use the proposed model to analyze the impact of different redundant schemes and configurations on the lifetime reliability of manycore systems.
INTRODUCTION
While the relentless scaling of CMOS technology has brought with it enhanced functionality and improved performance in every new generation, the associated ever-increasing on-chip power and temperature densities make the lifetime reliability of high-performance integrated circuits (ICs) one of the major concerns for the industry [3, 19] .
The failure mechanisms that contribute to IC's permanent failures (e.g., time dependent dielectric breakdown (TDDB) and electromigration) have been extensively studied at the circuit level in the past, and they are shown to be strongly related to the temperature and voltage applied to the circuit. Recently these failure mechanisms have been revisited at the processor microarchitecture level due to their increasing impact with technology scaling [18, 19] .
The above models mainly target unicore processor chips. State-ofthe-art computing systems (e.g., multi-digital signal processor (DSP) system [11] , general-purpose processors), however, have started to employ multiple cores on a single silicon die to improve performance through parallel execution instead of frequency increase, which have the benefits of power-efficiency and short time-to-market [7] . A 128-core GPU [15] and a 64-core general-purpose multiprocessor [22] have already been released to the market. Various research teams have projected that thousand-core processor chips will become commercially available in the foreseeable future [1, 4] . For such largescale manycore systems fabricated with latest technology, how to model its lifetime reliability is an interesting and relevant problem.
Assuming the failure distribution of an embedded core is known a priori, we analyze the lifetime reliability of manycore systems in this work. We make the following observations during the modeling process:
• embedded cores will age in operation. That is, we expect an increasing failure rate (IFR) when a core gets older.
• manycore systems are k-out-of-n:G systems 1 , in which n is the total number of processor cores fabricated on-chip and k is the number of cores for the system to function correctly. Generally speaking, the value of n is larger than the value of k to provide fault tolerance [24] .
• manycore systems are load-sharing systems, i.e., each embedded core is designed to carry only part of the load assigned by the operating system (OS). In fact, a core's failure rate and the associated lifetime depends significantly on its workload that determines the temperature and voltage applied to the circuit.
• manycore systems are nonrepairable systems. That is, unlike traditional board-level multiprocessor systems that can be easily repaired by replacing defective processor chip, embedded cores are integrated on silicon die in manycore systems and it is extremely difficult to repair or replace a faulty core, if not impossible.
Based on the above observations, we model the lifetime reliability of manycore systems using a load-sharing nonrepairable k-out-ofn:G system with general lifetime distributions for embedded cores. To the best of our knowledge, this is the first comprehensive reliability model for such complex systems.
Manycore systems can be configured in two ways to achieve reliability: (i). gracefully degrading systems that use all failure-free cores to execute tasks. When a core failure is detected, these systems attempt to reconfigure to a system with one fewer module; (ii). standby redundant systems that execute tasks on active cores. Upon detection of the failure of an active core, these systems attempt to replace the faulty unit with a spare unit. Depending on the above configurations and current workload, cores can be in normal functional mode, warm standby, or cold standby state, which have direct implications on the ageing effect of the manycore system. In this paper, we use the proposed model to analyze the impact of different configurations and redundant schemes on manycore systems' lifetime reliability. This will facilitate designers to make architecture decisions to achieve their design objectives.
The remainder of this paper is organized as follows. In Section 2, we present preliminaries and motivation for this work. The proposed lifetime reliability model for manycore systems is then discussed in detail in Section 3. Experimental results for different manycore system configurations are presented in Section 4. Finally, Section 5 concludes this paper.
PRELIMINARIES
In this paper, we consider homogeneous manycore systems that have n identical embedded cores fabricated on-chip. In order to function correctly, at least k (k ≤ n) cores need to be good. These cores will share the workload designated by operating system. Apparently, this is a k-out-of-n:G load-sharing system. Before discussing the technical details of the proposed lifetime reliability model, we present some preliminaries in this section.
IC Lifetime Reliability
Integrated circuit errors can be broadly classified into two categories: soft errors and hard errors. As soft errors caused by radiation effects do not fundamentally damage the circuit, they are not viewed as lifetime reliability threats. In this paper we mainly consider those hard errors that are permanent once they manifest, such as TDDB in the gate oxides, electromigration (EM) and stress migration (SM) in the interconnects, and thermal cycling (TC).
The above failure mechanisms have an increasingly adverse effect with technology scaling, and therefore have re-attracted research interests recently. Srinivasan et al. [19] described a so-called RAMP model that is able to dynamically track lifetime reliability of a processor according to changes in application behavior. Their model, however, is inherently inaccurate because it assumes a uniform device density over the chip and an identical vulnerability of devices to failure mechanisms. To address this problem, Shin et al. [18] introduced a structure-aware model that takes the vulnerability of basic structures of the microarchitecture (e.g., register files, latches and logic) to different types of failure mechanisms into account. Coskun et al. proposed a cycle-accurate lifetime reliability simulation methodology as well as a statistical one in [5] and used them to optimize the processor power management policy. In [20] , Srinivasan et al. studied the vulnerability of FPGAs to TDDB and EM effects.
Modeling Processor Core Behavior
We assume embedded cores execute tasks independently (an application however may consist of a series of tasks [14] ) and one core can perform at most one task at a time. In addition, the tasks assigned to a certain core is assumed to be stored in a first-in-first-out (FIFO) buffer with infinite capacity when the core is busy. Once the core becomes available, it starts to process the next task in the FIFO promptly. As shown in Fig. 1 , a core can be in active mode or spare mode in the manycore system (depending on redundancy configurations). For spare processor cores, their power supply can be reduced significantly or turned off completely, we therefore treat them as cold standby components with zero failure rate. For active cores, depending on the current workload, they can be in two states: processing or wait, which denote the state that the cores are performing tasks or waiting for task allocation, respectively. Generally speaking, cores operate at higher temperature in processing state and hence will wear out more quickly than in wait state. We therefore regard cores in wait state as warm standby components in this work, and we use R p (t) and R w (t) to denote the reliability functions of cores in processing state and wait state, respectively, where they have the same shape but different scale parameter. For example, R p (t) = e
, wherein θ p and θ w are scale parameters. According to the above discussion, if manycore systems are configured as a graceful degrading system, embedded cores cannot be in spare mode and hence they are in either processing or wait state. The number of cores in either state at a particular moment is dependent on the current workload and hence is uncertain. If, however, manycore systems are configured as a standby redundant system, an embedded core can serve as: cold standby, warm standby or processing core. As k cores are active, we know exactly how many cores are cold standbys but again not sure about the number of cores in processing or wait state at a specific time.
Related Work on Modeling k-out-of-n:G Systems
While there has been a large amount of research work on modeling the lifetime reliability of multi-component systems, most of them focused on parallel systems that are designed to carry full load, as shown in [10, 23] .
In the literatures on load-sharing k-out-of-n:G systems, for the sake of simplicity, many studies (e.g., [16, 12] ) assume an exponential lifetime distribution for every component, which implies a constant failure rate during a component's entire life cycle. With this assumption, the system can be represented by a discrete-state, continuous-time homogeneous Markov chain and analyzed using mature techniques [16] . The above assumption, however, implies that there is no difference between a brand-new unit and a 10-year old one in terms of failure rates, which is obviously not true. In fact, the popularity of this assumption is mainly due to its mathematical tractability rather than accuracy.
In real-life systems, we expect components to experience increasing failure rate over its life cycle, i.e., exponential lifetime distribution does not apply. For systems with general lifetime distributions for the internal components, Markov model cannot be used to analyze their lifetime reliability because whether a component is good or not depends on its past usage and hence the memoryless property required for Markov modeling does not hold. This makes the mathematical analysis for systems with general lifetime distributions much more complicated. [9] studied a 1-out-of-2:G system with time-varying failure rates in a general polynomial expression format. Later, [13] presented an analytical model for components with various general lifetime distributions.
An idle component in computing systems can serve as a cold, hot, or warm standby unit, which has a zero failure rate, the same failure rate as active components or a failure rate between cold and hot, respectively. In the above models (e.g., [9, 13] ), every component in the system is assumed to conform to a single failure distribution and hence can only be applied to analyze systems with hot standby components. In manycore systems, as discussed earlier, an active embedded core in wait state should be treated as a warm standby component. Consequently, the above models are not applicable. [17] provided a closed-form expression for the k-out-of-n:G systems with warm standby components. However, they assume the failure rates of both active and standby units are constant. Recently, [25] presented an analytical model for k-out-of-(M + N):G repairable warm standby systems that consists of two different types of components. Most prior work (e.g., [16, 9, 13] ) focuses on analyzing the lifetime reliability of gracefully degrading systems, in which all embedded cores are active. For standby redundant systems, an embedded core can be a "spare" unit, and such systems involve both warm standby (wait state) and cold standby (spare state) units as well as processing cores. [8] presented a mixture model for a 1-out-of-3 standby system that contains a dedicated warm standby unit and a dedicated cold standby unit. [21] analyzed a system in which a module can alternate between cold and warm standby state. However, the system investigated in this work contains only two components. Similar to many prior work on this topic, both [8] and [21] are difficult to extend to the general k-out-of-n:G systems.
Notations
The most widely-used concepts in reliability engineering and their representations are listed as follows and will be used throughout this paper without further explanations.
R(t)
reliability function 2
failure rate function (hazard function),
MT T F mean time to failure,
In addition, in the rest of the paper, we use superscript sys to distinguish the functions for the entire manycore system from that for a single core. We also use de and st to represent the system in degrading configuration and standby configuration, respectively. For manycore systems in standby configuration, subscripts i and j are used to indicate the number of active and spare cores.
PROPOSED LIFETIME RELIABILITY MODEL FOR MANYCORE SYSTEMS

Queueing Model for Task Allocation
Consider a manycore system composed of a set S = {1, 2,...,n} identical embedded cores. Among these cores, the set of active cores is S 1 , the set of spare cores is S 2 , the set of faulty cores is S 3 , and
To capture the key features, we model a generalpurpose parallel processing system with a central queue as a bulk arrival M X /M/|S 1 | queuing model, as shown in Fig. 2 . More specifically, the application arrival to the manycore system is assumed to be 2 Reliability function R(t), also known as survival function, gives the probability that a component does not fail up to time t.
a Poisson process with rate λ a . An application may consist of a series of tasks that can be processed independently of each other [14] . We denote the number of tasks in an application as X. Let γ i be the probability that an application consists of i tasks (i.e, Pr{X = i}). Apparently, i = 1, 2, ···; and we have ∑ ∞ i=1 γ i = 1. By using z-transform, the probability generating function of X is G X (z) = ∑ ∞ i=1 γ i z i . Each task is executed by an individual active core and the service time is exponentially distributed with mean 1 µ . Consequently, the entire manycore system is modeled as an M X /M/|S 1 | queuing system. The probability that a certain active core is occupied by tasks, i.e., traffic intensity (also called utilization), is ρ =
, where E[X] is the sexpected value of X and can be computed as
. Our approach could be easily extended to represent other kinds of manycore systems and/or task allocation mechanisms (e.g., modeling the entire manycore system as a set of M/M/1 queue).
Lifetime Reliability of A Single Core
To obtain the lifetime reliability of the manycore system, we need to first calculate the lifetime reliability of an individual core. In this section, we examine the manycore system with two different redundant schemes: gracefully degrading and standby redundant.
Gracefully Degrading System
In gracefully degrading manycore systems, all good cores in the system are active and they alternate between wait state (as warm standby) and processing state in their lifetimes with different ageing effects. To accommodate this issue, we define a core's accumulated time in a certain mode at time t as how long it has spent in such a state up to time t. For example, suppose a core has its first task executed from time zero to time T 1 , and its second task arrives at time T 2 (T 2 ≥ T 1 ), and suppose at time t this task has not finished yet. In this case, this core's accumulated time in processing state is (T 1 − 0) + (t − T 2 ) and its accumulated time in wait state is (T 2 − T 1 ). We thus have the following theorem.
Theorem 1 Suppose a manycore system with gracefully degrading scheme has experienced core failures, in the order of occurrence time at t 1 ,t 2 , ··· ,t , respectively, for any core that has survived until time t (t > t ) (a) its accumulated time in the processing state up to time t
(b) its accumulated time as warm standby up to time t
Proof Before the first core failure, there are n active cores in the system. In such an M X /M/n queuing system, the utilization of each core ρ n = λ nµ . Since usually the time scale of lifetime reliability (in years) is much larger than that of task processing, the accumulated time in the processing state from time zero to t 1 can be approximated as ρ n · t 1 = λ nµ t 1 . After each failure, the system is reconfigured to a system with one fewer module. Thus from t j to t j+1 , there are (n − j) active cores in the system (0 < j ≤ − 1). The system can therefore be modeled as an M X /M/(n − j) queuing system. Similarly, the utilization of any surviving core is ρ n− j = λ (n− j)µ . Its accumulated time in processing state in this period is therefore
The summation of these ( + 1) terms is Equation (1) .
Since a surviving core can be in either processing or wait state from time zero to t,
With different aging effects in processing state and wait state for a particular core, we need to combine the accumulated time in these two modes in a unified manner to calculate the lifetime reliability of embedded cores. Recall that we assume the reliability functions in wait state and processing state have the same shape but different scale parameters (see Section 2). The scale parameter is a value by which t is divided and we use θ w and θ p to denote it in wait state and processing state, respectively. Given a general reliability function defined as R(t, θ) (abbreviated as R (t)), the reliability functions of processing state and wait state are R(t, θ p ) (denoted as R p (t)) and R(t, θ w ) (denoted as R w (t)) respectively. After unification, we have
The following theorem provides the relationship between a single core's reliability and its accumulated times in different states, and enables their integration into one analytical model.
Theorem 2
Given a gracefully degrading manycore system that has experienced core failures which occur at t 1 ,t 2 ,...,t , respectively, the probability that a certain core survives at time t (t > t ) provided that it has survived until time t is given by R
where
) Proof The proof of this theorem is presented in Appendix.
We are also interested in the probability density function (p.d.f.) of core failures. By the definition of reliability and the corresponding p.d.f., given a manycore system has experienced core failures, at t 1 ,t 2 , ··· ,t , respectively, the probability that a certain core fails in an infinitesimal interval dt +1 at time t +1 is given by
Standby Redundant System
In a standby redundant manycore system, spare cores are put aside in the beginning and are activated only when failures occur. In other words, different from that in gracefully degrading system, the active cores at time t in standby redundant system may be initially configured as spare ones. Once a spare core converts into active mode, it starts its aging process and will never return to spare mode (see Fig. 2 ). To capture this feature, we define a core's birth time t b as the time point when it is configured as an active one. Before birth time t b , a processor core is in the cold standby mode and has zero failure rate. After that, it alternates between processing state and wait state.
Theorem 3 In a standby redundant manycore system, for any core with birth time t b that has survived until time t (t > t b )
(a) its accumulated time in the processing state up to time t
Proof A functioning manycore system with standby redundant scheme always keeps k active cores and leaves remaining good cores as spares. Therefore, the utilization of an active core remains ρ k = λ kµ . Similar to the analysis of the system with gracefully degrading scheme, the accumulated time of a core in processing state from t b to t can be approximated as ρ k · (t − t b ) = λ kµ (t − t b ). As can be observed, a processor core's accumulated time in either state only depends on its birth time t b while is independent of the past core failures. As for the accumulated time in wait state, since the core is in either processing or wait state from t b to t, ψ st
. Thus, we get Equation (7).
Theorem 4 In a manycore system with standby redundant scheme, the probability that a certain core with birth time t b survives at time t (t > t b ) is given by
Proof Because the failure rate of cold standbys is considered to be negligible, ∀τ < t b : R st (τ) = R (0). After becoming an active one, a core alternates between wait state and processing state. Except for this note, the proof for Theorem 4 is as same as the proof for Theorem 2.
Similar to the analysis in section 3.2.1, the probability that a certain core with birth time t b fails at time t +1 is given by
Lifetime Reliability of the Entire Manycore System
After obtaining the lifetime reliability of a single core, we move to study the lifetime reliability of the entire manycore system in this section. Again, we first introduce the reliability of gracefully degrading systems and then investigate that of standby redundant systems.
Gracefully Degrading System
As the manycore system is functioning when it contains no less than k good cores, it may contain k, k + 1, ··· , n good cores and all are in active mode. Let P sys,de n− (t) be the probability that the manycore system has (n − ) active cores at time t. The system reliability P sys,de (t) can therefore be expressed as
Hence, the mean time to failure of the system can be written as
Clearly, it is necessary to determine P sys,de n− (t) to compute MT T F sys,de . Since all components are in active mode before the first core failure, P sys,de n (t) is simply the probability that all n cores survive at time t, i.e., P sys,de n
The event that the system experiences exactly one failure before t is a union of a set of continuous elementary events in which a failure occurs in an infinitesimal interval dt 1 at time t 1 ; the probability that a certain core fails during dt 1 is f de (t 1 )dt 1 . After this failure, the system is reconfigured as a system with (n − 1) cores in a very short reconfiguration time. Because of this failure, the load on each surviving cores increases. The load strongly influence the aging effect of the remaining cores. Therefore, the probability that all other (n − 1) cores survive up to time t is given by
This elementary event consists of two independent events. Therefore, the probability for this elementary event is P sys,de n−1 (t|t 1 )· f de (t 1 )dt 1 . By the theorem of total probability, the probability of the union event P sys,de n−1 (t) is obtained by integration over t 1 . Besides, since there are n good cores before the first failure, we obtain
Using the same argument, by extending the event of only one failure to include failures at time t 1 ,t 2 , ··· ,t , the general term 3 can be written as Equation (16) .
where P sys,de
Standby Redundant System
In a standby redundant manycore system, it is functioning if it contains at least k good cores, i.e., |S 1 | + |S 2 | ≥ k. Among these good cores, k of them are configured as active ones. The number of spare cores in the system thus can be 0, 1, ··· , (n − k). Therefore, similar to the analysis of gracefully degrading system, we have
where, P sys,st k,n−k− (t) is the probability that a manycore system has exactly k active cores and (n − k − ) spare cores at time t.
The probability that no failure occurs in such a system up to time t equals to the probability that all active cores with birth time zero survives at t, that is, P sys,st
To compute the probability that exactly one core failure occurs up to t, we firstly analyze the event that a certain core with birth time zero fails in a small interval dt 1 at t 1 ; the probability for this is f st (t 1 ,t 0 )dt 1 (for ease of discussion, let t 0 ≡ 0). In addition, the probability that the remaining (k −1) active cores with birth time zero and the active core with birth time t 1 survive at t can be expressed as
Again, the unconditional probability is obtained by multiplying the probabilities of two independent events and integration over t 1 . Also, because there are k cores with birth time zero in the system before the first failure, we have
The reliability of a core depends on its birth time, independent of the occurrence time of past failures of manycore system. According to their birth times, active cores in a manycore system may belong to more than one type. Although the first failure of the manycore system must occur on a core with birth time zero, the following ones may occur on other types of cores. For example, after the first failure, there are two types of cores in the system: (k − 1) cores with birth time zero and one core with birth time t 1 . The second failure may happen on either one of them. Consequently, to compute the reliability of a manycore system having two failures, we should consider two cases: the birth time of two failure cores is t 0 , t 0 ; or t 0 , t 1 . Each one corresponds to a different failure rate. Therefore, when ≥ 2 the expression of P sys,st k,n−k− (t) is more complex. Different from the analysis of gracefully degrading system, the i th failure of the manycore system with standby redundant scheme should be described by two parameters: occurrence time t i and index of birth time x i , representing that the i th failure occurs on a core with birth time t x i at time t i . Then, the key features of past failures can be captured by two 1 × vectors: t 1× = (t 1 ,t 2 , ··· ,t ) and x 1× = (x 1 , x 2 , ··· , x ) .
To preserve the order of failures, vector t 1× satisfies: t 1 < t 2 < ··· < t . In addition, since the i th failure must occur on a core that is initially configured as an active one or activated because of the past (i − 1) failures, the birth time t x i ∈ set {t 0 ,t 1 , ··· ,t i−1 }. In other words, vector x 1× satisfies ∀1 ≤ i ≤ : x i = 0, 1, ··· , i − 1. Also, since there are at most one core with birth time t i (1 ≤ i ≤ ) in the manycore system having core failures, the birth time index vector x 1× also satisfies ∀x i , x j = 0 : x i = x j . Let π i, j be the number of i's in the first j elements of vector x 1× . It is also important to note that the number of cores with birth time zero in a manycore system is no more than k. Thus, vector x 1× should also meet the constraint π 0, ≤ k.
Initially, there are k active cores with birth time zero in the manycore system. After failures, π 0, of them fails and the remaining (k − π 0, ) are still functioning. After the i th failure, a core is activated and its birth time is t i . At time t, it may be either functioning or failed, which is described by x 1× . That is, the number of active cores with birth time t i (0 < i ≤ ) is (1 − π i, ) . The conditional probability that, given the past failures described by t 1× and x 1× , no more failures occur up to time t (t > t ) is therefore expressed as
Next, consider the event that the r th failure occurs in a small interval dt r at t r on a certain core with birth time t x r given past (r − 1) failures' description; the probability for this is f st (t r ,t x r )dt r . Now, we are able to compute the probability that the past failures of the manycore system. It can be described by vector x 1× as Equation (23) 1× ), two cases are considered: the r th failure occurs on a core initially configured as active one (i.e., x r = 0) or spare one (i.e., x r = 0). For the first case, there are (k − π 0,r−1 ) cores in the system belonging to this type before the r th failure; for the second case, only one core has birth time t x r . Therefore, we have N(x r |x (r−1)
To cover all possible failure cases, let X 1× be the set of all possible
NUMERICAL RESULTS
Experimental Setup
In this section, we present results for the lifetime reliability of manycore systems with different redundancy schemes and various workloads. Two widely-used non-exponential lifetime distributions are assumed in the experiments: Weibull and Linear Failure Rate, whose reliability functions can be written as
, respectively. θ is the scale parameter, and they are different in processing state (θ p ) and wait state (θ w ). Typically they are in unit of years or hours. Clearly, θ p < θ w .
The property of the Weibull distribution, whose failure rate function h(t) = β θ · ( t θ ) β−1 , highly depends on its shape parameter β. When β = 1, a Weibull distribution reduces to an exponential one, i.e., the failure rate is constant. For β > 1, it has increasing failure rate. For 0 < β < 1, the failure rate is decreasing with respect to time. We set β = 4 in our experiment. Linear failure rate distribution has hazard function h(t) = Also, we set the number of embedded cores in the manycore system to be (32 + m). Therefore, if it is configured as a gracefully degrading system, there are initially (32 + m) active cores; while if standby redundancy configuration is used, it consists of 32 active cores and m spare cores at time zero.
Experimental Results and Discussion
First of all, we discuss an issue that attracts the most attention: how much benefit can be expected from adding core-level redundancy into a manycore system? As shown in Table 1 , if we assume an exponential lifetime distribution, the sojourn time only depends on the number of active cores in the system, independent of aging effect. We therefore observe great lifetime enhancement (around m times extension), as shown in the last column of Table 1 , where we set θ w = θ p = 7. Apparently, this result does not conforms to our common sense. In practice, IC products experience increasing failure rates. Therefore, if we use Weibull or linear failure rate distribution to approximate such wearout effect, we are able to achieve more reasonable results. Fig. 3 shows the lifetime enhancement achieved by core-level redundancy with Weibull and Linear failure distributions. The larger m is, the longer lifetime of the manycore system at a larger area overhead. With the increase of m, the lifetime improvement gradually slows down. For example, see the curve for the θ p = 3, θ w = 25 case in Fig. 3(b) . The addition of first redundant core results in 60.8% lifetime extension; those of the second, third, and fourth one lead to 45.2%, 42.2%, 23.3% extension, respectively. Consequently, designers need to set m with an appropriate value to tradeoff area overhead with lifetime extension, rather than set m as large as possible under the area overhead constraints.
When m is fixed, from Fig. 3 , we can observe that the lifetime enhancement also depends on the scale parameters (i.e., θ p and θ w ) in the reliability functions. There are two extreme cases for the relationship between these two parameters. When θ p = θ w , there is no difference between wait state and processing state in terms of reliability function. Essentially, this case is the so-called hot standby scheme. Another extreme case is θ w → ∞, which means an embedded core in wait state is essentially a cold standby component and cannot fail. In other cases (e.g., θ p = 3 and θ w = 10), embedded cores in wait state serve as the warm standbys. Given the same θ p , since a core's accumulated time in either processing or wait state is independent of scale parameters, it is reasonable to expect that the lifetime increases as θ w increases.
A closer observation for both redundant schemes is shown in Table  2 and Table 3 . As an example, we set θ p = 3, θ w = 10, and λ µ = 10. Comparing Due to the increasing failure rate, the manycore system contains no faulty cores in most of its lifetime, especially for systems experiencing more severe wearout effects. For example, as shown in Line 7 of Table 2 , the sojourn time of a gracefully degrading system with 4 redundant cores in 0-failure state is 2.2452 years, while the expected value of its whole lifetime is 3.2864 years. In such case, one core's failure may imply the entire system is old and we cannot expect much residual useful lifetime.
Next, we show the impact of workload, namely λ µ , on the lifetime reliability, as depicted in Fig. 4 . We set θ p = 3; results are presented for four different cases. When the workload increases, the system's lifetime significantly decreases (e.g., see the curves for θ w = 10, Weibull distribution case). But the scale of the decrease of lifetime is smaller than that of the increase of workload. We attribute this phenomenon to the wearout effects of warm standbys. Nevertheless, the workload has significant influence on the lifetime reliability of manycore systems and should be paid much attention by designers. Table 4 compares the manycore system's lifetime reliability in gracefully degrading scheme and the one in standby redundant scheme. We set θ p = 3 and λ µ = 20. It can be observed that, the difference between these two schemes highly depends on the core failure function in wait state. Most prior work based on the hot standby assumption (e.g., [2] ) claim that the standby redundant system has longer lifetime but worse performance when compared with the gracefully degrading system. When we assume the embedded cores in wait state have the same failure functions as that in processing state (i.e., hot standby), our model leads to the same conclusion. However, if we consider the realistic warm standby situation, the difference is smaller than that based on the hot standby assumption (see . Moreover, when the aging effect in wait state is much slower than that in processing state, the lifetime of the standby redundant system may be even shorter than that of the gracefully degrading system (e.g., the eighth column). An extreme case is when it is assumed to be cold standby, the gracefully degrading system is better (the last column). In this sense, a reasonable assumption is important because it is able to prevent designers from misleading conclusions. In addition, this difference is also dependent on m. It is very small when m is small, and it becomes slightly larger with the increase of m. Both Weibull and linear failure rate distributions follows the above observations. 
CONCLUSION AND FUTURE WORK
State-of-the-art technology enables the integration of a great amount of embedded cores in a single computing system. The lifetime reliability of such large circuit is a major concern because IC failure mechanisms have an increasingly adverse impact with technology scaling. In this paper, we propose a comprehensive analytical model to estimate the lifetime reliability of manycore systems, which is able to facilitate designers to make a better decision at the architecture level.
The model presented in this paper could be extended to take some other aspects into account. For example, in this model we assume the wearout of active cores can be in two discrete states with different reliability functions. In practice, as the wearout effects highly depend on temperature and voltage, cores in the same active state can experience different reliability functions. The proposed model can be extended to deal with continuous states by expressing the scale parameter as a function of temperature and voltage. Also, the assumption that the failure rate in spare state is negligible can be abandoned.
Appendix A. Proof of Theorem 2
Any core can start with either wait state or processing state at time zero. Suppose a core is in wait state from 0 to T 1 , and then converts into the processing state and stays for (T 2 − T 1 ), and so on. Assuming it is surviving at t, we obtain a a subdivision of the time [0,t]: 0 = T 0 < T 1 < T 2 < ··· < T d = t. We know the initial reliability of this core is given by R
