ABSTRACT
INTRODUCTION
As technology advances, industry has started to employ multiple processor cores on a single silicon die to improve performance through parallel execution. Such chip multiprocessors, also known as multicore or manycore processors (depending on the number of cores on the die), being much more power-efficient than unicore processors with extremely high frequency, have become increasingly popular [4, 6] .
Due to device wearout, integrated circuits (ICs) suffer from various types of intrinsic failures, which manifest themselves after some time of operation and determine the circuits' service life. With the relentless technology scaling, the lifetime reliability of high-performance ICs has become a serious concern for the industry [10, 16, 18, 13] . Major intrinsic failures include TDDB in the gate oxides, EM in the interconnects, NBTI stresses that shift PMOS transistor threshold voltages, and thermal cycling. Many widely accepted reliability models for the above failure mechanisms have been proposed and empirically validated by academia and industry [3, 1, 2, 22] . These models, however, are not readily applicable in characterizing the lifetime reliability of manycore processors, because they assume constant temperature and voltage while these values vary significantly at runtime.
One method to obtain defect-tolerance capabilities is to incorporate redundant circuits in a system and use them as replacements when some units are faulty. This strategy can be also used to extend the service life of IC products. In particular, for manycore processors, employing corelevel redundancy is a more attractive solution than introducing complex microarchitecture-level redundancy and has been practiced in the industry. There are, however, many ways to make use of the redundant cores. We can configure some cores as standbys and use them only when some of the active cores fail. We can also activate all cores from the beginning and remove the faulty cores during the systemąŕs lifetime. Moreover, we have the freedom to dynamically configure which cores to serve as active cores and which cores to serve as spares at a specific time. As ICs' wearout-related failure rates are significantly related to operational conditions such as temperature and/or voltage, these strategies result in different aging stress on processors. How to characterize the lifetime reliability of manycore processors with different usages is therefore an important and relevant problem.
To address the above problem, in this paper, we explicitly consider the temperature variations caused by workloads in our analytical model to estimate the lifetime reliability of manycore processors with various redundancy schemes. To be specific, we introduce a parameter namely wearout rate to reflect a core's aging effect in its different operational state, which is computed with the temperature distribution of the core. We then model the lifetime reliability of manycore processors using wearout rates. Finally, extensive experiments are conducted to compare manycore processors in terms of both lifetime reliability and performance, under various workloads, service time distributions, and redundancy configurations.
The remainder of this paper is organized as follows. Section 2 reviews related prior work and motivates this paper. Our proposed analytical model for the lifetime reliability of a processor core is then detailed in Section 3 and we use this model to investigate the impact of various redundancy schemes on the service life of manycore processors in Section 4. Next, Section 5 and Section 6 present our experimental methodology and experimental results, respectively. Finally, Section 7 concludes this work.
RELATED WORK AND MOTIVATION
Processor lifetime reliability is significantly affected by its operating conditions, which vary with different applications running on the processor. In [16, 17] , Srinivasan et al. proposed an application-aware architecture-level model, namely RAMP model, which is able to dynamically track lifetime reliability of a processor according to changes in application behavior. Later, Shin et al. [14] defined reference circuits and introduced a structure-aware model that takes the vulnerability of basic structures of the microarchitecture (e.g., register files, latches and logic) to different types of failure mechanisms into account. Both Srinivasan's and Shin's models target unicore architecture. Coskun et al. [5] introduced two analytical frameworks for the lifetime reliability of multicore systems: a cycle-accurate simulation methodology and a statistical one, assuming uniform device density.
Some of the above models (e.g., [16, 5] ) assumed exponential failure distributions (i.e., constant failure rate) and thus cannot capture the processors' accumulated aging effects. In practice, we expect increasing failure rates as systems grow older and it is suggested to use nonexponential distributions, such as Weibull distribution and/or lognormal distribution, to describe the influence of hard errors [17, 1, 20] . Consider NBTI as an example, the increase of threshold voltage ΔV th at time t highly depends on the usage history of the transistors. In [17] , the authors modeled processors with microarchitecture-level redundancy as series-parallel failure systems with lognormal failure distribution and used a simple MIN-MAX analysis to determine the system lifetime. This model, however, is not applicable for analyzing the lifetime reliability of manycore processors with core-level redundancy. It cannot reflect the load-sharing feature of manycore processors. More importantly, series-parallel model is not applicable for many often-used configurations, such as standby redundant system, wherein some cores start their service life only when permanent core failures occur in the system.
Recently, Huang and Xu [9, 11] developed a high-level analytical model for the lifetime reliability of manycore processors, which takes arbitrary failure distribution and load-sharing feature into account. In this work, a processor core is assumed to be in three possible states: processing state, wait state, and spare state, each corresponding to a unique failure distribution. The above assumption, however, oversimplifies this problem because the lifetime reliability of a processor core highly depends on its operational temperature, which varies with different applications running on it. That is, even if two cores are in the same states and have the same usage history, they do not necessarily have the same failure rates. From this aspect, we need to extend the discrete states into a series of continuous states for more accurate estimation of the system's lifetime reliability, by taking the temperature and structural information that affect the system's lifetime reliability into account.
The above observations have motivated the work studied in this paper.
PROPOSED ANALYTICAL MODEL FOR THE LIFETIME RELIABILITY OF PROCESSOR CORES
As discussed above, the circuit wearout effects are related to its operational status such as temperature, voltage and frequency, which are not explicitly considered in the analytical model in [9, 11] . In this section, we first examine the impact of these factors on processor cores' lifetime reliability. Next, we consider the impact of workloads by mapping them into the different temperature distributions of processor cores.
Impact of Temperature, Voltage, and Frequency
To examine the impact of temperature, supply voltage, and clock frequency on the wearout effect of a single processor core, we start with the simplest case: no failures occur in the system up to time t. Under such circumstances, the task interarrival time distribution of a core is fixed up to time t.
We use the notation R (t, θ) to denote a general reliability function, where θ represents the general scale parameter by which time t is divided and depends on temperature and processor's execution mode. For example, the commonly-used Weibull failure distribution has the form
. Without ambiguity, we hereafter drop the notation θ because of its generality and refer to the general reliability function as R (t). Note that, R (t) does not necessarily to be an exponential distribution.
Without loss of generality, we consider a core can be in any state s of set S. An example of set S is defined in [9] , namely, {processing, wait, spare}. 
Next, substituting τ =t 1 into Eq. (1) yields the core's reliability at the end of the first sub-interval. Because of the continuity of reliability function, the reliability at the beginning of the second sub-interval has the same value as Eq. (2) . With this condition, we express the reliability in the second sub-interval
2) By the same argument and the limiting process, at time t we have
In this expression, the state parameter s j is in the set S for any j.
To simplify Eq. (3), we introduce a filter function over time horizon s (t,V, f ) such that it equals 1 if the core is in state s with voltage V and frequency f at timet while 0 otherwise. With this notation, Eq. (3) comes down to
In this equation, we integrate
overt. To integrate over dT (T is a function oft), we denote by ψ s (T,V, f )dT the accumulated time in state s with voltage V and frequency f in an infinitesimal temperature interval dT around T . We use it to substitute dt and change lower and upper limits of integration accordingly, yielding
Further, we use ν s (T,V, f ) to represent the probability density function (p.d.f.) of a core with temperature T , given the core is in state s. Also, π s is defined as the probability a core being in state s. Thus, the fraction of accumulated time within which the core falls in a infinitesimal interval dT at T and is in state s can be approximated by π s ⋅ ν s (T,V, f ) ⋅ dT . Hence, Eq. (5) can be rewritten as
Because π s and t are independent of V , f and T , they can be moved outside of the corresponding integral and summation signs to obtain
Now, we are ready to introduce the formal definition of wearout rate in state s, a quantity that describes the rate of core suffering from wearout effects, namely,
Using Ω s , Eq. (7) can be rewritten as
Clearly, ν s follows a constraint that
Here, ν s (T,V, f ) is the conditional p.d.f. of temperature T , voltage V , and frequency f with given state s. According to the theorem of total probability, it is possible for us to drop s from notation Ω s and express wearout rate in a concise form. As both θ and t are independent of wearout rate, from Eq. (9) we define
Thus, Eq. (9) can be written as
In particular, if the core has the same frequency and voltage in various states other than in the cold standby mode, we can redefine scale parameter θ(T ) according to these parameters. Since a core in cold standby state is switched off, its lifetime is close to infinity, i.e., θ spare (T ) → ∞. In other words, the wearout rate contributed by this state is approximated to zero. Therefore, we are only interested in the temperature distribution given the core is not in cold standby, denoted as ν(t). For a core which is not set into cold standby state within a time interval, from Eq. (11), we have
where, temperature distribution ν(T ) follows
Impact of Workloads
In many systems, the workload distribution of a core is not fixed. For instance, in a gracefully degrading manycore processor with redundant cores, all cores share the workload initially. We therefore examine how workloads affect the wearout rate in this section.
Remind the mathematical derivation of unified reliability function is independent of workload distribution. Suppose a set M = {1, 2, ⋅⋅⋅ , m} of cores equally share the workload, the probability for core i (1 ≤ i ≤ m) to process any task is 1 m . Thus, given the workload distribution of the entire system, it is easy to know every core's workload. In our model, it is reflected in temperature distributions ν s (T,V, f ) and hence the wearout rate Ω. Fig. 1 shows typical temperature distributions of a core under various workloads ρ sys (the formal definition is introduced in Section 5.1). The data is collected from HotSpot [15] for an application flow composed of 10,000 tasks. Without loss of generality, we assume every state s corresponds to a single supply voltage value V and clock frequency value f in this numerical experiment. We add a subscript m to indicate the quantity of cores which process workloads in the system. For example, Ω 36 can be used to represent the wearout rate of system that contains 36 active units.
We then present how to extend the definition of Ω, which is drawn from the expression of R (t) assuming the distribution is fixed from time zero up to t, to compute wearout rate Ω m for any active core quantity m. Under the same usage strategy, for the same system the difference in wearout rate caused by different workload is reflected in temperature distributions ν s,m (T,V, f ) over T and the probabilities of a core being in various states π s,m . Consequently, from Eq. (11) we have
Even if a core has experienced other workload distribution before the current one, Eq. (15) is able to capture the aging effect in this time interval. To take an example, suppose all n cores in a system equally share workload at the beginning, then one of them fails at time t 1 resulting in a heavier load on every surviving core. In this case, we can use Ω n and Ω n−1 to represent the wearout rate in two states respectively. From Eq. (12), the reliability of a surviving core at timet (t ≤ t 1 )
. Then, we analyze the second state. Since this core enters its second state at time t 1 , its initial reliability of the sec-
quentially, by the similar argument with that in Section 3.1, we can express the reliability of a surviving core at time t (t > t 1 ) as
) .
LIFETIME RELIABILITY ANALYSIS FOR MANYCORE PROCESSORS WITH VARIOUS REDUNDANCY SCHEMES
Modeling the lifetime reliability of manycore processors with redundancy is more complicated. This is because, the status, workload and the corresponding failure rate of each core in a system can be time-varying, depending on the redundancy configuration and wearout-related failure occurrences. In this section,we focus on three redundant schemes and discuss their lifetime reliability models in detail. Then, we present how to extend the proposed model for heterogeneous manycore systems.
Gracefully Degrading System (GDS)
In GDS, initially all n cores are configured as active units. When a core fails, the system will be reconfigured in a gracefully degrading manner, that is, the remaining (n − 1) good cores share the system workload. This process continues until there are only k good cores left. In that situation, if one more core fails, the entire system will be considered as faulty. The number of cores sharing workloads can be n, n − 1, ⋅⋅⋅, k, and corresponding wearout rates are Ω n , Ω n−1 , ⋅⋅⋅, Ω k , respectively. By extending deduction procedure presented in Section 3, for any surviving core at time t, given that the system contains (n − ℓ) good cores at t and the i th permanent component failure in the system occurs at time
The event that all surviving components after ℓ core failures is still functioning at time t, where t > t ℓ , can be modeled as a series failure system. We use ( R GDS (t|ℓ) ) n−ℓ to represent its probability. The next step is to uncondition it by being aware that it is conditioned, that the occurrence time of past failures are assumed to be given. Similar to the system lifetime analysis in [9] , as the reliability of a core given past i failures is R GDS (t|i), the event for its (i + 1) failure occurring at t i+1 has probabil- 
, referred to as f GDS (t i+1 |i) hereafter. Therefore, denoting by R GDS,sys (t,ℓ) the probability of event that the system has experienced exactly ℓ core failures before time t and by the theorem of total probability, we have Eq. (17) (see next page), where the domain is
Then, since the event that a GDS system is functioning can be expressed as the union of a set of mutually exclusive events, the system reliability R GDS,sys (t) is therefore given by
Consequently, the system mean time to failure is Eq. (19).
Processor Rotation System (PRS)
Processor cores can be used in a rotation manner to balance their aging effects. That is, they operate alternatively in active mode and spare mode and spend a relatively longer period in each mode when compared to the execution time of each single task in every state. Moreover, the duration is quite small when compared to the lifetime of the system. In [13] , the authors showed an example for caches enabled in a round-robin manner.
For modeling lifetime reliability, we consider a more general case that in any configuration k out of n cores serve as active ones while the remaining (n − k) have no power supply. The reconfiguration is conducted every time interval T r , which is much shorter than a core's service life (typically a few years) but much longer than a task's execution time. At every reconfiguration, the (n − k) oldest cores (that is, the cores with highest age) are shut down, and all spare ones convert to active mode. From a core's point of view, before the first failure in the system, its accumulated time up to time t as active core can be approximated as
Recall that the wearout rate in these time intervals is approximated to zero. Therefore, its reliability before the first component failure is given by
Then, we generalize this expression to the case that the number of failure cores in the system can be 0, 1, ⋅⋅⋅, (n − k). From t i to t i+1 , the system composed of (n − i) good components within which k are active at any time. Since a surviving core's accumulated time in this time interval depends on the quantity of both active cores and redundant ones, it can be approximated as k n−i ⋅ (t i+1 − t i ). Its wearout rate, on the other hand, remains Ω k . We therefore compute its reliability by
The sequential analysis is very similar to that of gracefully degrading systems (Section 4.1) and hence omitted here.
Standby Redundant System (SRS)
In SRS, k-out-of-n cores are initially configured as active units, while the remaining (n − k) cores are in spare mode. Upon detection of a permanent component failure, the system attempts to wake up a spare core and configure it as an active one. Note that, different from the strategy of PRS which aims to balance the age of all cores, in SRS only when some active cores fail, cold standbys might convert into active mode, which will lead to significant difference between cores in terms of age. For example, suppose the first core failure occurs when the system has been used for 4 years, after reconfiguration the system will be composed of (n − 1) 4-year old cores and a brand-new one.
Consider a core that starts its service life at time t s . From t s to its failure or the entire system's failure, its wearout rate is a constant Ω k , because the quantity of active cores in the system is always k and this core is always one of them. As a result, its reliability only depends on its service time up to t while is independent of the failures in the systems, given by
To evaluate MT T F SRS,sys , it is necessary to compute the probability that all surviving cores after ℓ failures are still operational at time t. It can be expressed by considering u i (the quantity of surviving cores starting their service life from time t i ), i.e.,
u i is function of past failure history h. As this event has a condition that h occurs, we can express P(h) the probability of history h according to [9] . Let H be the set of all possible histories. According to the theorem of total probability, the unconditional reliability is therefore
MT T F GDS,sys = E[service life of gracefully degrading system] =
whose domain D is as same as that of Eq. (17) . Similar to the analysis of gracefully degrading system, the expected service life is given by
Extension to Heterogeneous System
Up to now, we have shown how to model the lifetime reliability of manycore systems with various redundancy configurations, wherein we regard the entire manycore processor as a k-out-of-n: G system. In practice, some designs may consist of more than one type of processor cores [8] : main processors and co-processors.This event can be modeled by simply extending our model presented in the previous sections. Assuming the failures within the two subsystem are independent, the lifetime reliability of the entire system comes down to the probability that both subsystems are operational.
Generally, consider a system that can be divided into q subsystems, in which subsystem i (1 ≤ i ≤ q) contains n i identical components initially and functions if no less than k i are operational. Each subsystem can has its own redundancy configuration scheme, referred as CFG i in the superscript. Because of their different functionalities, we assume cores from different subsystems do not share workloads. The lifetime reliability of subsystem i at time t can then be obtained by substituting its parameters n i and k i into the models presented in Section 4, denoted as R CFGi,sys (t). Recall that the functioning of all subsystems is essential for the entire system to operate properly. The expected service life of the entire system is hence given by
MT T F HS,sys
= ∞ ∫ 0 q ∏ i=1 R CFGi,sys (t)dt(25)
EXPERIMENTAL METHODOLOGY
To compare manycore systems with different redundancy configurations, we conduct extensive experiments on a 36-core processor (i.e., n = 36) with various workloads, with the number of active cores k ranging from 32 to 36. We implement a discrete event simulator to perform task allocation and scheduling for an application flow composed of 60,000 tasks in every experiment and we generate the associated power trace files for the entire system accordingly. Next, we take these files as the input of HotSpot tool [15] to acquire the temperature trace files. All temperature samples are collected to extract the temperature distribution (as in Fig. 1 ). We then compute the wearout rate Ω according to its definition, and finally obtain the lifetime reliability of manycore systems with various redundancy configurations, by computing multidimensional integral with Monte Carlo simulation.
Workload Description
As discussed earlier, workloads determine the temperature distribution of the processor. In our experiments, we generate a task flow for each workload, which is characterized by the task interarrival time distribution and the task service time distribution.
We assume the task interarrival time is with an exponential distribution with rate λ. Assuming that all the given m active cores equally share the workload, the task interarrival time of a core is λ m . The task service time is modeled as exponential distribution and bimodal hyperexponential distribution. The exponential distribution is widely-used in the literature, while the bimodal hyperexponential distribution is regarded as the most probable distributions for modeling processor service time [19] . Exponential distribution has the expected service rate µ. Bimodal hyperexponential distribution is composed of two exponential distributions with mean 1 µ1 and 1 µ2 respectively, where
. We set α = 0.95, C x = 3.0 [21] . For both distributions, the system load is defined as ρ sys = 
Temperature Distribution Extraction
In our experiments, the die size of each core is set to be 5.76mm 2 (2.4mm × 2.4mm). Depending on its current workload, an active core can be in one of two states: Run and Idle. To represent the uneven power densities in the processing unit, a core contains a small block (e.g., Execution Unit) with higher power density, whose size is 0.5mm × 0.5mm. The power density values of this hotspot block and other parts in Run state are 5.0W /mm 2 and 0.5W /mm 2 , while that in Idle state are both 0.16W /mm 2 . These system parameters are set according to stateof-the-art processors (e.g., IBM PowerPC 750CL [12] ). The standbys are assumed to be in Shut Down state, whose power consumption is ∼ 0W .
Reliability Factors
We use Weibull distribution, a well-accepted lifetime distribution for modeling hard errors of IC product [2] , as the reliability function used in our system, R (t) = e −( t θ ) β . shape parameter β = 2.5. Although the proposed approach is applicable for any failure mechanisms or their combinations, due to the lack of public empirical data on the relative weights of different failure mechanisms on real circuits, we analyze the electromigration failure in our experiments, whose models is presented in [7] .
To compare the systems' lifetimes in various configurations, we normalize them to a certain scenario, wherein all cores of a 36-core system without redundancy are in active mode and the system workload is 5.0 and its Ω is set to be 0.1. In other words, a core in such a system has expected service life of 10 years.
RESULTS AND DISCUSSIONS 6.1 Wearout Rate Computation
We first demonstrate the effectiveness for one of the key concepts in this work, the wearout rate Ω computation, with experiments. On one hand, we trace the temperature variation of a core for 15,000 steps after the system has been warmed up, each corresponding to 3.33µs, from which the temperature distribution ν(T ) is extracted. The wearout rate Ω is then computed according to Eq. (13) . From Eq. (12), the component reliability can be expressed as a function of t ⋅ Ω (refer to as T Ω hereafter). We compute T Ω for 30 seconds. On the other hand, the temperature variation of the same core is traced for 30 seconds (around 9 × 10 6 steps) for comparison. We use T trace to represent the summation of Δt θs(T,V, f ) up to time t, where Δt = 3.33µs. The difference between T Ω and T trace versus time t is shown in Fig. 2 .
As can be seen from this figure, only in the first 4.832s, the difference between the approximated Ω value and the actual value is larger than 0.5%. After that, T Ω becomes very close to T trace . Since the service life of processors is typically in the range of years, the estimation error by the proposed approach is negligible. Fig. 3 shows one of the most important metrics reflecting system lifetime, mean time to failure, under various redundancy configurations and workloads. With the increase of redundant cores, it is expected to have system lifetime extension and all the subfigures show this trend. Also, a closer observation of these figures show that the lifetime growth rate becomes smaller when more cores are configured as redundancy, i.e., the sojourn time of i-Failure state is larger than that of (i + 1)-Failure state (see Fig. 4 ). This is mainly due to the increasing failure rate of IC products. Consider three PRS systems with 0, 2, and 4 redundant cores as an example (the three middle bars in Fig. 4) . From 36+0 to 34+2 systems, .35 extra service life, respectively. As a result, the overall lifetime extension is 16.41. If we further increase the number of redundant cores by two, the lifetime extension is 12.25, less than 16.41. From the above, improving the lifetime reliability of manycore processors by increasing the value of k gradually diminishes and it may not quite beneficial to set it as a very large value. From Fig. 3 and Fig. 4 , we can also see that PRS provides longer service life than the other two configurations under the same workloads. On one hand, when compared to standby systems, processor cores in PRS have a more balanced workload. That is, in SRS, some cores are set as cold standbys initially and convert to active mode only when some active cores fail. Thus, even if the workload is evenly distributed among all active cores, after some replacements the system is composed of many old components and a few new ones. Since the aged cores have already had high failure rate, although there are some new cores, the lifetime of the entire system cannot be extended much. From this aspect, a lot of potential computation capabilities of standbys are wasted. This problem can be avoided by using PRS configuration, which aims to balance the aging effect among all cores. On the other hand, when compared to gracefully degrading systems, processor cores in PRS alternate between active and standby modes while all cores in GDS keep aging in its lifetime. Although PRS can result in slightly heavier workload on every single core than GDS, the extra aging effects because of this issue is quite small when k is much smaller than n.
Comparison on Lifetime Reliability

Comparison on Performance
The various redundancy configurations also result in different performances for the manycore processors. Two widely-used metrics, mean response time and system utilization, defined as the expected time from a task's arrival until its completeness and the average percentage of cores under-usage over time, respectively, are used to evaluate the performances of the manycore systems. The results are achieved by using the same discrete-event simulator. Fig. 5 shows the mean task response time versus the number of active cores in the system under various workloads. Consider exponential service time first (Fig. 5(a) ). When the workload is not high (ρ sys ≤ 20.0), the mean response time slightly increases with the decline number of active cores. In addition, this value roughly doubles as the workload becomes twice larger. For instance, when all 36 cores serve as active ones, the mean response times corresponds to ρ sys =5.0, 10.0 and 20.0 are 4.96, 9.79, and 19.73, respectively. When the workload is high (i.e., ρ sys = 30.0), the mean response time is still roughly proportional to workloads, but a few less active cores lead to a noticeable increment of response time. For bimodal hyperexponential service time (Fig. 5(b) ), while the systems with ρ sys = 30.0 have similar lifetime with ρ sys = 20.0 (see Fig. 3 ), their mean response time is significantly larger (ranging between 6.5 − 33.4×). We attribute this phenomenon to the close-tosaturated system workload under such circumstances. In other words, with the parameters setup in our experiments, when the workload of systems with bimodal hyperexponential distribution becomes larger than 20.0, almost all active cores always have tasks to perform, thus leading to the dramatic increase of the mean response time of tasks.
When it comes to the performances of various redundancy configurations, a GDS system sequentially has 36, 35, ⋅⋅⋅ active cores in its lifetime and therefore experiences gracefully degrading performance from the users' point of view. Other configurations, by contrast, are with the same performance over its lifetime. For example, a PRS/SRS system that has 32 active cores at any time always provides mean task response time 38.60, given exponential service time and ρ sys = 30.0. A 32+4 GDS system, however, is able to provide better performance in the first several years (its mean response time is 30.75), and then gradually increase to 30.89, 32.90, 35.89, and finally 38.60. Fig. 6 shows the system utilization in various cases. For exponential service time, the system utilization is almost proportional to their system workload with the number of active cores. Under the fixed workloads, we can also observe slightly higher system utilization for systems with less active cores. When the workload becomes sufficiently heavy with hyperexponential distributed service time (when ρ sys ≥ 20.0), the system utilization increases very little with the increment of workloads. This can well explain the mean response time shown in Fig. 5(b) . 
Comparison on Expected Computation Amount
In this subsection, we combine the performance and lifetime reliability into a unified metric, namely expected computation amount, which reflects the amount of computation performed by a system before its failure. The results for 32+4 and 34+2 systems are shown in Fig. 7 . An interesting phenomenon can be observed from these figures. That is, as the system workload becomes heavier, in spite of significant decline in the system lifetime (see Fig. 3 ), in most cases the total computation amount of the system increases. This is mainly because, although the system with light workload has relatively lower temperature when compared with that with heavy load, the induced difference in lifetime is less than the difference in system utilization. In particular, consider GDS shown in Fig. 7(a) as an example. The sojourn time in 0-Failure states for ρ sys = 5.0 and 10.0 is 21.16 and 18.42, respectively, while the system utilization of two cases is 13.80% and 27.76%. Therefore, the expected computation amount ratio in this state is around 1.75, and we can observe similar trends in other states. From this aspect, longer service life does not mean more effective usage of the manycore processor. Moreover, we notice that when ρ sys of the system with bimodal hyperexponential service time distribution increases from 20.0 to 30.0, the expected computation amount decreases (see Fig. 7(c)-(d) ). The main reason lies in the fact that both cases nearly make full use of their resources and thus their computation amounts are mainly bounded by their service lives.
In some cases, we may not use computer systems until the end of their lifetimes. Hence, we are also interested in the computation amount of systems under such situations. In the following experiments, we set the minimum expected service life among GDS, PRS, and SRS computed by the proposed model as the actual system service life, and calculate the expected computation amount until that time point for the three redundancy configurations. The results for 32+4 systems are shown in Fig. 8 . When the systems are not fully used (i.e., exponential service time with ρ sys = 5.0, 10.0, 20.0, and 30.0 and bimodal hyperexponential with ρ sys = 5.0 and 10.0), we can see that the total computation amounts with different configurations are the same. This is because, as the system is not fully utilized, it is always able to complete tasks within a very short time period (compared to the system's lifetime). Therefore, the computation amount equals to the task amount feeded to the system. If the system has sufficient high utilization (i.e., hyperexponential distributed service time with ρ sys = 20.0 and 30.0 in Fig. 8(b) ), GDS systems finish more jobs than the other two configurations. This is expected because GDS systems contain more active units and keep them busy, leading to greater computation amount. It is also worth to note that with the same workload distribution, PRS and SRS always have the same computation amount when considering the same service time. This is because in both configurations the number of active cores at any time remains the same (32, in our experiments).
CONCLUSION
In this paper, we propose a novel analytical model to characterize the lifetime reliability of manycore processors with various redundancy configurations. Our proposed model is able to capture the impact of temperature variations of processor cores and workloads. Our experiments compare the lifetimes and performances of gracefully degrading systems, processor rotation systems and standby redundant systems, under various workloads.
