ABSTRACT
INTRODUCTION
Because of aggressive technology scaling, today's embedded computing systems are able to integrate multiple microprocessors and various hardware accelerators on a single silicon die, known as MPSoC. When building embedded applications, designers first partition them into hardware and software tasks. Then, task allocation and scheduling are carefully performed to effectively utilize the available processing elements (PEs) while satisfying various design constraints (e.g., timing constraints for real-time tasks).
At the same time, the relentless scaling of CMOS technology has also brought with ever-increasing variability in transistor parameters such as channel length, gate-oxide thickness and threshold voltage [4, 18] . While there have been extensive works in the literature to mitigate process variation effects at the logic level (e.g., statistical timing analysis and optimization [2] ) and algorithmic level (e.g., variation-aware high-level synthesis [21] ), we are not able to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC '10 hide the variability at the system level any more. For embedded computing systems, designers are facing MPSoCs containing processors with variable frequencies across identically designed cores and across different identically designed chips.
In view of the above, variation-aware performance analysis needs to be integrated into the system-level task allocation and scheduling process for efficiently designing MPSoCs. Wang et al. [20] first addressed this problem by introducing the concept of performance yield for MPSoC designs, defined as the probability of the assigned schedule meeting timing constraints of the system. Statistical task graph analysis is then utilized in their task allocation and scheduling algorithm to maximize performance yield. Later, Singhal and Bozorgzadeh [17] proposed a non-probabilistic approach to reduce computational complexity in [20] . Recently, Chon and Kim [6] presented a new task allocation and scheduling method that takes the impact of resource sharing into consideration.
All the above works assume that task execution time follows Gaussian distribution and the execution times of tasks performed on different processor cores are statistically independent of each other (denoted as s-independent). However, a notable feature of withindie process variations is that they often exhibit themselves as spatiallycorrelated systematic variations, where devices close to each other have a higher probability of observing a similar variation level than devices that are far apart [14] . Because of this effect, identically designed processor cores that are nearby tend to have similar characteristics while variance is more severe among distant cores. Without considering the above systematic variation effects, the performance yield calculated in prior work is not accurate. Besides, without the s-independence assumption between the task execution times, most properties of Gaussian distribution cannot be applied directly to analyze the multivariate normal distribution. Consequently, closedform statistical analysis becomes extremely difficult, if not impossible. To tackle the above problem, we rely on Monte Carlo simulation to estimate the performance yield for various task allocation and schedule solutions, and we show that a reasonable amount of test cases is sufficient to achieve high confidence for the accuracy.
In addition, prior work develops a unified schedule for all chips at design stage to maximize performance yield. Clearly, the effectiveness of such static solution decreases with the ever-increasing variation effects. We present a novel quasi-static scheduling strategy, wherein a set of variation-aware schedules is synthesized off-line and, at run time, the scheduler will select the right one based on the actual variation for each chip, such that the timing constraint can be satisfied whenever possible. Experimental results on various task graphs mapped onto hypothetical MPSoCs show that the proposed solution is able to dramatically improve the performance yield.
The remainder of this paper is organized as follows. Section 2 reviews related prior work and formulates the problem tackled in this paper. Section 3 presents the proposed quasi-static performance yield-driven task allocation and scheduling algorithm. Experimental results on various hypothetical platforms are presented in Section 4. Finally, Section 5 concludes this paper.
PRIOR WORK AND MOTIVATION

Related Work
While there is a rich literature on task allocation and scheduling solutions for embedded system designs [5, 12] , only a few explicitly take process variation effects into consideration. [20] introduced a new metric called performance yield to indicate the probability of MPSoC products meeting timing constraints given a certain task schedule, and proposed to enhance it by using statistical task graph analysis. The computation of two atomic functions, sum and max, is presented for the statistical timing analysis. The former one relies on the property of Gaussian distribution, while the latter one uses moment matching technique. Later, to avoid the timeconsuming computations on distribution functions, [17] presented a non-probabilistic approach. It made a good attempt to handle the cases that the latency of processors follow an arbitrary distribution in the problem formulation, and reached theoretical conclusions for some simplistic scenarios. However, the proposed task allocation and scheduling technique was again based on the assumption of Gaussian delay distribution. Recently, [6] examined the impact of resource sharing on statistical static timing analysis and introduced an analytical framework to take this issue into account.
All the above works assume that the execution times of tasks follow Gaussian distribution. Yet it has been demonstrated that modeling the threshold voltage V th and effective channel length L e f f with Gaussian distribution fits the empirical data well [15] . Then, according to the following processor frequency model [9] , the execution time of a task induced by process variation can be approximated with Gaussian distribution in some instances at best.
where, V dd is the supply voltage while α is a material-dependent constant.
More importantly, prior works assume that the execution times of multiple tasks (on different processor cores, at least) are s-independent of each other. This assumption, however, ignores the spatial correlation characteristic of systematic variation, which is modeled by the following spherical function for its good agreement with empirical data [15] .
where, φ is the distance pass which the correlation becomes zero, and r is the distance between two elements on the chip. ρ ∈ [0, 1] is therefore a measure indicating their spatial correlation. The higher the correlation is, the closer ρ is to one. When two processors are correlated with each other in terms of threshold voltage V th and effective channel length L e f f because of systematic process variation, their frequency values tend to have similar offsets [9, 15] . The jointly probability density function of two frequency values given ρ = 0.8 is illustrated in Fig. 1(a) . For comparison, we also plot the jointly probability density function of two processors whose frequency values are totally uncorrelated (i.e., s-independent, or ρ = 0), as shown in Fig. 1(b) . To get a clearer understanding, we use Kendall Tau rank correlation coefficient, which is defined as the probability of concordance of MPSoC pairs 1 minus that of disconcordance, to demonstrate the co-dependence of two processors fabricated on the same silicon die, as shown in Fig. 2 .
With correlation, the statistical properties of s-independent Gaussian distributions serving as the basis of statistical task graph analy- 1 Consider two observations of MPSoCs whose operational frequency of
have the same sign, this pair is called concordant; while otherwise disconcordant. independent assumption. The appreciable difference in the summation distribution is depicted in Fig. 3 .
From the above, we can conclude that the performance yield calculation in prior work is not accurate enough, especially considering that the spatially-correlated systematic variation effects are increasingly significant with technology scaling [13, 16] . In addition, prior work tried to construct a unified task schedule for all chips at design stage. Such static solution ignored the unique processor frequency map for each MPSoC chip, which is available after manufacturing test. This inflexibility significantly constrains the maximum performance yield that designers can achieve.
The above limitations of prior works motivate the proposed performance yield-driven task allocation and scheduling solution in this paper.
Problem Formulation
The problem studied in this work is formulated as: Given
··· , n} represents a task in G, and E is the set of directed arcs which represent precedence constraints. Each task τ i has a deadline d i ;
• A platform-based MPSoC embedded system and its floorplan.
The platform consists of a set of processors P = {P j : j = 1, ··· , m}, belonging to V categories; To determine periodical task schedules such that the percentage of MPSoC products meeting performance constraints is maximized.
QUASI-STATIC TASK ALLOCATION AND SCHEDULING ALGORITHM
As discussed earlier, deriving a unified schedule for all MPSoC chips restricts the maximum performance yield that we can achieve and its effectiveness decreases with the ever-increasing variation effects. A straightforward thought to maximize performance yield is then to generate a specific task schedule for each individual chip. This extreme solution, however, is not quite practical because of the associated overhead of applying task allocation and scheduling algorithms on-chip and the additional design effort to prepare code and data of tasks for all kinds of processors to enable flexible task allocation. We therefore propose a quasi-static solution, wherein we prepare a set of task schedules at design stage, and at run time, the scheduler selects the right one based on the actual variation for each chip. To be specific, we first generate an initial task schedule with simulated annealing-based technique, taking stochastic properties of MPSoC frequency map into account (Section 3.1). Then, we use data clustering technique to derive additional task schedules to further improve MPSoC performance yield (Section 3.2).
Initial Task Scheduling
Simulated Annealing Procedure
In this section, we resort to a modified simulated annealing technique to find a task schedule such that the performance yield is maximized. When compared with the classic procedure of simulated annealing, the differences stay particularly in the function YIELD ESTIMATION and MEET ACCEPTANCE CONDITION in the flow shown in Fig. 4 . The task schedule obtained with this method plays an important role in the quasi-static scheduling process, denoted by initial task schedule (or S init ).
The task schedule S is described with two sequences: (scheduling order sequence, resource binding sequence) [10] . For example, given the task graph shown in Fig. 5 is allocated onto an MPSoC with two processors, a feasible schedule is (τ 1 , τ 3 , τ 2 , τ 4 , τ 5 ; P 1 , P 2 , P 1 , P 1 , P 2 ), meaning that task τ 1 is scheduled first, followed by task τ 3 , τ 2 , τ 4 , and τ 5 . Task τ 1 , τ 3 , τ 4 are allocated onto processor P 1 , while τ 2 and τ 5 on processor P 2 . To generate a new schedule, we reuse the two types of random moves defined in [10] that are capable to search complete solution space: (i). swap two adjacent elements in scheduling order sequence if they do not have any precedence constraints; (ii). change an element in resource binding sequence.
For the iteration i from 1 to 
Yield Estimation and Error Analysis
To find the task schedule with maximum performance yield, every time a new schedule is constructed, we need to estimate the performance yield obtained by applying this schedule on MPSoC chips. Clearly, given the process variation model and MPSoC floorplan, performance yield Y is a function of task schedule S. However, since the s-independence assumption between the execution times of tasks do not hold (due to the spatial correlations for processor cores), it is extremely difficult, if not impossible to derive a closed-form expression for Y because most properties of Gaussian distribution cannot be applied directly to analyze the multivariate normal distribution. We therefore estimate the performance yield with Monte Carlo simulation, wherein the variation effects of the sample chips are generated according to process variation model. The objective function Y (S) is then approximated by the sample average
where, ω 1 , ··· , ω N are independent identically distributed samples of MPSoC frequency map, called test chips. x(ω i , S) is a binary value, representing whether test chip with frequency map ω i is able to meet performance constraints, provided the task schedule S. For the ease of discussion, we hereafter use x i to represent x(ω i , S) and Y to represent Y (S).
For effective performance yield estimation, we are interested in the efficiency of Monte Carlo simulation, that is, how many test chips do we need to consider in order to achieve a certain accuracy? To answer this question, we perform theoretical analysis in the following.
Consider the test chip set we mentioned before. Without loss of generality, we assume M out of them meet performance constraints, where M ≤ N. By Eq. (3), the performance yield Y is approximated by
Although we cannot tell Y must lie in a certain range around Y , we know with a desired confidence level Y stays in the confidence interval. To be specific, provided the variation of performance yield σ 2 is finite, we apply the central-limit theorem and have
where, βσ/ √ N is the half length of confidence interval, denoted by ξ in the rest of this paper. Φ(β) − Φ(−β) is the confidence level of the event that the approximated performance yield Y is in (Y − ξ,Y + ξ). Φ(x) represents the cumulative distribution function of normal distribution.
The variance σ 2 in Eq. (5) cannot be derived from multivariate normal distribution directly, but could be approximated with the sample data [11] , namely
Thus, we have
which implies that the length of the confidence interval depends on both test chip quantity N and M, provided a certain confidence level. For instance, if 1,000 test chips are generated for performance yield estimation and confidence level is set to 95%, the confidence interval half length is in the range between 0 and 0.031. The confidence interval around the approximated value is depicted in Fig. 6.   3 
328
21.3
Acceptance Condition
In the simulated annealing process, we need to compare the newfound task schedule with the original one in terms of performance yield to decide whether the newfound schedule should be accepted. Since we cannot obtain the exact value of performance yield but an approximation only, we need to use the approximated values for comparison (line 7 in Fig. 4 
where, t is the value of t-distribution with (N − 1) degrees of freedom given the confidence level. σ 1,2 is the variation of ( Y 1 − Y 2 ), which can be expressed in terms of the variations σ 1 , σ 2 and the correlationρ 1,2 between Y 1 and Y 2 , that is,
We reach to the conclusion at this point that Y 1 > Y 2 if and only
> 0, given the confidence level. With these arguments, the acceptance condition of a newfound solution is set to
where the latter is to accept a "bad" schedule with certain probability during the annealing process to jump out of a local optimal solution.
Clustering-Based Performance Yield Enhancement
With the initial task schedule derived earlier, some test chips might cannot meet performance constraints. Consequently, we need to generate more task schedules for performance yield enhancement. This requires the collection of the test chips that cannot meet deadlines (referred to as residual test chips or W bias hereafter) and the extraction of their frequency map characteristics. In this subsection, we use k-mean algorithm [19] , a data clustering technique, to classify these test chips into a few clusters and generate additional task schedules to further improve performance yield.
As illustrated in Fig. 7 , in the beginning of our algorithm, we only have the initial task schedule and we initialize the task schedule set S to be {S init }. At present, the total performance yield Y total is simply the performance yield achieved by using this only task schedule. Next, in each iteration a new task schedule S best is generated according to residual test chip characteristics, and the performance yield To generate a new task schedule, we first classify the residual test chips into k clusters according to their frequency map by using kmean algorithm [19] . This algorithm first makes initial guess for the centroid of every cluster. Next, it assigns every point to the cluster whose centroid is the nearest and then recomputes centroid of clusters in an iterative manner. This procedure repeats until the centroid of clusters does not change any more. For example, suppose a task graph with 31 tasks is assigned onto an MPSoC with two processor cores, 79.5% test chips out of 1,000 are able to meet performance constraints by using initial task schedule (i.e., Cluster 0 in Fig. 8 ). The residual test chips are categorized into three clusters and plotted in Fig. 8 . Here, the quantity of clusters k is initialized as an user-defined value k init before task schedule generation.
Next, a task schedule is generated for each cluster (line 4-5). To reduce runtime, we start from S init and change the resource binding and/or scheduling for a few times, where the frequency map of the centroid is used for makespan calculation. To make the most of test chips in a cluster meet performance constraints, the task schedule which results in the shortest makespan during this procedure is accepted and denoted by S i . It is necessary to highlight that we assume that the raw data of each task is prepared for one category of processors only, where the category is determined by S init . That is, for heterogeneous MPSoCs, given task τ i is allocated onto a processor belonging to category v j , in the clustering-based performance yield enhancement it is allowed to be assigned to the processors in category v j only. Then, the task schedule with the minimum makespan is selected and marked as S best (line 6). It is possible that the resulting S best cannot provide benefit in terms of Y total , because we use the centroid of each cluster for task scheduling. In this case, we increment k by one to shrink the cluster size and perform the task schedule generation again (line 14). Otherwise, S best is included into schedule set S (line 9) and W bias is recomputed (line 11). The last step (line 15) is to generate the task schedule selection criteria by using multilayer perceptron [8] , a machine learning technique. At run time, the scheduler will select task schedule according to these criteria. For this purpose, we build a multilayer sigmoid network with one input, one hidden, and one output layer, as shown in Fig. 9 . This network is off-line trained by using the well-studied backpropagation algorithm [8] , taking test chips and their corresponding feasible task schedules as training samples. That is, we train the weight parameters of the network ( u and w) such that given the input vector f (i.e., the frequency map of an MPSoC) the network provides an output vector s, indicating which task schedules are able to meet performance constraints. For instance, suppose the second task schedule is able to meet performance constraints for a particular MPSoC chip, the inputs are the frequency values of processors on this MPSoC and the outputs should be s = (0, 1, 0, ··· , 0). Note that, both hidden layer and output layer employ sigmoid function 2 , because the value of an output node i naturally implies the probability that the i th task schedule meets performance constraints. It is also worth to note that the storage overhead for the network is acceptable since the number of task schedules is bounded by U max .
At run time, given an MPSoC product, its exact frequency map becomes available with the mature speed binning techniques. The scheduler takes this information as the input of selection criteria network, and compute the output vector with forward propagation. Suppose s i has the highest values among all the elements in s, the i th task schedule is loaded and executed by scheduler. Since this selection is conducted only when the system is being initialized before usage or the frequency map is changed due to aging effects, the performance overhead for schedule selection is negligible.
EXPERIMENTAL RESULTS
To evaluate the effectiveness and efficiency of the proposed algorithm, we conduct a set of experiments on hypothetical MPSoCs. The task graphs are generated by TGFF [7] , whose attributes are described in Table 1 . These task graphs are allocated onto some hypothetical MPSoCs whose floorplan are illustrated in Table 2 . The process variation model is assumed to follow multivariate normal distribution with spatial correlation [15] . The variation σ Vth is set to 3.2%, and the distance pass which the correlation becomes zero is set to φ = {0.1, 0.5} [15] unless noted otherwise.
The simulated annealing parameters are set to I = 10 3 , T init = 10 2 , T end = 10 −3 , R cooling = 0.9. For performance yield estimation, 1,000 test chips are generated with process variation model (i.e., N = 10 3 ) while the confidence level is set as 95% (β = 1.96). In the performance yield enhancement, at most 10 task schedules are generated, that is, U max = 10. In addition, the quantity of clusters is initialized as k init = 5. Existing variation-aware task allocation and scheduling techniques (e.g., [20, 17] ) calculate MPSoC performance yield under the assumption that task execution times follow Gaussian distribution and tasks conducted on different processor cores are s-independent. We use one example schedule to show the difference between such calculated performance yield and our estimated one based on Monte Carlo simulation. The task schedule is generated by using the algorithm proposed in [3] , an extension of classic Highest Levels First with Estimated Times (HLFET) [1] for heterogeneous systems, for the case with G b , P homo , d = 350. On one hand, we generate a set of test chips with the process variation model and estimate the performance yield with 95% confidence level, for φ from 0 to 0.5. On the other hand, we use curve fitting technique to approximate the execution time of each task with Gaussian distribution and calculate the performance yield according to [20] under the s-independent assumption. The significant difference in the results obtained by using these two methods is illustrated in Fig. 10 3 . Clearly, the calculated performance yield in prior work is rather inaccurate.
Because the performance yield were obtained differently, the proposed method and prior work actually have different optimization objectives and it would be not fair to compare the effectiveness of the task schedules obtained from them. For this reason, we take [3] instead of the existing process variation-aware scheduling results as the baseline solution in our experiments. Since this baseline solution does not consider process variation effects, we should treat it as a reference only. Table 3 compares the performance yield achieved by various approaches, where QS is used to label results obtained with the proposed quasi-static scheduling technique. The difference between QS and baseline (namely, Δ 2 in the table) is in the range of 42.3-99.9%. When spatial correlation is strong (φ = 0.5), the improvement is mainly credited to initial task schedule S init while the effectiveness of clustering-based enhancement is limited (e.g., the case with G b , P homo , φ = 0.5, d = 300). This is because the processors on the same MPSoC tend to have similar variations and a task schedule suitable for this characteristic is able to bring the performance yield to a high level. Given weak spatial correlation (φ = 0.1), on the other hand, the performance yield enhancement tends to rely on clustering-based task schedule generation (e.g., the case with G b , P hete , φ = 0.1, d = 275). In this case, the deviation of frequency maps is significant. As a result, the simulated annealing-based technique that considers the overall characteristics of all chips is not as effective as that in strong correlation case. Under such circumstances, the characteristic of test chips exhibits separate clustering effects and hence generating individual task schedule for each cluster is more effective.
In Fig. 11 , we show the tradeoff between the number of task schedules and the corresponding performance yield. For comparison, we perform the proposed simulated annealing-based task scheduling for the same cases, while more schedules (at most U max ) with the highest performance yield are maintained in the searching process, referred to as SuperSA. In most cases, the initial task schedule is the most effective one. An example is shown in Fig. 11(a) , where S init provides 36.9% performance yield improvement when compared with the baseline solution (0.2% in this case). The remaining schedules in set S further enhance the result to 59.3%. SuperSA, however, only results in slight improvement with more task schedules, around 3.9%. The observations in Fig. 11(b) are similar, where the proposed QS results in higher performance yield than SuperSA. The above results demonstrate the effectiveness of our clusteringbased performance yield enhancement technique.
CONCLUSION
In this paper, we present a novel quasi-static variation-aware task allocation and scheduling technique for MPSoC designs. Based on a more accurate performance yield estimation method, the proposed simulation annealing based scheduling algorithm together with the novel clustering-based performance yield enhancement technique can significantly improve the performance yield of MPSoCs.
ACKNOWLEDGEMENT
