Abstract-Summary and Conclusions -A novel methodology is proposed for designing fault-tolerant real-time multi-processor systems-on-a-chip to achieve optimal productivity. The methodology employs the heterogeneous built-in-self-repair (BISR) based on graceful degradation and yield enhancement techniques as an embedded optimization engine. The technique exploits the flexibility provided in task-level scheduling and algorithm selection steps. A hardware fault model is developed for modern superscalar processors and multi-processors which enables an efficient treatment of the synthesis and compilation goals. For the first time, heterogeneous BISR is used at the task level. The key idea is to adapt scheduling and algorithm selection to the available nonfaulty resources. If there is a fault in memory, the algorithms that use less memory are selected and the scheduler exploits the other abundant resource, viz, the processors, more vigorously to compensate for the loss of part of memory. Similarly, a fault in a processor is backed up by memory. The synthesis approach minimizes the degradation in performance for single or multiple faults using simulated annealing-based algorithm selection, scheduling, and assignment algorithms. On the large set of examples this adaptive algorithm selection and scheduling technique has achieved important improvement of throughput compared to conventional nonadaptive schemes. The experimental results also indicate that important improvement in productivity can be achieved by using the extra throughput gained from the technique.
I. INTRODUCTION

A. Motivation
S
INCE the size of average application-specific systems and integrated systems has been doubling every year, the focus of behavioral and system-level synthesis has been shifting from single-task applications to multiple-task applications. The increasing levels of integration and aggressive deep-submicron technologies imply a greater need for fault-tolerance. While classical fault-tolerance techniques such as duplication and triplication provide fault tolerance in a straightforward manner, the overhead is usually too high for many cost-sensitive application domains such as consumer electronics and personal communication devices.
Recently, [5] proposed low overhead heterogeneous BISR for single application ASIC. This paper shows that the application domain of heterogeneous BISR can be appreciably enhanced, when intertask relationships are explored. In some sense the technique is also a step forward in the level of heterogeneity with BISR. While the technique due to [5] was able to back up only different types of execution units using a given type of execution unit, the new approach uses synthesis flexibility to back up memory with execution units or vice versa.
To the best of our knowledge, this is the first heterogeneous system-level synthesis approach which explores task-level scheduling and algorithm selection flexibility for design of fault-tolerant systems. The experimental results clearly show the effectiveness of the new technique.
B. Motivational Example
This sub-section provides a motivational example for each of 2 topics:
i. adaptive fault-tolerant algorithm selection and scheduling problem, ii. design of fault-tolerant system to optimize productivity [5] which is defined to be the ratio of the relative change in yield [20] over the relative change in area. First, the adaptive fault-tolerant algorithm selection and scheduling problem using the motivational example in Table I are introduced. There are 3 homogeneous processors and 330 units of shared memory that are allocated. Five periodic tasks are given.
Assumption (To Simplify This Example):
All the tasks have the same period.
Each task has two different algorithmic choices. Each algorithm requires a certain amount of shared memory to execute so that contention due to shared memory can cause some tasks to wait if there are processors available for their execution. TABLE I  AN INSTANCE OF THE ADAPTIVE FAULT-TOLERANT ALGORITHM SELECTION AND SCHEDULING PROBLEM;  NUMBER OF ALLOCATED PROCESSORS = 3; AMOUNT OF ALLOCATED SHARED MEMORY = 330 Fig. 1. Execution-time and memory-usage of the algorithm selection and schedule S3.
To switch from one task to the other, context switching-time overhead occurs. The context switching-time includes the overhead of loading task's code/data from permanent storage to shared memory, state update in the processor, and other operating system over-heads. Assumption: As soon as the context switch for the new task begins, the peak memory requirement for the task should be reserved. If available memory is less than the memory requirement, then the context switch has to wait until the required memory is available.
Therefore, a task can wait to save the memory usage if there is idle time available on a processor. There is an idle period between the task E and A on a Processor 1 in Fig. 1 due to this policy.
Assumption: A task to be scheduled next is assigned to the first available processor. If there are many processors available, then the processor with the minimum amount of context switching overhead is selected.
Typical real-time operating systems have context switchingtime of tens of microseconds, which is usually large compared to typical communication or execution times of one iteration of a periodic hard real-time task.
An optimal algorithm selection and schedule, takes 95 units of time per period. Fig. 1 illustrates how the execution time for is computed. When 1 of the 3 processors is down due to a permanent physical fault, then takes 151 units of time, while the other algorithm selection and schedule takes 118 units of time. When 110 units of memory becomes unavailable, takes 153 time units of time while the other algorithm selection and schedule takes 114 units of time. In-stead of using the algorithm selection and schedule, , consider using the following adaptive algorithm selection and schedule: i) for fault-free situation, ii) when a processor is faulty, iii) when a part of the memory is faulty. Using this adaptive algorithm selection and schedule, the system throughput can be improved by 21.9% when a processor is faulty, and by 25.5% when there is a fault in the memory subsystem. If the required schedule length is in the range [118, 151] for a faulty processor situation and [114, 153] for a faulty memory situation, the graceful degradation of throughput by adapting the algorithm-selection and task-schedule to the currently available resources renders the system operational. On the other hand, in order to tolerate a fault in processor or memory in a similar situation, the classical schemes require duplication of memory and processors.
Next, consider a motivational example for the design of a fault-tolerant system to optimize productivity.
Assumption: A nonredundant system with 8 processors and 8 memory units is the base-system to compare with redundant systems.
Assumption: (For simplicity) The probability of a good processor and the probability of a good memory unit are both 0.9; and a processor and a memory unit take the same sea in a chip. As an initial step, all nonredundant configurations that satisfy throughput constraint using the proposed BISR-approach as an optimization engine are found; they are provided in Table II . # Memory in column 4 in Table II is the number of memory units, where a single memory unit becomes unavailable by a single fault. Consider 2 redundant systems which employ 9 processors and 8 memory units, and 8 processors and 9 memory units, respectively. The yield for the first redundant system becomes 0.334, and the yield for the second one becomes 0.482, while both systems incur the same amount of area penalty. To show how the yield is calculated, use the first redundant system as an example. The yield for this first redundant system is the probability that the system works after manufacturing. There are two such cases: i) 9 processors and 8 memory units work, or ii) 8 processors and 8 memory units work. The probability of case i is ; the probability of case ii is . Similarly, the yield for the second redundant system is the probability that: i) 7 or 8 processors, and 9 memory units work, or ii) 8 processors and 8 memory units work. The productivity of the second redundant system is greater by a factor of 1.44 than that of the first. The details of this methodology to find an optimal design for productivity are in Section V.
C. Paper Organization
Section II reviews the related works on scheduling, algorithm selection, and behavioral and architectural level fault-tolerant techniques. Section III summarizes the selected computation, hardware and fault models, and describes a global synthesis approach for designing a fault-tolerant system to optimize productivity. Sections IV and V are the technical core of the paper; Section IV formally defines the problem of the heterogeneous BISR using the adaptive algorithm selection and scheduling, presents a synthesis method for the problem, and provides comprehensive experimental results. Section V defines the problem of designing fault-tolerant system with optimal productivity, provides a synthesis method for the problem, and presents experimental results.
II. PREVIOUS WORK
Scheduling has been widely studied in many areas, such as behavioral synthesis [13] , parallel processing [4] , and hard real-time systems [12] . Recently, algorithm selection has been recognized as an important system-level synthesis topic, and several approaches have been proposed [15] .
Behavioral synthesis provides the mechanism for design-space exploration so that a variety of design goals can be optimized [2] , [13] . Much of the behavioral synthesis research has targeted the optimization of area, speed (throughput), and (more recently) power and testability.
Relatively little work has been reported on behavioral-level synthesis techniques for fault-tolerant design. Reference [18] concentrates on designs with self-recovery from transient faults using micro roll-back and checkpoint insertion. References [7] , [8], [14] develop a spectrum of behavioral techniques to minimize the fault-tolerance overhead in application specific designs. Recently, [6] developed a high-level synthesis approach to optimize manufacturability using redundant interconnects. Reference [5] develops a heterogeneous BISR behavioral synthesis system, which explores flexibility in scheduling a single task during high-level synthesis, so that the resulting design is operational when there are no more than faulty units.
For a review of known fault-tolerance techniques, [19] provides comprehensive lists of relevant references.
III. PRELIMINARIES
This section outlines the relevant preliminaries. In particular, it describes the selected hardware, fault, and computation models, and outlines a global design flow of the fault-tolerant system.
A. Hardware and Fault Model
The hardware model being considered is shown in Fig. 2 . The processors, interconnect and memory shown in the figure are placed in a single-chip multi-processor platform. It is assumed that the area of a chip is the sum of the areas of processors and memory units, which accounts for the important portion of the overall area of the chip [1] .
Assumption: The ratio of the area between processor and memory is provided.
Assumption: The increase in the area of a chip by the addition of a processor or memory is just the area of the added component, regardless of how many processors and memory units are in the current configuration.
It is virtually impossible to replace the faulty processors, memory, and interconnects in the single-chip system. Faults can occur in either a processor, a memory system, or an interconnect. Only the permanent physical faults are considered here. A fault in interconnect can be regarded as a fault in a processor. A faulty interconnect prevents its corresponding processor from receiving data from the memory system, as shown in . A fault in a memory system causes a part or all of the memory unusable. A fault in a processor causes a single processor down, as shown in . The identification of the faulty parts can be done by either an offline testing before packaging or an online testing. The controller of the system is reconfigured upon detection of a fault.
B. Computational Model
The syntax of a targeted computation is defined as a hierarchical data-control flow graph (CDFG) [17] . The CDFG represents the computation as a flow graph, with nodes, data edges, and control edges. The semantics underlying the syntax of the CDFG format is homogeneous synchronous data ROW (HSDF) [11] , which assumes a semi-infinite or a very long input stream of data arriving and being processed at periodic intervals, imposed by the nature of the specified applications. The HSDF model is well suited for specification of single task computations in numerous application domains such as digital signal processing, video and image processing, broadband and wireless communications, control, information and coding theory, and multimedia.
Assumption: There are no data, control, or timing dependences among tasks. If there are dependences, the tasks are merged in a new composite single task.
Assumption: All tasks are periodic with the same period. This asumption implies that all the tasks should be completed to start the next period. Again note that, with no loss of generality using the least common multiple theorem [10] a set of tasks with arbitrary periods, can be transformed into the equivalent set of tasks with the same period in polynomial time. Since context switching overhead is usually high, we assume no task preemption. To execute a task, there must exist a unassigned processor and a required amount of unassigned operating memory for loading the task's code and data from the permanent secondary storage. Partly due to the loading of code and data, the context switch time is incurred before a new task starts its execution. The context switch times between tasks are described by a matrix between tasks.
IV. HETEROGENEOUS BISR USING ADAPTIVE ALGORITHM SELECTION AND SCHEDULING
This section formulates the adaptive algorithm selection and scheduling problem and provides the computational complexity of the problem. Next, it proposes synthesis method and presents experimental results.
A. Problem Formulation and Complexity
The goal is to achieve the graceful degradation of throughput in the presence of faults in the system using the adaptive algorithm selection and scheduling technique. The adaptive algorithm selection and scheduling problem can be formally stated as follows:
Problem: Throughput Optimization Using Adaptive Algorithm Selection and Scheduling Instance: Given tasks with algorithmic choices that have processor and memory requirements, homogeneous processors, memory units, context switch time matrix CST and a period . Question: Are there a selection of algorithms and a schedule of tasks such that in the presence of any subset of size at most faulty processors and any subset of size at most K2 faulty memory units, the resulting schedule length is at most ? 
Theorem 1: The Throughput Optimization Using Adaptive Algorithm Selection and Scheduling problem is NP-complete.
Proof: The problem is proved to be NP-complete by using Karp's polynomial transformation technique from the traveling salesman problem which is well known to be NP-complete [3] . Consider a special case of the original problem formulation. Suppose there is a single processor with unlimited memory units. There are tasks with 0 execution times. It takes time only when a processor switches from one task to the other, which is described as the context switch time matrix CST. Assume that the CST matrix is symmetric. Let each task denote a city and the CST matrix denote a distance matrix between the cities. The reduced problem is the traveling salesman problem. Because a special case of the problem is NP-complete, the original problem is NP-complete.
B. Synthesis Approach
Because the computational complexity of the algorithm selection and scheduling problem forbids an exact or optimal solution, a general combinatorial optimization technique known as simulated annealing (SA) [9] has been used to obtain nearly optimal throughput, given the faults in processors and memory units. The technique can be applied to all the scenarios considered, with some straightforward changes. The SA algorithm is in Fig. 3 .
The actual implementation details are presented for each of the following areas; the cost function, the neighbor solution generation, the temperature update function, the equilibrium criterion, and the termination criterion.
1) The actual schedule length for the given schedule has been used as the cost function. 2) The neighbor solution is generated by the random choice of 2 perturbations: i) the interchange of 2 tasks in the schedule, and ii) the change of algorithms of a task in the schedule.
3) The temperature is updated by the function . is a function of the tem- , is chosen to be 0.1 so that in the high temperature region where every new state has very high chance of acceptance, the temperature reduction occurs very rapidly. For , is set to 0.95 so that the optimization process explores this promising region more slowly. For , is set to 0.8 so that is relatively quickly reduced to firmly converge to a local minimum. The initial temperature is set to 4 000 000. 4) The equilibrium criterion is specified by the number of iterations of the inner loop. The number of iterations of the inner loop is set to 20 times the number of tasks. 5) The termination criterion is given by the temperature. If the temperature falls below 0.1, the simulated annealing algorithm stops.
C. Experimental Results
A set of random examples was generated by varying the number of tasks to show the effects of the problem size on the performance. 10, 20, 30, 40, 50 tasks were tried. The number of different algorithmic choices for each task is randomly chosen between 2 and 5. The number of processors allocated is 5, and the number of memories allocated is 1000. The processing times for tasks are chosen randomly between 10 and 100, while the memory requirements are randomly distributed between 10 and 350. The context switching times take values between 5 and 30. Each fault in processors is assumed to cause a processor to be unavailable. When a fault in memory occurs, 200 units of memory are assumed to becomes unavailable. Table III illustrates the effectiveness of the adaptive scheduling approach.
• The first number in each item represents the schedule length for the nonadaptive schedule which is obtained under the no-fault assumption.
• The second number represents the schedule length for the adaptive schedule, which is obtained under the appropriate fault situation.
• The third number provides the percentage improvement for the adaptive schedule from the nonadaptive one.
The CPU times reported are the running times on SUN Sparc 4. The adaptive scheduling approach achieves an average improvement of 13.1% compared to the nonadaptive approach. Four different sets of 12 tasks were constructed from the 16 real-life tasks in [16] . The tasks are:
• 2 GE controllers, For each task, the number of operations and the area of implementation were used as the running time and the memory requirement, respectively. The context switching times are approximated such that they are proportional to memory requirement, the switching between similar tasks costs little overhead and the switching between different tasks requires an important amount of overhead. The context switching costs are distributed between 10 and 173; 4 processors and 160 memory units are allocated. Each fault in memory causes 40 units of memory unusable. Algorithm selection is not considered be cause it is assumed that only 1 algorithmic choice is available for each task. Table IV presents the experimental results and clearly shows the effectiveness of this technique. The adaptive scheduling approach achieves an average improvement of 10.6% compared to the nonadaptive approach V. DESIGN OF FAULT-TOLERANT SYSTEM FOR OPTIMIZING PRODUCTIVITY
A. Global Design Flow
The goal is to design a fault-tolerant system with optimal productivity using the heterogeneous BISR-approach (see Section IV). Stage #l of the design flow designs a nonredundant system that satisfies the throughput requirement which is assumed to be given. Stage #2 finds all nonredundant system configurations that can satisfy the throughput requirement using the heterogeneous BISR-approach which uses the task-level scheduling and algorithm selection flexibility. Stage #3 finds an optimal redundant configuration that maximizes productivity by considering both yield-enhancement and area-penalty, which are described in Section V-B.
B. Synthesis Method
The yield and area for a given system configuration must be computed to get the productivity measure. Let the given system use processors and memory units. The probability of a good processor is and the probability of a good memory unit is . All nonredundant system configurations are ; and are the number of processors and memory units, respectively;
. Observe that:
• if , then ; • if ( , ) is a nonredundant system configuration, then ( , ) is a redundant system configuration when and . The is computed using (1) which basically calculates the sum of the probabilities for all the possible functional configurations after manufacturing for a given number of processors and memory units. Each product term represents the probability that there are at least functional processors and memory units for . If , then the product term is 0. The area for a configuration is computed based on an area model in Section III. IY initial yield; FY nal yield; PIF improvement factor in productivity; n # of processors in the nal design; n # memory units in the nal design; Initially, 8 processors and 8 memory units are used.
The algorithm to maximize productivity is based on enumeration, where, from every configuration, new configurations are generated by adding a processor or memory unit, and the further enumeration from the new configuration stops when the productivity of the new configuration is smaller than that of the previous configuration. The algorithm finds an optimal solution because the relative improvement in yield by adding a processor or memory decreases as the number of processors or memory units increases, and the relative increase in area by addition of a processor or memory stays constant regardless of how many processors and memory units are used in the current configuration.
C. Experimental Results
The same random examples are used here as used in the previous section, with a change that 8 processors and 800 memory units are used, where 100 memory units are affected by a single fault in memory. The number of memory units reported in the paper is in terms of 100 memory units. Using the adaptive algorithm selection and scheduling, all the nonredundant system configurations have been identified and provided in Table V . The design of fault-tolerant systems with optimal productivity has been performed under various design parameters such as the probability of a good processor, the probability of a good memory unit, the area ratio of processor and memory, as suggested in [20] .
The results of redundant system with optimal productivity are provided in Tables VI and VII under ranges of values for the probabilities of a good processor and memory unit, where the first, second, third, fourth, and fifth numbers in each entry represent the initial yield, the final yield, PIF, the number of processors and memory units of the optimal design, respectively. PIF is defined as the relative yield increase divided by the relative area increase.
The relative yield increase is defined as: 'final yield minus the original yield' divided by the 'original yield'. The relative area increase is defined as the 'new area minus the old area' divided by the 'old area'. The running times for the entire design process ranged between several minutes and about an hour on Sun Sparc 4, depending on the number of tasks. The experimental results 
