We propose a novel methodology for designing fault-tolerant real-time system to achieve optimal productivity on a single-chip multiprocessor platform using the heterogeneous builtin-self-repair(BISR1 based graceful degradation and yield enhancement technique as an embedded optimization engine which exploits task-level scheduling and algorithm selection flexibility. We also developed a hardware fault model for modern superscalar processors and multi-processors which enables an efficient treatment of the synthesis and compilation goals.
I. INTRODUCTION
Since the size of average application specific system has been doubling every year and the size of integrated systems has been growing at just slightly low rate, the focus of behavioral and system level synthesis has been shifting from single task applications to multiple task applications. The increasing levels of integration and aggressive deep-submicron technologies imply emphasized need for fault-tolerance. While classical fault-tolerance techniques provide fault tolerance in a straightforward manner, the overhead is usually too high for many cost sensitive application domains such as consumer electronics and personal communication devices.
Recently, Guerra et al. [2] proposed low overhead heterogeneous BISR for single task ASICs. In this paper, we show that the application domain of heterogeneous BISR can be significantly enhanced, when intertask relationships are explored. In some sense the technique is also a step forward in the level of heterogeneity with BISR. While the Guerra's technique was able to back up only different types of execution units using a given type of execution unit, the new approach uses synthesis flexibility to back up memory with execution units or vice versa.
To the best of our knowledge, this is the first heterogeneous system-level synthesis approach which explores task-level scheduling and algorithm selection flexibility.
We consider a motivational example for the following two topics: (i) the adaptive fault tolerant algorithm selection and scheduling problem and (ii) the design of faulttolerant system to optimize productivity [2] which is defined to be the ratio of the relative change in yield [12] over the relative change in area.
Ramesh Karri Department of Electrical and Computer Engineering
University of Massachusetts Amherst, MA 01003 USA First, we introduce the adaptive fault tolerant algorithm selection and scheduling problem. There are 3 homogeneous processors and 330 units of allocated shared memory. Given are 5 periodic tasks. Each task has two different algorithmic choices. Each algorithm requires a ce'rtain amount of shared memory to execute so that contentions to shared memory may cause some tasks to wait if there are processors available for their execution. To switch from a task to the other, context switching time overhead occurs. The context switching time includes the overhead of loading task's codeldata from permanent storage to shared memory, state update in the processor and other operating system overheads. Typical real-time operating systems have context switching time of several to tens of microseconds, which is usually large compared to typical communication or execution times of one iteration of a periodic hard real-time task.
An optimal algorithm selection and schedule S* = ( A l , C 2 , 0 2 , B1, El) takes 95 units of time per period. It is illustrated in Figure 1 how the execution time for S* is computed. When one of the 3 processors is down due to a permanent physical fault, S* takes 151 units of time while the other algorithm selection and schedule S, = ( B l , C 2 , 0 2 , A l , E l ) takes 126. When 110 units of memory becomes unavailable, S* takes 153 time units while the other algorithm selection and schedule S, = ( 0 2 , B2, E2, A l , C2) takes 117 units. Instead of using the algorithm selection and schedule S*, consider using the following adaptive algorithm selection and schedule: (i) S* for fault-free situation, (ii) S, when a processor is faulty and (iii) S, when a part of the memory is faulty.
Using this adaptive algorithm selection and schedule, the system throughput can be improved by 16.6% when a processor is faulty and by 23.5% when there is a fault in the memory subsystem. If the required schedule length is in the range [126-1511 for a faulty processor situation and [117-1531 for a faulty memory situation, the graceful degradation of throughput by adapting the algorithm selection and task schedule to the currently available resources renders the system operational. On the other hand, in order to tolerate a fault in processor or memory in a similar situation, the classical schemes require duplication of memory and processors.
Next, we consider a motivational example for the design of fault-tolerant system to optimize productivity. We assume that a nonredundant system with 8 processors and 8 memory units is the base system to compare with redundant systems. We assume, for the sake of simplicity, that the probability of a good processor and the probability of a good memory unit are the same as 0.9, and a processor and a memory unit take the same area in a chip. As an initial step, all nonredundant configurations that satisfy throughput constraint using the proposed BISR-approach as an optimization engine are found, which are provided in Table 11 . The 4th column in the Table I1 is the number of memory units, where a single memory unit becomes unavailable by a single fault. Consider two redundant systems which employ 9 processors and 8 memory units and 8 processors and 9 memory units, respectively. The yield for the first redundant system becomes 0.334 and the yield for the second becomes 0.482 while both systems incur the same amount of area penalty. The productivity of the second redundant system is greater by a factor of 1.44 than that of the first. The details of our methodology to find an optimal design for productivity are provided in Section V. The rest of the paper is organized in the following way. First, in the next section, we review the related works. In Section 111, we summarize the selected computation, hardware and fault models. Section IV and V are techni- cal core of the paper, where Section IV formally defines the problem of the heterogeneous BISR using the adaptive algorithm selection and scheduling, presents a synthesis method for the problem and provides comprehensive experimental results while Section V defines the problem of designing fault-tolerant system with optimal productivity, provides a synthesis method for the problem, and present extensive experimental results. Finally, we conclude our contribution in Section VI.
PREVIOUS WORK
Recently, algorithm selection has been recognized as an important system-level synthesis topic, and several approaches have been proposed [9] . Karri and Orailoglu [4] [8] develop a spectrum of behavioral techniques which minimize the fault-tolerance overhead in application specific designs. Recently, Iyer et al. [3] develop a high-level synthesis approach to optimize manufacturability using redundant interconnects. Guerra et al.
[2] develop a heterogeneous BISR behavioral synthesis system, which explores flexibility in scheduling a single task during high level synthesis so that the resulting design is operational when there are no more than k faulty units.
PRELIMINARIES
The hardware model being considered is shown in Figure 2 . The processors, interconnect and memory shown in the figure are placed in a single-chip multi-processor platform. We assume that the area of a chip is the sum of the areas of processors and memory units because the area for interconnect is insignificant. The ratio of the area between processor and memory is assumed t o be provided. The increase in the area of a chip by the addition of a It is hard or impossible to replace the faulty processors, memory, and interconnects in the single-chip system. Faults can occur in either a processor, a memory system, or an interconnect. We consider only the permanent physical faults. The fault in interconnect can be regarded as the fault in a processor. A faulty interconnect prevents its corresponding processor from receiving data from the memory system, as shown in P2. . A fault in a memory system causes a part or all of the memory unusable. A fault in a processor causes a single processor down, as shown in PI. The identification of the faulty parts is done by the testing before packaging. The controller of the system is reconfigured upon detection of a fault.
The syntax of a targeted computation is defined as a hierarchical data-control flow graph (CDFG) [ll] . The semantics underlying the syntax of the CDFG format is homogeneous synchronous data flow (HSDF) [7] .
We assume that there are no data, control, or timing dependences among tasks. If there are dependences the t,asks are merged in a new composite single task. We also assume that all tasks are periodic with the same period. Again note that with no loss of generality using the least common multiple theorem [6] a set of tasks with arbitrary periods, can be transformed into the equivalent set of tasks with the same period in polynomial time.
Since context switching overhead is usually high, we assume no task preemption. To execute a task, there must exist a unassigned processor and a required amount of unassigned operating memory for loading the task's code and data from the permanent secondary storage. Partly due to the loading of code and data, the context switch time is required before a new task starts its execution. The context switch times between tasks are described by a matrix between tasks.
IV. HETEROGENEOUS BISR USING ADAPTIVE ALGORITHM SELECTION AND SCHEDULING
Our goal is to achieve the graceful degradation of throughput in the presence of faults in the system using the adaptive algorithm selection and scheduling technique. The adaptive algorithm selection and scheduling problem can be formally stated as follows:
Problem: Throughput Optimization Using Adaptive Algorithm Selection and Scheduling
Instance: Given are N tasks with A algorithmic choices that have processor and memory requirements, P homogeneous processors, M memory units, context switch time matrix CST and a period D.
Question: Are there a selection of algorithms and a schedule of tasks such that in the presence K1 faulty processors and K2 faulty memory units the resulting schedule length is at most D?
We proved that the problem is NP-complete by using Karp's polynomial transformation technique from the traveling salesman problem which is well known to be NP-
Since the computational complexity of tRe algorithm selection and scheduling problem forbids an exact or optimal solution, a general combinatorial optimization technique known as simulated annealing(SA) [5] with standard geometric cooling has been used to optimize the throughput given the faults in processors and memory units.
We generated a set of random examples by varying the number of tasks to show the effects of the problem size on the performance. We have tried 10, 20, 30, 40 and 50 tasks. The number of different algorithmic choices for each task is randomly chosen between 2 and 5. The number of processors and memories allocated are 5 and 1000 respectively. The processing times for tasks are chosen randomly between 10 and 100 while the memory requirements are randomly distributed between 10 and 350. The context switching times take values between 5 and 30.
Each fault in processors is assumed to cause a processor unavailable. When a fault in memory occurs, we assume that 200 units of memory becomes unavailable. The Table I11 illustrates the effectiveness of the adaptive scheduling approach. The first number in each item represents the schedule length for the nonadaptive schedule which is gotten under the no fault assumption while the second number represents the schedule length for the adaptive schedule which is gotten under the appropriate fault situation. The third number provides the percentage improvement for the adaptive schedule from the nonadaptive one. The CPU times reported are the running times on SUN Sparc 4. The adaptive scheduling approach has achieved the average 13.1% improvement from the nonadaptive approach.
We have constructed 4 different sets of 12 tasks from the 16 real-life tasks described in [lo] . The tasks are the following: two GE controllers, two Honda controllers, a wavelet filter, four audio filters, a 8 x 8 discrete cosine transform, an NEC digital-to-analog converter, two components of modems, a LMS audio formatter, an echo canceller, and a linear contrller for automotive motion control. For each task, we have used the number of operations and the area of implementation as the running time and the memory requirement, respectively. The context switching times are approximated such that they are proportional to memory requirement, the switching between similar tasks costs little overhead and the switching between different tasks requires a significant amount of overhead. The context switching costs are distributed between 10 and 173. The number of processors and memory units allocated are 4 and 160, respectively. Each fault in memory causes 40 units of memory unusable. Algorithm selection is not considered since we assume that only one algorithmic choice is available for each task. Table IV presents the experimental results and clearly shows the effectiveness of our technique. The adaptive scheduling approach has achieved the average 10.6% improvement from the nonadaptive approach.
v. DESIGN OF FAULT-TOLERANT SYSTEM FOR OPTIMIZING PRODUCTIVITY
Our goal is to design a fault-tolerant system with optimal productivity using the heterogeneous BISR-approach described in the previous section. The first stage of the design flow is t o design a nonredundant system that satisfies throughput requirement which is assumed to be given. The second stage is to find all rionredundarit system configurations that can satisfy the throughput requirement using the hcterogcneous BISR-approach which uses the task-level scheduling and algorithm selection flexibility. The third stage is to find an optimal redundant configuration that maximizes productivity by considering both yield enhancement and area penalty.
The yield and area for a given system configuration must be computed to get the productivity measure. Suppose the given system employs np processors and n, memory units. The probability of a good processor is Pp and the probability of a good memory unit is P, . All the n nonredundant system configurations are (pl, ml), (p2 , mz) , . . . , (pn, m,) , where pi and mi are the number of processors and memory units, respectively and pi 5 pi+l. We observe that if pi 5 pi+l, m i 2 mi+l. We also observe that if (pi, mi) is a nonredundant system configuration, then ( p j , m j ) is a redundant system configuration when p j 2 pi and mj 2 mi. The yield Y(n,, n,) is computed using the equation provided in Figure 3 . The area is computed based on the area model in Section III.
The algorithm t o maximize productivity is based on enumeration, where, from every configuration, new configurations are generated by adding a processor or memory unit, and the further enumeration from the new configuration stops when the productivity of the new configuration is smaller than that of the previous configuration. The algorithm finds an optimal solution since the relative improvement in yield by addition of a processor or memory decreases as the number of processors or memory units increascs, and the relative increase in area by addition of a processor or memory stays constant regardless of how many processors and memory units are used in the current 
configuration.
We used the same random examples used in the previous Section assuming that 8 processors and 800 memory units are used, where 100 memory units are affected by a single fault in memory. The number of memory units reported in the paper is in terms of 100 memory units. Using the adaptive algorithm selection and scheduling, all the nonredundant system configurations have been identified and provided in Table V . The design of fault-tolerant system with optimal productivity has been performed under various design parameters such as the probability of 
VI. CONCLUSION
For the first time we proposed a design methodology for fault-tolerant real-time system to achieve optimal productivity by providing the novel approach for heterogeneous BISR-based graceful degradation and yield enhancement using task-level algorithm selection and scheduling flexibility based on our hardware fault model for modern superscalar processors and multi-processors. On a large set of examples, the approach resulted in significant improvement in productivity and throughput.
