Abstract-Embedded systems employed in critical applications demand high reliability and availability in addition to high performance. Hardware-software co-synthesis of an embedded system is the process of partitioning, mapping, and scheduling its specification into hardware and software modules to meet performance, cost, reliability, and availability goals. In this paper, we address the problem of hardware-software co-synthesis of fault-tolerant real-time heterogeneous distributed embedded systems. Fault detection capability is imparted to the embedded system by adding assertion and duplicate-and-compare tasks to the task graph specification prior to co-synthesis. The dependability (reliability and availability) of the architecture is evaluated during cosynthesis. Our algorithm, called COFTA (Co-synthesis Of Fault-Tolerant Architectures), allows the user to specify multiple types of assertions for each task. It uses the assertion or combination of assertions which achieves the required fault coverage without incurring too much overhead. We propose new methods to: 1) Perform fault tolerance based task clustering, which determines the best placement of assertion and duplicate-and-compare tasks, 2) Derive the best error recovery topology using a small number of extra processing elements, 3) Exploit multidimensional assertions, and 4) Share assertions to reduce the fault tolerance overhead. Our algorithm can tackle multirate systems commonly found in multimedia applications. Application of the proposed algorithm to a large number of real-life telecom transport system examples (the largest example consisting of 2,172 tasks) shows its efficacy. For faultsecure architectures, which just have fault detection capabilities, COFTA is able to achieve up to 48.8 percent and 25.6 percent savings in embedded system cost over architectures employing duplication and task-based fault tolerance techniques, respectively. The average cost overhead of COFTA fault-secure architectures over simplex architectures is only 7.3 percent. In case of fault-tolerant architectures, which cannot only detect but also tolerate faults, COFTA is able to achieve up to 63.1 percent and 23.8 percent savings in embedded system cost over architectures employing triple-modular redundancy, and task-based fault tolerance techniques, respectively. The average cost overhead of COFTA fault-tolerant architectures over simplex architectures is only 55.4 percent.
INTRODUCTION
MBEDDED systems have begun to play a significant role in our day-to-day lives. Fault-tolerant distributed embedded systems can offer high performance as well as reliability and availability to meet the needs of critical realtime applications. Many embedded systems concurrently perform a multitude of complex tasks. Heterogeneous distributed architectures are commonly used to meet the performance needs for such systems. These architectures contain several general-purpose processors and applicationspecific integrated circuits (ASICs) of different types which are interconnected by various types of communication links. Each task to be performed on the system can be executed on a variety of software and hardware modules which have different dollar costs, reliability, area, delay, and power requirements. For example, a task can be performed on a general-purpose processor (software) or an ASIC (hardware). Similarly, a message can be communicated via a serial link, local area network (LAN), or a bus. Parameters such as area, delay, reliability, and power are usually estimated by simulation/synthesis or laboratory measurement from previous designs.
The derivation of an optimal hardware-software architwo major approaches to solve the distributed system cosynthesis problem. Mixed integer linear programming (MILP) and exhaustive are two distinct optimal approaches. Prakash and Parker have proposed MILP-based cosynthesis [10] which has the following limitations: 1) it allows only one task graph, 2) it does not allow preemptive scheduling, 3) it requires specification of the interconnection topology upfront, and 4) it does not consider fault tolerance. Due to computational complexity, it is only suitable for small task graphs consisting of around 10 tasks. D'Ambrosio and Hu have proposed a configuration-level hardware-software partitioning algorithm [11] which is based on an exhaustive enumeration of all possible solutions. Limitations of this approach are:
1) It allows an architecture with at most one CPU and few ASICs, 2) It ignores communication overheads, 3) It does not consider fault tolerance, and 4) It uses simulation for performance evaluation which is very time-consuming.
Iterative [12] , [13] , [14] and constructive [15] , [16] are two distinct approaches in the heuristic domain. In the iterative approach, an initial solution is iteratively improved through various architecture moves. In the constructive approach, the architecture is built step-by-step and the complete architecture is not available before completion of the algorithm. The iterative procedures given in [12] , [13] do not address fault tolerance and consider only one type of communication link. They do not allow mapping of successive instances of a periodic task to different PEs, which may be important in deriving cost-effective architectures. The algorithm in [14] employs power dissipation as a cost function for allocation. It ignores intertask communication scheduling. A constructive co-synthesis algorithm for faulttolerant distributed embedded systems has been proposed in [15] . The method in [15] has the following limitations:
1) It employs task-based fault tolerance (TBFT) [17] , but does not exploit the error transparency property (explained later), which can significantly reduce the fault tolerance overhead, 2) It does not support communication topologies, such as bus, LAN, etc., 3) It employs a pessimistic finish time estimation technique which may increase the architecture cost, 4) It does not address availability of systems, and 5) It is not suitable for multirate systems.
The primary focus in [16] is on general and low power cosynthesis of distributed embedded systems. The methods in [18] , [19] , [20] , [21] consider fault tolerance during task allocation, but not during co-synthesis. Direct optimization of dependability (reliability and availability) or determination of an efficient error recovery topology of the architecture has not been attempted before during co-synthesis. Also, the concepts of multidimensional assertions, and assertion sharing have not been exploited before.
We have developed a heuristic-based constructive cosynthesis algorithm, COFTA (Co-synthesis Of FaultTolerant Architectures), which produces an optimized distributed embedded system architecture for fault tolerance.
Fault detection is accomplished through the addition of assertion and duplicate-and-compare tasks. A new task clustering technique exploits the transparency of some tasks to errors to reduce the fault tolerance overhead and determines the best placement of assertion and/or duplicate-and-compare tasks. Concepts of multidimensional assertions and assertion sharing are introduced to further reduce the fault tolerance overhead. The best error recovery topology is automatically extracted during co-synthesis. Error recovery is accomplished through a few spare PEs. Markov models are used to evaluate the availability of the architecture. It is the first algorithm to optimize dependability during co-synthesis. In order to establish its effectiveness, COFTA has been successfully applied to a large number of real-life telecom transport system examples.
The rest of this paper is organized as follows. Section 2 provides the definitions and basic concepts behind the co-synthesis framework. Section 3 describes various schemes to reduce the fault tolerance overhead and increase architecture dependability. Section 4 describes how techniques introduced in Section 3 are used during the different steps of our co-synthesis algorithm. Section 5 gives experimental results. Section 6 gives the conclusions.
THE CO-SYNTHESIS FRAMEWORK
Each application-specific function of an embedded system is made up of several sequential and/or concurrent jobs. Each job is made up of several tasks. Tasks are atomic units performed by embedded systems. Tasks contain both data as well as control flow information. The embedded system functionality is usually described through a set of task graphs. Nodes of a task graph represent tasks. Tasks communicate data to each other, indicated by a directed edge between communicating tasks. Task graphs can be periodic or aperiodic. Though, in this paper, we focus primarily on periodic task graphs, our co-synthesis algorithm can be easily extended to cover aperiodic tasks as well, using the concepts in [22] . Each periodic task graph has an earliest start time (EST), period, and deadlines, as shown for an example in Fig. 1a . Each task of a periodic task graph inherits the graph's period. Each task in a task graph can have a different deadline. The task graph in Fig. 1a will be used as a running example to illustrate various steps of cosynthesis.
The PE (link) library is a collection of all available PEs (communication links). The PE and link libraries together form the resource library. The resource library and its costs for two general-purpose processors, P1 and P2, two ASICs, ASIC1 and ASIC2, and two links, L1 and L2, are shown in Fig. 1b . The following definitions form the basis of the cosynthesis framework. Some of these definitions have been taken from [16] . Clustering of tasks in a task graph reduces the communication times and significantly speeds up the co-synthesis process. This vector indicates which PEs the cluster cannot be allocated to. This vector is used to determine compatibility of tasks in a cluster with tasks outside the cluster.
DEFINITION 7. Task t i is said to be preference-compatible with
cluster C k if the bit-wise logical AND of the preference 
TABLE 1 LIST OF SYMBOLS vector of cluster C k and task t i does not result in the zerovector (a vector with all elements zero).
If all elements of a preference vector of cluster C k are zero, it makes the cluster unallocatable to any PE.
DEFINITION 8. Task t i is said to be exclusion-compatible with cluster C k if the ith entry of the exclusion vector of C k is zero.
This indicates that tasks in cluster C k can be co-allocated with task t i . If t i is both preference-and exclusion-compatible with C k , it is simply said to be compatible with C k .
DEFINITION 9. Task t i is said to be error-transparent if an error at its input always propagates to its outputs.
Traditionally, for fault detection purposes, either an assertion task is added to check the output of each task or the task is duplicated and a comparison task is added to compare the outputs of the duplicated tasks. An assertion task checks some inherent property/characteristic of the output data from the original task [17] . If task t i feeds an errortransparent task t h , whose output can be checked with an assertion task, then we do not need to have an assertion task or duplicate-and-compare tasks to check the output of t i . This reduces the fault tolerance overhead. Many tasks in real-life task graphs that we have encountered do have the error transparency property. For example, a task graph for telecom input interface processing consists of the following tasks in a chain: preamplification, timing recovery, bipolar decoding, framing, and payload processing. All these tasks are error-transparent, and one assertion task at the output of the chain suffices for fault detection purposes if the fault detection latency requirement (explained later) is satisfied. As mentioned before, a communication link can take different forms such as point-to-point, bus, LAN, etc. We take this into consideration through the communication vector. The communication vector for each edge is computed a priori for various types of links as follows. Let r j be the number of bytes that need to be communicated on edge e j , and a l be the number of bytes per packet that link l can support, excluding the packet overhead. Suppose the link under consideration, l, has s ports. Let t l be the communication time of a packet on link l. Some communication links may incur a per-packet access overhead called D l for link l. Then, y jl is given by:
The link access overhead per packet can be reduced in case of large messages requiring multiple packets. At the beginning of co-synthesis, since the actual number of communication ports on the links is not known, we initially use an average number of communication ports (specified a priori) to determine the communication vector. This vector is recomputed after each allocation, considering the actual number of ports on the link. When an assertion is shared among multiple tasks, the implementation of the assertion task is augmented with additional input/output lines and control signals, which in turn increase its execution time. The assertion_excess_overhead_vector allows us to factor in the overhead resulting from assertion sharing.
In order to provide flexibility for the communication mechanism, we support two modes of communication: In general, tasks are reused across multiple embedded system functions. To exploit this fact, the concept of architectural hints is used. Architectural hints are created during task graph generation. They are based on the type of task, type of resource library, and previous experience of the designers. These hints are used to indicate possibilities of reuse, preemption, error recovery topology, etc. These hints are not necessary for the success of our algorithm. However, it can exploit them when they are available.
In critical embedded system applications, the dependability of the system is of utmost concern. The measures of dependability are reliability and availability. Reliability is the ability of a system to perform the required functions for a specified period under stated mechanical, thermal, electrical, and other environmental conditions. The specified period is generally referred to as service life. In general, systems in operation allow repair scenarios for failed components. For example, most of the telecom embedded systems are designed for critical applications requiring continuous operation where repair operation is allowed. Availability is a measure of the fraction of time the system is available to perform the required functions. Generally, the maximum allowed unavailability (1 -availability) of the system is specified in units of minutes per year. Different embedded functions have different availability requirements. For example, in telecom systems, the availability of the control module may affect the availability of provisioning and communication to the system, but may not impact the transmission functions performed by the system. The term provisioning is generally used to describe embedded system functions such as configuration, addition of new services, etc. On the other hand, the failure of the transmission module may not impact the availability of the control function, but it may impact service and generally has a more stringent availability requirement, compared to that of a control module. Thus, we assume that each task graph T i has an allowed unavailability specified a priori in terms of U i minutes/year.
An embedded system architecture generally has several interconnected modules. A module is defined as an interconnection of several elements from the resource library to perform a specific set of functions. Elements of the resource library are also sometimes referred to as components. In order to meet availability requirements for various task graphs of the system, we form a failure group which is a collection of service and protection modules. In the event of failure of any service module, a switch to the protection module is required for efficient error recovery.
Failure-in-time (FIT) rate l of a component or system is the expected number of its failures in a given time period. In order to facilitate unavailability analysis of the architecture, the FIT rate for each hardware and software module and mean-time to repair (MTTR) of a faulty module are assumed to be specified a priori, in addition to the system's availability requirements. The FIT rate of a module/component indicates its expected number of failures in 10 9 hours of operation. For each failure group, background diagnostics are run on the protection (also known as stand-by) module to increase its availability in the event of a protection switch. Background diagnostics either consist of a separate set of tasks specified a priori or the allocated tasks to the failure group. An assertion task flags an error when the task it checks outputs erroneous data. Some common examples of assertion tasks used in telecom transport systems are: 1) parity error detection, 2) address range check, 3) protection switch control error detection, 4) bipolar-violation detection, 5) checksum error detection, 6) frame error detection, 7) loss-of-synchronization detection, 8) software code checksum error detection, 9) software input and output data constraints check, etc.
For each task, it is specified whether an assertion task(s) for it is available or not. For each assertion, an associated fault coverage is specified. A combination of assertions may sometimes be required to achieve the desired fault coverage. For each such task, a group of assertions and the location of each assertion is specified. For each check (assertion or compare) task, the execution vector, and the communication vector of the edge between the checked and check tasks are specified.
For each available processor, its cost, FIT rate, supply voltage, average quiescent power dissipation, peak power constraint, associated peripheral attributes, such as memory architecture, processor-link communication characteristics, and cache characteristics, are assumed to be specified. In addition, the preemption overhead for each processor is specified a priori in terms of its execution time as well as average and peak power consumption. For each ASIC, its cost, supply voltage, average quiescent power dissipation, package attributes, such as available pins, available gates, and FIT rate, are assumed to be specified. Similarly, for each FPGA, its cost, supply voltage, average quiescent power dissipation, FIT rate, as well as package attributes, such as available pins and maximum number of programmable functional units (PFUs), are assumed to be specified. Generally, all PFUs are not usable due to routing restrictions. A very high utilization of PFUs and pins may force the router to route the nets such that it may exceed the execution time (delay constraint) defined by the execution vector. We take this into account through a term called the effective usage factor (EUF). Based on our previous design experience, we have assumed an EUF of 70 percent for our experimental results to determine the percentage of the logical blocks that are actually usable for allocation purposes. We also allow the user to specify an EUF based on his/her own experience. The user can also specify the effective pin usage factor (EPUF) to indicate what percentage of package pins can be used for allocation (default is 80 percent to allow pins for power, ground and due to routing restrictions). The default percentages used for EUF and EPUF were derived based on existing designs, and experimentally verified to guarantee the satisfaction of delay constraints.
LOW OVERHEAD FAULT TOLERANCE SCHEMES AND ARCHITECTURE DEPENDABILITY
The embedded system architecture is made fault-secure using the concept of task-based fault tolerance (TBFT) [17] against at least single PE faults. The link faults are addressed by traditional techniques such as data encoding, loss-of-signal detection, loss-of-clock detection, etc. A system is said to be fault-secure if, in the presence of a fault, either transient or permanent, the system either detects it or always gives the correct output [23] . In [24] , task redundancy (such as duplicate-and-compare or triplicate-andvote) is used for fault detection and tolerance. However, relying on task duplication alone for fault security results in large overheads in cost and power consumption. The use of assertion tasks can substantially lower these overheads. In this section, we show how this can be accomplished. For error recovery, once the fault is detected through a check task and determined to be permanent (explained later), the service module on which the checked task resides is marked faulty, and tasks allocated to that module are run on a standby protection module. In many large distributed embedded systems, there are several modules of the same type which perform identical functions on different input data. For example, in a transport system for processing N OC-192 (9.92 Gb/s synchronous optical network transport (SONET)) signals, there would typically be N service modules, each processing one OC-192 signal. In this case, each module is designed to execute the same set of tasks. We propose to use one protection module for every failure group consisting of N service modules in order to minimize the fault tolerance overhead, whenever possible.
Cluster-Based Fault Tolerance
In order to exploit the error transparency concept properly, we propose the concept of cluster-based fault tolerance (CBFT). To illustrate its advantages, consider the task graph shown in Fig. 2a . Assume that an assertion task is available for all tasks except task t2. Application of the TBFT concept [15] , [17] results in an augmented task graph shown in Fig. 2b . Since task t2 does not have an assertion, its duplicate task t2d and compare task t2c are added. For each of the remaining tasks, an assertion task is added, e.g., t1c for t1, and so on. Application of the clustering procedure (given later) results in the clusters shown in Fig. 2b . This means, for example, that tasks t1, t2, and t4 belonging to cluster C1 will be allocated to the same PE. Any transient or permanent fault in the PE may affect any one or more of these tasks. Suppose tasks t2, t3, and t4 are error- transparent. Then, t1c, t2d, t2c, and t3c can be dropped from the augmented task graph, obtaining the graph shown in Fig. 2c . Suppose the fault in the PE that cluster C1 is allocated to affects task t1, then the corresponding error will be propagated to t4c and detected by it. A fault affecting task t2 or task t4 will similarly be detected by t4c. We make sure that a checked task and its check task are allocated to different PEs using the exclusion vector concept so that a single PE fault does not affect both. Similarly, a task and its duplicate, if one exists, are also allocated to different PEs.
Fault Detection Latency
In real-time systems, the fault detection latency (the time it takes to detect a fault) can significantly impact the protection switch time. The protection switch time includes the fault detection latency of the system and the error recovery time. Therefore, even when a task is error-transparent, it may be necessary to add a check task to its input to improve the fault detection latency. We take care of this concern as follows: Suppose the maximum allowable system fault detection latency is t d . We first compute the fault detection latency for each check task, as illustrated by the following example. Consider the task graph in Fig. 3a . Its augmented task graph with the addition of an assertion task and duplicateand-compare tasks for task t j are shown in Figs. 3b and and t jd , which will, in turn, result in the same erroneous values at the outputs of t j and t jd , and the error will not be detected. In this case, t i 's output will need to be checked directly even though t j is error-transparent.
To illustrate the fault detection latency estimation procedure, consider the task graph shown in Fig. 4 , where task t k has m input paths, e.g., from t 11 to t k . Task t kc is an assertion task for task t k . Let t kc be the fault detection latency of task t kc , where t k has a set M of m input paths. Let t j be a task and e l be an edge on the jth path. The fault detection time, F k , at task t k , is estimated using the equation given below. We sum up the execution and communication times on each path after the last checked task on that path. The communication time on an intertask edge between two tasks belonging to the same cluster is made zero (this is a traditional assumption in distributed computing). 
Application of Error Transparency and Fault Detection Latency Properties to Task Clustering
We use a new task clustering technique to take advantage of error transparency, whenever possible, and to find the best placement of the assertion and duplicate-and-compare tasks as follows. Task clustering involves grouping of tasks to reduce the complexity of allocation. Our clustering technique addresses the fact that different paths may become the longest path through the task graph at different points in the clustering process since the length of the longest path changes after partial clustering. The longest path is defined as the path for which the summation of execution and communication times of all the tasks and edges on the path is the maximum. Even though task clustering reduces the allocation complexity and leads to reduced run-time of the co-synthesis algorithm, it can increase the embedded system cost since all tasks in a cluster are allocated to the same PE which may not be optimal. Our experience from COSYN [16] shows that task clustering results in up to five-fold reduction in co-synthesis CPU time for medium-sized task graphs (with the number of tasks in the hundreds) with less than 1 percent increase in embedded system cost. Since meeting the real-time constraints is the most important objective, we first assign a deadline-based priority level to each task and edge in each task graph in order to determine the ordering of tasks for clustering [15] , [16] . The priority level of a task (edge) is an indication of the longest path from the task (edge) to a task with a specified deadline and includes the computation and communication times along the path, as well as the deadline. It can be either positive or negative. A nonsink task t j may either have a deadline or not. We define b(t j ) to be equal to the deadline of t j if the deadline is specified, and otherwise. Then, the priority level of a task and edge are determined as follows:
2) Priority level of an edge e k = priority level of destination node (e k ) + y max (e k ).
3) Priority level of nonsink task t j = max (priority level
Application of the above deadline-based priority level assignment procedure to the task graph in Fig. 1a results in the initial priority levels indicated by numbers next to nodes and edges in Fig. 5 . In the beginning, the maximum communication time, y max (e k ), is used to compute the priority levels. However, as clustering progresses and the communication times get fixed to zero, the priority levels are recomputed at each step.
At the beginning of the clustering procedure, we also assign an assertion overhead and fault tolerance (FT) level to each task using the procedure given in Fig. 6 . The FT level indicates the longest path from the task to a sink task considering the assertion overhead and communication. Like the priority levels, FT levels are also recomputed as clustering progresses. The clustering method for CBFT is given in Fig. 7 . We use priority levels to determine the order of tasks for clustering. We pick the task with the highest priority level since such a task indicates the longest or critical path from the execution time standpoint. However, we use FT levels to derive the most appropriate cluster as well as to identify the fan-out task along which to expand the cluster. This approach allows clustering of tasks while minimizing fault tolerance overhead. During cluster formation, we use error transparency, as well as allowable system fault detection latency, to define the best placement of the assertion and duplicateand-compare tasks.
For each unclustered task t i , we first form a fan-in set, which is the set of compatible fan-in tasks. We identify the cluster C j of the task from the fan-in set with which t i can be clustered. If the fan-in set is empty, we form a new cluster. Once task t i is clustered with cluster C j , we use the EXPAND_CLUSTER procedure given in Fig. 8 to expand the cluster. In order to ensure load balancing among various PEs of the architecture, the cluster size should be limited. If the cluster size is too big, it may be prevented from being allocated to any PE. If it is too small, it would increase the total number of clusters and increase the computational complexity of the co-synthesis algorithm. We use a parameter called cluster size threshold, C th , to limit the size of the cluster. C th is set equal to the hyperperiod which is the least common multiple of the periods of all task graphs. If period i is the period of task graph i, then [hyperperiod ' period i ] copies for it need to be explicitly or implicitly tackled [25] . Let there be f PEs in PE library to which some cluster C k is allocatable. Recall that preference_vector(C k ) will have 1s corresponding to these f PEs. At any point in the clustering procedure, if cluster C k contains d tasks {t 1 , t 2 , ..., t d }, its size, denoted as q k , is estimated by the following equation. p im denotes the execution time of task t i on PE m to which C k is allocatable. p denotes the period of the tasks in cluster C k and G denotes the hyperperiod. Then,
In order to take into consideration the worst-case allocation, we obtain q k as the maximum over all relevant PEs of the summation of the execution times of all copies of all tasks in cluster C k .
To illustrate the clustering technique, consider the task graph in Fig. 9a . Suppose there is only one PE and one link in the resource library. The numbers adjacent to nodes (edges) indicate their execution (communication) times and dl indicates the deadline. The initial priority levels are shown in Fig. 9b . Suppose all tasks except task t12 are errortransparent and only tasks t3, t8, and t9 do not have assertions. The execution and communication times for assertion, as well as duplicate-and-compare tasks, and assertion overheads are given in Fig. 9c . The numbers adjacent to nodes in Fig. 9d indicate the associated initial FT levels. The application of the clustering procedure for CBFT to the task graph of Fig. 9a results in the clusters shown by enclosures in the augmented task graph shown in Fig. 9e . The allowable system fault-detection threshold, t d , is assumed to be 75. The fault detection time from t1 to t6c is 78 (summation of execution and communication times up to t6c; note that the communication time between t2 and t6 is 0 since they belong to the same cluster) which exceeds t d . Therefore, an assertion task t2c is added at task t2. On the other hand, the duplicateand-compare tasks for t3 are eliminated since tasks t7 and t10 are error-transparent and t d is not exceeded. An assertion task is required at task t4, even though task t8 is errortransparent, since t8 does not have an assertion. Also, an assertion task is required at task t5 since: 1) t9 is errortransparent, but does not have an assertion, and 2) task t12 is not error-transparent. We recalculate the priority and FT levels of tasks after the clustering of each task to address the fact that there may be more than one critical path in the cluster and the critical path may change as tasks get clustered. As mentioned before, the accumulated fault detection time at any node is the maximum of fault detection times on all paths leading to it from the last checked task on the path. The fault detection times can be more accurately estimated during co-synthesis by considering actual start and finish times of associated tasks and communication edges. If the difference between the finish time of a check task and the start time of the last checked task in that path is more than t d , extra check tasks would need to be added. On the other hand, this accurate estimate may also indicate that some check tasks are redundant. They can be deleted.
Application of the clustering method for CBFT to the task graph of Fig. 1a results in the five clusters shown in Fig. 10a , assuming in this case that tasks t2, t3, and t4 are error-transparent and t d is sufficiently large (e.g., greater
than the maximum deadline of 88). The execution vectors of the check tasks and the communication vectors of the edges leading to those tasks are given in Fig. 10b . 
Multidimensional Assertions
Each fault can affect the system in multiple ways and dimensions. A separate assertion check, whenever available, can be used to monitor each of these dimensions. For example, a single fault in the input interface of the OC-192 signal in a telecom system can be detected through checks based on a loss-of-clock, loss-of-frame, transmission bit error detection, loss-of-signal, loss-of-synchronization, excessive pointer adjustments, etc. Our algorithm allows the user to specify multiple types of assertions for each task and the algorithm uses the assertion or combination of assertions which achieves the required fault coverage without incurring too much overhead.
Assertion Sharing
In embedded systems, there may be several tasks which require the same type of assertion check. Such an assertion task can be time-shared, if the associated checked tasks do not overlap in time, to further reduce the fault tolerance overhead. The pseudocode of the assertion sharing procedure is given in Fig. 11 . A similar technique can be used to time-share compare tasks as well. Once the architecture is available, the actual start and finish times of each task and edge are stored after scheduling. For each PE, we first sort the assertion tasks in the order of decreasing execution cost. For each unshared assertion from the sorted assertion list, we form an assertion sharing group which is a collection of tasks that share the same assertion. While evaluating each assertion for sharing, we pick the assertion with the largest execution time first since such an assertion generally requires a large number of resources. Sharing an assertion with a large resource requirement among multiple tasks leads to larger savings in architecture cost. We use the EXPAND_ASSERTION_GROUP procedure to expand such an assertion sharing group based on architectural hints, if provided. If such hints are not provided (when system_assertion_sharing = FREE), a list of assertions of the same type is created. We pick an assertion from the list and create an assertion sharing group which is expanded using the EXPAND_ASSERTION_GROUP procedure. Suppose two tasks t i and t j , allocated to the same PE, require the same type of assertion, say t c . In order to evaluate the possibility of assertion sharing for these two tasks, we first check whether there is any overlap of execution times of these assertion tasks during the hyperperiod. An overlap may be possible when these assertion tasks are mapped to a PE which allows concurrent execution, e.g., an ASIC. If there is no overlap, in order to reduce the resource requirement (total number of gates) or power dissipation or both, we modify the execution time of task t c based on the assertion excess overhead vector. We use this modified execution time to schedule the shared assertion tasks. The execution time of an assertion task may increase due to the additional circuitry, such as extra fan-in, fan-out, and control lines. If deadlines are still met, we allow the assertion sharing and consider the next assertion from the list for possible assertion sharing with t c .
To illustrate the application of assertion sharing, consider the task graphs in Fig. 12a . Assertion task t3c is used to check tasks t3A and t3B. The associated architecture is given in Fig. 12b . The schedule without assertion sharing is given in Fig. 12c . Since the execution times of tasks t3cA and t3cB do not overlap, we consider assertion sharing. We use the assertion excess overhead vector of this assertion to compute the modified execution time for shared assertion task t3cs to be 4 (1 + 0.5) = 6. We use this modified time to reschedule shared task t3cs. The resultant schedule is given in Fig. 12d . Since all deadlines are still met, this assertion sharing is allowed. As shown in Fig. 12d , though the length of the execution slot for t3cs is increased, the resource requirement in terms of the number of gates in ASIC1 is decreased due to the fact that the functional module for t3cs is now time-shared between two assertion tasks. This in turn supports our ultimate goal of reducing the embedded system cost while meeting real-time constraints. We have observed that in certain examples, assertion sharing makes room for accommodating additional functions in a PE such that one or more PEs (ASIC, FPGA, etc.) can be eliminated. 
Architecture Dependability
Critical applications of embedded systems demand highly dependable architectures. Architecture dependability largely relies on how efficiently a fault is detected and how fast the system recovers from a fault. Therefore, efficient fault diagnosis and error recovery procedures are important in achieving required dependability objectives of an embedded system. Next, we describe the procedure to impart fault diagnosis and error recovery and follow up with a method to evaluate embedded system architecture availability. 
Fault Diagnosis and Error Recovery
A module may have more than one check task. Each check task indicates fault-free/faulty status which is stored. Once a fault is detected by a check task, we use a concept called hit timing to classify whether the detected fault is transient or permanent. To do this, we employ a counter to keep track of the number of times faults have been detected by the check task. This counter is cleared after a specific time interval. It has a programmable threshold. When the threshold is exceeded, an interrupt is generated for the diagnostic controller. The fault-isolation software running on the diagnostic controller monitors the interrupt status of various check tasks and declares that a permanent fault is located on the module. In case of systems requiring continuous operation, a switch to the protection or stand-by module needs to occur for permanent faults. The protection tasks are preassigned to the protection module to reduce the protection switch overhead. Our scheduler takes into account the overhead associated with the protection switch time so that deadlines are always met in the presence of a single fault. For efficient error recovery, an m-to-N topology is used, where m protection modules are used for N service modules. Recall that an embedded system architecture generally has several interconnected modules. A module is defined as an interconnection of several elements from the resource library to perform a specific set of functions. Fig. 13 illustrates the 1-to-N protection philosophy for error recovery. The service and protection modules together form a failure group (FG). All service PEs and links, to which a group of task graphs (for which a hint is provided) is allocated, form the pilot group (PG) for such an FG. There can be more than one task graph being executed on a PG. PG is a subset of FG. The group of service PEs and links in a PG are switched out together to a set of protection PEs and links in the event of failure in any one of these PEs and links. Next, we duplicate the PG to provide 1-to-1 protection.
In order to derive an efficient error recovery topology in the co-synthesis setting, we need to identify the FGs and their interconnections such that the unavailability constraints (which are specified a priori as some fixed number of minutes per year for each task graph) of various task graphs are met. We formulate this problem as a restricted version of the graph isomorphism problem [26] , [27] . We start with an architecture graph, where nodes represent PEs and edges represent links. The FG size is defined based on architectural hints for the task graph, if specified, and the task graph unavailability constraint. It indicates the number of service modules in the FG. If more than one task graph are executed on a given FG, its unavailability constraint is set to the minimum of the unavailability constraints of all associated task graphs. The error recovery topology definition procedure is given in Fig. 14 . We use the architecture, architectural hints for FGs, and task graphs to derive the error recovery topology. Architectural hints can indicate the PEs and links required to implement specific task graphs, which can form part of an FG. For example, a set of PEs and links that execute a set of task graphs for the control function can form part of an FG. Existence of such hints is not necessary. However, if hints are available, we form the FGs based on them.
We use the EXPAND_FAILURE_GROUP procedure to expand the FG in two dimensions: horizontal and vertical. In the horizontal dimension, we expand the PG, while, in the vertical dimension, we expand the FG to increase the number of service modules (which have an architecture graph isomorphic to that of PG) to reduce the fault tolerance overhead. Expansion of the PG is limited by the PG limit (PG_limit). If the pilot group is too large and executes a large number of task graphs, then the FIT rate of the pilot group will be high. Also, in the event of PG failure, a large number of task graphs will be affected, which may not be acceptable from the quality of service standpoint, even though the system unavailability criteria may be met. In order to address this aspect, FG's horizontal expansion threshold, known as PG_limit, is specified a priori in terms of the number of concurrent task graphs executed in the PG. During expansion of the FG in both horizontal and vertical dimensions, we make sure that the unavailability constraints are met using Markov models of the FGs derived for each task graph. We use the CREATE_FAILURE_GROUP procedure given in Fig. 15 to create FGs for those PEs which are not covered by architectural hints, and follow up with the EXPAND_FAILURE_GROUP procedure to expand such FGs, as explained above.
To illustrate the application of the error recovery topology definition procedure, consider the architecture graph shown in Fig. 16a . It has four PEs of the same type, P1 (P1a is an instance of P1, and so on), executing the same type of task graphs concurrently. We form the pilot group containing P1a, as shown in Fig. 16b . Then, we duplicate the pilot group by adding a protection PE of the same type, P1Pa. Next, we need to consider expansion of this pilot group if the unavailability criteria are met. We create a list of PEs of the same type, which has connections isomorphic to the pilot group, in order to expand the pilot group in the horizontal dimension. First, we determine PG_limit to be the number of concurrent task graphs executed in the pilot group. For example, suppose that a PG is executing a task graph for the DS-3 signal (a 44.736 Mb/s transmission signal comprising of 672 telephone channels in a telecom system). In the event of failure, a protection switch occurs and the service related to one DS-3 signal is affected. If the PG_limit is set to two, then the PG is expanded to include PEs and links which perform functions for two DS-3 signals. Now, even if failure occurs in a PE/link which supports one DS-3 signal, the entire PG is switched to a stand-by module. In this scenario, the second DS-3 signal is interrupted even if the failure is not associated with any PE/link servicing the second signal. Therefore, on the one hand, we would like to decrease the PG_limit to minimize the adverse impact on the system during protection switch. However, on the other hand, we would like to increase the PG_limit such that the protection switch overhead is minimized. In Fig. 16 , the PG_limit is assumed to be two. Therefore, we add P1b to the pilot group and follow up by expanding the protection group, as shown in Fig. 16c . Recall that the pilot group and protection group together form the FG. Next, we consider expansion of the FG in the vertical dimension by identifying a set of PEs for which the architecture graph is isomorphic to that of the pilot group. In this case, we identify the set consisting of P1c and P1d. We add P1c and P1d to the FG, as shown in Fig. 16d . The resulting FG interconnection graph is shown in Fig. 16e , which is used to estimate the unavailability of this FG. The above process is repeated for the remaining PEs: P2 and P3.
In order to increase the availability of the protection modules, we schedule a set of background diagnostic tasks on them. The frequency f of the execution of the background diagnostics is determined a priori. However, it may be increased if deemed necessary to meet the unavailability requirements during FG unavailability analysis.
Architecture Dependability Analysis
We characterize the system as being an interconnection of several FGs, where each FG can have either 1-to-1, 1-to-N, or m-to-N protection or no spare modules. Even though we assume a single PE fault model, m-to-N protection may still be needed to protect against subsequent faults which occur before the first one is repaired (in other words, to decrease the unavailability). More faults in an FG than it can handle leads to system failure. In order to determine the architecture availability, we use Markov models [28] , [29] , [30] to determine the unavailability of each FG. There are two major contributors to system unavailability or downtime: hardware faults and software faults. In order to characterize hardware faults, we use the FIT rate of each hardware component which is specified a priori. For all software tasks allocated to the hardware component, the composite FIT rate is estimated using the execution-time model [30] . To facilitate unavailability analysis, in addition to the FIT rate of each component, we also assume that the MTTR is specified a priori. To estimate the FIT rate for a general-purpose processor (software), we sum up the associated hardware FIT rate and the composite FIT rate of the allocated software tasks.
Once the FGs are formed, we use the procedure given in Fig. 17 to perform the dependability analysis. If the calculated unavailability fails to meet the system unavailability constraints, we reject the architecture and continue with the next possible allocation.
THE COFTA ALGORITHM
We first provide an overview of our co-synthesis algorithm, COFTA and, then, follow up with details of each step. Fig. 18 presents the co-synthesis process flow which we follow in our work. The task graphs, system/task constraints, and resource library are parsed and appropriate data structures are created during the parsing step. The task clustering technique for CBFT is used to form clusters. During cluster formation, we use the concept of error transparency and fault detection latency for the placement of assertion and duplicate-and-compare tasks. The hyperperiod of the system is computed and we form the association array which stores the various attributes of each copy of a task graph in the hyperperiod. In traditional real-time computing theory, if period i is the period of task graph i, then [hyperperiod ' period i ] copies are obtained for it [25] . However, this is impractical from both co-synthesis CPU time and memory requirements point of view, especially for multirate task graphs for which this ratio may be very large. We use the concept of an association array [16] to tackle this problem.
Clusters are ordered based on their priority level. We define the priority level of a cluster as the maximum of the priority levels of the constituent tasks and incoming edges. The mapping of tasks (edges) to PEs (communication links) is determined during the allocation step. COFTA has two loops in the co-synthesis process flow: 1) an outer loop for selecting clusters, and 2) an inner loop for evaluating various allocations for each cluster. For each cluster, an allocation array consisting of the possible allocations is created. The size of this array is kept at manageable levels by limiting the number of extra PEs and links added at each step. While allocating a cluster to an ASIC or FPGA, it is made sure that the PE's capacity related to pinout, gate count, and power dissipation is not exceeded. We use the power dissipation estimation procedure from [16] to estimate the power dissipation of each PE and link, and check whether the constraints are exceeded. Also, while allocating a cluster to a general-purpose processor, it is made sure that the memory capacity of the PE is not exceeded. Inter-cluster edges are allocated to resources from the link library.
In the scheduling step, the relative ordering of task/communication execution and the start and finish times for each task and edge are determined. We employ a combination of preemptive and nonpreemptive static scheduling. We also take into consideration the operating system overheads such as interrupt overhead, contextswitch, remote procedure call, etc., through a parameter called preemption overhead. The preemption overhead is determined experimentally and given to the co-synthesis algorithm beforehand. Incorporating scheduling into the inner loop facilitates accurate performance evaluation. Performance evaluation of an allocation is extremely important in picking the best allocation. An important part of performance evaluation is finish-time estimation. This estimation process determines the start and finish times of each task employing the longest path algorithm [16] to check whether a task with a specified deadline meets it. In addition to the finish-time estimation, we also calculate the overall FIT rate of the architecture and ascertain whether it meets the system unavailability constraints. The allocation evaluation step compares the current allocation against previous ones based on total dollar cost. If there are more than one allocation with equal dollar cost, then we pick the allocation for which the summation of the unavailability of all FGs is minimum. Fig. 19 gives the pseudocode for the COFTA procedure. Next, we describe each step of COFTA in detail.
Task Clustering for Fault Tolerance
Our task clustering technique was presented earlier in Section 3.1.2. We impart fault tolerance by adding assertion tasks, when available, else duplicate-and-compare tasks to some tasks. The duplicate-and-compare tasks inherit the preference and exclusion vectors of the original task. In addition, the exclusion vector of the duplicate/compare or assertion task is formed such that they are prevented from being allocated to the same PE as the checked task. This is done to prevent a fault from affecting both the checked and added task(s). However, duplicate and compare tasks can both be mapped to the same PE, since a fault in that PE will not affect the validity of the checked task's output. This technique exploits the fact that certain tasks may be error-transparent and duplicate-and-compare and assertion tasks can be eliminated if the fault detection latency requirements are met.
The Association Array
It was shown in [25] that there exists a feasible schedule for a job if and only if there exists a feasible schedule for the hyperperiod. Therefore, traditionally, as mentioned before, each task graph is replicated the requisite number of times in the hyperperiod. This is the approach used in [15] . The advantage of this approach is that it allows different instances of a task to be allocated to different PEs. However, this flexibility comes at a severe price in terms of co-synthesis CPU time and memory requirement when the hyperperiod is large compared to the periods. A large hyperperiod can result when either the task graph periods are co-prime or when there is a wide variation among task graph periods. Though a large hyperperiod does not necessarily increase the cost of the embedded system architecture, it requires that the cosynthesis algorithm must take into account a large number of copies of various tasks, which in turn significantly increases the run-time of the co-synthesis algorithm. In order to address this concern, the concept of association array was proposed in [16] . We use this concept to eliminate the need for replication of task graphs. Our experience from COSYN [16] shows that up to 13-fold reduction in co-synthesis CPU time is possible using this concept for medium-sized task graphs with less than 1 percent increase in embedded system cost. The slight increase in system cost is due to a slight decrease in the efficiency of PE and link schedules.
An association array has an entry for each copy of each task and contains information such as: 1) the PE to which it is allocated, 2) its priority level, 3) its deadline, 4) its best-case finish time, and 5) its worst-case finish time.
The deadline of the nth instance of a task is offset by (n -1) multiplied by its period from the deadline in the original task. The association array not only eliminates the need to replicate the task graphs, but it also allows allocation of different task graph instances to different PEs, if desirable, to derive an efficient architecture. This array is created after cluster formation and is updated after scheduling. It also supports pipelining of task graphs, when necessary, to derive an efficient architecture [16] .
There are two types of periodic task graphs: 1) those with a deadline less than or equal to the period, and 2) those with a deadline greater than the period. In order to address this fact, an association array can have two dimensions, as explained next. If a task graph has a deadline less than or equal to its period, it implies that there will be only one instance of the task graph in execution at any instant. Such a task graph needs only one dimension in the association array, called the horizontal dimension. If a task graph has a deadline greater than its period, it implies that there can be more than one instance of this task graph in execution at some instant. For such tasks, we create a two-dimensional association array, where the vertical dimension corresponds to concurrent execution of different instances of the task graph. Details of the association array are provided in [16] .
Cluster Allocation
Once the clusters are formed, we need to allocate them. Clusters are ordered based on decreasing priority levels. After the allocation of each cluster, we recalculate the priority level of each task and cluster. We pick the cluster with the highest priority level and create an allocation array of the possible allocations for the given cluster at that point in co-synthesis. Once the allocation array is formed, we use the inner loop of co-synthesis to evaluate the allocations from this array.
The Outer Loop of Co-Synthesis
The allocation array considers the following: 1) architectural hints, 2) preference vector, 3) allocation of the cluster to existing resources in the partial architecture, 4) upgrade of links, 5) upgrade of PEs, 6) addition of PEs, and 7) addition of links.
Architectural hints are used to prestore allocation templates (these templates correspond to the mapping of subtaskgraphs to part of the architecture being built). We exclude those allocations for which the pin count, gate count, memory limits, and power constraints are exceeded. During allocation array formation, addition of up to two new PEs and links of the same type is allowed to keep the size of the allocation array at manageable levels. However, our algorithm does allow the user to specify the limit on the number of new PEs and links of the same type that can be used at any step for allocation purposes.
The Inner Loop of Co-Synthesis
Once the allocation array is formed, we mark all allocations as unvisited. We order the allocations in the allocation array in the order of increasing dollar cost. We pick the unvisited allocation with the lowest dollar cost, mark it visited, and go through the scheduling and performance estimation steps described next.
Scheduling
We employ a combination of preemptive and nonpreemptive priority-level based static scheduler for scheduling tasks and edges on all PEs and links in the allocation. We usually need to schedule the first copy of the task only. The start and finish times of the remaining copies are updated in the association array. However, we do sometimes need to schedule the remaining copies. To determine the order of scheduling, we order tasks and edges based on the decreasing order of their priority levels. If two tasks (edges) have equal priority levels then we schedule the task (edge) with the shorter execution (communication) time first. While scheduling communication edges, the scheduler considers the mode of communication (sequential or concurrent) supported by the link and the processor. Though preemptive scheduling is sometimes not desirable, due to the overhead associated with it, it may be necessary to obtain an efficient architecture. We take preemption overhead into consideration during scheduling. The preemption overhead, x, is determined experimentally considering the operating system overhead. It includes context switching and any other processor-specific overheads. Preemption of a higher priority task by a lower priority task is allowed only in the case when the higher priority task is a sink task which will not miss its deadline, in order to minimize the scheduling complexity.
Performance Estimation
We estimate the finish times of all tasks with specified deadlines and check whether their deadlines are met. For fault tolerance overhead optimization, in addition to the finish time, we identify the FGs for efficient error recovery and evaluate the unavailability of various FGs, as well as the architecture using Markov models. We store the best-and worst-case start as well as finish times of each task and edge. When a task (edge) gets allocated, its minimum and maximum execution (communication) times become equal and correspond to the execution (communication) time on the PE (link) to which it is allocated, as shown in the finish time estimation graph in Fig. 20 (cluster C1 is mapped to P1 and no other mapping is assumed to be performed yet). The numbers in the braces, e.g., {104, 96} adjacent to t4c, indicate maximum and minimum finish times, and the numbers in the parentheses, e.g., (21, 11) adjacent to t3, represent its maximum and minimum execution times, respectively.
Following finish-time estimation, we also use actual start and stop times of the task and communication edges to calculate the fault detection latencies of each check (assertion or compare) task. If necessary, additional assertion and/or duplicate-and-compare tasks are added to meet system fault detection latency requirements. In addition, the unavailability of each FG is estimated to assess the overall unavailability of various system functions, as shown in Fig. 17. 
Allocation Evaluation
Each allocation is evaluated based on the total dollar cost. We pick the allocation which at least meets the deadlines in the best case. If no such allocation exists, we pick an allocation for which the summation of the best-case finish times of all task graphs is maximum. The best-case finish time of a task graph is the maximum of the best-case finish times of the constituent tasks with specified deadlines. This generally leads to a less expensive architecture. Note that we use "maximum" instead of "minimum" to be frugal with respect to the embedded system architecture cost at the intermediate steps. If deadlines are not met, then we have the option of upgrading the architecture at a later step anyway.
Application of the Co-Synthesis Algorithm
We next apply COFTA to the augmented task graph of Fig. 10 . The five clusters are ordered based on the decreasing value of their priority levels. Fig. 21 illustrates the allocation of various clusters during the outer and inner loops of co-synthesis. Since cluster C1 has the highest priority level, it is allocated first to the cheaper processor P1, as shown in Fig. 21a . The scheduler is run and the finish time is estimated, as shown in Fig. 20 . Since t4's deadline is not met in the best case, the allocation is upgraded, as shown in Fig. 21b . Now since deadlines are met, cluster C2 is considered for allocation. First, an attempt is made to allocate cluster C2 to the current PE, as shown in Fig. 21c . After scheduling, since finish-time estimation indicates that deadlines cannot be met in the best case, the allocation needs to be upgraded, as shown in Fig. 21d . Similarly, the allocation is continuously upgraded until the allocation configuration shown in Fig. 21f is reached, where deadlines are met in the best case. During allocation evaluation, we do not change the allocation of previously allocated clusters. For example, while evaluating various allocations for cluster C2, we do not change the allocation for cluster C1. However, we may downgrade one or more components of the allocation, as done between the allocations shown in Figs. 21e and 21f. In this case, though the link is downgraded from L2 to L1, the overall cost of the allocation is increased by adding a more powerful processor P2. Next, cluster C3 is considered for allocation. Since C3 is excluded from being allocated to P1 or P2 (see Fig. 10b ), it is allocated to ASIC1, as shown in Fig. 21g . Now, since deadlines are met, cluster C4 is considered for allocation, as shown in Fig. 21h . Since deadlines are again met, cluster C5 is considered for allocation. An attempt is made to allocate cluster C5 to the existing PE, ASIC1, as shown in Fig. 21i . Since the deadlines are met and all clusters are allocated, the distributed heterogeneous architecture given in Fig. 21i is the final solution.
For simplicity, the error recovery topology extraction and dependability analysis aspects are not illustrated for this example. These concepts were illustrated with earlier examples. However, if the system had three independent task graphs, such as the one shown in Fig. 1a , it would result in the overall system architecture shown in Fig. 21j . Here, the architecture shown in Fig. 21i forms one service module and is augmented with necessary selectors at the inputs and outputs. There are three service modules and one protection module in this system. In the event of failure in one of the service modules, a switch to the protection module occurs. This switch is accomplished by a system-level controller which monitors the health of the service and protection modules.
EXPERIMENTAL RESULTS
COFTA is implemented in C++. It was run on various Bell Laboratories telecom transport system task graphs. These are large task graphs representing real-life field applications. They contain tasks for synchronous optical network (SONET) interface processing, asynchronous transfer mode (ATM) cell processing, digital signal processing, provisioning, transmission interfaces, performance monitoring, protection switching, etc. These task graphs have wide variations in their periods ranging from 25 microseconds to 1 minute. The real-time constraints vary from 75 microseconds to 1 minute. The execution vectors for the tasks in these task graphs were either derived from experimental measurements or estimation from existing designs. The Table 2 . They were either based on the existing designs or estimated using Bellcore guidelines [31] . MTTR was assumed to be two hours since transport systems are part of the central office and are considered as attended equipment. The unavailability constraints for task graphs providing provisioning and transmission functions were assumed to be 12 minutes/year and 4 minutes/year, respectively. Tables 3 and 4 show that COFTA was able to handle these task graphs efficiently. Cost of the architecture is the summation of the cost of PEs and links in it. When two architectures derived by two different algorithms have an equal number of PEs and links, but different dollar costs, it implies that they employ PEs/links of different types. CPU times for co-synthesis were measured on a Sparcstation 20 with 256MB RAM. Table 3 shows the efficacy of COFTA in deriving fault-secure architectures. There are five major columns in Table 3 . The first column shows the name of the example and the number of tasks in it. The second column represents co-synthesis of architectures without any fault security. In the third column, fault security was imparted using TBFT. The fourth column indicates the cost of the double-modular redundant (DMR) architecture where outputs of two simplex architectures are compared. In this case, we have simply doubled the cost of the simplex architecture (second column), ignoring the cost of the comparison elements. In the fifth column, fault security was TABLE 3 FAULT-SECURE TELECOM TRANSPORT SYSTEM ARCHITECTURES imparted using CBFT. For fault-secure architectures, COFTA (fifth column) resulted in an average (average of individual cost reductions; averages are derived similarly for other columns) architecture cost reduction of 46.3 percent over DMR and 13.7 percent over TBFT (third column). Another important observation is that the average cost overhead of COFTA fault-secure architectures over simplex architectures is only 7.3 percent. Note that the cost overhead of a DMR architecture over a simplex architecture is at least 100 percent. Table 4 shows the efficacy of COFTA in deriving faulttolerant architectures. There are five major columns in Table 4 . The second column represents simplex architectures without fault-security or fault tolerance. In the third column, TBFT was used to impart fault detection, followed by error recovery. The fourth column indicates the cost of the triple-modular redundant (TMR) architectures. In this case, we have simply tripled the cost of the simplex architecture (second column), ignoring the cost of the voting elements. In the fifth column, COFTA was used with CBFT, assertion sharing, and error recovery to impart fault tolerance. For fault-tolerant architectures, COFTA is able to achieve an average architecture cost reduction of 48.2 percent over TMR and 14.7 percent over TBFT. Also, the average cost overhead of the COFTA fault-tolerant architectures over simplex architectures is only 55.4 percent. Note that TMR architectures have a cost overhead of at least 200 percent over simplex architectures. COFTA did not result in large cost savings for OAS1/2/3/4 examples over TBFT because the optical receiver/transmitter modules dominated the cost. If these modules were excluded from consideration, the overall cost reduction for these four examples was 16.2 percent for fault security and 21.7 percent for fault tolerance, compared to TBFT.
CONCLUSIONS
We presented an efficient co-synthesis algorithm for faulttolerant heterogeneous distributed embedded system architectures. Experimental results on various large real-life telecom transport system examples are very encouraging. This is the first TABLE 4 FAULT-TOLERANT TELECOM TRANSPORT SYSTEM ARCHITECTURES hardware-software co-synthesis algorithm to optimize dependability. We proposed the CBFT technique to take advantage of task error transparency to reduce the fault tolerance overhead. The error transparency property is common in telecom system task graphs. We proposed a new technique to identify the failure groups for efficient error recovery. We also provided methods to exploit multidimensional assertions as well as assertion sharing to further reduce the overhead.
