Introduction
CAD-related design methodologies are nowadays the means for designing embedded systems, in particular those adopted in mission-critical environments (e.g. telecom, automotive electronics,...), allowing the adoption of ad hoc solutions to ensure properties that are not purely functional (e.g., testability, self-checking (SC), ...).
Typical application-specific design approaches take full advantages from high-level synthesis techniques. Conforming the design to the requirements of a self-checking system is usually left to a lower abstraction level (logical or transistor level), by suitably modifying the architecture via data encoding and constrained specific logic synthesis. Proposal approaches operate through a suitable encoding in the state assignment of FSMs as well as by encoding the data flowing into the data path supported by a subsequent specific logic synthesis [1] [2] [3] . However, these methodologies suffer of some drawbacks and, furthermore, they are introduced when most of the synthesis steps have already been carried out, so that the resulting system is not optimized with respect to the overall functionality of the circuit.
The aim of this work is to shift the handling of the selfchecking properties towards the upper levels of the design process (high-level) to enhance resource exploitation while granting autonomous error detection capability (not requiring the modification of the functional circuit).
The paper, in Section 2, initially introduces the classes of faults we are considering, the target architecture for high-level synthesis and the modifications necessary to obtain a self-checking system. Section 3 presents the proposed approach, by detailing the design methodology. Two different classes of approaches can be envisioned. The first has the goal of fulfilling both design constraints on time, performance and checking latency, by exploiting the data dependencies. The second attempts to overcome a not optimal exploitation of the resources by introducing an interleaving between the computation and the checking activities. Other proposals can be found in literature concerning similar design methodologies [4] [5] . The proposed work focuses the attention on the optimization of the hardware resources by acting on the global scheduling and resource allocation. The approach presented in [4] optimizes the scheduling itself by using a multidirectional force-directed scheduling on a region based partitioning of the data-flow graph. On the other hand, the proposal presented in [5] aims at providing a semi-concurrent error detection ability. Section 4 reports some considerations concerning the implementation cost. Experimental data concerning a small example outlining the potential benefits of the proposed approach are also included. Section 5 draws some conclusions and outlines future investigations.
Preliminaries

Architectural Model
The target is the multiplexer-based architecture ( Multiplexer-based architectures are composed of functional units (FU) and registers; in this organization, each data transfer between FUs and registers occurs through multiplexers and each value produced by a functional unit is stored into a register through multiplexers [6] .
Self-checking Architecture
The self-checking architecture is mainly composed of two elements: the functional architecture and the checking architecture. A single control unit is used. The functional part is the one designed without any self-checking capabilities provided by the checking section. 
Fault Model and Faults Equivalence
The fault model refers to the stuck-at fault (s-a-0, s-a-1), widely adopted at the register transfer level, and, with respect to the present architecture, covers the following classes:
• faults affecting any component of the data path. Note that multiple faults inside each component can be dealt with provided they generate an observable error; • input and output lines of any component;
• the control unit, thus generating erroneous control configurations for the data path. Moreover, multiple faults affecting different components of the data path can be detected if they cause differently observable errors.
Checking Architecture Design
As depicted in Fig. 2 , the verification of the correct behavior of the functional architecture can be performed by means of a checking unit. to compare intermediate or final results at some checkpoints. Checkpoint positioning depends on the specific characteristics of the application and on the design goals, such as the level of observability of the results, cost, and state restoring in case of fault tolerant systems. In the following we deal mainly with datapath synthesis, since it constitutes the most relevant factor influencing gate count for our target applications. Three proposals are presented; the starting point for all of them is a datapath optimized for cost or speed (allocation and binding have already been performed). The first solution guarantees a null latency and the goal is area minimization while the other approaches are tailored for applications where silicon cost is the key factor.
Data Dependency Driven Methods
This first class of methods allows the realization of a checking architecture maintaining the same dependencies among data. Two approaches are available and differ in terms of objectives although both aim at improving resource exploitation with scheduling and allocation strategies using the entire set of available resources in a given set of control steps (Csteps). The initial system has a checking architecture equal to the functional one, after normal scheduling and allocation.
a) Time Constrained -Force Directed Based Scheduling for Concurrent Error Detection.
When time is the most critical constraint, the checking architecture initially duplicates the same hardware resources of the functional one (null latency). The adopted optimization algorithm is derived from the force directed scheduling (FDS), where the mobility inside each Cstep of the operators pertaining the functional part is zero, and the mobility of the resources of the checking part is derived by ASAP and ALAP scheduling algorithms [7] . Since the presence of at least two units of the same type T is mandatory to verify the correctness of the results, only the functional units (FUs) whose cardinality is greater than two can be considered for hardware minimization. As far as the remaining FUs are concerned, for each type a distribution graph is determined, together with an estimation of the lower bounds of their amounts [8] . If the number of allocated FUs is greater than the lower bound, it is possible to modify the scheduling of the checking architecture to balance the density of the distribution graph through the minimization of the strength. The obtained result is a structure minimizing the amount of resources while maintaining a null latency. A simple example of the methodology is reported in Fig. 3 . 
B) Area Constrained Scheduling for Concurrent Error
Detection.
This other criterion for scheduling imposes constraints on the resources. The starting point for the optimization is a checking architecture equal to the functional one. Constraints on resources are introduced by considering that at least two FUs of the same type T are necessary to verify the correctness of a result. The algorithm processes, one Cstep at a time, the number of allocated operations: in case of violation of the constraint for a given resource, the operation is shifted to the following Cstep. To improve the checking latency, among all the possible candidate operations the one with the smallest mobility is selected. The algorithm ends when the scheduling of all the operations of the checking architecture is completed. A small example is reported in Fig. 4 ; this approach improves area in detriment of latency.
Method relaxing Data Dependency: Data Interleaving
The previously described methods do not efficiently exploit the available resources because operations need to be carried out in a determined sequence. In fact, the existing precedence relation between operations and the lack of a sufficient number of FUs often forces a later scheduling of the operations, even if the necessary resources are available. By relaxing the precedence constraint a twofold effect is achieved; functional units are more extensively and efficiently employed, and there is a smaller latency in the checking. The initial situation is the complete architecture, composed of the functional part and an identical checking part. The algorithm takes into account constraints on resources having at least two functional units of the same type, and processes, a Cstep at a time, the number of allocated operations. If in a Cstep, a violation of resource use constraint occurs, operations associated with the interested resource are delayed. Among all the possible candidate operations to be carried out in the Cstep, the ones with the smallest mobility are bound to that Cstep. If, at a given control step, it is possible to carry out an operation by violating the precedence constraint, this solution is adopted, taking as inputs the corresponding values produced by the functional architecture. Additional checkpoints will be introduced. The algorithm ends when the scheduling of all the operations of the checking architecture is completed. A simple example is shown in Fig. 5 (for simplicity the multiplication is assumed to be a single cycle operation).
Design Space Exploration
Both approaches, Area Constrained Scheduling and Data Interleaving, are criteria for designing TSC devices allowing the adoption of simple metrics for the analysis of both cost and performance. These metrics allow the identification of a possible solution and the comparison of alternative solutions different from the trivial ones (minimal cost and no checking latency). The system performance is evaluated in terms of the significant element: checking latency. Cost evaluation requires a more complex analysis; a first rough estimation can be based on the following assumptions.
• The cost of the control unit does not depend on the checking unit implementation (the evaluation in [6] shows that doubling the number of micro-states produces a cost increase lower than 10%).
• The estimated number of registers, NumReg, is equal to the maximal width of the DFG. Registers cost depends on the size of the data to be stored (DimReg); a bit of information has an equivalent cost of 6 gates. The resulting cost for all registers is
• Functional units have an impact on cost proportional to their number and type, providing a final cost of
• The number of concurrent checkpoints determines the most significant part of the cost associated with the checking architecture. A TRC checker has a cost of 6 equivalent gates, a Controllable-TRC of 8. The global cost of the TRC checker trees and Controllable-TRC checker trees is given by
• The estimated number of multiplexers and their size constitute the hardest parameter to be evaluated, since it depends on optimizations. We adopt the average estimation proposed in [6] . The multiplexers area (from registers to FUs, and from FUs to registers) can be evaluated as
where w i is the number of writing operations on operator of type i, n i is the total number of operators of type i, |FU| is the number of allocated functional units. The reported results support the validity of the proposed approach for achieving a self-checking device. Furthermore, two aspects are worth mentioning: the final architecture is self-checking with respect to single and multiple faults, provided they are observable in different ways. This is a significant issue, since it allows the adoption of a more general functional stuck-at fault model.
Concluding Remarks
The paper presents a proposal for designing TSC devices by working at a high level of abstraction. The goal is to limit area overhead, eventually increasing checking latency, and to provide an approach based on the standard design libraries and not requiring the ad-hoc design of each component of the device.
The approach aims at building a checking architecture, aside the functional one (the original device), that partially reuses the existing resources and partially duplicates them, and monitors some points of the data computation flow to detect the occurrence of faults.
Experimental results have shown a final area cost lower than that of the duplication scheme of at least a 25% without increasing checking latency. The effectiveness of the TSC design methodologies is exploited for data path intensive designs, where the control unit does not dominate area costs.
