Abstract-Two algorithms that combine the operations of scheduling and recovery-point insertion for high-level synthesis of recoverable microarchitectures are presented. The first uses a prioritized cost function in which functional unit (FU) cost is minimized first and register cost second. The second algorithm minimizes a weighted sum of FU and register costs. Both algorithms are optimal according to their respective cost functions and require less than 10 min of central processing unit (CPU) time on widely used high-level synthesis benchmarks. The best previous result reported several hours of CPU time for some of the same benchmarks on a computer of similar computational power.
I. INTRODUCTION

I
N THIS paper, we address the high-level synthesis of fault-tolerant digital systems. Fault tolerance is required in many applications, including: 1) those where repair is difficult or impossible, e.g., systems used in space, underground, or underwater; 2) safety-critical applications such as fly-by-wire aircraft and nuclear reactor control systems; and 3) applications where down time is extremely costly, e.g., in stock markets and banking systems. As virtually every aspect of society becomes reliant on digital systems, more and more applications fall into this third category.
Studies have shown that transient and intermittent faults are the dominant types of hardware faults in many environments. These faults become even more prevalent as the density of devices on a chip continues to increase. A common technique for recovering from transient and intermittent faults is rollback recovery. A large volume of work on rollback recovery exists, covering almost every type of system and method of implementation [1] , [4] , [6] , [8] , [16] . Many applications, e.g., those with real-time constraints, require low overhead and rapid recovery from faults. These applications can not afford the high cost and slow recovery associated with software-based checkpointing and recovery techniques. For these applications, micro-rollback techniques, which roll back only a few instructions upon occurrence of a fault [15] , are an ideal alternative. In this paper, we consider a micro-rollback technique that has been proposed for high-level synthesis [2] , [3] , [7] , [13] , [14] . In this technique, fault-free states are maintained in a computation (specified by a control and data flowgraph (CDFG), scheduled in [2] , [3] , and [14] and unscheduled in [7] ) by artificially extending the lifetimes of certain variables and using a hardened register file. The fault-free states provide points to which execution can be rolled back upon occurrence of a fault and, hence, they are referred to as recovery points. A fundamental problem to be solved, called recovery-point insertion, is to select a "good" set of recovery points for a given CDFG. The work of [14] attempts to insert recovery points to minimize execution time with given hardware resources, while the approach of [7] and [13] attempts to minimize hardware overhead with a given execution time. Our previous work [2] , [3] provides provably optimal solutions to both of these versions of the problem for a given schedule.
Although our previous algorithms [2] , [3] are optimal for a fixed schedule, the choice of schedule can have a significant impact on overhead. Hence, in this paper, we present combined algorithms that do scheduling and recovery-point insertion while minimizing hardware overhead. We present two algorithms, both of which use branch-and-bound search techniques, but with different cost functions. The first algorithm employs a prioritized cost function in which functional unit (FU) cost is minimized first and then, subject to the minimum FU constraint, register cost is minimized. The second algorithm employs a combined cost function that is a weighted sum of FU and register costs. The algorithms make use of a novel technique to obtain the lower bound of the number of registers needed by any feasible schedule containing a suitable set of recovery points.
Our algorithms guarantee solutions that minimize their respective cost functions, and they have been tested on a number of well-known high-level synthesis benchmarks. The experimental results verify that both algorithms produce minimumcost solutions and use, at most, several minutes of central processing unit (CPU) time on these benchmarks. This contrasts the only known previous approach, which required several hours of CPU time for some of the benchmarks on a computer of similar computational power [13] .
II. BACKGROUND
A. Motivation
In this paper, we consider how to synthesize efficient designs with micro-rollback capability. We assume a computation is specified by a CDFG that contains the operations to be performed and the data and control dependencies between these operations. The first step in implementing a design is scheduling, which assigns each operation of the CDFG to a control step (or a clock cycle) of the execution. This scheduling may be subject to constraints on hardware (the set of FU's), or on execution time (the number of clock cycles). After scheduling, CDFG operations are bound to specific instances of the FU's, multiplexers are added to allow the sharing of resources, and registers are added to store temporary variables, input values, and output results. This results in an register-transfer (RT)-level netlist, which is then translated into a layout through a sequence of design tasks such as logic synthesis and layout synthesis.
Resource-constrained scheduling has been extensively studied by researchers in the very large scale integration/computeraided design (VLSI/CAD) community and a large number of scheduling techniques have been proposed. These techniques are surveyed in [5] and [9] . Resource-constrained scheduling is well known to be NP-hard even when certain simplifying assumptions are made. Nevertheless, it is also well recognized that scheduling has a significant impact on the quality of the final design since it determines the configuration, resource requirements (such as FU's, registers, buses, and multiplexers), and execution time of the design. This is true for designs with micro-rollback capability as well.
Standard scheduling techniques do not take into consideration micro-rollback and its impact on the final resource count and design performance. One possible approach is to do scheduling first and then add micro-rollback capability to the resulting RT-level design. Such an approach has been presented in [2] and [3] , and has resulted in good high-level designs. In many (but not all) cases, the designs produced in [2] were optimal in the sense of minimum hardware overhead or minimum number of cycles. However, in some cases, it may be possible to produce improved designs if the micro-rollback process is considered from the beginning, i.e., in conjunction with scheduling. Accordingly, a scheduling technique should take into account the impact of a particular scheduling decision on micro-rollback, and should attempt to optimize the overall design cost and/or performance including any additional cost/delay introduced by the rollback mechanism. This paper thoroughly investigates this enhanced scheduling approach.
B. Problem Definition
A recovery point is defined as the state of a computation at a particular instant of time. A computation state consists of the values of all live variables. Recovery points are maintained during system execution. When a fault is detected, a previous recovery point is loaded into the machine and execution is restarted from that state. If the recovery point is uncorrupted and the fault is no longer in the system, the new execution will proceed correctly. Since transient faults have nonzero duration, the recovery procedure may need to be attempted several times before it is successful.
As in previous work on this problem [2] , [3] , [7] , [13] , [14] , we assume that the register file always performs correctly. 1 Since the register file does not fail, it is not necessary to store recovery points in a separate stable storage. It is sufficient to ensure that the values of all variables that are live at one recovery point are maintained in the register file until the next recovery point. Then, if rollback recovery is necessary, the values comprising the previous recovery point are available to restart the computation. Hence, the process of inserting recovery points in a scheduled CDFG modifies its live variable graph by extending the lifetimes of certain variables.
For example, consider the scheduled CDFG of Fig. 1(a) , which represents a second-order differential-equation solver. If a recovery point is placed at the end of control step 2 and the next recovery point is placed at the end of control step 4, then variables and must have their lifetimes extended through control step 4 so that they will be maintained in the register file in case of rollback recovery. We refer to any registers that are used for storing lifetime-extended variables as recovery registers. In most systems, each input or output value of the CDFG is stored in a dedicated register. These dedicated registers are a fixed cost and, hence, they are not considered in overhead calculations in the remainder of this paper. We, therefore, are concerned only with the temporary registers used to store intermediate results of the computation. 2 These intermediate results are indicated with bold lines in Fig. 1 .
A recovery-point set is a subset of , where is the number of control steps in the CDFG. The elements of the recovery-point set represent the locations of all recovery points inserted in a CDFG. Since recovery points are assumed to be uncorrupted, fault detection must be performed prior to a recovery point. Selection of a suitable fault-detection technique is application-dependent and, thus, we do not assume a particular method for fault detection in this paper. There must be a recovery point prior to the start of execution because if a fault occurs before the first point at which fault detection is done, then the computation must roll back to the beginning. There must also be a recovery point at the final control step in order to verify the outputs of the CDFG. Fig. 1 will be used to illustrate the basic operation of the recovery process. If recovery points are inserted at the beginning of the CDFG and at the ends of control steps 2 and 4, then is the recovery point set. This situation is illustrated in both of the scheduled CDFG's in Fig. 1 . As indicated in the figure, execution proceeds until the end of control step 2. At that point, fault detection is done on the intermediate values of the CDFG. If a fault is detected, the computation rolls back by restarting from the beginning of the CDFG with the same input values. If no fault is detected, execution proceeds to control step 3. A similar operation is performed at the end of control step 4. If a fault is detected in this case, execution restarts from control step 3. If no fault is detected in this case, execution is completed.
Since the number of registers is dependent on the lifetime table, scheduling has a direct influence on recovery-point overhead. To illustrate this more clearly, consider the two different schedules of the same CDFG shown in Fig. 1 . The schedule in Fig. 1 (a) results in a lifetime table, which requires three registers before recovery-point insertion and four registers afterwards. On the other hand, the schedule in Fig. 1 (b) results in a lifetime table that requires three registers before recovery-point insertion and five registers afterwards. While the schedule in Fig. 1 (a) requires fewer registers than the one in Fig. 1(b) , it uses an additional multiplier (three multipliers instead of two). Hence, to optimize the total hardware cost, scheduling and recovery-point insertion must be performed simultaneously.
Use of recovery points also increases execution time; each recovery point introduces a delay in the execution due to fault detection. The delay introduced by a set of recovery points is referred to as the recovery-point delay. In this paper, we assume that fault detection takes unit time and that the maximum recovery-point delay is given as a constraint, denoted by . In addition to the maximum recovery-point delay, there is another constraint that must be satisfied. This constraint is referred to as the stride, denoted by , which is the maximum number of control steps between recovery points. A recovery-point set satisfies the stride constraint if , for . As detailed earlier, there must be recovery points at the beginning of the CDFG and at the final control step. A recovery-point set that contains these values (0 and ) and satisfies the stride constraint is referred to as a valid recovery-point set. If we assume in Fig. 1(b) , then the only valid recovery-point sets are , , , , and . The problem we are concerned with in this paper is to find a schedule and a recovery-point set that minimize hardware resources subject to the maximum recovery-point delay and stride constraints. We consider two versions of this problem. In one version, the goal is first to minimize FU cost and then, subject to the minimum FU constraint, to minimize register cost. In the second version of the problem, the goal is to minimize a weighted sum of FU and register costs.
As with previous work in this area, our results do not consider wiring cost or cost of the control unit. It is extremely difficult to account for layout-dependent effects, such as wiring, during the scheduling process. In order to account for these effects, practical high-level synthesis implementations typically carry out an iterative process wherein the steps from scheduling to layout generation are repeated several times. Layout characteristics provide feedback to the scheduler in this iterative process so that the design can be truly optimized. Hence, the scheduling techniques described in this paper should not be thought of as stand-alone methods, but rather as part of this overall iterative process. We assume that optimization of a control unit for a design with microrollback is performed separately. For a given schedule and recovery-point set, controller design and optimization follow standard well-known principles. While controller cost may vary somewhat for different schedules and recovery-point sets, this effect is expected to be small relative to the overall design cost.
III. OPTIMAL RECOVERY POINT INSERTION FOR A FIXED SCHEDULE
Given a scheduled CDFG, the minimal hardware problem is to find a valid recovery-point set that minimizes the number of registers used subject to a constraint on the maximum recovery-point delay that can be tolerated [2] , [3] . In [2] , we presented an algorithm, referred to as "Algorithm MHP-OPT," which solves the minimal hardware problem optimally. Since this algorithm is used by our scheduling algorithms described in Section IV, we give a brief overview of it here. The interested reader is referred to [2] for a more complete description.
Several data structures are used by Algorithm MHP-OPT. The most important to our discussion is the initial recoverypoint overhead graph (IRPOG). The vertex set of the IRPOG consists of a vertex corresponding to each control step of the CDFG and a vertex labeled "0" that corresponds to the recovery point at the beginning of the CDFG. The IRPOG is a directed graph that contains an arc from one vertex to another vertex if it is possible that control steps and are the locations of consecutive recovery points. In other words, there are arcs from each vertex to the vertices following it, where denotes the stride. Associated with each vertex and each arc of the graph is a weight. The weight of a vertex , denoted by , is equal to the number of nonrecovery registers used at the end of control step . 3 The weight of an arc , denoted by , represents the number of recovery registers that are needed at control step if the last recovery point was placed at control step and the next recovery point is placed at control step or later. An IRPOG can be constructed from a scheduled CDFG in a straightforward manner.
Given an IRPOG, we define a new graph, referred to as the final recovery-point overhead graph, or just recoverypoint overhead graph (RPOG). The vertex set and arc set of the RPOG are identical to the IRPOG. However, the RPOG contains arc weights only. Given an IRPOG with vertex weight function and arc weight function , the weight of arc in the associated RPOG, denoted by , is defined by An RPOG is a simple graph and, hence, a path can be specified by a sequence of vertices. The cost of a path in an RPOG is defined as . The length of a path is equal to the number of arcs it contains, i.e., the length of path is . Let denote the number of control steps in a given scheduled CDFG. A valid recoverypoint set , defines a path in the associated RPOG from vertex 0 to vertex . Such a path is called a recovery-point path. It is shown in [2] that: 1) the length of a recovery-point path is equal to the delay, in control steps, incurred by the corresponding recovery-point set and 2) the cost of a recovery-point path is equal to the register overhead of the corresponding recovery-point set. From these two facts, it can be proven that the minimal hardware problem is equivalent to finding, in an RPOG, a lowest cost recoverypoint path of length no greater than , where is the maximum recovery-point delay.
As an example, consider the CDFG of Fig. 1(b) . Assuming , the initial and final RPOG's are shown in Fig. 2 . If the recovery-point set is {0,2,4}, then the corresponding recoverypoint path in the RPOG is 024. The number of registers used by the recovery-point set is equal to the cost of the corresponding path, which is the maximum of and . Hence, the number of registers used is five. Note that this agrees with the register count derived from the lifetime table of Fig. 1(b) .
The RPOG and IRPOG can be constructed in time. Furthermore, once the graphs are constructed, we can find an optimal recovery-point set from the RPOG in polynomial time. We use a dynamic programming algorithm to solve this constrained shortest path problem. The result is Algorithm MHP-OPT, which solves the minimal hardware problem in time. Details of the algorithm and evaluation of its time complexity are given in [2] .
IV. OUR APPROACH
Our goal in this paper is to find, for a given CDFG, a schedule and a recovery-point set that are optimal in terms of FU and register cost. We first prove that the associated decision problem is NP-complete. Following that, we describe our optimal approach to solving the problem that is efficient for a standard set of high-level synthesis benchmarks. The overall flow of our approach is shown in Fig. 3 . To achieve our goal, we first derive lower bounds on FU cost and register cost. Using these lower bounds, we then find a minimum-cost schedule and extract the optimal recovery-point set from it. We describe two possible solution techniques: scheduling with prioritized cost function and scheduling with combined cost function.
A. Problem Complexity
Although the scheduling problem we address here is an optimization problem, i.e., we are interested in minimizing the cost of the solution found, it is well known that this is at least as hard as the associated decision problem. The associated decision problem of interest is referred to as the precedence constrained scheduling with recovery-point insertion (PCSRPI) and is stated below. In this statement, represents the set of operations to be scheduled and represents the precedence constraints imposed on the operations due to data dependencies. In general, there can be FU types and we must not exceed the resource constraints, given by and on the number of FU's of each type, and on the number of registers. We also have the recovery constraints and defined earlier. Finally, we are given the deadline , i.e., the maximum number of control steps that our final schedule can contain. The question of interest is then, "Do there exist a schedule and valid recovery-point set that satisfy all of the given constraints?" as defined above.
PCSRPI
Instance: Set of operations, each having delay , partial order on the operations, integer number of FU types , integer FU and register upper bounds and , integer recovery-point stride , integer recovery-point delay upper bound , and integer deadline Question: Is there a schedule for that meets the deadline , obeys the precedence constraints specified by , satisfies the FU upper bounds , and a valid recoverypoint set that satisfies the stride constraint, has recovery-point delay no greater than , and, together with the schedule, satisfies the register upper bound ?
Theorem 1: PCSRPI is NP-complete. Proof: PCSRPI is in NP because if we are given a schedule and a recovery-point set, we can check in polynomial time whether they form a solution. This can be done in the following four steps.
1) Checking that the precedence constraints are satisfied-this can be accomplished in polynomial time by checking, for each operation, that the time step of the operation is greater than the time steps of all of its predecessors. 2) Checking that the FU upper bounds are satisfied-this can be done in polynomial time by finding, for each type of operation, the number of such operations scheduled at each time step and taking the maximum over all time steps. 3) Checking that the recovery-point set is valid and satisfies the stride and recovery-point delay constraints-this can be done in polynomial time by verifying that 0 and are in the set, where is the schedule length; checking that the maximum separation between recovery points is no greater than , and ensuring that the size of the set is no greater than . 4) Checking that the register upper bound is satisfied-this can be accomplished in polynomial time by building the RPOG for the given schedule and recovery-point set and verifying that the cost of the path associated with the recovery-point set is no greater than . We now restrict PCSRPI to "precedence constrained scheduling," which is known to be NP-complete [17] . This is done by setting , , and . With , there is only FU type, e.g., the processors of the precedence constrained scheduling problem. With , each operation can have a private register for its output, meaning that any schedule and recovery-point set will satisfy the register upper bound. Finally, with , any schedule that satisfies the deadline has as a valid recovery-point set that satisfies the stride constraint, where is the length of the schedule. The key aspects of this restricted problem are: 1) there is only one FU type and 2) recovery-point insertion and register usage are not issues. Therefore, the simplified question becomes "Is there a schedule for that meets the deadline , obeys the precedence constraints specified by , and satisfies the FU upper bound ?" This is equivalent to precedence constrained scheduling.
B. Lower Bound Derivation
Given that the PCSRPI problem is intractable, we propose optimal solution techniques which are based on a branch-andbound approach. Although such techniques usually have an exponential worst-case time complexity, they can be made runtime-efficient for most problem instances if proper bounding is accomplished. To that end, we briefly describe how to derive tight bounds on the cost of a solution for a given unscheduled CDFG. Later, we show how to use the bounds in our solution technique. , where is the number of control steps, to get a global lower bound for FU type . For each interval , we find the number of operations of type guaranteed to be scheduled during that interval, and apply the pigeonhole principle to derive a lower bound on the number of FU's of type for that interval. Note that an operation is guaranteed to be scheduled in range only if its earliest and latest available control steps are both included in . The earliest and the latest available control steps of an operation can be obtained by as soon as possible (ASAP) and as late as possible (ALAP) scheduling algorithms, respectively. The maximum of these bounds over all intervals yields a global lower bound on the cost of FU type .
For example, the four multiplications , , , and in Fig. 1 are guaranteed to be executed in the control step interval regardless of scheduling when since their time frames are fully included in this interval. Thus, is a lower bound on the number of multipliers for this interval. To get a tighter lower bound, calculation is repeated over all intervals and the maximum lower bound, which is two for this example, is selected. In a similar way, a lower bound on arithmetic logic unit (ALU) count is derived.
This basic idea can also be generalized to support multicycling, chaining, and functional pipelining of operations. For more details on FU lower bounds, the reader is referred to [10] and [12] .
2) Register Lower Bounds: In Section III, we sketched how an optimal recovery-point set, which requires the minimum register cost, can be obtained in polynomial time from the IRPOG constructed for a given schedule. To find register lower bounds, we first construct a "lower bounded IRPOG" in which the weight of each arc (vertex) is a lower bound on the weight of the same arc (vertex) of any IRPOG that can be constructed for any feasible schedule. This involves calculating lower bounds on both recovery registers and nonrecovery registers, as discussed below. Applying Algorithm MHP-OPT to this lower bounded IRPOG provides the desired register lower bounds. a) Nonrecovery register lower bound: A data value is guaranteed to be active over control step only if the ALAP control step of its source operation is not larger than and the maximum of the ASAP control steps of its destinations is greater than . In this way, for each control step, we find data values that are guaranteed to be active over that control step, regardless of scheduling. We choose the number of such data values as the lower bound on the weight of the vertex corresponding to that control step. For example, data value in Fig. 1 is the only value guaranteed to be active over control step 2 regardless of schedule. The lower bounded weight of vertex 2 is, therefore, calculated to be one. b) Recovery register lower bound: Once the weight of each vertex is determined, we calculate a lower bound on the weight of each arc of the IRPOG. All data values that are active at one recovery point and consumed before the next recovery point must be stored in recovery registers. Thus, for each feasible pair of consecutive recovery-point locations, we find data values that are guaranteed to be active at the first point and guaranteed to be consumed before the second. The number of such data values is a lower bound on the weight of the arc representing the recovery-point pair. For example, data value in Fig. 1 is the only one that is guaranteed to be active over control step 2 and consumed before control step 4, regardless of scheduling. Thus, the lower bounded weight of arc (2, 4) is one. The lower bounded IRPOG obtained in this manner from the CDFG of Fig. 1 is shown in Fig. 4(a) . c) Lower bound on total register count: Once the lower bounded IRPOG is constructed, a lower bound on the total register cost is obtained from the graph using Algorithm MHP-OPT, described in Section III.
C. Scheduling with Prioritized Cost Function
Once lower bounds on hardware cost are known, scheduling must be performed. Ideally, we would like to minimize a cost function representing a weighted sum of the resource counts. This, however, is a highly complex problem where the solution space can be quite large. An alternative solution can be found by prioritizing the cost function. If we assume, for example, that FU cost is much higher than register cost, we can find all the schedules that minimize FU cost and then select, out of those schedules, one that minimizes register cost. If these two phases are done separately, the number of schedules selected in the first phase may be very large and may contain many solutions that would turn out to be optimal in the second phase. Clearly, there is no need to keep all these candidates since only one optimal solution is needed. Thus, a more sensible approach is to find a tight lower bound on FU cost and use that bound as a constraint in searching for a schedule that minimizes register cost. This approach is referred to as "scheduling with prioritized cost function." Our scheduling technique employs branch and bound search and is loosely based on the work reported in [11] and [12] .
1) Time-Frame Adjustment: Once lower bounds on FU's of each type are derived, we assume that the numbers of FU's used are equal to the lower bounds. In other words, the lower bounds are set as an FU constraint. We then use the FU constraint to adjust the time frames of operations in the CDFG. If a schedule satisfies this FU constraint, then each operation must be assigned to some control step in its adjusted time frames [10] , [12] . This helps reduce the number of solutions to consider and also makes the register lower bounds tighter, thus improving the execution time. In this case, our register lower bounds are conditional on the FU constraint, thus, we refer to them as conditional lower bounds. In contrast, the lower bounds derived from the initial time frames are called absolute lower bounds.
For example, in Fig. 1 , multiplication can be scheduled between control steps 1 and 2. However, if we assume only two multipliers are available, it can be scheduled only in control step 2 since other multiplications and are already fixed to control step 1. Consequently, it also adjusts the possible lifetime of its output . Once time frames are adjusted, we can derive tighter lower bounds on register count, although they are subject to the FU constraint. Fig. 4(b) shows the lower bounded IRPOG for the CDFG of Fig. 1 using adjusted time frames generated by assuming that two multipliers and two ALU's are used. Note that this IRPOG is identical to the one shown in Fig. 2 . Hence, the scheduled CDFG of Fig. 1(b) and its optimal recovery-point set will be optimal over all schedules that use two multipliers and two ALU's.
2) Search Space: Any feasible scheduling result can be derived from the ASAP schedule by simply rescheduling some operations into later control steps. In our approach, we start with the ASAP schedule as the initial solution, and postpone each operation by one control step at a time, generating a search tree where each node represents a feasible schedule and each branch means the postponement of an operation by a control step. In our approach, the postponement of an operation may incur the successive delays of other successor operations depending on it in order to preserve their precedence relations. Note that this search space is guaranteed to include all the feasible solutions. We will describe how to further reduce the search efforts later in the paper.
3) Lower Bound Derivation in the Search Space: In our traversal of the search space, operations are moved only to later time frames, but never to earlier ones. Therefore, if operation is scheduled into control step in solution , it is guaranteed to be scheduled into some control step greater than or equal to in any solutions that are in the subtree of the search space rooted by . For these solutions, is the earliest control step where operation can be scheduled, while its latest control step (obtained by ALAP scheduling) is unchanged.
For each schedule in our search space, we derive lower bounds on the cost of the schedule and any other solutions produced from that schedule as well. We get these lower bounds from the procedure described in Section IV-B simply by replacing the earliest control step of each operation with the control step where the operation is scheduled in the current schedule. Moreover, the lower bounds derived for the ASAP schedule are lower bounds on the cost of any solution, including an optimal one. It is a very important feature of the proposed approach that the lower bounds on the FU and register costs of a set of schedules represented by a subtree in the search space can be obtained without having to traverse the subtree.
4) Overall Structure of the Algorithm: Fig. 5 shows the pseudocode of our scheduling algorithm, which guarantees to find a schedule that is optimal in its use of registers while meeting the FU constraint. In this algorithm, Best_Schedule represents the schedule with the minimum register cost while satisfying the FU constraint, and UB_REG_COST and UB_COST are their register cost and the total cost, respectively. FU_COST and REG_COST are FU cost and register cost required for schedule , while LB_FU_COST and LB_REG_COST represent the lower bounds on FU cost and register cost for schedule and any other schedules derived from . We compute the actual cost, as well as lower bounds on the FU and register cost, of each solution found during the search starting with the ASAP schedule, and we use these values to reduce the search space.
a) Updating Best_Schedule: In the algorithm, whenever the new schedule meets the FU constraint and requires fewer registers than the current Best_Schedule, we assign it to Best_Schedule and update UB_REG_COST and UB_COST.
b) Pruning the search space: During the search, a set of schedules can be cut off without loss of optimality, only if: 1) their FU cost does not satisfy the constraint; 2) they have register cost no lower than Best_Schedule; or 3) they have total cost no lower than Best_Schedule. These conditions are detected using the lower bounds previously described.
c) Termination condition: If a schedule satisfies the FU constraint and it also has register cost equal to the lower bound on register cost, it means that the schedule is optimal. Therefore, in that case, the algorithm prints out Best_Schedule and its optimal recovery-point set and then halts. The algorithm also halts if there are no more candidates to search. d) New candidate: If neither pruning condition nor the termination condition is met, we consider other schedules one by one. From the current schedule, we first move an operation and, if needed, its successors into their next available control steps, thereby constructing a new candidate schedule. The procedure is repeated with the new schedule recursively.
D. Scheduling with Combined Cost Function
The final result from the algorithm described in Section IV-C is optimal in the sense that it requires the minimum number of registers subject to the FU constraint. However, relaxing the FU constraint may cause further reduction in register cost and, thus, produce a better result with higher FU cost, but lower total cost. For example, a schedule requiring four adders, two multipliers, and ten registers is better than one requiring three adders, two multipliers, and 11 registers, if the cost of an adder is less than that of a register. Choosing a schedule to minimize the total hardware cost is referred to as "scheduling with combined cost function."
In our approach (described in Fig. 6 ), we start with the lower bound on FU cost as the initial FU constraint, and select the next FU constraint by incrementing the number of FU components one by one. For each FU constraint, the scheduling procedure described in the previous section is applied. This is repeated until it is guaranteed that relaxing the FU constraint further will not produce a solution better than the best solution found thus far. Note that if the sum of the FU constraint and the absolute lower bound on the number of registers is greater than the cost of the best solution found thus far, it means that the cost of any schedule requiring as many FU's as the FU constraint is higher than the cost of the best schedule found thus far. Thus, in that case, the procedure is terminated. Fig. 7 shows the pseudocode of our optimal scheduling algorithm. In Fig. 7 , "Best_Schedule" now represents an optimal schedule with respect to the combined cost function. Once the optimal schedule is selected, its corresponding optimal recovery-point set can be extracted using the method described in Section III.
V. EXPERIMENTAL RESULTS
A. Benchmarks
The benchmarks that we consider are: 1) a fifth-order elliptic filter (EWF); 2) an AR filter; 3) a 16-point filter; and 4) a back solver [7] . In these experiments, we use the VTI 0.8-m component library, shown in Table I . However, our results should scale down with feature size as long as the relative cost of the three types of units remains roughly the same. One important thing to note is that the costs of an adder and register are comparable in this library.
B. Prioritized Cost-Function Experiments
The first set of experiments are for scheduling with a prioritized cost function. We derive lower bounds on the register count and use them to find a schedule that requires the minimum register cost, while satisfying the given FU constraint. Table II shows the results of the scheduling with a prioritized cost function. To verify our approach, we applied our algorithm with varying stride values ranging from 2 to 6. As shown in Table II , our lower bounds on FU and register count are very accurate. The lower bound on FU count of each TABLE II  PRIORITIZED COST-FUNCTION SCHEDULING RESULTS   TABLE III  COMBINED COST-FUNCTION SCHEDULING RESULTS: EWF type is equal to the actual count in all cases experimented. Furthermore, the register lower bound is equal to or one less than the actual register count in the selected schedule. Using these lower bounds, the scheduling results are optimal in the sense that they satisfy the given FU constraint and also require the minimum number of registers. Note that the execution time is very short, as shown in Table II . 
C. Combined Cost-Function Experiments
The next set of experiments are for scheduling with combined cost function. In this case, we consider different FU constraints and then choose the best among the generated schedules. Tables III and IV show the results of our algorithm for this case. For the EWF, we can find two cases where relaxing the FU constraint reduces register cost. In this experiment, we first choose a schedule which requires three adders, two multipliers, and 11 registers when the FU constraints are three adders and two multipliers and the stride is five (or six), as shown in Table III . However, we find another schedule with four adders, two multipliers, and ten registers when the FU constraints are four adders and two multipliers. Since the cost of a register is higher than that of an adder in our library, the latter solution is chosen as the optimal one.
In contrast, for the AR and 16-point filters, the schedules obtained for the first FU constraint require as many registers as their lower bounds. Thus, we do not need to relax the FU constraint further for these examples. Since the results are essentially identical to those of Table II for these examples, we do not repeat them here. For the back solver, additional FU constraints beyond the initial ones are considered, but no better solution is found. This is shown in Table IV .
Tables II-IV also show the CPU times used by our algorithm on a Sun SparcStation 10 for the various benchmarks. For all but the EWF with combined cost function, the CPU times are only a few seconds. For the EWF, the maximum CPU time was about 8 min. This is in contrast to the only previous result on this problem, which reported CPU times of several hours on a SparcStation 2 for at least one of these benchmarks [13] .
VI. COMPARISON WITH RELATED WORK
The recovery-point insertion problem has been considered previously in the high-level synthesis context in [2] , [3] , [7] , [13] , and [14] . In [2] , [3] , and [14] , recovery-point insertion is done for a fixed schedule, i.e., the impact of scheduling on the quality of the design was not considered. Karri and Orailoglu were the first to consider a combined approach to scheduling and recovery-point insertion [7] , [13] .
In [7] , a heuristic nonoptimal approach to scheduling and recovery-point insertion is presented. The results of our experiments can be directly compared to those of [7] because their cost function is simply a weighted sum of FU's costs and register cost. Since their approach is nonoptimal, the designs produced by our algorithms are more efficient. Over the benchmarks used in both papers, the design costs produced by our algorithms are 10%-30% lower than those in [7] . CPU times were not reported in [7] , but it is likely that, due to its heuristic nonoptimal nature, their algorithm is substantially faster than ours.
In [13] , the scheduling problem is stated in integer linear programming (ILP) terms. This allows optimal scheduling to be done although, due to the very large number of variables in the problem specification, the ILP solution required several hours of CPU time on a Sun SparcStation 2 for at least one benchmark. The maximum CPU time required by our algorithm was about 8 min on a SparcStation 10. A comparison of design costs is difficult since the cost function used in the experiments of [13] differs significantly from ours. However, since both approaches are optimal, the results should be identical, given the same objective function. In [13] , voter cost is included, but register cost is ignored. Including register cost in the objective function would add even more variables to the problem specification and would, therefore, increase solution time even more. Our experimental results include register cost, but not fault detector (voter) cost. However, in [2] , we explicitly show how detector cost can be included in Algorithm MHP-OPT without affecting the time complexity of the algorithm. This modified MHP-OPT can then be used within the framework described in this paper to include voter cost in the objective function with no significant increase in execution time.
VII. CONCLUSION
In this paper, we presented techniques for combined scheduling and recovery-point insertion in high-level synthesis of digital systems. Our algorithms are optimal and ran efficiently on standard benchmarks. Currently, CDFG's with loops and conditional branches can be decomposed into basic blocks, which can be scheduled using our techniques. Loops can be unrolled either partially or fully. Other extensions have already been accomplished, such as chaining and multicycling. Future work will consider the scheduling of these more general CDFG's to achieve more efficient resource utilization.
