Experimental results show that the proposed variable-latencyunit-based synthesis techniques achieve a performance improvement of upto 1.6 (average of 1.4 ) over a state-of-the-art HLS tool, with minimal area overheads (average of 5.3%). The use of reduced variable-latency units leads to a performance improvement of upto 1.6 (average of 1.3 ), with a simultaneous area reduction of upto 17.9% (10.6% on the average). Index Terms-High-level synthesis, high-performance design, variable-latency units.
I. INTRODUCTION
C USTOM integrated circuits (ICs) like high-end microprocessors often use information derived from data statistics to speed up execution. Branch prediction and control-and dataspeculation are common examples [1] . The success of these techniques has fueled interest in exploring the use of data statistics in ASIC design [2] - [4] .
The technique presented in [2] performs power optimization through the use of input data statistics. It optimizes an embedded S. Ravi is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA.
G. Lakshminarayana is with the C&C Research Labs, NEC USA, Princeton, NJ 08540 USA.
Publisher Item Identifier S 0278-0070(00)09140-5.
combinational circuit block by adding significantly simpler circuits (called predictor circuits), which compute the output and disable the original circuit for a subset of input conditions. Precomputation was also used to enable retiming to obtain clock period reductions over and above those obtainable through conventional retiming in [3] . A technique to identify commonly occurring threads of computation and implement them through specialized circuitry was proposed in [4] . This work was based on the observation that a lot of the computation time is spent in short segments of code, which can be aggressively optimized for, if their occurrence is detected in advance.
Performance optimization techniques used in high-level synthesis (HLS) use chaining, multicycling, and pipelining of components to obtain performance improvements. However, they typically assume that the latency of each component is fixed [5] - [7] . Scheduling techniques such as relative scheduling [6] do consider the possibility that a behavior can consist of parts (e.g., a while loop) that have unbounded delay. However, in all cases, the atomic components used for constructing the register transfer level (RTL) circuits are assumed to have a fixed latency.
Telescopic circuits [8] were based on the observation that the time taken by a circuit to complete execution depends upon the past and present values of the circuit inputs. For example if the inputs to an 16-bit ripple carry adder change from to , the delay of the adder is equal to the delay of a 1-bit full adder. If the adder were implemented using 0.35-technology the delay of the full adder would be 1.5 ns, as compared to 20 ns, the worst case delay of the 16-bit adder. A telescopic circuit can be constructed from the original circuit by adding a signal for detecting completion. The output of the circuit is now available as soon as its ready, as opposed to being constrained by the worst case delay of the circuit. As opposed to the original circuit, which has a fixed latency, the output of a telescopic unit can arrive in a range of times, depending upon its inputs. The term variable-latency units is used here to describe such circuits. Note that, unlike asynchronous and self-timed circuits, variable-latency units are synchronous.
Clearly, if data statistics favor fast execution, variable-latency units are significantly faster than their fixed-latency counterparts. However, in order to fully realize their potential, they should be efficiently integrated into a synthesis environment/design methodology. In this paper, we present techniques to effectively exploit the power of variable-latency units to synthesize compact, fast circuits. The techniques we propose are applicable to register-transfer level designs, with a clearly specified controller-data path partition, or to designs generated by a generic HLS tool.
In this paper, we discuss some issues involved in integrating the use of variable-latency units into a synthesis environment. We demonstrate that naive replacement of fixed-latency units with variable-latency units does not fully exploit the power of these units, and can sometimes even result in larger and lower performance designs. Therefore, we need a comprehensive methodology to accurately evaluate the performance and area impact of variable-latency units, and integrate them seamlessly into the flow of a generic HLS system. Variable-latency units impact several aspects of HLS, including module selection, scheduling, and resource sharing. Thus, the process of selecting computations for variable-latency implementation needs to take into account these effects. We present the modifications that need to be made to a generic HLS flow for it to utilize variable-latency components. Further, we present algorithms for judicious selection of computations for variable-latency implementation. Our technique accepts as inputs, a scheduled behavioral description and resource allocation constraints (constraints on the numbers and types of functional units, including variable-latency components, available for synthesis). We output a modified schedule, where some operations are mapped to variable-latency units. Our algorithm uses a novel procedure to select the "best" operations in the behavior to be performed by variable-latency functional units. The procedure takes into account both the topology of the behavior, e.g., the existence of critical paths passing through the operation under consideration, and its data statistics, which determine the fraction of time an operation can execute in the variable-latency mode. At the end of the selection process, the design is rescheduled to reflect the changes introduced by the inclusion of the variable-latency functional units.
Variable-latency functional units are typically larger than their fixed-latency equivalents. Hence, they need to be used judiciously to keep circuit sizes small. In our work, we introduce techniques that reduce the overhead associated with using variable-latency units. This is done by considering variable-latency units as a composition of three parts: a high-latency part, the low-latency part (which returns the correct result for a subset of the input space), and a detection mechanism (called the low-latency condition) that detects the input subset for which the low latency part computes correctly. If the low-latency part computes correctly frequently enough, we modify the variable-latency unit by removing circuitry specific to the high-latency part. When a high-latency case is encountered, it is performed using one of the remaining (fixed latency) resources. This transformation reduces the area of the variable-latency unit significantly, and if performed carefully, does not significantly degrade the overall circuit performance. We describe a procedure to introduce this transformation into the scope of a HLS system.
Our performance-enhancement and area-reduction methodologies push the area-delay tradeoff curve for conventional HLS, and result in solutions that are sometimes more compact, as well as faster than the original designs.
The rest of this paper is organized as follows. Section II covers some background material and introduces some issues related to the use of variable-latency functional units. Section III details our algorithm. Section IV presents experimental results and Section V concludes.
II. CONSTRUCTING RTL CIRCUITS USING VARIABLE-LATENCY COMPONENTS: ISSUES AND ILLUSTRATIONS
Variable-latency components exhibit the property that the time (number of clock cycles) taken to compute the outputs depends on the input values. This is unlike components used in conventional HLS (functional units, multiplexers, registers), which have a fixed latency that is independent of input values. An example variable-latency unit embedded in an RTL controller/data path circuit is shown in Fig. 1(a) . In addition to the output representing the value computed, the variable-latency unit provides an additional "Ready" signal that indicates when the computed output values are available. The controller senses the Ready signal, and configures. the rest of the data path appropriately. For example, if the Ready signal assumes a logic value one, the load signal for the register that needs to store the output of the variable-latency unit is asserted (note that the high-latency case can be automatically inferred when the low-latency case does not occur, hence, the Ready signal need not be asserted for the high-latency case). The symbolic waveforms in Fig. 1(b) illustrate the operation of a generic variable-latency unit with two latencies in the "low-latency case" as well as the "high-latency case." Clearly, the time required by the variable-latency unit can transitively influence the timing of the subsequent computations in the circuit. In order to reflect the data-dependent completion time of a variable-latency unit, an operation assigned to a variable-latency unit can be viewed as an equivalent control data flow graph (CDFG) that contains three distinct operations, as shown in Fig. 1(c) .
• An operation that represents the low-latency case 1 .
• An operation that represents the high-latency case.
• A low-latency condition that is used to decide whether the or will be executed. Note that the substitution of variable-latency operations in a behavior by their equivalent CDFGs converts a DFG with no conditionals into a CDFG.
In practice, arithmetic operations (addition, subtraction, multiplication, division) lend themselves easily to variable-latency implementation. For such operations, one possible low-latency condition is that the most significant bits of all the inputs are equal to 0. This effectively reduces an -bit arithmetic operation into an bit arithmetic operation. For comparators, an efficient strategy is as follows. Consider the operation . The condition xor may be used as the low-latency condition. If the low-latency condition is true, all the lower order input bits can be ignored and the result of the comparison can be replaced by , resulting in a significant reduction in delay.
Though variable-latency units bear a promise of improved performance, several issues must be kept in mind. These include the following.
• In order to maximize performance improvements, it is critical to perform a judicious selection of operations that are to be implemented by variable-latency units. In addition to traditional metrics like how the operations affect the critical path in the behavior, the statistical properties of variables in the behavior also significantly affect the decision of which operations should be assigned to variable-latency units.
• The use of variable-latency units may lead to significant area overheads. In addition to containing additional circuitry compared to their fixed-latency counterparts, the use of variable-latency units may inhibit resource sharing, since operations assigned to fixed-latency units and variable-latency units cannot be shared. Hence, techniques to reduce the area overhead due to variable-latency units need to be studied. The issues involved in HLS with variable-latency units are illustrated through examples in the remainder of this section.
The first example deals with the selection of operations to assign to variable-latency functional units during HLS.
Example 1: Consider the behavior fragment shown in Fig. 2(a) (for now, please ignore the shaded ellipses). The behavior represents the body of a loop that computes two expressions ( , and ), and writes the results into two arrays in memory. The library of components that can be used to implement the behavior is shown in Fig. 2(b) . The library includes fixed-latency components (e.g. ADD_SUB1) as well as variable-latency components (e.g. VL_MUL1). Note that it is possible to transform any arbitrary combinational circuit or pipelined sequential circuit into a variable-latency implementation [8] . In this example, however, for the purposes of easy illustration, we consider only multiplications for variable-latency implementation. Further, suppose that the designer requires to implement the behavior of Fig. 2(a) under the resource constraint: 1 LT_COMP1, 2 ADD_SUB1, 1 VL_MUL1, 1 MUL1. Since there are four multiplication operations, the natural question that arises is which multiplication operations should be performed using a variable-latency multiplier (VL_MUL1). There are various factors that may affect this decision.
• The critical paths in the behavior are likely to be , and . In order to speed up the loop body, it is necessary to speed up both critical paths.
• The mapping of some operations to variable-latency units creates bottlenecks for resource sharing that were not hitherto present (e.g. an operation mapped to MUL1 cannot be shared with an operation mapped to VL_MUL1). In a resource-constrained scheduling paradigm, this means that the assignment of operations to variable-latency units will affect the parallelism available during the scheduling process. For example, in Fig. 2 (a), if operations and are both selected to be performed by a unit of type VL_MUL1, and there is only one resource of such type available, the operations can no longer be performed in parallel.
• The input variables of different operations differ (in a statistical sense) in the values they assume. Since the latency of a variable-latency unit is determined dynamically based on the input values it encounters, different input distributions will lead to different average completion times for a variable-latency unit. In order to investigate the effect of the above factors and their interplay on the performance and area of the synthesized RTL circuits, we examined several subsets of multiplication operations from the behavior. For each subset, we selected operations in the subset to be performed by a variable-latency unit (VL_MUL1), while the remaining multiplications were mapped to a fixed-latency unit (MUL1). For each configuration, we performed scheduling using the Wavesched scheduler [9] , and calculated the expected number of clock cycles (ENC)/iteration of the loop as a performance metric. 2 The results of these experiments are summarized in Table I . Each row in the table represents a distinct RTL implementation (the first row corresponds to a design where no variable-latency units are used, i.e., two fixed-latency multipliers are used. We refer to this as the original design). The operations that are selected to be performed by variable-latency units are indicated in the second column, while the ENC metric is reported in the third column.
The results in Table I illustrate several important issues that are encountered when using variable-latency units in a HLSbased design methodology. As expected, the original design has an ENC of 12 cycles/loop iteration.
• The performance of designs with variable-latency units varies from 10.22 cycles/loop iteration to 14.34 cy- cles/loop iteration. While the best design (case 5) is better (faster) than the original design, the worst design (case 7) is actually worse (slower) than an implementation without variable-latency units. Thus, it is critical to perform a judicious selection of which operations to implement using variable-latency units. In addition, the decisions made during this step significantly impact the other steps in the HLS process.
• The importance of considering "critical paths" in the behavior while selecting operations for variable-latency implementation is illustrated by the fact that selecting operations and (case 4) does not lead to any improvement in the performance. This is because while one critical path in the loop body ( ) is shortened, the other critical path ( ) remains intact. A similar argument applies for case 6, when operations and are selected for variable-latency implementation.
• The scheduling bottlenecks that may be introduced by an improper selection of operations for variable-latency implementation are illustrated by case 2 and case 7. In case 2, operations and are selected for variable-latency implementation. Recall that the resource constraint specifies one variable-latency multiplier VL_MUL1, and one fixed-latency multiplier MUL1. In the behavior, operations and have no dependency, and can potentially be executed in parallel (indeed, that is the case in the schedule for the original design). However, in the variable-latency implementation for case 2, the resource constraint forced the scheduler to serialize operations and (even though the total number of multipliers is the same as the original design, there is only one variable-latency multiplier). This new dependency increases the worst case critical path in the schedule to 15 clock cycles, and the ENC increases to 14.22 cycles. A similar argument holds for case 7. On the other hand, in case 3 the additional dependency introduced between operations and is not a scheduling bottleneck, since these operations are not executed in parallel (due to the structure of the CDFG) even in the absence of such a dependency.
• The importance of considering data value statistics of the variables in the behavior while synthesizing a variable-latency implementation is shown by comparing the designs corresponding to case 3 and case 5. From the point of view of the first two criteria described above, the two cases are equivalent-in case 3 as well as case 5 both the critical paths in the loop body are shortened, and no scheduling bottlenecks are introduced. However, case 5 leads to an RTL implementation with a better ENC than case 3. That difference can be attributed to the fact that the values assumed by the input variables of the multiplication operations have different data statistics. Specifically, the input statistics of operations and are such that they favor the the low-latency case in the multiplier implementation (as shown in the figure, the low-latency probabilities for operations and are 0.89 and 0.83, respectively, while those for operations and are 0.5 and 0.1, respectively). In general, the low-latency probability for an operation (with respect to a specific variable-latency component) is determined by the data statistics of an operation's input variables, and the exact input conditions that trigger the low-latency case in the variable-latency component that implements it. Section III formalizes this concept, and presents techniques to compute the latency probabilities for operations and use them during the synthesis of variable-latency RTL implementations. The previous example demonstrates the need for automatic techniques that consider the various factors involved in design with variable-latency units in a quantitative manner, and operate in synergy with the conventional HLS tasks. Section III presents procedures that can be used as plug-ins into a generic HLS tool to generate high-performance RTL designs through the use of variable-latency units.
While consideration of the factors mentioned in the previous example can lead to RTL designs with significantly improved performance characteristics, the hardware (area) overhead associated with using variable-latency units may be of concern in area-constrained designs. As mentioned earlier in this section, variable-latency units contain additional circuitry compared to fixed-latency implementations. The resulting area overhead can sometimes be significant [8] . One straightforward way to limit the area overhead is to trade off the number of variable-latency units used with the performance improvements. Unfortunately, this may significantly reduce the performance improvements achievable. In this paper, we propose a novel technique to reduce the area overhead due to variable-latency units, with minimal impact on performance, which we introduce next.
Variable-latency units perform better than fixed-latency units when data statistics imply that low-latency cases will be encountered frequently. When the input values correspond to a high-latency case, little or no performance improvement results. However, for the sake of completeness, a variable-latency unit contains circuitry to implement both the low-latency and high-latency cases, in addition to control circuitry to generate the Ready signal. In other words, a significant amount of circuitry in the variable-latency unit may be used only in the high-latency cases. Our technique is based on the use of reduced variable-latency units, which contain circuitry that computes the desired function only in the low-latency cases. 3 Reduced variable-latency units contain significantly lesser circuitry than complete vari-able-latency units. In order to maintain correctness, the computations of the unit for the high-latency cases are detected and performed by one of the regular (fixed-latency) components in the data path. The use of reduced variable-latency units is illustrated next through an example.
Example 2: Consider the scheduled behavior fragment shown in Fig. 3(a) and the RTL component library shown in Fig. 2(b) . The resource constraint used for deriving this schedule is: 2 ADD_SUB1, 2 MUL1. Each operation in the schedule is also annotated with the functional unit to which it is mapped. The schedule requires 12 clock cycles, and the area required by the RTL implementation when synthesized is 58 832 units. The variable-latency schedule that results when the resource constraint is changed to 2 ADD_SUB1, 1 MUL1, VL_MUL1 is shown in Fig. 3(b) . In order to accurately reflect the timing behavior, each of the operations that have been selected for variable-latency implementation has been split into three distinct operations-an operation that represents the low-latency case, a low-latency completion condition, and an operation that represents the high-latency case. For example, multiply operation *1 has been split into (low-latency case), , and . Note that this representation is used purely for the purposes of our analysis-in the implementation, all three operations ( , , ) are performed by a single variable-latency multiplier VL_MUL1. The performance (ENC) for the schedule shown in Fig. 3(b) is 8.4 cycles. The RTL implementation of this schedule when synthesized requires an area of 59 470 units. Thus, the variable-latency schedule of Fig. 3(b) achieves a performance improvement of at an area overhead of 1.1% over the original design of Fig. 3(a) .
The area overhead incurred by the use of variable-latency units can be alleviated as follows. A reduced variable-latency multiplier is one which only computes the output in the low-latency case, and computes the low-latency condition. When the low-latency condition evaluates to False, the output of the unit is invalid. In effect all input values that do not lead to the low-latency condition can be considered as don't-cares. These don'tcares can be used to eliminate and optimize parts of the circuitry that implements the variable-latency unit. For example, consider a 16-bit variable-latency multiplier with inputs and and outputs and Ready, for which the variable-latency condition is AND . In a reduced variable-latency unit, any logic that feeds only outputs can be eliminated. Note that given the low-latency condition, the conversion of a variable-latency unit into a reduced variable-latency unit can be performed using conventional logic synthesis and redundancy removal techniques [11] , [6] . For example, in the case of the variable-latency multiplier described above, the reduced variable-latency unit requires 43% lesser area than a complete variable-latency unit. Fig. 4 shows an implementation of the example of Fig. 3 using a reduced variable-latency multiplier. The key difference between the schedule of Fig. 3(b) and the schedule of Fig. 4 is that in the later case, the high-latency cases of variable-latency operations need to be performed using the available regular (fixed-latency) multipliers. For example, the operations and have to be performed by the fixed-latency multiplier that also performs operation . As a result, these operations can no longer be executed in parallel. However, since the high-latency case occurs relatively infrequently, the performance (ENC) of the schedule shown in Fig. 4 is 8.6 cycles. Thus, there is a slight degradation in the performance compared to the variable-latency implementation of Fig. 3(b) . However, the area of the design actually improves, and is 17.9% less than that of the original design. The use of variable-latency units, without the transformation described above, creates a design with an area overhead of 1.1%. As the above example illustrates, reduced variable-latency units can result in a significant reduction in area overhead, with a marginal impact on performance improvements. Thus, in some sense, reduced variable-latency units present an intermediate point in the area-delay tradeoff space (i.e. in between a fixedlatency implementation and a variable-latency implementation with full-fledged variable-latency units). However, the area savings obtained by reduced variable-latency units can be traded off for better performance by providing more resources (functional units, registers, buses) for the scheduling process. Section III presents techniques for automatically incorporating reduced variable-latency units into the HLS process to provide a better exploration of area versus delay tradeoffs.
III. ALGORITHMS
In this section, we describe our algorithm for synthesizing a circuit that incorporates variable-latency units to realize the performance improvements seen earlier. Section III-A presents an overview of this framework, while Section III-B details the constituent steps.
A. Overview
Conventional design methodologies use a HLS tool to synthesize an RTL controller/datapath (without any variable-latency units) from a given behavior, design constraints and optimization objectives. The design flow typically involves scheduling and resource sharing in addition to many optimizations targeting design metrics like area, performance, etc. Fig. 5 presents a generic HLS methodology that incorporates our algorithm (the block shaded in grey) as a plug-in for enhancing the quality of the synthesized RTL design. The variable-latency optimization step is applied between the scheduling and resource sharing steps of synthesis. It takes as inputs, a functional RTL description that represents a cycle-accurate behavior (i.e. schedule) for a circuit as its input. The final output is an optimized schedule with a subset of operations assigned to variable-latency units. This description is then processed by the remaining HLS stages to generate a structural RTL circuit, wherein the datapath is augmented with variable-latency units. Fig. 5 outlines the different steps of our algorithm. The algorithm starts with an input schedule Curr_Sch that does not account for any variable-latency computations and evolves a schedule with operations assigned to variable-latency functional units. It also accepts as input, a parameter that determines the aggressiveness of the optimization procedure. A large value of results in high-quality solutions, at the cost of increased CPU time. The algorithm involves an iterative construction of the final schedule through the selection of a set of operations at every stage of our analysis (Steps 1-4) .
Step 1 first selects a set of operations as candidates for variable-latency implementation, based on the following criteria.
• The sensitivity of the ENC of the schedule to the operation: if an increase in the delay of the operation increases the total schedule length, the operation is deemed critical, and is a good candidate for implementation by a variable-latency unit.
• The probability that the operation will finish with a low latency, when implemented by a variable-latency unit.
Step 1 (see Section III-B1) finds the operations best satisfying these two characteristics.
Steps 2 and 3 then choose the "best" schedule that can be obtained by implementing of these operations using variable-latency units. Note that the best schedule is not necessarily obtained by choosing the operations with best individual impact. This is because the performance impact of implementing multiple operations by variable-latency units is not a simple function of the performance impacts of individual operations. The overall performance of the design depends, in addition to the delay of individual operations, on the resource constraints, schedule, and the topology of the behavior.
In order to consider the cumulative impact of implementing multiple operations by variable-latency units, Step 2 generates schedules for all possible subsets of cardinality .
Step 3 estimates the actual performance for each case in terms of expected number of cycles. The subset corresponding to the best speedup is chosen for variable-latency implementation, and the schedule is updated accordingly. Steps 1-3 are then repeated until operations have been selected.
While the use of variable-latency units leads to significant performance gains, they also cause additional area overheads. Therefore, in Step 4, we perform area recovery using a novel technique, as described below. This technique simply exploits the possibility that an operation mapped to a variable-latency unit can also be mapped as follows: The low latency computation and the low-latency condition are mapped to a reduced variable-latency unit which is usually much smaller than a complete variable-latency functional unit. The high-latency part is mapped to an existing resource. Therefore, the extra hardware required is only a reduced variable-latency unit, as opposed to a full variable-latency unit. This optimization does not cause significant overheads in terms of performance when the low latency condition is satisfied with a high probability.
B. Details
In this section, we detail different aspects of our algorithm. Section III-B1 describes the initial selection phase (Step 1 of Fig. 5 ). Section III-C details the subsequent rescheduling and selection phases (Steps 2 and 3) , while Section III-C1 outlines the final area recovery phase (Step 4).
1) Selection:
In this section, we present a simple technique for determining an initial set of candidate operations for possible variable-latency implementation. We base our formulation on the following general observations.
• An operation that completes with a low latency frequently is a good candidate for variable-latency implementation.
As mentioned earlier, the frequency of low-latency completion depends on the statistics of the input values.
• An operation whose speedup has a higher impact on the ENC of the schedule must have a greater chance of selection (i.e., operations on "critical paths" should be given higher priority). Given an operation , , the expected length of the longest path from any primary input to primary output that passes through , forms a good measure of operation criticality. This is because, operations which lie on longer input-tooutput paths, when rescheduled, are more likely to influence the computation times of the primary output (and, hence, the ENC of the entire schedule). The probability of low-latency completion of an operation, denoted by , is computed through a simulation of the schedule by monitoring the input values.
We multiply by to obtain the measure potential op op out (1) The operations with the highest potential values are selected as candidates for variable-latency implementation. We now discuss our technique for assigning multiple operations (say, in number) to variable-latency units. One possible way to achieve this would be to select the top-ranked operations from the list of candidates output by the initial selection phase (Section III-B1). However, there are some clear disadvantages of this approach as illustrated by the following example.
Example 3: Fig. 6(a) shows a behavior fragment with three operations ( ) relevant to this analysis. Fig. 6 (b) shows the ENC improvements obtained by selecting individual operations and two-operation subsets for variable-latency implementation. In Fig. 6(b) , the X axis represents different operation subsets, and the length of bar represents the improvements in ENC obtained.
From Fig. 6(b) , it is clear that and have the maximum individual impact on the ENC. However, the best overall impact is obtained by choosing the operation pair , (note that has an individual impact of zero). This is because the choice of one operation for variable-latency implementation affects the critical paths, rendering new operations to be critical, as illustrated in Fig. 6(c) . The left-hand side of this figure shows the original schedule for the behavioral fragment of Fig. 6(a) . As we can see, the critical path passes through operations and , which consequently have the highest individual impact. When is chosen and rescheduling is performed, the critical paths through the behavior change. The new critical path passes through operation , as shown in the figure. Hence, is a better candidate set than . The issue illustrated above is addressed by actually deriving schedules corresponding to all possible subsets of size . We then select the schedule with the best speedup as the starting point for the next iteration.
C. Rescheduling
When an operation in the behavior is implemented by a variable-latency unit, its timing behavior is altered. The schedule, therefore, needs to be modified to reflect this change. Note that the effect of the change in the operation's timing can propagate to other operations in its transitive fanout. Also note that this rescheduling is performed in the inner loop of our algorithm, hence, it can be invoked a large number of times.
In order to realize performance gains while being computationally efficient, it is important that the scheduling procedure be incremental, i.e. it should start with the existing schedule information and modify the schedule in a series of incremental steps so. The rescheduling problem defined above is similar to the problem of performing a set of simple rearrangements across dependencies for a piece of code, with an aim to extract parallelism without affecting the execution semantics.
The rescheduling problem maps naturally to incremental transformation-based scheduling algorithms like Percolation Scheduling [12] .
Our rescheduling procedure operates as a set of transformations to the existing (initial) schedule. These transformations include moving of operations across states, deletion of empty states, etc., as outlined in [12] . Each transformation is guaranteed to respect the control and data dependencies between operations in the behavior and, hence, preserves functional equivalence with the behavior. We next illustrate our rescheduling procedure through an example.
Example 4: Consider the original schedule shown in Fig. 7(a) derived for the following input specification.
• An initial allocation constraint that does not include any variable-latency units.
• In any given iteration, the dependencies are as follows.
2 is schedulable if and only if +1 has been scheduled. Likewise, is schedulable if and only if 1 has been scheduled. +1 and 1 are schedulable if and only if of the previous iteration terminates. Assume that the operation is deemed a suitable candidate for variable-latency implementation such that its low-latency case terminates in 2 cycles. Then, a scheduler has to perform the basic actions enumerated below to incorporate this solution. 
1) Making
a variable-latency operation creates an alternative path ( in Fig. 8 ) that the schedule can take. Since this represents the low-latency case for , finishes when state is reached. Consequently, is absent from states and . 2) Empty state can be deleted since the functionality of the schedule will still be preserved. 3) Since finishes earlier in the low-latency case path, operations dependent on its output become immediately schedulable subject to the input allocation constraint. For example, we can advance the operations in state to state . The composite state shown in Fig. 8(b) captures this scenario. This move also adds the additional edges from to and .
As mentioned above, these tasks map directly to the transformations employed in scheduling algorithms like Percolation Scheduling [12] . When our rescheduling procedure is called, these transformations are performed on the original schedule to derive intermediate representations leading to the final output as shown in Fig. 7(b) . 1) Area Recovery: As mentioned in earlier sections, variable-latency units might be significantly larger than normal functional units. This is because, in addition to the high-latency part and the low latency part, a variable-latency unit also has a low-latency completion check. If an operation frequently executes with a low latency, the high-latency part of the variable-latency unit is a resource that is infrequently used. Therefore, it can be implemented on a regular functional unit, and the variable-latency unit can be redefined to consist only of the low-latency part and the low-latency detection circuit. This step is illustrated in Fig. 9 . Fig. 9 (a) shows a variable-latency operation, , and Fig. 9(b) shows a schedule for a behavioral fragment containing , scheduled with a constraint of one multiplier, and one variable-latency multiplier (note that is a multiplication operation.). The probability that evaluates with a low latency is 0.95. Therefore, we replace the variable-latency implementing it by a reduced variable-latency unit. The schedule is modified as shown in Fig. 9(c) . As we can see, the schedule is elongated in the high-latency branch because the regular multiplier implements both the high-latency case, and operation . However, since the high-latency branch occurs with a very low probability, the increase in expected execution time of the schedule is not significant. In general, we perform this optimization when it can be shown that the expected execution time does not increase by more than a user-defined factor, .
D. Algorithm Complexity
In this section, we analyze the complexity of our algorithm, and present some insights on the CPU time taken by the various steps in practice. As shown in Fig. 5 , our algorithm needs to perform the following major tasks.
• Simulate the input behavior with the given input traces to compute low-latency and high-latency completion probabilities for individual operations: the complexity of this procedure is , where is the number of operations in the input behavior, and is the number of inputs simulated.
• Evaluate potentials for individual operations and select the operations with highest potentials: The complexity of this procedure is dominated by the computation of the expected lengths of the longest paths passing through individual operations [13] . The complexity of computing the expected length of the longest path passing through a single operation [13] is linear in the number of nodes and edges in the CDFG representation, which in the worst case translates to quadratic in the number of operations in the behavior. Repeating the above computation for each operation in the behavior makes the complexity of this procedure . Note that the computation of potentials and the sorting of operations does not need to be performed within the inner loop of our algorithm, as shown in Fig. 5 .
• The following steps need to be performed for each -subset of the candidate operations, i.e., within the inner loop of the algorithm. 1) Rescheduling the modified behavior: since rescheduling is performed incrementally, the complexity of this procedure is linear in the number of operations in the behavior. However, in practise, the number of moves performed while rescheduling is significantly smaller. 2) Computing the ENC: the complexity of this procedure is , where is the set of states in the schedule [10] . Putting these terms together, the overall complexity of the algorithm is Complexity (2) In this equation, represents the number of iterations of the outer loop. Note that Step 4 of the algorithm (area recovery) does not figure in the final equation because its complexity is eclipsed by that of the loop involving steps 1-3. Also, note that the computation time required by our algorithm is added on to the time required by the HLS tool to perform other tasks, including the initial scheduling, the final resource sharing and hardware generation.
IV. EXPERIMENTAL RESULTS
The techniques described in this paper were implemented within the framework of an existing HLS tool. The RTL library of functional units originally consisted of the following fixed-latency components: an adder, a subtracter, a multiplier, various comparators, an incrementer and a decrementer. Variable-latency and reduced variable-latency implementations of these functional units were designed and added to our design library. For the arithmetic functional units, the low-latency condition used to derive the variable-latency implementation was that the higher order bits of each input should equal 0 where is the bitwidth of the functional unit. For comparators, a different low-latency condition, as explained in Section II, was used. For the scheduler, unlimited numbers of single-bit logic components (OR, AND, and NOT) were assumed to be available. We evaluated the techniques using five example benchmarks. We generated structural RTL implementations with and without the use of variable-latency units, as well as with the use of reduced variable-latency units. We refer to these as the original, variable-latency, and reduced variable-latency RTL implementations, respectively.
Simulation traces obtained from the designers (along with the designs) were used to simulate the behavior for computing low-latency completion probabilities and branch probabilities. Where designer-given testbenches were not available, we used Gaussian streams, which were filtered to introduce desired levels of spatial and temporal correlations, to stimulate each input of the behavioral description. The values of the parameters and used in our algorithm (see Section III were set to ten and three, respectively, for our experiments.
The ENC metric, computed as described in [10] , was used to compare the performance of the RTL implementations. The logic synthesis tool SIS [11] was used to perform logic optimizations and to technology map the circuits to the MSU stdcell2_2.genlib library. Area estimates were reported by SIS using the cell area (grid count) information from the technology library. Table I summarizes the results of our experiments. Major column ENC and Area report the ENC and area reported by SIS, respectively. Minor columns Orig, VL_opt, and Reduced_VL represent the original, variable-latency, and reduced variable-latency RTL implementations for each example. The speedup ratio and percentage overheads in area are provided in brackets for the VL_opt and Reduced_VL columns.
Examples MemInit and Poly were discussed earlier in Section II (Figs. 2 and 3, respectively) . PPsum is a parallel prefix sum routine used in address calculations. SeqDiv is a sequential implementation of the standard integer division algorithm, while Findmin computes the minimum of a set of given values. The results show that the proposed variable-latency-unit-based synthesis techniques achieve a performance improvement of upto about (average of ) over a state-of-the-art HLS tool, with minimal area overheads (average of 4.5%). The use of reduced variable-latency units leads to a performance improvement of upto (average of ), with a simultaneous area reduction of upto 17.9% (10.3% on the average).
The additional CPU times required by the HLS tool, as measured on a SPARCstation 10 with 128-MB RAM, ranged from 32 sec to 411 sec for all the examples reported. This includes only the time required for performing variable-latency unit related optimizations, i.e., it does not include the time taken to perform other behavioral synthesis tasks. The CPU time was observed to vary in proportion to when the parameters and were varied. However, for the examples used in our experiments, we did not observe any significant improvement in the solution quality (area, performance) when and were increased above the values used to generate the results reported in Table II .
In order to study the performance of a variable-latency RTL implementation under varying input statistics (different from the one used during its synthesis), we simulated the variable-latency optimized circuit obtained for the SpHarm example with a range of input statistics. The input statistics were varied by controlling the parameters (mean, standard deviation) of the Gaussian streams used to generate the stimuli, and by varying the spatial and temporal correlations introduced thereafter. Note that deterministically generating input sequences that exercise the worst case path is a computationally difficult problem. Our experiments revealed that the speedup of the variable-latency implementation compared to the original implementation ranged from 1.04X to 1.2X for the different inputs. This implies that for inputs which triggered high-latency cases the performance improvement was as low as 1.04X.
We performed an additional experiment with examples MemInit and SpHarm, where we generated the area-performance tradeoff curves obtained using HLS with and without variable-latency units. The area-delay tradeoff was explored by varying the resource constraints given to the HLS tool, and optimizing for performance for each resource constraint. Each RTL circuit was synthesized and technology mapped, and from area/performance estimates of each synthesized netlist we obtained one point in the graph (i.e., an ENC area tuple). The results of this experiment for the MemInit example are plotted in Fig. 10 . The plots obtained clearly suggest the following.
• Variable-latency implementations can result in higher performance than can be achieved without the use of variablelatency units (even with unlimited resource constraints).
• A given area can be utilized better (to achieve higher performance) by using variable-latency units.
• Through the use of reduced variable-latency units, a desired performance can be attained with a circuit that requires lesser area. At the point , for example, we achieve an ENC of 6828 cycles with 22% lesser area. The results for the SpHarm example are plotted in Fig. 11 . They indicate similar trends, further supporting the conclusions drawn above. In each of the two examples, note that the lowest area point among all the design points (on the two curves collectively) is a conventional (nonvariable-latency) implementation. This corroborates the fact that variable-latency units do incur some overheads due to the control circuitry and reduced resource sharing. The strength of regular variable-latency units lies in achieving improved performance (or improved performance for a given area). However, with the use of reduced variable-latency units, we observed that even if the aim is purely area optimization without any concern for performance, reduced variable-latency implementations often come out better (as indicated by the numbers in Table II .
Thus, we believe that an appropriate use of variable-latency units can lead a HLS tool to explore portions of the area-delay space that were not previously possible to reach.
V. CONCLUSION
We have presented a technique to integrate variable-latency components into the scope of a general HLS system. Variable-latency components are, on an average, much faster than fixed-latency components because they use input statistics to speed up execution. Our techniques help translate these module level savings into maximal performance improvements with minimal area overheads for the entire design. We have also presented a novel technique based on the use of reduced variable-latency units to further reduce area overheads, and often improve the area of the synthesized design. These improvements are accompanied by simultaneous improvements in performance, thus changing the nature of the area-delay tradeoff curve. Experimental results have indicated up to 1.6-fold improvements in performance (average 1.4-fold), with area overheads of under 12% (average 5.26%). When reduced variable-latency units were employed, average performance improvements of were achieved, with a simultaneous area reduction of up to 17.94% (average area reduction of 10.6%).
