Abstract-Excess delay that each component of a design can tolerate under a given timing constraint is referred to as delay budget. Delay budgeting has been widely exploited to improve the design quality in very large scale integrated computer-aided design flow. The objective of the delay-budgeting problem investigated in this paper is to maximize the total delay budget assigned to each node in a directed acyclic graph under a given timing constraint. Due to the discreteness of the timing of the components in the libraries during design-optimization flow, discrete solution for delay budgeting is essential. We present an optimal integer delay-budgeting algorithm. We prove that the problem can be solved optimally in polynomial time. In addition, we look at different extensions of the delay-budgeting problem, such as maximization of weighted summation of delay budgets assigned to the nodes with constraints on the lower and upper bounds on the delay budget allocated to each node. We prove that for both aforementioned extensions, our algorithm can produce an optimal integer solution in polynomial time. Our algorithm is generic and can be applied at different design tasks at different levels of abstraction. We applied our proposed optimal delay-budgeting algorithm in library mapping during datapath synthesis on a field programmable gate array (FPGA) platform, using preoptimized cores of FPGA libraries. For each application, we go through synthesis and place and route stages in order to obtain accurate results. Our optimal algorithm outperforms the zero-slack algorithm (Nair et al. 1989) in terms of area by 10% on average for all applications. In some applications, optimal delay budgeting can speedup runtime of place_and_route up to two times.
I. INTRODUCTION

D
UE TO THE complexity of system design and the high uncertainty of timing issues and quality metrics, it is not effective to optimize performance intensively in earlier stages of very large scale integrated (VLSI) computer-aided design (CAD) flow. Instead, the optimization should aim at ensuring correctness and convergence of the design. In order to abstract away the design complexity, each design is decomposed into a set of subdesigns.
The essential constraint during the design-optimization flow is the timing constraint. Along with timing, there are other constraints, such as size, power dissipation, etc. In order to manage the timing constraint, a percentage of the total delay in a complex design is dedicated to each subdesign. The subdesigns along the critical paths are the most constrained components during the optimization process in CAD flow. However, timing constraint is loose on the other subdesigns. Hence, the delay allocated to each subdesign can be greater than actual/intrinsic delay of the subdesign. This excess delay is referred to as delay budget (or timing budget). In other words, delay-budget assignment is the problem of assigning the upper bound on the delay (or latency) of all the subdesigns under the given timing constraint. Delay assignment is applied at all levels of abstraction in VLSI CAD flow.
There is usually a tradeoff between timing and some other design metrics, such as size, power consumption, throughput, etc. Hence, delay budgeting can be exploited through the whole CAD design flow to improve the other design metrics, such as area, power consumption, etc. The more delay budget assigned to the design, the more flexibility would be given for further optimization on other design constraints. From another point of view, larger upper bound on delay relaxes the optimization, thus resulting in faster compilation and design time.
Each design is represented by a directed acyclic graph (DAG) . There is a delay associated with each node. Let the timing constraint be the maximum latency (or delay) at the output nodes. Delay or latency at the output is computed as the longest path delay in the graph from input to output. The delay along each path is the total delay associated with the nodes and/or edges along the path. Under a given timing constraint, delay budget at each node is the extra delay the component can tolerate such that no timing constraint is violated. A similar definition can be applied for the budget of an edge. The budget of each node/edge is related to timing slack of the node/edge. If there is any node or an edge with negative slack, timing constraint is violated. However, due to dependency between the nodes, the total timing slack of the nodes/edges is not the total budgets nodes/edges can tolerate. In Fig. 1 , two different methods of delay budgeting ( and ) are applied on a DAG. Columns "Budget A" and "Budget B" of the table correspond to excess delay (delay budget) assigned to each node under timing constraint (13 ns) in approach and approach . After applying any of budgeting or on the graph, no other node can tolerate any excess delay. Total delay budget after budgeting is 17, while the total delay budget after budgeting is 12.
In this paper, we study the problem of the assignment of maximum total budget in a graph. The budgeting problem in a graph 0278 is well studied in theory and practice and is widely used in today's industry and research. Delay budgeting has several applications in design optimization as follows:
• Timing-driven placement and floor planning-Delay budgeting during placement and floor planning has been extensively studied by several researchers [7] - [10] . In timing-driven placement, the goal is to optimize the path delays with fewer numbers of iterations. Delay budget is assigned to edges in the graph. Per net delay bounds are considered in order to have a better distribution of delay budgets in the graph. In [6] and [7] , placement and rebudgeting are combined. The optimization problem of budgeting on the edges in a graph is formulated as a piecewise linear objective function and solved using a modified graph-based Simplex algorithm [1] .
• Gate/wire sizing and power optimization-Under timing constraint, the gate-sizing problem is to find a set of nodes/edges in the graph such that their physical size can be reduced by mapping to smaller cell instances with larger delays from a target library [17] , [18] . In general, delay budgeting can be applied during the library mapping stage. Delay budget at each node can be exploited to map the node to a smaller cell (or with a lower power consumption) with a larger delay [10] .
• VLSI layout compaction-The main objective is to minimize the physical area of the layout. In addition, minimizing wirelength cannot be ignored during the optimization. The concept of budget in such problems is exploited to reduce wirelength [14] . An important constraint in analog IC design is the symmetry constraint in layout. With multiple symmetry constraints, layout compaction is solved using LP solver [15] . In [16] , a graph-based simplex method is applied to improve the runtime of the linear programming algorithm. LP formulation of compaction is similar to the formulation of the delay-budgeting problem. The space budget is assigned along the or axes to leave a sufficient space for wiring.
• Exploiting slack in high-level synthesis-There exist several related works in the area of high-level synthesis where timing slack of the nodes in the data-flow graphs are considered for better optimization in area and power. Examples are the algorithms and techniques developed for area minimization in pipelined datapath [21] , power minimization under timing constraint [19] , [20] , etc. In [21] , the design entry is a pipelined datapath. In the problem formulation, there are a set of constraints regarding the number of registers and depth of pipeline stages, which are not considered in budgeting on DAGs. All the proposed algorithms are heuristic suboptimal algorithms. There are heuristic algorithms in literature and industry to solve the delay-budgeting problem, such as MISA [3] and zeroslack algorithm (ZSA) [4] algorithms. In maximum delay budgeting, the objective is to maximize the value of an expression, which is a function of budgets associated with the nodes/edges in a graph. The most popular and efficient algorithm for delay budgeting is ZSA [4] , [5] . The solution is not optimal and can be far away from the optimal result. The MISA algorithm proposed in [3] finds the total budget in the graph with a more sophisticated and intuitive technique using maximum independent set in the graph. The MISA algorithm finds a potential slack, which strongly correlates with the total budget in the graph. However, both the ZSA and MISA algorithms cannot solve the budgeting problem optimally.
In this paper, we focus on the theoretical study of integer delay-budgeting problem on the nodes in a DAG. The objective function in our delay-budgeting problem is to maximize the total delay budget of the nodes under a given timing constraint. The general problem can be formulated as a linear programming problem. However, the solution can have a fractional value and needs to be normalized. According to the following reasons, optimal integer solution is preferred. First, the space/timing budget is mostly a discrete value, especially at higher levels of abstraction. For example, delay on interconnect is discrete in grid-based global routing. At the datapath level, latency of each component is given in terms of number of clock cycles under a given frequency. Delay of gates can be scaled to integer values. In VLSI compaction, grid constraints require integer solution [12] . Second, the budget at each node is mostly used to map the subdesign to another component in a target library, which inherently is discrete rather than continuous. Hence, in the formulation of the delay-budgeting problem, we assume that the variables associated with the budgets are all integers. The ZSA and MISA algorithms can be modified to generate integer budgets, but with no guarantee on the optimality of the solutions.
The complexity of the integer delay-budgeting problem on DAGs has been an open problem since budgeting problem was first formulated by Wong et al. in [11] . Applying rounding techniques to the LP optimal solution of the budgeting problem cannot preserve the optimality of the integer solution. In this paper, we propose our novel efficient graph-based transformation technique to produce an optimal integer solution from the optimal LP solution. We prove that the integer delay-budgeting problem can be solved optimally by transformation from the LP relaxation solution to an integer solution in polynomial time . The preliminary version of this work is published in DAC'03. In this paper, we describe the detailed analysis of our delay-budgeting algorithm. In addition, we look at different extensions of the delay-budgeting problem, such as maximization of weighted summation of delay budgets assigned to the nodes and additional constraints on lower and upper bounds on the delay budget allocated to each node. We prove that in both aforementioned extensions, our algorithm can produce an optimal integer solution in polynomial time.
We apply the delay-budgeting technique in the library-mapping stage. Mapping subdesigns to pre-existing preoptimized and synthesized components is an unavoidable scheme to abstract away the complexity of the given design to be optimized. For faster compilation and exploiting the architectural features of FPGAs, FPGA vendors provide a relatively rich intellectual property (IP) library of arithmetic functions and application-specific operations such as MAC, FFT, and DCT in the DSP domain. Along with this growth, design automation and synthesis flows need to be able to exploit the existing libraries with a better design planning. We apply our methodology and technique for timing-budget management during library mapping at datapath level. The datapath of a given application at behavioral level is mapped to the components in a library customized for the target programmable architecture (e.g., FPGA libraries). We applied the timing-budgeting algorithm in selecting the components of the library and mapping to different components of the application such that the design complexity is reduced without violation of timing constraints. Using IP library of FPGAs, we show that delay budgeting resolves the tradeoff between latency of a datapath and area of hardware resources. Our empirical results show that delay budgeting yields a solution with smaller area and faster design time compared to the case in which no delay budgeting is applied. We compare our proposed optimal delay-budgeting algorithm with ZSA, the well-known suboptimal heuristic algorithm. The decrease in complexity of datapath improves the runtime of place and route stage, which is the most time-consuming stage in mapping an application on FPGA platforms. Our experimental results show the effectiveness of budgeting in library mapping.
The rest of the paper is organized as follows. In Section II, the problem is formally defined. In Section III, the delay-budget reassignment is proposed. Applying delay-budget reassignment on the LP solution of the budgeting problem is described in Section IV, and it is proven that the final solution is integer and optimal. In Section V, two different extensions to the formulation of integer delay-budgeting problem are presented on which our algorithm can be applied to produce an optimal integer solution. In Section VI, the experimental results on tradeoff between latency and area by budgeting technique in FPGA platform are presented. In Section VII, the conclusion and some possible future directions are outlined.
II. LP FORMULATION OF DELAY-BUDGETING PROBLEM
In a directed graph , edge is incident to node and incident from node .
is the set of incoming edges to node .
is the set of outgoing edges from node . Primary inputs (PIs) are the nodes with no incoming edges. Primary outputs (POs) are the nodes with no outgoing edges. Associated with each node , there is a delay variable . Assume node drives node , i.e., there is an edge in graph . If data or signal at the output of node is ready at time , the output of node is ready at least at time . Let be the extra potential delay assigned to node . Hence, the output at will not be ready before . The arrival time of : Arrival time at node is defined as the maximum total-path delay among all the paths from PI nodes to node . If input to the PI of graph is ready at time 0, the output of node is ready at . is computed as (1) The arrival time at a PO is the maximum summation of the delay budget and the intrinsic delay associated with each node along the path from the PI to the PO. The arrival time at each PO cannot exceed a fixed value . This is referred to as required time at the POs. Although requited time at the POs and arrival time at the PIs can be different, for simplicity, we assume that arrival time at each PI is zero and required time at the POs is .
Delay [2] . By TU of the coefficient matrix, every extreme point of LP relaxation is integral, regardless of objective function.
Theorem 1: The linear programming relaxation of the integer delay-budgeting problem gives an optimal integer solution if the input graph is a directed path.
The aforementioned sufficient condition does not necessarily hold for the general DAG other than a directed path. In the following sections, we prove that the integral delay-budgeting problem can be solved optimally in polynomial time, using the solution of the linear programming relaxation problem.
III. DELAY-BUDGET REASSIGNMENT
In this section, we first define the maximal budgeting on a given directed graph with required time at the POs. The arrival time of any node cannot exceed . Otherwise, the dependency constraints in (6) are not satisfied. Some basic definitions used in this section are as follow Definitions: required time at , , is computed as .
for . is required time at the POs in graph . Slack at node is . The value of a-slack for edge is computed as , . Similarly, r-slack of , is computed as , . Edge is said to be critical if the a-slack value and r-slack value associated with edge are zero. A path in a graph that includes only critical edges is called a critical path. In a directed graph , if the slack of the nodes at the two ends of the edge are equal, i.e., , then . is used to refer to this value as the slack of edge .
In any budgeting on graph , the slack of each node/edge must be nonnegative. This is referred to as feasibility in the graph. A graph with budgeting is not feasible if the timing slack of any node/edge is negative.
Maximal Budgeting Graph : Let be the set of delay budgets assigned to the nodes in graph .
is a feasible solution to budgeting problem on a DAG . Feasible solution is called maximal budgeting if no more budget can be given to any node while the budget of any other node does not decrease.
In graph , if the slack of each node is zero, the corresponding budgeting is a maximal budgeting. In a maximal budgeting, all the noncritical edges have the same a-slack and r-slack values. The term is used to refer to the slack of noncritical edge . The maximum solution is also a maximal solution. Maximal budgeting solution can be obtained by applying different algorithms such as the MISA [3] and ZSA algorithms [4] .
Lemma 3: In a maximal budgeting , each node (except PIs and POs) has at least one critical incoming edge and at least one critical outgoing edge.
Proof: By the way of contradiction, we assume that there is no critical outgoing edge from . is not a PO. There has to exist at least one outgoing edge from . If there are noncritical outgoing edges from node , then and are not zero. The slack of each node is zero. Therefore, and can be added to the budget of node , while all the arrival time and required time constraints for the whole graph are met. Hence, more delay budgets are obtained, and this contradicts the definition of maximal budgeting. A similar argument can be applied to prove that at least one incoming edge to each node must be critical. Now, we propose a budget-reassignment method on a given maximal budgeting.
Feasible Delay-Budget Reassignment on : In a graph with a maximal budgeting solution , the budgets of the nodes are changed such that the new budgeting is still a maximal budgeting . Delay-budget reassignment on graph transforms the budgeting from solution to . Feasible -delay-budget reassignment on is a feasible delay-budget reassignment in which the change of budget in each node is either or 0. In Fig. 2 , example of feasible delay-budget reassignment on a DAG is shown. After feasible delay-budget reassignment, the budgeting is maximal and feasible.
Assume that in a reassignment of budget of at each node in graph , the total amount of change in the budget of the nodes along each critical path is zero. In this case, the arrival time at each node is changed by . Since budget of each node changes either or , the change of the budget along each critical path from PI to node is multiple of , say , . Theorem 2 presents two sufficient conditions for feasible -delay-budget reassignment.
Theorem 2: The reassignment of the budget of at each node in graph is a feasible -delay-budget reassignment if • the total amount of change in the budget of the nodes along each critical path from PI to PO is zero; • for each -edge , , where edge is critical. and are the amount of change in total budget along any critical path from PI to node and , respectively. Proof: We prove that after the delay-budget reassignment of , is a feasible maximal budgeting. is the arrival time at node before delay-budget reassignment. By induction, it can be shown that the arrival time at is after delay-budget reassignment.
is the total delay-budget reassignment along the critical paths from PI to node . As shown in Fig. 3 (a), edges and are both critical in graph . After delay-budget reassignment, arrival time at node is . The arrival time at node is . Since along the critical path from PI to PO through edge the total change in budget is zero, the amount of change in budget along any critical path from node until PO is . Since the critical path from PI through nodes and until PO is critical, the total budget along this path is zero, that is . Hence, . Noncritical edges in the graph need to be considered. There is a slack of associated with each noncritical edge. In Fig. 3(b) , edge is a noncritical edge and . After delay-budget reassignment, edge has to remain critical. Based on the second condition in the Theorem , the arrival time cannot become greater than the arrival time . Hence, edge remains critical. Now, assume node is a PO. The arrival time at node is . is the total change of budget along the critical paths from PIs to node PO. Due to the first condition, is zero. Hence, the arrival time at node does not change after delaybudget reassignment, i.e., feasibility is satisfied . The arrival time at node is . A similar argument can be applied to show that the amount of change in the required time at node is , where is the amount of change in the total budget along the critical paths from POs to node . The slack of node after the budget change is . We have , since is a maximal budgeting. is equivalent to the total change of the budget along the critical path through node , which is zero. Hence, the slack of node after the budget change is still zero. Therefore, we have a feasible maximal budgeting.
According to Lemma 2, if the budget of is reassigned among the nodes under the aforementioned conditions, another maximal solution on graph is obtained. We show that the budget exchange between two subgraphs under the child-parent relation satisfies the sufficient conditions, hence, a feasible -delay-budget reassignment in graph . Parent/Child Relation: In a directed graph , edge and is critical. Node is a child of node . is used to refer to as a child of node . Node is said to be the parent of node .
is used to refer to as a parent of node . is in set , node has at least a child in set , say node . Therefore, there is an edge from node to node both belonging to . This contradicts Lemma 7.
Let -budget exchange in the parent-child set be decreasing the budget of the nodes in by and increasing the budget of nodes in by , . Lemma 9: In a given in , if , where is the slack of the noncritical -edge with and (incoming -edges to ), the -budget exchange is a feasible -delay-budget reassignment in . Proof: We show that the sufficient conditions in Theorem 2 are satisfied during the budget exchange between a parent-child set. Since there is no critical path between any two nodes in or , the critical paths in consist of two nodes, one in the parent set and the other in the child set. At each parent node , the amount of change in arrival time is . At each child node , the amount of change in the arrival time is zero. Hence, along each critical path in this subgraph, the total amount of change in the budget is zero. In addition, there is no change in budget or arrival time at any other nodes outside the parent-child set in graph . Therefore, the first sufficient condition in Theorem 2 is satisfied. The -edges can be categorized based on where the two ends of the edges are located. Fig. 4 shows all such possible edges with respect to a given parent-child set. At each child node , the amount of change in the arrival time is zero. Therefore, the arrival time at a child node and, hence, the criticality of the edges connecting the child nodes to the rest of the graph remains unchanged. Similarly, the criticality of incoming edges to the parent nodes are unchanged after budget exchange, i.e., -edges 3 and 2 remains noncritical. The inequality is satisfied as well. For -edge 4, the inequality is held since for all incoming -edges to the child set. There cannot exist any -edges between two parent nodes, two child nodes, or between a child and a parent node. Hence, the second sufficient condition in Theorem 2 is satisfied.
Similarly, the budget can be increased by in the parent set and reduced by in the child set. This is called -budget exchange in . Lemma 9 can be adjusted to be applied for -budget exchange on the parent-child set as well. In this paper, we apply -budget exchange on a given parent-child set. In the next section, we apply -delay-budget reassignment on LP solution which is a maximal budgeting on in order to obtain the integer solution.
IV. INTEGER SOLUTION TO DELAY-BUDGETING PROBLEM
is the optimal solution to linear programming relaxation of the integer delay-budgeting problem.
is also a maximal budgeting. Hence, delay-budget reassignment is applicable to . In addition, since is the optimal solution, for any maximal budgeting . We define in the -delay-budget reassignment on graph , such that the budget of all the nodes become the integer. We show that during this transformation from optimal solution to integer solution , the objective value of new solution is equal to . Integral sequence: A sequence of nodes along a critical path in is called an integral sequence if , and . Lemma 10: The total budget of the nodes along any integral sequence in is an integer if , . Proof: Since the arrival time of the nodes at the two ends of an integral sequence is an integer, . Since each is an integer, . Corollary 1: The total budgeting on any critical path from PI to PO is integral.
Based on Lemma 10, each node with fractional budget belongs to an integral sequence. Hence, within an integral sequence, it is sufficient to reassign the fractional budgets only on the nodes in an integral sequence. On the other hand, in graph , there are several integral sequences connected to each other. In reassigning the budget between the nodes, the required conditions in Theorem 2 have to be satisfied in all those sequences. Hence, the goal is to apply delay-budget reassignment of the fractional budgets on the nodes in graph in to obtain integer solution. Since the delay-budget reassignment needs to be applied between the nodes with fractional budget, we reduce the graph to graph , the fractional adjacency graph defined as follows.
Fractional Adjacency Graph: Graph is the fractional adjacency graph corresponding to a given graph . The nodes in graph are a subset of nodes in graph that have noninteger (fractional) budgets. A critical edge between two nodes in graph represents the existence of a directed critical path between two nodes in graph , such that there is no fractional budget along the path and the arrival time of each node along the path is not an integer. There is a noncritical -edge between two nodes and , if there is no critical path between the two nodes, but at least a path with -edges along the path. Among all different paths between the two nodes, the minimum of the total value of the -edges along each path is the value of the -edge in graph . Two adjacent nodes and in graph represent the two immediate nodes on a directed critical path in graph with fractional budget, both belonging to the same integral sequence. Fig. 5 demonstrates a budgeted DAG and the corresponding fractional adjacency graph.
The -delay-budget reassignment is applied on graph such that the budget of all the nodes becomes an integer. Only the fractional value of budgets needs to be reassigned in order to obtain the integer solution. Hence, is a fractional value less than the unit. As described in the previous section, feasible budget-reassignment can be applied on a parent-child set on graph . A similar argument can be applied to graph as follows.
Lemma 11: In graph , if node , the fractional values of the arrival time at both nodes are equal, i.e., . Proof: Assume . Let be the child node of both nodes and . The arrival time at node is equal to the fractional value of a summation of the fractional value of the arrival time at node and the fractional value of the budget at node . The budget of the nodes along the critical path from to in graph are all integer. Similarly, arrival time at node is equal to fractional value of summation of the fractional value of the arrival time at node and the fractional value of the budget at node . Hence, . If and do not share a child node, due to transitivity in the parent relation, we still have . Lemma 12: If nodes in graph and there is a directed critical path between nodes and in graph , there has to exist at least one node on the path between the nodes and in graph . Proof: Assume there is a path between node and in graph . Let node be the child node of nodes and . There are two paths from node to , one is the direct edge and the other is the path . See Fig. 6 . The fractional value at the node from the first path is and from the other path is . According to Lemma 11, these two values need to be equal. This is possible if and only if , which contradicts that . Therefore, there has to exist at least one node say on the path from to such that . Similarly, if the two nodes and do not have a same child, we can prove that the total fractional value on the path from to , including , needs to be integral, i.e., there has to exist at least one node on the path between and . Fig. 7 shows such a case.
The set is the set of nodes in graph such that each node shares at least a common child with another node in . The set is the set of nodes in which each node in the set shares a parent at least with one another node in the set. In Fig. 8 , a parent-child set in is shown. Lemma 13: Set and do not intersect . Proof: Assume that the two sets intersect, i.e., there is a node belonging to both sets. Since node is in set , it has at least one parent, say . Therefore, there is an edge from node to node ). On the other hand, . That is, there is a direct edge between two nodes , . This contradicts Lemma 12.
On a given parent-child set in graph , we apply -budget exchange. If fractional budget in graph are reassigned by the delay-budget reassignment on the parent-child set, the fractional budget is removed from each parent node and reassigned to one of its successors in the graph. Hence, the fractional budgets are reassigned from PIs to POs, in one direction within an integral sequence. There are -edges in a given graph . In order to have a feasible delay-budget reassignment on the parent-child set, we show that the sufficient conditions outlined in Theorem 2 are satisfied in a given graph as well. Lemma 14: -budget exchange on a parent-child set in graph is a feasible -delay-budget reassignment if , where is -edge. is an incoming edge to a child set. is the fractional value at parent nodes.
Proof: In Fig. 8 , a set of the parent-child set is shown in a given graph . In a budget exchange on the set, there is an alternative budget exchange along each critical edge in graph .
corresponds to a total change of budget along the critical paths from to node . At each child node , the corresponding is zero. At each parent node , the corresponding when budget in parent set is decreased by . Hence the first sufficient condition in Theorem 2 is satisfied. We prove that as long as , the budget exchange is a feasible -delay-budget reassignment on a given graph . There are eight possible type of -edges with respect to . At each edge, we check the inequality defined in Theorem 2 after budget exchange. -edges 2 and 3 will not change since the arrival time at the child set and incoming edges to parent set are not affected by budget exchange. At -edge 1, the inequality is True for any . At -edge 5, the inequality is True for any . At -edge 6, the slack will not change since the arrival time at the child node does not change. At -edge 8, the inequality is True for any . At -edges 7 and 4, the arrival time at the nodes incident from -edges do not change.
is satisfied as well. Therefore, both sufficient conditions in Theorem 2 are satisfied.
If is less than the fractional value of the budget in parent nodes, after delay-budget reassignment, the arrival time at the parent node is reduced by . Hence, if is equal to the fractional value of the arrival time, the arrival time at each parent node is an integer value after delay-budget reassignment. On the other hand, need to be at most as large as the minimum available budget in parent nodes.
In Fig. 9 , a -edge incident to a child node is shown. Let and be the fractional value of the arrival time at nodes and , respectively. In -delay-budget reassignment, if , for , is True. Assume . The value of is computed as follows:
When , . Since , . Hence, the inequality of Theorem 2 is held. Hence, the value in -delay- budget reassignment can be computed independent of -edges incident to the child set as follows.
Lemma 15: Let be a parent-child set with , the fractional value at the arrival time at the parent nodes. Assume that is the smallest fractional value of the arrival time at all the nodes in graph . The -budget exchange of from parent nodes to child nodes is a feasible delay-budget reassignment.
Proof: In order to be able to reassign the budget of from the parent nodes, each parent node must have at least a budget of , i.e., , . Assume that there is a node such that , hence . In this case, the arrival time at the parent of node is and this contradicts the fact that the fractional value of the arrival time at no other nodes other than the parent nodes can be as small as . Hence, each . Next, consider -edges connected to . According to Lemma 14, only two types of -edges, -edges 4 and 7 as shown in Fig. 4 , are under the condition that the value of such edges have to be larger than for edges with . According to (7) , since the parent set has the smallest fractional value , . This ends the proof that the delay-budget reassignment is feasible.
After delay-budget reassignment on the parent-child set , the arrival time at each parent node becomes an integer with
. If the budget of any node in a parent or child set becomes an integer, the node is removed from . In this delay-budget reassignment, an integer budget of any node in graph never becomes fractional. Hence, no node is added to graph after delay-budget reassignment. Since the arrival time at a parent node becomes an integer, all the edges connecting the parent nodes to the child nodes are removed from graph . Similarly, no edge is added to graph after delay-budget reassignment.
An important fact is that after delay-budget reassignment, the parent nodes do not have any outgoing edges in the updated graph . Hence, the corresponding nodes cannot become parent nodes anymore. Therefore, we have the following lemma.
Lemma 16: Each node in graph can only appear once in a parent set during the sequential parent-child delay-budget reassignment.
Note that after each -budget exchange, the outgoing edges of the parent nodes are removed. No more outgoing edges are added to parent nodes in since the arrival time at the parent nodes are integer. On the other hand, the integer budget of a node never becomes fractional after any -budget exchange. Since each node can only appear once in a parent set, the number of parent-child sets which can be generated followed by delaybudget reassignment on each set is , where is the set of nodes in graph .
Theorem 3: Sequentially generating parent-child set followed by -delay-budget reassignment in the order of increasing fractional value of the arrival time at the parent nodes of the parent sets with , in iterations. If graph , the budget of all the nodes in graph are integer. Hence, Theorem 3 shows that a maximal integer solution can be obtained from LP solution using -budget exchange on graph . The following lemma proves that during delay-budget reassignment, optimality is preserved.
Lemma 17: In graph corresponding to , if . Proof: Assume that there are more numbers of nodes in one of the sets, say . After delay-budget reassignment of the minimum budget, say , the total budget changes by . This contradicts the optimality of budget in . Theorem 4: In any feasible -delay-budget reassignment on the parent-child set in graph , the total budget does not change.
Hence, after applying the delay-budget reassignment on , the solution is still optimum. The pseudocode of delay-budget reassignment to produce an optimal integer solution is shown in Fig. 10 .
Each parent-child set construction takes , delay-budget reassignment takes . Updating graph takes . This repeats times. However, by amortized analysis we see that the complexity of during the process applies to a set of edges during the current iteration and then those edges are removed from graph before the next delay-budget reassignment. Hence, the total complexity is . The result is a transformation from solution to a new solution in which integer budget is assigned to each node while objective value does not change, i.e., . Theorem 5: The solution to linear programming relaxation problem of the integer delay-budgeting problem on graph can be transformed to equivalent integer solution in polynomial time with same objective value.
V. EXTENSION OF AN INTEGER DELAY-BUDGETING PROBLEM
We proved that the maximum integer delay-budgeting problem as formulated in Section II, is polynomially solvable.
In this formulation, there are two major simplifications. First, the objective function is simply a summation of the delay budgets on the nodes. This means that the objective is independent on the type of the operation on each node. Depending on the type and complexity of the operation at each node, the extra budget can have a different impact. We extend the problem to maximization of weighted summation of delay budgets assigned to the nodes. We assume that based on the complexity and type of operation, a nonnegative weight is given for each node. This value determines the rate of relaxation on the structure of the component and/or synthesis effort on the component for each extra delay-budget assigned to the node. The second important simplification in the original problem is that the delay budget assigned to a node can be unlimited. However, in reality, the delay budget can be exploited within a certain range and beyond that range, it is more beneficial to assign the remaining budget to other nodes in the graph. Both extensions are still integer linear programming problems. The formulation of the extended integer delay-budgeting problem is (8) (9) (10) (11) (12) (13) The following propositions prove that in both aforementioned extensions of the integer delay-budgeting problem, the optimal integer solution can be obtained using our algorithm.
Lemma 18: For each parent-child set in the fractional adjacency graph corresponding to , the condition is satisfied. Proof: Assume that the condition does not hold in . Assume . After -delay-budget reassignment of minimum budget, say , the total budget changes by . Hence, the total value of the objective increases and this contradicts the optimality of budget in . Based on this lemma, applying delay-budget reassignment iteratively on does not change the value of objective. Hence, the integer solution is optimal.
Lemma 19: In the budget-reassignment algorithm on , the budget of each node never exceeds the corresponding upper bound on the budget at each node.
Proof: The order based on which the parent-child set is constructed and delay-budget reassignment is applied, depends on the fractional value of arrival time at the nodes. Due to this ordering, when budget of a node is increased by in the child set, it is guaranteed that the total budget at the node cannot exceed the upper bound on this node. When delay-budget reassignment is applied on a parent-child set, the fractional value of arrival time at each child node is either zero or greater than the fractional value of arrival time at parent set . After delay-budget reassignment of , the summation of fractional value of the budget at each node and the increase in budget is at most 1. Hence, the total budget at each child node never exceeds its upper bound. The lower bound on the budget can be added to the original delay of each node and if the initial solution remains feasible under a given timing constraint, we apply the integer delay-budgeting algorithm to assign the extra delay budget to the nodes. In this paper, we formulated the delay-budgeting problem as an integer linear programming problem. The main assumption in this problem and its aforementioned extensions is that the objective value increases linearly with any unit of delay budget assigned to each node. However, in several applications, the gain is obtained only if the budget is a certain value. We refer to this problem as the discrete-budgeting problem that can be formulated as a mixed 0-1 integer linear programming problem as follows:
In [22] , it is proved that this problem is an NP-hard problem and an approximation algorithm on a rooted tree has been proposed. This is out of the scope of this paper and our experiments in this paper are only based on regular integer delay-budgeting. In the next section, we show that the application on which our technique is applied, the regular unit change in the budget has almost linear correlation with the objective function. Hence, delay budgeting as formulated and solved in this paper can be applied to solve the delay-budget assignment.
VI. APPLICATION
The delay-budgeting algorithm is very generic and can be applied in different design tasks at different stages of CAD flow, such as gate sizing in logic synthesis, timing optimization in placement, and library mapping in the datapath level. In this section, we apply integer delay-budgeting in mapping datapath of an application on FPGA platform. We describe the experimental setup and then we present some experimental results applied to some DSP benchmarks. The results show that early management of timing budget on IPs can lead to a faster compilation in the physical implementation level.
A. Delay Budgeting and Core-Based Compilation Flow
The IP components are predesigned and preverified blocks realizing a particular functionality. Designers cannot spend a lot of time regenerating most of the standard functions for future designs. Designers try to leverage the existing designs of components and use it in the current/future development of new applications. Mapping subdesigns to already existing preoptimized and synthesized components is an unavoidable scheme in design automation flow of today's complex designs. Especially in programmable systems such as FPGAs, design and market of soft IPs are growing rapidly, hence providing a rich library of various functional components. In a programmable system, realizations of IPs are basically the predefined program bits for a subset of chip that corresponds to the functionality of the IPs. Since there is no fabrication cost, IPs are more cost effective to be generated. There is lots of effort and research both in academia and industry to come up with customized functional units for FPGAs. The CoreGen [23] tool provided by Xilinx [24] is a library of parameterizable functional cores for Xilinx FPGA devices. Also, due to highly constrained and finely grained architecture of FPGAs, efficient implementation of functional units are more challenging and complicated compared to ASIC IP designs. The new generations of FPGAs are getting more and more irregular. There are special architectural features integrated into the device such as carry chain, etc. The synthesis tool cannot exploit such features efficiently at the gate level logic optimization.
In Fig. 11 , CAD flow of an IP-based (or core-based) mapping of an application on a FPGA is illustrated. Xilinx CoreGen tool generates and delivers parameterizable cores optimized for target architecture. The parameters include data width, registered output, number of pipeline stages, etc. Core layout is specified up front. Cores are delivered with optimally floorplanned layouts. Since CoreGen cores are preoptimized, they are considered as black boxes during the synthesis. Hence, synthesis is ignored in a core-based design. In a rich core library, there can exist several cores realizing the same functionality with different implementation and latency (in terms of clock cycle). Fig. 12 demonstrates a tradeoff between the latency and the area of a CoreGen 16-bit multiplier mapped on FPGA VirtexE, Xilinx. Slices are the logic blocks in VirtexE FPGA series which consist of registers, lookup tables (LUTs), and other specific features.
The tradeoff between delay and area is one of the most common relation observed in many library cores. Area and size of a design is an additive design metrics. Area of a design is roughly computed by summation over the size of its components. However, during synthesis and optimization, if the design goes under a lot of optimizations across the boundary between the components, the area of the whole design cannot be estimated as the summation over the size of individual components. Such boundary merging often occurs during logic-level synthesis. In the data-path level, design is defined as a control-flow graph. A data-flow graph is a DAG. Latency at the output is defined as the delay of the longest path in the graph. Each basic element of a design is an operation which can take multiple cycles to execute. Hence, they are mapped to registered cores in the library. Since there are registers between the cores, not much boundary merging and logic-like optimization can be applied. Hence, we can roughly estimate the area of the design as the summation over the size of individual cores the operations map to. Therefore, the area can be defined as a linear function of the size of each component.
Due to the dependency between the operations in a data-flow graph under the given latency (as a timing constraint), it is not efficient and almost impractical to manually (arbitrarily or ad hoc method) choose the cores from the library. Instead, we need to develop a systematic and algorithmic technique for library mapping. That is where our delay budgeting can be a good guidance in library mapping.
B. Experimental Setup
We start from a DAG representation of an application. Each node corresponds to a computation in the data path. The benchmark in our experiments is a set of some standard DSP benchmarks. The types of the computations are multiplier, adder, subtracter, division, and shifter. We assume all the data paths are 16-bit wide. As shown in Fig. 12 , there is almost a linear relation between latency of the cores in CoreGen and their corresponding size in terms of the number of CLBs and LUTs. Hence, based on the delay assigned to each component in the dataflow graph, it can be mapped to the core which gives a smaller size.
Delay of each computation is defined by a delay-budgeting algorithm under the given latency at the output. Each computation is assigned to a resource generated from a CoreGen tool based on the delay budget allocated to the node. We apply a delay-budgeting algorithm to allocate the delay budget at each node. After library mapping and synthesis, the whole circuit is placed and routed on an FPGA device. We used the ISE 4.1 place_and_route tool provided by Xilinx. We conducted two sets of experiments. Once, we applied the no-delay-budgeting algorithm and mapped the components to the best latency cores. In the second set, we applied delay-budgeting algorithms once our optimal delay budgeting and once a heuristic budgeting (ZSA-like) to distribute the delay budget in the graph.
C. Experimental Results
The original latency and other characteristics of the benchmarks are given in Table I . The latency is the minimum latency of the data-flow graphs with the fastest core in the library plus one more clock cycle in order to have sufficient delay budget in the applications. The benchmarks are the typical benchmarks used in high-level synthesis experiments and research. Table II compares the implemented designs in terms of area when different delay assignments are used before library mapping. In this table, the first set of results correspond to the original designs with no delay budgeting applied to them. Hence, each operation is mapped to the core in the library with the best delay. Gate count is one of the area metric reported by the Xilinx mapping tool which corresponds to equivalent gate area. Gate count reflects the logic area of the design. On the other hand, the number of slices and the number of LUTs are the other area metrics, which are real physical area of design on FPGA devices. Due to optimization techniques during mapping and merging the subdesigns, the design can get more compact. Hence, we look at both metrics for an area to understand the correlation between the delay budgeting and area. In this table, it is observed that in all the benchmarks, the area resulted from optimal delay budgeting algorithm is smaller than the area resulted from ZSA algorithm. Also, comparing the results when no budgeting is applied, we observe that delay budgeting is a useful technique to reduce the redundant complexity in the designs. In this table, the amount of total delay budget inserted into the graphs are also reported. The results show that solution by ZSA can be far away from optimal solution in some of the benchmarks such as FDCT. However, in some cases such as DIFFEQ, the solution is close to optimal.
In Table III , we compare the implemented designs in terms of other design metrics. The physical size of an implemented design on FPGA device is defined based on the number of LUTs and slices. We can observe similar behavior in terms of the number of LUTs and slices as the gate count for the area that was reported in Table II in some of the benchmarks. However, in ARF, the number of slices does not vary much when ZSA is replaced by an optimal algorithm for budgeting. In all benchmarks, the number of slices decreases when delay budgeting is applied and optimal delay-budgeting can minimize the number of slices further than ZSA algorithm. Another two design metrics evaluated in this experiment are the total compilation runtime from design entry until the end of the place_and_route and maximum clock frequency reported at the end of the design flow. Timing is analyzed after place_and_route. It is interesting to see that compilation flow can speed up due to a decrease in the complexity of the computations in the application with delay budgeting. In small applications, this speed up is not very significant. In benchmarks ARF and DCT, the speed up is quite significant (up to two times). On average, the optimal algorithm outperforms ZSA by 15% and 4% in terms of total delay budget and gate count, respectively. On average, applying the proposed optimal delay-budgeting during library mapping can reduce the gate count by 12% compared to the case where no budgeting is applied.
The maximum clock frequency is another design metric reflecting the timing characteristic of the implemented design. Note that the latency of each design in all the three sets of experiments is the same in terms of the number of clock cycles.
Hence, the faster the clock, in a shorter time the result is available at the output. By reducing the complexity of the design, the optimization during place_and_route can get more relaxed and better performance can be obtained. However, if timing is affected by the most critical components in design significantly, the delay budget on noncritical paths may not be helpful for timing optimization. Although the computational components with longer latency may operate with faster clock, they require more numbers of registers and during place_and_route, register allocation can lead to a reduction in the clock frequency. Timing analysis depends on many other factors. However, in benchmark DCT, the clock frequency increases by the larger delay-budgeting algorithm. The topology and connectivity in the applications affect the distribution of the delay budget in the graph. If most of the paths in the graph are critical paths, there is not much timing slack in the graph in order to be able to compare different delay budget distribution and its effect on component selection and library mapping. On average, the clock frequency with no delay budgeting is 77.35 MHz. After applying optimal delay-budgeting, the average clock frequency decreases to 75.84 MHz. However, In benchmarks DCT and FDCT, the clock frequency increases. By applying the ZSA delay-budgeting technique, the resulting clock frequency is 77.65 MHz, which is close to the original implementation. Applying optimal budget management, the compilation runtime is 23 s on average, while it is 26.8 s with ZSA delay budgeting and 31.8 s when no delaybudget management is applied. The runtime of delay-budget management added to the compilation flow is very negligible compared to runtime of the place_and_route stage. Applying optimal delay budget management, area of the implemented design in terms of the number of LUTs and slices improves by 16% and 18%, respectively, compared to when no delay budgeting is applied. Based on the reported results, optimal delay budgeting algorithm always decrease the size of design in any of the three design metrics on average.
In the second set of experiments, we assume that the timing constraint at the output of each application is the original la- tency reported in Table I , plus the excess latency applied to the design. Therefore, depending on , more timing slack is injected to the graph before applying different delay-budgeting algorithms. Fig. 13 demonstrate the size of implemented designs in terms of the number of gates in both cases when ZSA and optimal delay budgeting algorithms are used for delay assignment to the computations in the applications. Axis is the increase in the latency of the output . It is shown as delay budget in the figure. axis is the area of the design after synthesis in terms of equivalent gate count. In all the benchmarks, with an increasing value of , the optimal algorithm outperforms ZSA more significantly. However, the tradeoff is in the latency of the output (increased by ). If is a very large value, there may not exist cores in the library with the large delay budget assigned to the components in the design. Hence, the area and other design metric cannot improve in parallel with increase in total budget. Fig. 14 shows the change in clock frequency when increases and compares the clock frequency in both cases when ZSA and optimal delay-budgeting algorithms are applied. As described before, clock frequency is not an additive function of components as area and size of design is. Hence, increasing the delay budget can have a different impact on the clock frequency. For example, for the benchmark DCT and FDCT, the clock frequency resulted by optimal delay-budgeting is greater than the clock frequency resulted from ZSA delay-budget management.
However, in benchmark EWF, the clock frequency resulted from optimal budgeting (82 MHz) is slightly greater than the one resulted by ZSA budgeting (80 MHz). In large , the difference of clock frequency resulted from ZSA and optimal delay budgeting algorithm gets larger and larger. Hence, the impact of optimal delay-budgeting is more visible. Table IV shows the area in terms of the number of LUTs and slices in FPGAs and compilation runtime for different excess delay of 2, 4, and 6 clk cycles. The larger the , the more delay budget is distributed. Although budget increases significantly by , the improvement in area is not as significant as budget. In FDCT, there are some multipliers on noncritical path with a large delay budget, which is not exploited in library mapping. Although the area of applications by optimal delay budgeting is always smaller than the area resulted by heuristic method by 10% on average, runtime of place_and_route in some application does not speed up. In other benchmarks, such as ARF, the runtime of place_and_route gets almost two times faster. On average for an excess-delay budgeting of six cycles, the runtime of place_and_route gets faster by a factor of 1.7. Although speedup in PAR runtime were not significant in smaller applications, due to lesser complexity and smaller structure, the effect on runtime for place_and_route for larger applications.
As a result, delay budgeting gives the flexibility of mapping the applications to components in the target library with simpler structure and smaller area. Developing complete libraries facil- itates the design CAD tool to exploit the existing delay budget to improve design quality.
VII. CONCLUSION
General delay budgeting can be solved using a linear programming solver. However, due to numerical instability and discrete behavior of libraries of components, the integer solution is required. In this paper, using optimal solution to LP relaxation of budgeting problem, we transform the solution to an optimal integer solution. For this purpose, we introduce delay-budget reassignment in a DAG. We reassign the fractional value of budget associated with the nodes in the graph such that budget of each node becomes integer. We prove that during this transformation , objective value from optimal LP solution does not change. Hence, an optimal integer solution is obtained in polynomial time. In this paper, we describe the detailed analysis of our algorithm. In addition, we look at different extensions of the delay-budgeting problem, such as maximization of weighted summation of delay budgets assigned to the nodes and additional constraints on lower and upper bounds on the delay budget allocated to each node. We prove that in both aforementioned extensions, our algorithm can produce an optimal integer solution in polynomial time.
We applied our budgeting technique in the mapping of applications on the FPGA device. We applied the timing-budgeting algorithm in selecting the components of library and mapping to different components of the application such that the design complexity is reduced without violation of timing constraints. Using the IP library of different computations, delay budget is exploited to improve the area and, hence, to speedup the runtime of place-and-route. Our experimental results show the effectiveness of budgeting on IP-based application mapping. Our optimal algorithm outperforms ZSA algorithm [4] in terms of area and compilation runtime significantly.
Our polynomial algorithm is applied to a general optimal LP solution. Developing a polynomial time graph-based algorithm for integer delay-budgeting is the current problem we are working on. Discrete budgeting is another challenging problem that needs to be studied and intuitive heuristic algorithms need to be developed for variations of this problem. Other future directions are delay budgeting problem in pipelined data paths and resource-shared data paths in IP-based design implementation.
