AbstractÐData driven architectures have significant potential in the design of high performance ASICs. By exploiting the inherent parallelism in the application, these architectures can maximize pipelining. The key consideration involved with the design of a data driven ASIC is ensuring that throughput is maximized while a relatively low area is maintained. Optimal throughput can be realized by ensuring that all operands arrive simultaneously at their corresponding operator node. If this condition is achieved, the underlying data flow graph is said to be balanced. If the initial data flow graph is unbalanced, buffers must be inserted to prevent the clogging of the pipeline along the shorter paths. A novel algorithm for the assignment of buffers in a data flow graph is proposed. The method can also be applied to achieve wave-pipelining in digital systems under certain restrictions. The algorithm uses a new application of the retiming technique; the number of buffers here is shown to be equal to the minimum number of buffers achieved by integer programming techniques. We also discuss an extension of this algorithm which can further reduce the number of buffers by altering the DFG without affecting functionality or performance. The time complexities of the proposed algorithms are O( Â i) and O( P Â logV), respectively, a considerable improvement over the existing strategies. Also proposed is a novel buffer distribution algorithm that exploits a unique feature of data driven operation. This procedure maximizes throughput by inserting substantially fewer buffers than other techniques. Experimental results show that the proposed algorithms outperform the existing methods.
. Data driven architectures attempt to exploit the parallelism inherent in the application, thereby providing efficient pipelining capabilities and high throughput [22] , [11] , [16] , [10] . They are asynchronous by nature, operations being computed subject to the availability of data and need not be performed in synchrony with a global clocking signal.
. High throughput DSP algorithms effectively implemented by fast prototyping ensuring low design cycle time [9] .
. Data driven architectures with minimum sharing of resources results in a much lesser energy consumption and is more preferred for low-power design [27] , [8] . Architecture implementations like ªData-Waveº chip [10] , PADDI-2 [8] , etc., have shown that it is possible to design cost-effective data driven designs with considerable performance improvement. However, in order to maintain their success, we must ensure that the promised performance gains are guaranteed with minimal area.
The behavior of data driven architectures can be captured by using a data flow graph, henceforth DFG. The nodes of the DFG represent the operators and the edges represent the data dependencies between them. Data driven execution is governed by a set of firing rules, where an operator node is said to fire when all of its operands are available and all of its outputs have been consumed. Nodes and buffers communicate with each other by means of a uniform handshaking protocol. An operation node, therefore, is comprised of an execution unit, a handshake controller, and an output buffer to facilitate pipelining. The output buffers allow us to treat each node in the DFG as a stage in a pipeline. Hence, the minimum achievable pipeline period is determined by the slowest node.
A high level synthesis framework for the synthesis of data driven ASICs can be found in [16] , [11] , [26] , [8] . The desired behavior is specified using a DFG. To maximize performance, a one-to-one mapping scheme is adopted, where each node in the DFG is mapped onto its own functional unit in the VLSI implementation. Hence, if there are n nodes in the DFG, there will be n corresponding functional units in the actual VLSI implementation.
When designing a data driven ASIC, it is desirable to achieve high performance while still maintaining low area. To maximize throughput, all operands must arrive simultaneously at their corresponding operator node. If this condition is not satisfied, then a clogging of the pipeline along the shorter path may occur, thereby resulting in a suboptimal pipeline period. When this condition is satisfied, the underlying DFG is said to be balanced. Area can be minimized by proper selection of the node implementations so that the desired performance constraints are not violated. A novel node selection procedure for area minimization has been presented in [18] . This paper proposes two buffer distribution strategies to maximize pipelining in data driven ASICs.
Once the nodes are selected based on area minimization, the underlying DFG may still be unbalanced due to the presence of nonuniform paths. As a result, the design may not be optimally pipelineable. As mentioned earlier, optimal throughput requires that the DFG has no accumulation of data at its nodes, implying a simultaneous arrival of all input data to a multi-input node. Buffers are typically inserted along the shorter paths in order to balance the DFG. There are two methods for assignment of buffers. The most common way to balance a DFG is by selectively placing unit delay buffers in the DFG [4] , [5] , [17] . Alternatively, we can take full advantage of the data driven execution model by adding a handshake controller to each buffer. It would then be possible for a buffer to exhibit more than one unit of delay, thereby drastically reducing the number of buffers needed. In the latter case, the architecture is asynchronous and is composed of execution nodes, consisting of an execution unit and a handshake controller, and buffers, consisting of a register and a handshake controller (Fig. 1) . In both methods, it is desirable to use as few buffers as possible since the buffers are associated with area and performance overheads of their own.
Another problem similar to buffer distribution is wavepipelining [30] , [29] in digital systems at the logic design level. In traditional digital system design, when a new set of values is clocked into a set of registers, the values are allowed to propagate to the next set of registers before the first set is clocked again. The clock cycle, here, is determined by the longest combinational path in the circuit. However, wave pipelining uses multiple coherent waves of data between the registers (Fig. 2) , facilitating a higher clock rate. Designing wave-pipelined circuits involves balancing path delays by introducing active delay elements (buffers) between gates. Our methods of buffer distribution can be applicable to this problem under some restrictions.
Several researchers have proposed methods using unit delay buffers to balance acyclic DFGs. Leiserson and Saxe [13] demonstrated how synchronous pipelines can be optimized using retiming. The synchronous circuit modeled as a network of functional elements and globally clocked buffers is represented by a finite edge-weighted directed multigraph, q Y i. The nodes correspond to the functional elements and the edges correspond to the connections between the functional elements. The weight wuY v on an edge uY v corresponds to the number of buffers between the two functional elements, denoted by u and v. The polynomial algorithm proposed in [13] to distribute the buffers in the communication graph and to minimize the clock period of the circuit is based on retiming transformations. A retiming can be viewed as an assignment of a delay to each node in a circuit. It is a process of inserting and deleting delay buffers without affecting the circuit's overall behavior. Retiming of a circuit q ` Y iY w b is an integer valued node-labeling r X 3 . The retiming transforms the original graph q into a new graph q r ` Y iY w r b , where the weight of an edge w r uY v is given by w r uY v wuY v rv À ruX A properly chosen ªlegalº retiming of a circuit q can result in an optimal clock period, hence maximizing the performance. Leiserson and Saxe [13] , [14] proposed an O(j j j i j ) algorithm to determine whether a given clock period is feasible by retiming and proposed an O(j j j i j log j j ) algorithm to determine the retiming that gives the minimum clock period. Leiserson and Saxe [14] further proposed techniques for minimizing the total number of buffers or registers in the retimed circuit.
For asynchronous architectures, Chang and Lee [4] formulated the problem as an integer programming (IP) problem, which is NP-hard. They proposed a technique to decompose this large-scale optimization problem into several smaller IP problems. The procedure is still nonpolynomial and is applicable only to DFGs with one input node and one output node. In [5] , Boros et al. proposed a polynomial time solution for a rooted graph by using graph duality and network flow theory. They proposed an yj j j i j log j j algorithm to balance the rooted network with j j nodes and j i j edges.
The second method for maximizing throughput entails the use of buffers with handshake controllers. Although these buffers will have the extra overhead of the handshake controllers associated with them, considerably fewer of these buffers will be required to balance the DFG. This is because these buffers are capable of exhibiting more than a single unit of delay as a result of data driven execution. This technique was first described in [26] , [15] , where the problem was formulated as a quadratic integer programming problem, a nonpolynomial time solution proposed for it. This paper proposes an efficient algorithm for the distribution of buffers in DFGs by using extensive handshaking protocols. This paper proposes an O(j j j i j ) algorithm [17] for balancing a general acyclic network so that the DFG can operate like a systolic architecture to allow better pipelining. In addition, techniques and heuristics are proposed to further reduce the number of buffers by redistributing the buffers and altering the DFG, while keeping the functionality intact. An algorithm for distributing buffers in large DFGs using extensive handshaking protocols is also proposed. Our results demonstrate the superiority of the proposed schemes. These methods are able to derive implementation with fewer buffers than other methods while requiring lower computational complexity.
The buffer assignment algorithm using unit delay buffers is presented in the following section. The section first gives the essential preliminaries for the problem and is followed by the description of the algorithm. A detailed analysis of the proposed algorithm, along with some novel techniques to further reduce the number of buffers, are proposed in the section. Performance results of the proposed algorithms are also summarized. Section 3 proposes a strategy to maximize pipelining in DFGs using buffers with handshaking units. This section further presents performance results, giving a comparison between the two buffer assignment strategies. We conclude in Section 4.
BALANCING DFGs USING UNIT DELAY BUFFERS
In this section, we derive a technique for balancing DFGs using buffers which can provide delay of a single time unit only. In the next section, we take a more generalized approach, where buffers can provide a delay of more than one time unit.
Problem Formulation and Definitions
We start with an unbalanced directed acyclic graph, q Y i graph having j j nodes and j i j edges. The nodes correspond to the nodes of the DFG and the edges correspond to the arcs of the DFG. Each edge (uY v) in the graph is associated with a nonnegative integer w uYv which is equal to the sum of the execution time of node u and the time taken for the data to reach node v from node u after node u has finished its task. To balance this graph, we must assign nonnegative integers uYv , for each edge (u,v), which is the number of unit delay buffers that must be inserted between node u and node v. The values are chosen such that any two paths, È l and È k , leading to a multi-input node are of equal path weight along with the buffers and the sum of all the buffers in the graph is minimum. Hence, the IP formulation of the problem can be expressed as 
I
Let denote the set of source nodes that consists of nodes with zero inputs. Similarily, represents a set of terminal nodes with zero outputs. Each node v with indegreev in the DFG will lead to indegreev À I linear equations. Hence, the total number of linear equations in the IP formulation is VvP À indegreev À I j i j À j j j j and the total number of variables that are to be determined are j i j . In the buffer assignment formulation of (1), we have j i j À j j j j and j i j . It should be noted that the basic solution for a set of linear equations 2f this nature is not unique and, hence, there can be more than one such solution [20] .
Definition 2.2. An additional variable d v for every node v is defined which corresponds to the time taken for the last data operand to flow from the source nodes to the node v, i.e., Proof. The proof of this theorem can be found in [5] .
t u
Application of the above theorem transforms the IP formulation to:
The critical path is a path from one of the source nodes to one of the terminal nodes which has maximum path weight, i.e., the path, say È k , is the critical path from node s I to node t I if the uYvPÈ k w uYv is the maximum of all paths from s I to t I .
Lemma 2.1
The optimal buffer assignment solution will contain no buffers in any of the critical paths, i.e., for any edge (uY v) that belongs to a critical path, uYv H.
is the data flow graph) and each edge (u,v) in i H is assigned a value H uYv which corresponds to the buffers in the balanced DFG.
Comparison with Wave-Pipelining:
The IP formulation (1) is similar to the wave-pipeling design problem for digital system design. The wavepipelining problem as derived from [29] is given as follows:
is the latest arrival time at the node v.
Similarly, v is the earliest arrival time at the node v. An upper time bound, v , is provided on a terminal node. At each terminal node, v, the latest arrival time e v can be at most equal to v . diff is the maximum difference between the longest and the shortest delay to any node from the inputs. diff is the component of the clock period which can be controlled by buffer insertion. The problem (3) is to be solved subject to the given values of diff and v . All the variables involved here, including w uYv , uYv , v , and diff , are real, thus realizing a Linear Programming [20] problem. The problem (3) can be restricted to our IP formulation (1) under the following conditions:
VuY v P iX R v for the terminal nodes can be used as inputs to problem (3). Another variation of the wave-pipelining problem, using discrete delays instead of continuous delays, can also be compared to our buffer distribution formulation. This problem is NP-complete and the proposed methods can be applied to give a good approximate solution.
Algorithm for Buffer Assignment
The algorithm proposed to determine an optimal distribution of buffers to balance a DFG can be compared to the Simplex Algorithm [20] , which is normally applied for linear programming. Our algorithm is applied to DFGs and the solution is reached in polynomial time for this particular problem. The algorithm consists of two main steps. First, we assign buffers to the DFG to balance the graph. In the second step, we minimize the number of buffers while maintaining the balanced structure of the DFG.
Initial Buffer Assignment
Here, we determine the buffer distribution that balances the DFG. We also ensure that there are no buffers in any of the critical paths. We use the Initial Buffer Assignment procedure to accomplish this. The key steps are summarized in Fig. 3 . At first, the nodes are topologically sorted 1 ( Step 1) and the nodes are traversed in that order. For each node v, we determine d v using Definition 2.2 (Steps 3 and 4) and, then, buffers are assigned to all edges incident on v (Step 5). It follows from Theorem 2.1 that the DFG will be balanced and that no buffers have been placed along the critical 1 . When nodes are topologically sorted in acyclic directed graphs, there will be no edge going from a node v to another node u, where v comes after u in the sorted order. paths. The second condition holds because an edge (u, v) lying in the critical path will satisfy the relation
An example illustrating the assignment of buffers in the DFG is shown in Fig. 4 . Topologically sorting nodes in acyclic directed graphs takes O(j j j i j ) iterations. The steps 2-5 also require O(j j j i j ) iterations.
Buffer Minimization
After the initial assignment of buffers, it is possible to further reduce the number of buffers without causing a further degradation in performance. We use the algorithm buffer minimization to achieve the optimum distribution. The key steps of the algorithm are summarized in Fig. 5 . The first step is similar to the retiming technique applied by Leiserson and Saxe [13] for synchronous systems. Here, the graph is traversed from the terminal nodes to the source nodes, i.e., reverse topologically sorted order. At every iteration, for a node v, if outdegreev ! indegreev and all the outgoing edges of v have at least one buffer, then the number of buffers which is the minimum of all the buffers leaving v are pushed backward (Step 2). An example of pushing back buffers behind a node is shown in Fig. 6 . Pushing back k buffers behind a node v is equivalent to a retiming transformation by assigning r v k and r u H for all other nodes u P . The algorithm proposed by Leiserson and Saxe [13] determines one particular retiming to optimize the clock period in the synchronous circuit. Our approach is to iteratively push buffers behind nodes to reach an optimal distribution.
Buffers are pushed back about single nodes until the source nodes are reached at the end. The order in which the graph is traversed ensures that, once the buffers attached to node v are pushed back, all nodes ahead of v in the topologically sorted order have been checked for pushing back buffers and will no longer satisfy the conditions for pushing back buffers. At the end of Step 2, there will be no more cases in which buffers can be minimized by pushing back buffers about a single node.
In the remainder of the algorithm, we try to reduce the number of buffers by pushing back buffers about a cluster of nodes. So, the BDG, q H H Y i H is constructed from the DFG and the present buffer distribution based on Definition 2.4 (Step 3). Now, we start combining nodes in the BDG connected by edges containing no buffers in an ordered fashion and start checking if pushing back buffers about these merged nodes reduce the total number of buffers in the BDG. Pushing back buffers about a merged node in the BDG is equivalent to pushing back buffers through a series of nodes in the DFG. A node v The BDG is traversed from the terminal nodes in reverse topological order. When the merging of two nodes results in a buffer pushback condition (Step 7), the algorithm stops merging nodes in that traversal (Step 10) and tries to pushback buffers about the rest of the nodes in the traversal. Pushing back buffers in a node in the BDG is followed by the pushing back of buffers in the corresponding nodes in the DFG (Step 9). The BDG is traversed iteratively (Steps 5-11) until a complete traversal through the BDG yields no additional merged nodes. The corresponding buffer distribution in the DFG is now an optimal one.
The above algorithm for buffer minimization is illustrated by the example shown in Fig. 7 . We observe that, whenever buffers are pushed back about a node, at least one edge with no buffers is created. So, the final traversal across the BDG for merging nodes does not have any buffers being pushed back. Theorem 2.2. In the buffer minimization algorithm, the maximum number of times the BDG is traversed in the repeat-until loop is less than j j .
Proof. Each time the BDG is traversed within the repeatuntil loop, there is at least one merging operation (or the loop will terminate). As the maximum number of nodes that can be merged is less than j j , the theorem follows. t u
In the above algorithm, Steps 1, 2, and 3 will take O(j j j i j ) cycles. The repeat-until loop will need maximum of j j yj j j i j cycles. The complexity of the buffer minimization algorithm is therefore O(j j j j j i j), which is equal to O(j j j i j ) because j j j i j in this case.
Analysis of the Algorithm
Here, we analyze and justify the steps that have been taken to determine the optimal buffer distribution. We also compare the results at various stages of the algorithm with those in the IP problem. IP methods such as the Primal Cutting Plane Algorithm and the Lexicographic DualSimplex Algorithm [20] first move away from the integral bounds of the solution and find a basic solution. They then explore around adjacent solutions to reach an optimal integral solution if it exists. Determining an integral basic solution and moving around the solution space forming integral solution vectors in a generalized IP problem is not always possible. However, in our data flow problem, the proposed algorithm determines the integral basic solution and modifies the solution vector by minimizing the number of buffers to converge to an optimal solution.
We define three notations to be used to identify buffer distributions in the analysis:
h : Buffer distribution after initial buffer assignment. h m : Buffer distribution at the end of the buffer minimization algorithm.
h o : Optimal buffer distribution satisfying the IP formulation (1) .
Properties of h :
Property 2.1. h is a solution to the IP formulation (1).
Property 2.2. h is equivalent to a basic solution of the IP formulation (1).
Proof. At the initial buffer assignment stage, while traversing through each node v, at least one of the incomimg edges to v (the one with the maximum weight) will have no buffers assigned to it. Thus, there will be at least (j j À j j ) edges with no buffers. The solution set of the IP formulation (1), at the end of initial buffer assignment, will contain at least (j j À j j ) variables set to zero. The proof follows from Definition 2.1 and the IP formulation. t u Property 2.3. No buffers can be pushed forward about any node in h . Proof. In h , at each node v, at least one of the incomimg edges to v have zero buffers. Hence, no buffers can be pushed forward about v. t u Theorem 2.3. The hyperplane that corresponds to the variables of the IP problem given by (1) contains all the basic solutions, as well as the optimal solutions.
Proof. See [20] . t u
To each solution of the IP problem, there is a corresponding buffer distribution in the DFG. One can start from a buffer distribution corresponding to the basic solution, h and reach an optimal solution, h o , by shifting buffers about nodes. We have shown this specifically in Theorem 2.5. Proof. We take a generalized example where two nodes are connected by an edge with zero buffers to prove the 
(d) BDG of (c). (e) The final DFG after buffer minimization. (f) BDG of (e).
The first number in each edge denotes its weight, whereas the second number is the number of buffers inserted. The lightly shaded nodes are those about which buffers will be pushed back.
theorem. In the diagram shown in Fig. 8 , the BDG is broken up into two components. Part B contains all the nodes that have been traversed starting from the terminal nodes and the nodes in part A are yet to be covered. Earlier traversals ensure that all combinations of nodes in B do not satisfy the criteria for pushing back buffers about them. Here, we try to see whether pushing back the buffers outgoing from v has to be zero in order to get a distribution with minimum number of buffers. This ensures that the buffer distribution of the nodes merged to form the single node in the BDG has the minimum number of buffers. This condition can be relaxed where the total number of buffers need not be minimum.
Theorem 2.5. h o can be derived from h by successive steps of pushing buffers about nodes.
Proof. To derive a contradiction, we assume that h o cannot be determined from h by pushing buffers about nodes in the graph. In such a case, there will always be series of buffer distribution transformations from h to h o , the whole series denoted by h 3 h H 3 h I 3 F F F 3 h o , where all the transformations, except one, involve pushing buffers about nodes in the graph. Each intermediate buffer distribution is a solution to the IP formulation (1 (Property 2.3 ). In the distribution h m , there will be several nodes about which buffers have been pushed back about which the buffers can now be pushed forward. But, pushing forward buffers about these nodes will only increase the number of buffers (Steps 2 and 7) . Pushing buffers forward about multiple nodes is equivalent to pushing buffers about single nodes in the BDG. Hence, pushing forward buffers in the distribution h m will never reduce the number of buffers.
Case 2: Pushing Buffers Back. As the other possibility to derive h o from h m is by pushing back buffers, there has to be at least a node, say v H , (in the BDG) or a cluster of nodes (in the corresponding DFG) about which buffers can be pushed back to reduce the total number of buffers. The other pushing back operations done before will not prevent pushing back buffers about v H (Theorem 2.6). This implies that the DFG or the BDG should have a node which satisfies the condition of pushing back buffers. And, this possibility cannot occur as is ensured by the algorithm. Proof. The proof follows from Theorems 2.3, 2.4, and 2.7.t u
Further Reduction of Buffers by Buffer-Sharing
Once we minimize the number of buffers required for balancing the DFG by satisfying the IP formulation, the number of buffers can be reduced further by sharing output buffers between edges leaving the same node. The method is similar to register minimization adopted in synchronous pipelines [13] . We propose an improved version of that method to maximize the benefits of pushing back buffers and sharing buffers. The algorithm buffer sharing is shown in Fig. 9 . Two heuristics, applied during buffers sharing are described as follows:
1. Heuristic A: Sometimes pushing back buffers about nodes whose indegree is greater than its outdegree may allow more buffers to get shared finally reducing the number of buffers. We define a variable Maxbuffer of a node v, as the maximum number of buffers present among all the edges leaving v. It gives the maximum number of buffers that its outgoing edges can share. Steps 1-6 of the buffer sharing algorithm, while traversing through the graph in the reverse topological order, determine how many buffers can be pushed behind each node maximizing buffer sharing. 2. Heuristic B: During buffer sharing (Steps 7-11 in the algorithm), we start sharing buffers about nodes in the topological order. Before sharing buffers of the edges leaving a node, we try to pushfront 2 all the extra buffers of the edges entering the node. Figs. 10 and 11 illustrate the application of Heuristics A and B, respectively, comparing the number of buffers. A number on an edge denotes the number of buffers on that edge.
In addition to the heuristics, the algorithm also uses an efficient method to share buffers about a node. For a node having e outgoing edges, sharing the buffers is started by first sorting the edges on the basis of the number of buffers in the edges in the ascending order and then by using a divide and conquer strategy to create extra nodes and share buffers. A recursive procedure for sharing buffers, sharebuffer, is presented in Fig. 12. Fig. 13 gives an illustration of procedure Share-Buffer. The maximum number of edges leaving a node is j j ÀI. Here, sharing buffers among edges is equivalent to shifting (de IaPe) edges to a new node and sharing buffers for two nodes each with about e/2 number of edges. The complexity of sharing buffers in a node with e buffers, T(e) can be expressed as e de IaPe P eaPX S
The solution of the above equation using Master's Theorem [25] is O(elog P e) and the time complexity of sharing buffers of edges emanating from any node in the DFG is O(j j log P j j ). Fig. 14 shows the final distribution of the buffers in the DFG from Fig. 7 after sharing the buffers. In the buffer sharing algorithm, Steps 3-5 and Step 6 take O(j j ) 2 . pushfront does the opposite of what pushback does. comparisons. As sorting the edges and sharing the buffers will take O(j j log P j j ) cycles, the time complexity of reducing the number of buffers by sharing them is O(j j P log P j j ).
Theorem 2.9. The time complexity for assigning buffers and minimizing them in a DFG using the proposed algorithms is Oj j j i j j j log P j j.
Results
In order to fully appreciate the advantages of the proposed algorithms, we compared it to the existing methods for assigning unit delay buffers. The results of this comparison are summarized in Table 1 , which gives the method, as well as the time complexity, associated with it and the number of buffers derived. We observe that the time complexities of the proposed methods is lower than the existing methods. The proposed algorithms were applied to various DFGs available from published literature and benchmark sites to demonstrate their effectiveness. The number of buffers obtained after the buffer minimization and the buffer sharing stages have been tabulated in Table 2 . We observe that, as the size of the DFG increases, the buffer sharing algorithm generally allows for higher reductions in terms of buffers thereby introducing greater savings in area. It follows that, when the graph size increases, the number of edges leaving each node also increases, thereby making buffer sharing more effective.
DISTRIBUTION OF BUFFERS WITH HANDSHAKE UNITS
In this section, we derive a new method for buffer assignment which is distinguishable from methods using unit delay buffers. Our method exploits the fact that a node may provide a delay of more than its execution time (see Fig. 1 ). Similarly, we show that, by adding a handshake controller to the implementation of a buffer, it can provide a delay of more than a single time unit, thereby giving a much stronger tool to balance a DFG. The buffer implementation here is utilizing the consequences of the firing rules of a node. 
Terminology and Problem Formulation
Once again, we start with an unbalanced DFG. We define the pipeline period of the DFG to be the largest delay of the nodes in the DFG. Some additional terminology is defined below: X T Based on the above definitions, the problem of assigning an optimum buffer distribution using buffers with handshake controllers in the DFG can also be formulated as: minimize uYvPi f uYv sXtX t i t i X for ny uffer or node where f uYv P Y f uYv ! H nd tsks rrive t every X U Note that, in our discussion, we use uYv to represent number of unit delay buffers, whereas f uYv denotes the number of buffers with handshake controllers. Definition 3.1. As the maximum time that can be spent in a node or a buffer is , another variable l v for every node v is defined which is equal to the time taken for the last data operand to flow from the source nodes to the node v, provided the data operands spent time in each node in the path, i.e., 
Proposed Strategy
First, we show how to balance two paths of unequal lengths and expand on this method to assign buffers in the DFG maximizing pipelining and throughput. Here, as in the earlier situation, it is assumed that the optimal solution will have no buffers in the critical paths, whereas all other paths will have buffers allocated in their edges. Suppose there is a node, v, with two inputs coming from u I and u P . Without loss of generality, let us assume that the path from a source node to node v is shorter via u I than via u P . This implies that d v d uP w uPYv and l v l u P (using Definitions 2.2 and 3.1). There are two possible cases:
Here, the data operand coming from u I can wait until the other data operands are available for v, without any of the nodes successor to u I waiting more than units of time. Therefore, no buffers need to be placed in between u I and v. 2. l uI `d v : In this case, buffers have to be assigned between u I and v to ensure maximum pipelining. The number of buffers required for the path via u I is given by
Here, all the nodes in the path via u I keeps their data items for (execution and waiting time combined together). The rest of the extra time difference can be divided into (B-1) and a time interval less than . An extra buffer is needed to provide that extra delay; hence the ceiling function is used. Fig. 15 gives an illustration of the buffer assignment of this nature. The propagation delay of a node is shown on its outgoing edges. Here, P, d g S, and l e H. Hence, the optimum number of buffers required to maximize pipelining will be dS À PaPe P. Fig. 15a shows the behavior of the DFG without any buffers by its Gnatt chart. Fig. 15b and Fig. 15c show the DFGs and their corresponding Gnatt charts using one and two buffers. The average pipeline periods using no buffers, one buffer, and two buffers are 5, 2.5, and 2, respectively. Note that three unit delay buffers would be needed to balance the DFG shown in Fig. 15 .
The algorithm (Fig. 16 ) starts by first assigning zero buffers at all the edges in the graph. It is followed by assigning d and l values to the nodes in the topologically sorted order starting from the source nodes. The d values are assigned based on the weights on the edges as in the initial buffer assignment algorithm. These d values are used to calculate the number of buffers based on (8) . Once the buffers are placed in the fanin of a node, the l value is calculated in Step 11, based on Definition 3.1. At the end of Step 11, the buffers have been assigned to balance the DFG.
The number of buffers are minimized using the buffer minimization techniques presented in Section 2. These algorithm can be reused again as they work on the buffer distribution only. Fig. 17 gives an illustration of the above algorithm. The lightly shaded nodes at that particular stage indicate the node chosen in the topological order. We show snapshots of the DFG during buffer assignment, particularly those stages where a buffer is to be added (Fig. 17b and  Fig. 17c ). The (d u , l u ) values at each node u and weights at the edges have been shown in the figure. Fig. 18 gives the final buffer distribution after minimization and sharing.
Topologically sorting the graph, assigning all the d values, initial buffer assignment and assigning l values take O(j i j j j ) steps. Hence, the time complexity of the algorithm is equal to the that of the Buffer Minimization and the Buffer Sharing, which is equal to O(j j j i j j j log j j). Though the time complexity has an upper bound, it is difficult determine the optimality of the solution.
It can be observed that the number of buffers obtained using this method is much less than those using the earlier methods. This is because, primarily, buffers with handshake units provide a delay more than one unit delay. Second, the algorithm fully utilizes the capability of the nodes in the shorter path to hold on to the data operands upto the pipeline period. For the example in Fig. 18 the number of unit delay buffers obtained by applying the Buffer Minimization and Buffer Sharing algorithms is IS and that obtained by applying this algorithm (taking the S) is I.
Results

Performance of Proposed Buffer Distribution Algorithm
Here, we present experimental results to demonstrate the efficiency of the proposed strategy of distributing buffers with handshaking units. We compare this algorithm with an ad hoc strategy of deriving distributions of buffers with handshaking units in a DFG from the corresponding unit delay buffer distribution. The ad hoc scheme would replace a cluster of unit delay buffers in any edge of the DFG by da e buffers with handshaking units (Fig. 1) . For an example, in a DFG of S, a cluster of seven unit delay buffers in an edge will be replaced by dUaSe P buffers. Although this approach gives fewer buffers than the unit delay approach, it nevertheless does not give the optimal number of buffers. The algorithm shown in Fig. 16 is especially suited for buffers using the handshaking mechanism and, hence, gives the minimal number of buffers. Table 3 presents a comparison based on the number of these buffers required to maximize the throughput of the DFG when both of the approaches are used. We observe that, when the execution time taken by the nodes varies considerably (example from [4] , example from [3] ), the proposed algorithm outperforms the ad hoc scheme with a considerable difference in the number of buffers. In DFGs like example from [10] , Parker (HLSynth91), and Diffeq (HLSynth91), the execution time in most of the nodes does not vary much and, hence, the number of buffers are the same.
Comparison: Unit Delay Buffers vs. Buffers with Handshake Units
We have shown above that the number of buffers required to balance the DFG when using handshake controllers is substantially lower than the number of unit delay buffers. Although the size of the buffers with handshake controllers would be much larger than the unit delay buffers, the net savings in area can still be substantial. Preliminary experimental results conducted on benchmark DFG have shown that buffers, when used with handshake units, indeed provides less area overhead. The results of these experiments are summarized in Table 4 . For simplicity, we only consider the areas associated with the buffers when comparing area overheads. In this particular set of experiments, the number of bits, 3, for any node value in the DFG is equal to four. The area overhead, derived from buffer implementations in [26] , is expressed in ! P , where ! is the minimum feature size [32] . As can be observed from Table 4 , the area saved by using buffers with handshake units varies from 32 percent to 57 percent in these examples. The results will have a similar trend with other values of 3 also. We are currently trying to identify those situations where unit delay buffers and buffers with handshake units can be used for maximizing pipelining in different DFGs.
CONCLUDING REMARKS
The data driven architecture has proven to be an attractive alternative in situations where the limitations associated with global synchronization cannot be tolerated. However, to guarantee high performance, certain conditions must be satisfied by the underlying DFG. Balancing the DFG using the minimum number of buffers is a prerequisite for high performance in data flow architectures. Extensive research in this direction has reduced this formerly nonpolynomial time problem into one that has polynomial time complexity. We have proposed two distinct methods for buffer assignment. Our algorithm using unit delay buffers not only solves the problem in less time than existing IP tools, it may require fewer buffers as well. We have also shown that, by using handshake controllers and a unique feature of data driven architectures, we can affect an even greater reduction in the number of buffers needed. These methods may also be applicable to wave-pipelining in digital circuits. We are currently developing a CAD framework for the high level synthesis of data driven architectures. The buffer assignment procedures developed here can be included in any CAD framework for the high level synthesis of data driven ASICs. 
