Abstract-Block-processing is a powerful and popular technique for increasing computation speed by simultaneously processing several samples of data. The effectiveness of block-processing is often reduced, however, due to suboptimal placement of delays in the dataflow graph of a computation.
Introduction
In many application domains, computations are defined on semi-infinite or very long streams of data. The rate of the incoming data is dictated by the nature of the application and often cannot be satisfied by a straightforward implementation of a systems' specification. In order to meet the computational demands of several applications, multiple samples of the incoming data stream must be processed simultaneously. This approach, known as block-processzng or vectorzzntzon, is widely used to satisfy throughput requirements through the use of parallelism and pipelining. Block-processing enhances both regularity and locality in computations, thus facilitating their efficient hardware implementation [l, 41. Enhanced regularity reduces the effort in software switching and address calculation, and improved locality improves the effectiveness of code-size reduction methods [7] . Moreover, block-processing enables the efficient utilization of pipelines and efficient implementations of vector-based algorithms such as FFT-based filtering and error-correction codes. In general, block-processing is beneficial in all cases where the net cost of processing n samples individually is higher than the net cost of processing n samples simultaneously. There are several ways to increase the block-processzng factor of a computation, that is, the number of data samples that can be processed simultaneously. For example, one can unfold the basic iteration of a computation and schedule computational blocks from different iterations to execute successively. This technique, however, may not uniformly increase the block-processing factor for all computational blocks.
Another transformation that can increase the blockprocessing factor is retzmzng. Unlike other optimization techniques that have targeted high-level synthesis [3, 91, retiming has been used traditionally for clock period minimization [a, 5, 61 . Figure 1 illustrates the use of retiming to improve block-processing. The computation dataflow graph (CDFG) in this figure has three computation blocks A, B, and C and three delays. An input stream is coming into block A , and an output stream is generated by A. Assuming that the computation is implemented by a uniprocessor system, the pair (z + y) above each block gives the initiation time z and the computation time y per block input The initiation time includes contextswitching overhead for fetching data and instructions from the background memory and the cost for reconfiguring pipelines. A single iteration of the computation in Figure l (a) completes in (7+5)+(7+6)+(6+3)=34 cycles by executing the blocks in the order A I , B1, C1 For three iterations, the computational blocks can be executed in the order A I , B1, C1, A2, Ba, Cz, As, B3, C3. In this case, a new input is consumed every 34 cycles, and the entire computation needs 3 x 34 = 102 cycles. The functionally equivalent CDFG in Figure l( [8, lo] . Specifically, a technique for linear vectorization of DSP programs using retiming has been presented in [lo] . This technique involves the redistribution of delays in the CDFG representation of a DSP program in a way that results in maximum concentration of delays on the edges. Fully regular vectorization, however, cannot be achieved using the linear vectorization approach in that paper. Moreover, the non-linear integer programming formulation of the retiming problem for computing linear vectorizations presented in that paper can be comput at ionally very expensive.
In this paper, we consider the problem of retiming computation dataflow graphs to achieve any given blockprocessing factor k. We call this the k-delay problem.
We first present a straightforward integer linear programming (ILP) formulation of the k-delay problem. We then give an 0 ( V 3 E + V4 lg V)-time algorithm for the k-delay problem, where V is the number of computation blocks and E is the number of interconnections in the CDFG. This is the first polynomial-time algorithm ever presented for the k-delay problem. Given a CDFG and a positive integer k , our algorithm computes a retimed CDFG that achieves a block-processing factor of k or determines that such a retiming does not exist. An important feature of our approach is that all blocks in the retimed CDFG achieve the same block-processing factor k and the same execution order across iterations. As a result, our retimed CDFGs can operate faster and be less expensive to implement than generic block-processed CDFGs.
The remainder of this paper is organized as follows. In Section 2 we describe the representation of computations as dataflow graphs, and we give background material on block-processing and retiming. We also give a precise mathematical formulation of the k-delay problem. In Section 3, we present an integer linear programming formulation of the k-delay problem. In Section 4, we describe an asymptotically more efficient algorithm for the k-delay problem whose running time is polynomial time in the size of the CDFG. We conclude in Section 5 with an empirical evaluation of our proposed optimization technique.
Preliminaries
In this section, we give background materialon computation dataflow graphs, block-processing, and retiming. We also give a mathematical formulation of the k-delay problem.
Computation dataflow graphs
A computation dataflow graph is an edge-weighted directed graph G = ( V , E , w ) . The vertices w E V model the computation blocks of a computation (subroutines, arithmetic or boolean operations), and the directed edges e E E model interconnections (data and control dependencies) between the computation blocks. Each edge e E E is associated with a weight w ( e ) that denotes the number of delays associated with that interconnection.
A CDFG is well-formed if for every edge e E E , we have w ( e ) 2 0, and every directed cycle contains at least one delay.
Block-processing
Block-processing strives to maximize the throughput of a computation by simultaneously processing multiple samples of the incoming data. The maximum number of samples that can be processed simultaneously or immediately after each other by a block v is called the blockprocessing factor ku of that block. A block-processing is linear if all blocks have the same block-processing factor k~ Given a linear block-processing with factor k , the k ' IVI computational block evaluations that generate k iterations of t8he computation constitute a block iteration. A linear block-processing with factor k is regular if the k data samples processed simultaneously by every computational block are accessed during the same block iteration. The retimed CDFG in Figure l The following lemma gives necessary and sufficient conditions for achieving effective block-processing. E , w ) be CDFG. We can achieve a linear and regular block-processing of G with factor k if and only af,for every edge e E E , we have w ( e ) = 0 or w ( e ) 2 k .
(1) (1) is not satisfied for some CDFG that can be block-processed linearly and regularly with a factor of k , then there exists an edge U 5 v such that 1 5 w ( e ) 5 k: -1. Vertex w can process at most w ( e ) samples per iteration. Since w ( e ) < k , the remaining k -w ( e ) samples must be accessed from the previous block iteration, which contradicts regularity.
Retiming
A retiming of a CDFG G = (V, E , w ) is an integer valued vertex-labeling r : V -+ Z. This integer value denotes the assignment of a lag t o each vertex which transforms G into G, = (V, E , w,) where for each edge
In order for the retimed CDFG G, to be well-formed, the retiming T must satisfy the constraint w,(e) 2 0 for all
An important characteristic of the graph G in the context of retiming that is defined for each pair of vertices U and w is the parameter
where w ( p ) =: C e , , w ( e ) denotes the delay count of a path p .
}
Another useful parameter defined for a CDFG is According to Lemma 1, a linear and regular blockprocessing with factor k can be achieved only for CDFGs that have either 0 or a t least k delays on each edge. If a given CDFG does not satisfy Relation (l), we can redistribute its delays by retiming. We call the problem of computing such a retiming the k-delay problem: 
ILP formulation
Problem KDP cannot be expressed directly in a linear programming form because of the disjunction in Relation (5). In this section we rely on the notion of the companion graph that was described in [5] 
and for each edge U 5 U E E, we have w'(e1) = min{l,w(e)} , and
The following lemma gives necessary and sufficient conditions for any retiming that solves Problem KDP.
Lemma 2 Let G = (V, E , w ) be a CDFG, and let G' = (V', E', w') be its companion graph. Then there exists a retiming function r : V -+ Z that solves Problem K D P on G zf and only if there exists a retiming function r' :
V' --f Z such that f o r e v e r y edge U 5 v E E', we have
w : / ( e l ) 5 1 1 ( 7 ) w:/(ez> 5 F . w:/(el) , and ,(el) . (9) and for every edge U -% v in E , we have Proof. Inequality ( 6 ) ensures that the retimed CDFG is well-formed. Inequalities (7) and (8) 4 Polynomial-time algorithm
The constraints in the ILP formulation of Problem KDP do not appear to have any special structure. We thus need to resort to general integer linear programming solvers to compute a solution. In this section we show that Problem KDP can be solved efficiently by giving a polynomial-time algorithm for it. We first give a set of necessary conditions for the feasibility of Problem KDP on a given CDFG. Subsequently, we describe the construction of a transformed graph G T , and we give a set of necessary and sufficient conditions for the feasibility of Problem KDP on GT. We then present a polynomial-time algorithm that uses GT to solve Problem KDP.
Necessary conditions
The following lemma gives necessary conditions for the feasibility of Problem KDP. 
Proof. Omitted.
CI
The challenge in solving Problem KDP is t o determine which edges should have nonzero delay count. In the ILP formulation, we determine these edges explicitly. The asymptotically efficient technique we describe in this section determines these edges implicitly. It first identifies vertex pairs U , v E .V such that for every path which violates Relation (5 . Condition C2 must hold for path U A w in G,. must be a t least IC, otherwise Relation (5) will be violated for some edge along this path.
The following lemma casts the necessary conditions of Lemma 4 as a retiming problem on an appropriately constructed constraints graph. The followiing 1' emma expresses the necessary condi-
Run an all-pairs 2-shortest paths algorithm on G. 
for every delay essential vertex pair U , U E G?
Run an all-pairs shortest paths algorithm on GT 
Sufficient c o n d i t i o n s
In this section we show that the necessary conditions presented in Subsection 4.1 are also sufficient for Problem KDP. The following lemma proves an important property of GT which is used in Lemma 9 to prove the sufficiency of these conditions. Intuitively, this lemma shows that if an edge U 5 v has positive but fewer than k delays in a graph G, that satisfies the conditions of Lemma 6, then its delay count is the minimum delay count for the vertex pair U , v E V . Consequently, we can zero out the delay count of such an edge by retiming. Then for every deficient edge U 5 v in E , such that 0 < w T ( e ) < k , we have wT(e) = w : ( u , v ) . We can transform any given G into its supergraph sup(G) by collapsing all vertices connected by zeroweight edges into a single supervertex. This transformation is illustrated in Figure 4 . Note that when we compute the supergraph of a transformed graph s u p ( G T ) , we only collapse those vertices connected by zero-weight edges that belong to the original graph G. In Figure 4 , for example, even though the additional edge C -+ 2) has no delays, the vertices C and 2) are not collapsed.
In the following lemma, we show that a retiming that minimizes the number of edges with delays in E subject to Inequality (11) is a solution to Problem KDP. We prove this result by contradiction. We consider a retiming that satisfies Inequality (11) and minimizes the number of edges E E with delays without solving Problem KDP. We then show how to to further minimize the number of edges with delays in E, thus contradicts our assumption that our initial retiming was optimal. We now show that the number of edge with delays can be reduced further. By construction, a supergraph has no zero-weight edges. Therefore, retiming supervertices ensures that no delays are introduced on zero-weight edges. Since all delays are removed from at least one edge, namely U A w E E , the resulting supergraph s~p ( G : )~, has a smaller number of edges in E with nonzero delays than Gjf l thus contradicting our assumption that r minimizes the number of edges with delays in E . We conclude from this contradiction that r is a solution to Problem KDP. 
The algorithm
The proof of Lemma 9 captures our two-step strategy for solving Problem KDP. The first step generates the transformed GT and computes a retiming that satisfies Inequality (ll), thus ensuring that all paths in G, have enough delays. The second step places delays onto individual edges using a greedy procedure that computes an incremental retiming for every edge with fewer than k delays. In each iteration, the incremental retiming removes all the delays on the deficient edge and redistributes them on other edges with delays without violating any of the necessary conditions. This process is repeated until there are no more deficient edges left and all edges satisfy Relation (5). The final retiming is the sum of the retiming in the first step and of all the incremental retimings in the second step. Algorithm SOLVEKDP described in theFigure 5 implements this greedy procedure. is possible to achieve performance improvements between 30% and 45% using our k-delay optimization.
Algorithm SOLVEKDP for solving Problem KDP
