Abstract-Digital signal processing (DSP) applications involve exploiting specialized hardware units or features, and using processing long streams of input data. It is important to take into various other DSP-oriented compiler optimization techniques account this form of processing when implementing embedded (e.g., see [9]). software for DSP systems. Task-level vectorization, or block proTask-level vectorization or block processing is one general cessing, is a useful dataflow graph transformation that can signifimethod for improving DSP software performance in a variety cantly improve execution performance by allowing subsequences of ways. In this context, block processing refers to the ability of data items to be processed through individual task invocations. In this way, several benefits can be obtained, including reduced of a task to process groups of input data, rather than individual context switch overhead, increased memory locality, improved scalar data items, on each task activation. Such a task is typiutilization of processor pipelines, and use of more efficient DSPcally implemented in terms of a block processing parameter oriented addressing modes. On the other hand, block processing that indicates the size of the each input block that is to be progenerally results in increased memory requirements since it effeccessed. This way, the task programmer can optimize the intertively increases the sizes of the input and output values associated nal implementation of the task through awareness of its block with processing tasks.
(e.g., see [9] ). software for DSP systems. Task-level vectorization, or block proTask-level vectorization or block processing is one general cessing, is a useful dataflow graph transformation that can signifimethod for improving DSP software performance in a variety cantly improve execution performance by allowing subsequences of ways. In this context, block processing refers to the ability of data items to be processed through individual task invocations. In this way, several benefits can be obtained, including reduced of a task to process groups of input data, rather than individual context switch overhead, increased memory locality, improved scalar data items, on each task activation. Such a task is typiutilization of processor pipelines, and use of more efficient DSPcally implemented in terms of a block processing parameter oriented addressing modes. On the other hand, block processing that indicates the size of the each input block that is to be progenerally results in increased memory requirements since it effeccessed. This way, the task programmer can optimize the intertively increases the sizes of the input and output values associated nal implementation of the task through awareness of its block with processing tasks.
processing capability, and a task-level design tool can optimize In this paper, we investigate the memory-performance tradethe way block processing is applied to each task and coordioff associated with block processing. We develop novel block pronated across tasks for more global optimization. cessing algorithms that take carefully take into account memory
In this paper, we explore such global block processing opticonstraints to achieve efficient block processing configurations within given memory space limitations. Our experimental results matio fo dataflow-based ign tools. ues to its oint indicate that these methods derive optimal memory-constrained match to signal flow graphs, and its capabilities for improving block processing solutions most of the time. We demonstrate the verification and optimization, dataflow is becoming increasadvantages of our block processing techniques on practical kernel ingly popular as the semantic basis for DSP-oriented languages functions and applications in the DSP domain.
and tools. Commercial examples of tools that provide dataflow-based design capability for DSP include ADS by Agilent I. INTRODUCTION Technologies and Cocentric System Studio by Synopsys. Relevant research tools include DIF [6], Ptolemy [3] , PeaCE [13] , Indefinite-or unbounded-length streams of input data charand Streamlt [14] .
acterize most applications in the digital signal processing More specifically, in this paper, we examine the trade-off (DSP) and communications domains. As the complexity of between block processing implementation and data memory DSP applications grows rapidly, great demands are placed on requirements. Understanding this trade-off useful in memoryembedded processors to perform more and more intensive constrained software and design space exploration. Theoretical computations on these data streams. The multi-dimensional analysis and algorithms are proposed to efficiently achieve requirements that are emerging in commercial DSP products streamlined block processing configurations given constraints including requirements on cost, time-to-market, size, power on data memory requirements. In [4] . Block processing also facilitates efficient utilization of objectives (such as power dissipation) are jointly optimized pipelines for vector-based algorithms, which are common in under a fixed schedule length through appropriate sequencing DSP applications [2] .
of task execution. Task-level, block processing optimization for DSP was first These efforts target different objectives and operate on sinexplored by Ritz et al. [12] . In this approach, a dataflow graph gle-rate dataflow graphs, which are graphs in which all task is hierarchically decomposed based on analysis of fundamental execute at the same average rate. In contrast, the methods tarcycles. The decomposition is performed carefully to avoid geted in this paper operate on multirate dataflow graphs, which deadlock and maximize the degree of block processing. While are common in many signal processing applications, including the work jointly optimizes block processing and code size, it wireless communications, and multimedia processing. Our does not consider data memory cost. Another limitation of this work is motivated by the importance of multirate signal proapproach is its high complexity, which results from exhaustive cessing, and the much heavier demands on memory requiresearch analysis of fundamental cycles. ments that are imposed by multirate applications. Joint optimization of block processing, data memory minimization, and code size minimization is examined in [11] .
III restriction that for each edge in the dataflow graph, the numIn contrast to these related efforts, the optimization problem bers of data values produced by each invocation of the source that we target in this paper is formulated to take into account a actor and the number of data values consumed by each invocauser-defined data memory bound. This corresponds to the comtion of the sink actor are constant values. Given an SDF edge mon practical scenario where one is trying to fit the implemene, these production and consumption values are represented by tation within a given amount of memory (e.g., the on-chip prd(e) and cns(e), respectively, and the source and sink memory of a programmable digital signal processor). Also, by actors of e are represented by src(e) and snk(e), respeciterating through different memory bounds, trade-off curves tively. between performance and memory cost can be generated for A schedule is a sequence of actor invocations (or firings). system synthesis and design space exploration.
We compile an SDF graph by first constructing a valid schedIn this paper, in conjunction with block processing optimizaule, which is a finite schedule that fires each actor at least once, tion, memory sizes of dataflow buffers are efficiently configand does not lead to unbounded buffer accumulation (if the ured through novel algorithms that frequently achieve schedule is repeated indefinitely) nor buffer underflow (deadoptimum solutions, while having low polynomial-time comlock) on any edge. To avoid buffer overflow and underflow plexity.
problems, the total amount of data produced and consumed is Various other methods address the problem of minimizing required to be matched on all edges. In [8], efficient algorithms context switching overhead when implementing dataflow are presented to determine whether or not a valid schedule graphs. For example, the retiming technique is often exercised exists for an SDF graph, and to determine the minimum numon single-rate dataflow graphs. In the context of context switch ber of firings of each actor in a valid schedule. We denote the optimization, retiming rearranges delays (initial values in the repetitions count of an actor as this minimum number of firdataflow buffers) so they are better concentrated in isolated ings, and we collect the repetitions counts for all actors in the parts of the graph [7] [15]. As another example, Hong et al. [5] repetitions vector. The repetitions vector is indexed by the investigate throughput-constrained optimization given heteroactors in the SDF graph and it is denoted by q . geneous context switching costs between task pairs. The
Given an SDF edge e and the repetitions vector q, the balapproach is flexible in that overall execution time or other ance equation for e is written as q(src(e))prd(e) = q(snk(e)) cns(e).
as follows: a(A) = a maps the SDF actor A to its associated schedule tree leaf node a. The loop iteration count associated Fig. 3 (b) has less schedule of 5 [1] . The binary structure of an R-schedule can procedure call overhead, fast addressing through auto-increbe represented efficiently as a binary tree, which is called a ment modes, and better locality for pipelined execution comschedule tree or just a tree in our discussion [10] . In this tree pared to add scalar() in Fig. 3(a) (right(a)). We define the association operator, denoted x(u, By inspecting these charts, block processing is seen to achieve significant performance improvement, except when the actor invocation count (vectorization degree) is unity. In CD DAT this case, one must pay for the overhead of block processing without being able to amortize the overhead over multiple If r is a leaf node and a((R) = r, then vect(R) successive actor invocations, so there is no improvement. Moreover, invocations of actor R are equivalent to a single activation, improvements are seen to saturate for sufficiently high vectorand act(r) = 1 . If r is an internal node, then based on the ization degrees. structure of SASs, an activation is necessary when left(r) Charts of this form can provide application designers and completes, and is followed by right(r) at each of l(r) iterasynthesis tools helpful quantitative data for applying block pro-
tions. An activation occurs also when right(r) completes in cessing during design space exploration. one iteration and is followed by left(r) in the next iteration. Therefore, we have l(r)(act(left(r)) + act(right(r))) activa-V. BLOCK PROCESSING IN SOFTWARE SYNTHESIS tions for tree(r) .
Given a valid schedule of an SDF graph S, there is a unique To model block processing in SDF-based software synthepositive integer J(S) such that S invokes each actor A exactly sis, we convert successive actor invocations to inlined loops J x q (A) times, where q is the repetitions vector, as defined in embedded within a procedure that represents an activation of Section III. This positive integer is called the blockingfactor of the associated actor. Here, the number of loop iterations is the schedule S. The blocking factor can be expressed as equivalent to the number of successive actor invocations that is, to the vectorization degree. Given 
0%
In this section we consider in detail the problem of minimiz--1 0% (; One restriction in these formulations is that they assume that mon parent of a and b in the schedule tree. Then the buffer the input SDF graph is acyclic. Acyclic SDF graphs represent a cost on e can be evaluated by the following expressions. broad and important class of DSP applications (e.g., see [1] (4) In this paper, we investigate the problem of activation rate minimization under the constraint that the schedule blocking In summary, the activations minimization problem can be factor is unity. We restrict ourselves to unit blocking factor set up by casting Equations (1) through (4) into a non-linear here because we are interested in memory-efficient block pro-programming (NLP) formulation, where the objective is given cessing configurations, and increases in blocking factor gener-by (1), the variables are the loop iteration counts of the schedally increase memory requirements [1] . However, our ule tree nodes, and the constraints are given in (2), (3), and (4). developments on the activation rate minimization problem can Due to the intractability of NLP, efficient heuristics are desired easily be extended to handle arbitrary blocking factors. The to tackle the problem for practical use. details are omitted from this paper due to space restrictions.
To determine an initial schedule to work on, we must conThe problem is formally described as follows. Assume that sider the potential optimization conflicts between buffer cost we are given an acyclic SDF graph G and a valid schedule S and activations. While looped schedules that make extensive (and associated schedule tree tree(r)) for G such that use of nested loops are promising in generating low buffer J(S) = 1 . Block processing is to be applied to G by re-costs, activations minimization favors flat schedules, that is, arranging the loop counts of tree nodes in the schedule tree for schedules that do not employ nested loops.
S. The optimization variables are the set {I(x)} of loop
We employ nested-loop SASs that have been constructed for counts in the leaf nodes of the rearranged schedule tree (recall low buffering costs as the initial schedules in our optimization that these loop counts are equivalent to the vectorization process because flat schedules can easily be derived from any degrees of the associated actors). The objective is to minimize such schedule by the setting loop counts of all internal nodes to the number of activations:
one, while setting the loop counts of leaf nodes according to the repetitions counts of the corresponding actors. Furthermore min(act(r)).
(1) the construction of buffer-efficient nested loop schedules has been studied extensively, and the results of this previous work
Changes to the loop counts of tree nodes must obey the con-can be leveraged in our approach to memory-constrained block straint that the overall numbers of actor invocations in the processing optimization. Specifically, the APGAN and schedule is unchanged. In other words, GDPPO algorithms are employed in this work to compute buffer-efficient SASs as a starting point for our memory-constrained block processing optimization [1] .
A. Loop Count Factor Propagation As described earlier, activations values of leaf nodes are where a(A) = a, and path(a, r) is the set of nodes that are always equal to one and independent of their loop counts. traversed by the path from the leaf node a to the root node r.
Hence, one approach to optimizing activations values is to Intuitively, the equation says that no matter how the loop enlarge the loop counts of leaf nodes by absorbing the loop counts are changed along the path, their product has to match counts of internal nodes. A similar approach is proposed in the repetitions count of the associated actor.
[12] to deal with cyclic SDF graphs with delays. The strategy
In the activations minimization problem, we are also given a in [12] is to extract integer factors out of a loop's iteration buffer cost constraint M (a positive integer), such that the total count and carefully propagate the factors to inner loops. Propabuffer cost in the rearranged schedule cannot exceed M. That gations are validated by checking that they do not introduce is, deadlock. However, as described in Section II, the work of [12] does not consider buffer cost in the optimization process. Zbuf(e) < M, (3) For acyclic SDF graphs, as we will discuss later, factors of e loop counts should be aggressively propagated straight to the inner most iterands to achieve effective block processing. where buf(e) denotes the buffer cost on SDF edge e.
Under memory constraints, such propagation should be bal-
The structure of R-schedules permits efficient computation anced carefully against any increases in buffering costs.
of buffer costs. Given an acyclic SDF graph edge e, we have two~~~~~~~~~~~~~~~~~lefndsaadbascae.ihtesuc n ik Definition 1. FActor Propagation towvard Leaf nodes (FAPL).
Given tree(r) , the FAPL operation, when it can be applied, iS u(sr(e))= a nd (snke)) b. et be he last omto extract an integer factor V> 1 out of l(r) and merge V into the loop iteration counts of all the leaf nodes in tree(r) . Forgraph and Q is the maximum number of factors in the prime mally, the new loop count of r is l(r)/ V, and for every leaf f factorization of the loop count of an internal node.
in tree(r) the new loop count is V I(/) . Loop counts of internal nodes remain unchanged. For notational convenience, a VII. EXPERIMENTS FAPL operation is represented as 0(r, V) for tree(r) and factor V, or simply as 4 when the context is known. We call r To demonstrate the trade-off between buffer cost minimizathe FAPL target internal node, all leaf nodes in tree(r) the tion and activation rate optimization, exhaustive search is FAPL target leafnodes, and V the FAPL factor.
employed to the CD to DAT sample rate conversion example An example of FAPL is illustrated in Fig. 5 . FAPL reduces of Figure 1 . From the initial SAS of Fig. 1 (c) , factor combinations of all loop counts are exhaustively evaluated. The results the number of activations and increases buffer costs.
are summarized in Fig. 7 . Each dot in this chart is derived from Theorem 1. Given tree(r) with act(r), 0(r, V) reduces the a particular factor combination and the activations are obtained activations of tree(r) by a factor of V. by min(act(S)) (vertical axis) subject to buf(S) < M, where Definition 2. Given an SDF edge e with u(src(e)) = a, and M is the buffer cost bound (horizontal axis). The following DSP applications are examined in our experi-VIII. CONCLUSION mental evaluation: l6qam (a 16 QAM modem), 4pam (a pulse amplitude modulation system), aqmf (filterbank), and cd2dat
In this paper, we have first demonstrated the advantages of (CD to DAT sample rate conversion). These applications are block processing implementation of DSP kernel functions. ported from the library of SDF-based designs that are available Then we have examined the integrated optimization problem in the Ptolemy design environment [3] . For each application, a of block processing, code size minimization, and data space number of buffer cost upper bounds (values of M) are selected reduction. We have shown that this problem can be modeled uniformly in the range between the cost of the initial schedule through a nonlinear programming formulation. However, due (obtained from APGAN/GDPPO) to the cost of a flat schedule to the intractability of nonlinear programming, we have develGiven a buffer bound M, the degree ofsuboptimality (DOS) oped a efficient heuristic that are computationally efficient. We of GreedyFAPL is evaluated as (actsub -actopt)/actopt, have evaluated the GreedyFAPL heuristic, and our results demwhere actopt is the optimal number of activations observed onstrate that it consistently derives high quality results. This and actsub is the number of activations computed by Greedy-paper has presented a number of concrete examples and addi-FAPL. The degrees of suboptimality thus computed are avertional bodies of experimental results that provide further aged over the number of buffer bounds selected to obtain an insight into the relationships among block processing, memory average degree of suboptimality. The results are summarized in requirements, and performance optimization for DSP software. Fig. 8 , and they demonstrate the ability of GreedyFAPL to REFERENCES achieve optimum solutions most of the time.
To further evaluate the efficiency of our algorithms, ran- rate is seen to be a good high level model for comparing differ-2654, May 1995. ent design points in terms of average latency.
[12] S. Ritz, M. Pankert, and H. Meyer, "Optimum vectorization of scalable synchronous dataflow graphs," ASAP, pp. 285-296, October 1993.
[13] W. Sung, M. Oh, C. Im, and S. Ha, "Demonstration of hardware software 16qam 4pam aqmfl aqmf2 aqmf3 aqmf4 cd2dat codesign workflow in PeaCE," International Conference on VLSI and CAD, DOS 0%0 0%0 1%0 2% 0%0 1%0 3% October 1997. 
