There has been a proliferation of block-diagram environments for specifying and prototyping DSP systems. These include tools from academia, such as Ptolemy, and commercial tools, such 
Introduction

of 42
Shared Buffer Implementations of Signal Processing Systems Using Lifetime Analysis Techniques guages can be based on models of computation that have strong formal properties, enabling easier and faster development of bug-free programs. Block-diagram specifications also have the desirable property of not over-specifying systems; this can enable a synthesis tool to exploit all of the concurrency and parallelism available at the system level.
In a block-diagram environment, the user connects up various blocks drawn from a library to form the system of interest. For simulation, these blocks are typically written in a high level language (HLL) like C++. For software synthesis, the technique typically used is that of inline code generation: a schedule is generated, and the code generator steps through this schedule and substitutes the code for each actor that it encounters in the schedule. The code for the actor may be of two types. It may be the HLL code itself, obtained from the actor in the simulation library. The overall code may now be compiled for the appropriate target. Or the code may be hand-optimized code targeted for a particular target implementation. For programmable DSPs, this means that the actors implement their functionality through hand-optimized assembly language segments. The code-generator, after stitching together the code for the entire system then simply assembles it and the resulting machine code can be run on the DSP. This latter technique is generally more efficient for programmable DSPs because of a lack of efficient HLL DSP compilers.
For hardware synthesis, a similar approach can be taken, with blocks implementing their functionality in a hardware description language, like behavioral VHDL [12] [34] . The generated VHDL description can then be used by a behavioral synthesis tools to generate an RTL description of the system, that can be further compiled into hardware using logic synthesis and layout tools.
High-level language compilers for DSPs have been woefully inadequate in the past [35] . This has been because of the highly irregular architecture that many DSPs have, the specialized addressing modes such as modulo addressing, bit-reversed addressing, and small number of special purpose registers. Traditional compilers are unable to generate efficient code for such processors. This situation might change in the future, if DSP architectures converge to general-purpose architectures; for example, the C6 DSP from TI, the newest DSP architecture from TI, is a VLIW architecture and has a fairly good compiler. Even so, because of low power requirements, and cost constraints, the fixed-point DSP with the irregular architecture is likely to dominate in embedded applications for the foreseeable future. Because of the shortcomings of existing compilers for such DSPs, a considerable research effort has been undertaken to design better compilers for fixed point DSPs (e.g., see [19] [20] [21] ).
Synthesis from block diagrams is useful and necessary when the block diagram becomes the abstract specification rather than C code. Block diagrams also enable coarse-grain optimizations based on knowledge of the restricted underlying models of computation; these optimizations are frequently difficult to perform for a traditional compiler. Since the first step in block diagram synthesis flows is the scheduling of the block diagram, we consider in this paper scheduling strategies for minimizing memory usage. Since the scheduling techniques we develop operate on the coarse-grain, system level description, these techniques are somewhat orthogonal to the optimizations that might be employed by tools lower in the flow.
For example, a behavioral synthesis tool has a limited view of the code, often confined to basic blocks within each block it is optimizing, and cannot make use of the global control and dataflow that our scheduler can exploit. Similarly, a compiler for a general-purpose HLL (such as C) typically does not have the global information about application structure that our scheduler has. The techniques we develop in this paper are thus complementary to the work that is being done on developing better HLL compilers for DSPs, such as that presented in [19] [20] [21] . In particular, the techniques we develop operate on the graphs at a high enough level that particular architectural features of the target processor are largely irrelevant. We assume that the actor library that the code generator has access to consists of either hand-optimized assembly code, or of specifications in a high-level language like C. If the latter, then we would have to invoke a C compiler after performing the dataflow optimizations and threading the code together. Even though this might seemingly defeat the purpose of producing efficient code, since we are using a C compiler for a DSP (the compiler might not be very good as mentioned), studies have shown that for larger systems, C code produced this way compiles better than hand-written C for the entire system [15] .
2
Problem statement and organization of the paper
In this paper, we describe a technique for reducing buffering requirements in synchronous dataflow (SDF) graphs based on lifetime analysis and memory allocation heuristics for single appearance looped schedules (SAS).
As already mentioned, the first step in compiling SDF graphs is determining a schedule. Once the schedule has been determined, memory has to be allocated for the buffers in the graph.
Both of these steps present many algorithmic challenges; we tackle many of these steps in this paper. We concentrate on the class of single appearance schedules in our framework because non-single appearance schedules for SDF graphs can be exponentially long; this can lead to very large code size [4] . Within the class of SAS, there are two algorithmic challenges: to determine the order in which the actors should appear in the schedule, subject to the precedence constraints imposed by the graph (the topological ordering), and the order in which the loops should be organized once the order has been determined. Solutions to both of these problems depend on the optimization metric of interest. In this paper, the metric is buffer memory; hence, these algorithms all try to minimize the amount of buffer memory needed. While previous techniques for buffer minimization have used techniques where each buffer is allocated independently in memory (we will refer to this as the non-shared model), in this paper we try to share buffers efficiently by determine the most efficient combination for each test example. In section 11 we discuss possibilities for future work and conclude.
Related Work
Lifetime analysis techniques for sharing memory are well known in a number of contexts. The first is for register allocation in traditional compilers; given a scheduled dataflow graph, register allocation techniques determine whether the variables in the graph can be shared by looking at their lifetimes. In the simplest form, this problem can be formulated as an interval graph coloring problem that has an elegant polynomial-time solution. However, the problem of scheduling the graph so that the overall register requirement is minimized is an NP-hard problem [30] . Register allocation problems are made somewhat simpler because the variables in question all have the same size. The allocation problem becomes NP-complete if variables are of differing sizes, as for example, in allocating arrays of different sizes to memory.
Fabri [8] studies the more general problem of overlaying arrays and strings in imperative languages. Fabri models array lifetimes as weighted interval graphs and uses coloring heuristics for generating memory allocations. She also studies transformation techniques for lowering the overall memory cost;
these techniques attempt to minimize the lower and upper bounds on the extended chromatic number of the weighted interval graph. Some transformation techniques found to be effective for reducing overall storage include the renaming transformation, where by the use of judicious renaming of aggregate variables, lifetimes can be fragmented, allowing greater opportunities for overlaying; the technique of recalculation, where some variables are recalculated when needed, rather than holding them in storage; code motion techniques that reorder the program statements in a semantics preserving manner; and loop splitting.
There are important differences between Fabri's work and ours. Fabri considers general imperative language code, and hence has to solve allocation problems for a more general class of interval graphs. We apply our techniques on SDF graphs, and because the SDF model of computation is restricted, the interval graphs in our problem have a more restricted structure, enabling us to use simpler allocation heuristics more effectively. For instance, the liveness profile of an array in our framework is always periodic (in a certain technical sense), and these periods can be deduced from the SDF graph and the specific class of schedules that we use, whereas in a general setting, liveness profiles may not be periodic, and deducing these profiles can be expensive algorithmically. Also, the SDF model and SDF schedules present unique problems for deducing the liveness profiles, and thus the interval graphs, in an efficient manner; these techniques have not been presented or studied in any previous work. We show that for the important class of single appearance schedules, these deductions can be made in polynomial time in the size of the SDF Also, previous work has not addressed the relationship been loop fusion and the extended chromatic number. Finally, even though certain subsets of the techniques we present in this paper have been studied in the compilers community, to date they have not been used in block-diagram compilers. An additional contribution of this paper is to show that many of the techniques used in traditional compilers can be specialized and applied fruitfully in block diagram based DSP programming environments.
Vanhoof, Bolsens and De Man have observed that in general, the full address space of an array does not always contain live data [33] . Thus, they define an "address reference window" as the maximum distance between any two live data elements throughout the lifetime of an array, and fold multiple array elements into a single window element using a modulo operation in the address calculation. This concept is similar to our use of the maximum number of live tokens as the size of each individual SDF buffer. The number of logically distinct memory elements in a buffer for an edge is equal to , which can be much larger than the maximum number of live tokens that reside on simultaneously [4] .
In a synthesis tool called ATOMIUM, De Greef, Catthoor, and De Man have developed lifetime analysis and memory allocation techniques for single-assignment, static control-flow specifications that involve explicit looping constructs, such as for loops [11] . This is in contrast to SDF, in which all iteration is specified implicitly, and the use of looping is left entirely up to the compiler. However, once a single appearance schedule is specified, we have a set of nested loops. Thus, some relationships can be observed between the lifetime analysis techniques we develop for single appearance schedules, and those of ATOM-IMUM. In particular, the class of specifications addressed by ATOMIUM exhibits more general and less predictable array accessing behavior than the buffer access patterns that emerge from SDF-based single appearance schedules. We exploit the increased predictability of single appearance schedules in our work using novel lifetime analysis formulations that are derived from a tree-based schedule representation. This results in thorough optimization with significantly more efficient (lower complexity) algorithms. Furthermore, through our in-depth focus on the restricted, but useful, class of SDF-based single appearance schedules, we expose fundamental relationships between scheduling and buffer sharing in multirate signal processing systems.
Ritz et al. [29] give an approach to minimizing buffer memory that operates only on flat SASs since buffer memory reduction is tertiary to their goal of reducing code size and context-switch overhead (defined roughly as the rate at which the schedule switches between various actors). We do not take context-switch into account in our scheduling techniques because our primary concern is memory minimization; off-chip memory is often a bottleneck in embedded systems implementations and is better avoided.
Flat SASs have a smaller context switch overhead then nested schedules do, especially if the codegeneration strategy used is that of procedure calls. Ritz et al. formulate the problem of minimizing buffer memory on flat SASs as a non-linear integer programming problem that chooses the appropriate topological sort and proceeds to allocate based on that schedule. This formulation does not lead to any polynomialtime algorithms, and can lead to much more expensive memory allocations than those obtainable through nested schedules. For example, in Section 10, we show that on a satellite receiver example, Ritz's technique yields an allocation that is more than 100% larger than the allocation achieved by techniques developed in this paper. However, the techniques in this paper do not take context-switch overhead into account (since we assume inline code generation, the effect of context switches is arguably less significant), and are thus able to operate on a much larger class of SASs than the class of flat SASs. Also, the techniques in this paper are all provably polynomial-time algorithms.
Goddard and Jeffay use a dynamic scheduling strategy for reducing memory requirements of SDF graphs, and develop an earliest-deadline-first (EDF) type of dynamic scheduler [10] . However, experiments in the Ptolemy system have shown that dynamic scheduling can be more than twice as slow as static schedules [36] ; hence, for many embedded applications, this penalty on throughput might be intolerable.
Sung et al. consider expanding the SAS to allow 2 or more appearances of some actors if the buffering memory can be reduced [31] . They give heuristic techniques for performing this expansion and show that the buffering can be reduced significantly by allowing an actor to appear more than once. This technique is useful since it allows one to trade-off buffering memory versus code size in a systematic way.
SASs will give the least code size only if each actor in the schedule is distinct and has a distinct codeblock that implements its functionality. In reality, however, many actors in the graph will be different instantiations of the same basic actor, with different parameters perhaps. In this case, inline code generated from a SAS is not necessarily code-size optimal since the different instantiations of a single actor could all share the same code [31] . Hence, it might be profitable to implement procedure calls instead of inline code for the various instantiations, so that code can be shared. The procedure call would pass the appropriate parameters. A study of this optimization is done in [31] where the authors formulate precise metrics that can be used to determine the gain or loss from implementing code sharing compared to the overhead of Ade has developed lower bounds on memory requirements of SDF specifications, assuming that each buffer is assigned to separate storage [1] . Exploring the incorporation of buffer sharing opportunities into this analysis is a useful direction for further investigation.
As already mentioned, dataflow is a natural model of computation to use as the underlying model for a block-diagram language for designing DSP systems. The blocks in the language correspond to actors in a dataflow graph, and the connections correspond to directed edges between the actors. These edges not only represent communication channels, conceptually implemented as FIFO queues, but also establish precedence constraints. An actor fires in a dataflow graph by removing tokens from its input edges and producing tokens on its output edges. The stream of tokens produced this way corresponds naturally to a discrete time signal in a DSP system. In this paper, we consider a subset of dataflow called synchronous dataflow (SDF) [17] . In SDF, each actor produces and consumes a fixed number of tokens, and these numbers are known at compile time. In addition, each edge has a fixed initial number of tokens, called delays. A schedule is a sequence of actor firings. We compile an SDF graph by first constructing a valid schedule -a finite schedule that fires each actor at least once, does not deadlock, and produces no net change in the number of tokens queued on each edge. Corresponding to each actor in the schedule, we instantiate a code block or procedure call that is obtained from a library of predefined actors. The resulting We define to be the total number of samples exchanged on edge by actor ; i.e, .
Notation and background
5
Constructing memory-efficient loop structures
In [4] , the concept and motivation behind single appearance schedules (SAS) has been defined and shown to yield an optimally compact inline implementation of an SDF graph with regard to code size (neglecting the code size overhead associated with the loop control). An SAS is one where each actor appears only once when loop notation is used. If the SAS restriction is removed, significant increase in code size can occur. The increase in code size will manifest itself even if inline code-generation is not used and subroutine calls are used instead. This is because the length of a non-SAS can be exponential in the size of the graph, and there could be exponentially many subroutine calls. can be exponentially many ways of nesting loops in a flat SAS [4] .
Scheduling can also have a significant impact on the amount of memory required to implement the buffers on the edges in an SDF graph. For example, in figure 2(b), the buffering requirements for the four schedules, assuming that one separate buffer is implemented for each edge, are 50, 90, 130, and 80 respectively. As can be seen, SASs can have significantly higher buffer requirements than a schedule optimized purely for buffer memory. For example, the non-SAS for the SDF graph of figure 2 has a buffer requirement of 50; the three possible SASs for the graph, , , and
, have requirements of 130, 90, and 100 respectively. We give priority to code-size minimization over buffer memory minimization; justification for this may be found in [4] [24] . Hence, the problem we tackle is one of finding buffer-memory-optimal SASs, since this will give us the best schedule in terms of buffer-memory consumption amongst the schedules that have minimum code size.
R-Schedules and the Schedule Tree
In order to extract buffer lifetimes efficiently, we develop a useful representation of the nested SAS, called the schedule tree. The lifetime extraction algorithms of section 8 can then be formulated as tree-traversing algorithms for determining the various required parameters.
As shown in [24] , it is always possible to represent any SAS for an acyclic graph as ,
(EQ 1)
where and are SASs for the subgraph consisting of the actors in and in , and and are iteration counts for iterating these schedules. In other words, the graph can be partitioned into a left subset and a right subset so that the schedule for the graph can be represented as in equation 1. SASs having this form (in conjunction with some additional technical restrictions on the loop iteration counts) at all levels of the loop hierarchy are called R-schedules [24] .
Given an R-schedule, we can represent it naturally as a binary tree; we call this the schedule tree.
The internal nodes of this tree will contain the iteration count of the subschedule rooted at that node. The leaf nodes (nodes that have no children) will contain the actors, along with their residual iteration counts.
If a node has children, we refer to the left child and right child of by and . For a node , the parent is referred to as . Figure 3 shows schedule trees for the SASs in figure 2(b).
Note that a schedule tree is not unique since if there are iteration counts of 1, then the split into left and right subgraphs can be made at multiple places. In figure 3 , the schedule tree for the flat SAS in figure   2 (b)(3) is based on the split . However, we could also take the split to be . However, the split will not affect any of the computations we perform using the tree.
If is a node of the schedule tree, then is the (sub)tree rooted at node . If is a subtree, define to be the root node of . The function , where is the set of nodes in the tree, and is the set of positive integers, returns for a non-leaf node, the iteration count at that nesting level, and returns 1 for a leaf node.
Generating single appearance schedules
We have shown [4] that for an arbitrary acyclic graph, an SAS could be derived from a topological sort of the graph. To be precise, the class of SASs for a delayless acyclic graph can be generated by enumerating the topological sorts of the graph. We use the lexical ordering given by each topological sort to derive a flat SAS (this is a schedule of the form , where the are actors and are the repetitions . The lexical order is the order given by the topological sort of the graph.).
This lexical ordering then leads to a set of nesting hierarchies; the complete set of lexical orders and for each lexical order, the set of nesting hierarchies, constitutes the entire set of SASs for the graph. Hence, we need a method for generating the topological sort. As we have shown [4] , the general problem of constructing buffer-optimal SASs under both models of buffering, namely the coarse shared buffer model, and the non-shared model, is NP-complete. Thus, the methods for generating topological sorts are necessarily heuristic, and not-optimal in general.
We have developed two methods for generating SAS optimized for non-shared buffer memory for acyclic graphs [4] : a bottom-up method based on clustering called APGAN, and a top down method based on graph partitioning called RPMC. The heuristic rule-of-thumb used in RPMC is to find a cut of the graph such that all edges cross in the same direction (enabling us to recursively schedule each half without introducing deadlock), and such that size of the buffers crossing the cut is minimized. While this rule is intuitively attractive for the non-shared buffer model, it is also attractive for the shared model as will be shown.
The APGAN technique is based on clustering adjacent nodes together that communicate heavily, so that these nodes will end up in the innermost loops of the loop hierarchy. For a broad subclass of SDF systems, APGAN has been shown to construct SAS that provably minimize the non-shared buffer memory metric over all SAS [4] .
An arbitrary SDF graph may not necessarily have an SAS; Bhattacharyya et. al. developed necessary and sufficient conditions for the existence of SAS for SDF graphs [5] . They developed an algorithm for generating single appearance schedules that hierarchically decomposes the SDF graph into strongly connected components (SCC) and recursively schedules each SCC. At each stage, the SCC decomposition results in an acyclic component graph that has an SAS as mentioned, and can be scheduled using any algorithm for generating SAS for acyclic graphs. Hence, the techniques we develop in this paper can all be incorporated into the framework of [5] and can handle arbitrary SDF graphs.
Efficient loop fusion for minimizing buffer memory
Once we have a topological order generated by APGAN or RPMC, we have a flat single appearance schedule corresponding to this topological order. The next step is to perform loop fusion on the flat SAS to reduce buffering memory. To do that, we first define the shared buffer model.
The shared buffer model
Since we are interested in sharing buffers, we have to first determine an appropriate model for buffer lifetimes and the manner in which they can be shared. First, we need a definition for describing token traffic on the edges: given an SDF graph , a valid schedule , and an edge in , let denote the maximum number of tokens that are queued on during an execution of .
For example, if for fig. 2 , and , then and .
Buffer sharing for looped schedules can be done at many different levels of "granularity." At the finest level of granularity, we can model the buffer on the edge as it grows over the execution of the loop, and then falls as the sink actor on that edge consumes the data. The maximum number of live tokens would
give the lower bound on how much memory would be required. An alternative model would be at the coarsest level, where we assume that once the source actor for an edge starts writing tokens, tokens immediately become live, and stay live until the number of tokens on the edge becomes zero, where is the schedule. In other words, even if there is one live token on the edge, we assume that an array of size has to be allocated and maintained until there are no live tokens. Figure 4 shows these two extremes pictorially for the buffer on edge . In the fine-grained case, each firing of results in the buffer on expanding by 5, and each firing of results in the buffer contracting by 2. In the coarse-grained case, the buffer expands to 30 immediately as all 6 firings of are treated as one composite firing, and then shrinks to 0 after all 15 firings of have occurred. Of-course, there are a number of granularities within these extremes, based on how many levels of loop nests we con- sider; figure 4 shows the in-between alternative for this example, where only the outer loop of iteration count 2 is considered, meaning that the three firings of in the inner loop are treated as one composite firing. The buffer, in this case, expands by tokens on each composite firing of , and contracts by on each composite firing of consisting of firings.
In this paper, we assume the coarsest level of buffer modeling. The finer levels, although requiring less memory theoretically, may be practically infeasible to achieve due to the increased complexity of the algorithms. To see that the complexity might significantly increase, notice that the finest level requires modeling to be done at the granularity of single firing of an actor in the schedule. The number of firings in a periodic SDF schedule is , where , is the set of edges in the SDF graph, and . Of-course, there may be more clever ways of representing the growth and shrinkage, but presently the only known ways are equivalent to stepping through a schedule of size . Clearly, this is an exponential function in the size of the SDF graph, and can grow quickly. In contrast, we will show that the coarsest level model can be generated in time polynomial in the number of nodes and edges in the SDF graph.
One weakness of the coarse buffer sharing model is the assumption that all output buffers of an actor are live when the actor begins execution, and all input buffers are live until the actor finishes execution. This means that an output buffer of an actor can never share an input buffer of that actor under the model used in this paper. In reality, this may be an overly restrictive assumption; for instance, an addition actor that adds two quantities will always produce its output after it has consumed its inputs. Hence, the output result can occupy the space occupied by one of the inputs. We have formalized this idea, and have devised another technique called buffer merging [23] that merges input and output buffers by algebraically determining precisely how many output tokens are simultaneously live with input tokens (via a formalism called the consumed-before-produced (CBP) parameter [3] ). The buffer merging technique is similar in spirit to the array merging technique presented by DeGreef et. al [11] ; however it is more efficient in some ways, and also exploits distinguishing characteristics of SDF schedules in a novel way. We according to schedule
have shown that the buffer merging technique is highly complementary to the approach taken in this paper, and is in effect, a dual of the lifetime analysis approach because buffer merging works at the level of a single input/output edge pair, whereas the lifetime analysis approach of this paper works on a global level where the buffering efficiency results from the topology of the graph and the structure of the schedule [22, 23] .
Initial tokens on edges can be handled very naturally in our coarse shared buffer model. An edge that has an initial token will have a buffer that is live right at the beginning of the schedule. It may be live for the entire duration of the schedule if the buffer never has zero tokens. If the buffer does have zero tokens at some point, then the buffer would be not be live for the portion of the schedule where the buffer has zero tokens.
In order to reason about the "start time" and "stop time" of a buffer lifetime, we use the following abstract notion of time: each invocation of a leaf node in the schedule tree is considered to be one schedule step, and corresponds to one unit of time. For example, the looped schedule would be considered to take 4 time steps. This is because the firing sequence is , and since the schedule loop is a leaf node in the schedule tree, it is considered to take one schedule step. The first invocation of would take place at time 0, and the last invocation of begins at time 3 and ends at time 4. Note that this notion of time is not used to judge run-time performance of the schedule in terms of throughput; it is simply used to define the lifetimes for purposes of lifetime analysis. equation so that the start time of the th occurrence of the buffer can be computed algebraically. Finally, the width of the buffer is defined to be as mentioned already, for the coarse-grained model.
Loop fusion under the non-shared buffer model-DPPO
Once the topological ordering of the nodes in the SAS has been determined, we have to determine the order in which loops should be nested, since, as shown in section 5 and figure 2, the loop hierarchy has significant impact on buffer memory usage. In [24] and [4] , we developed a post-processing technique based on dynamic programming (called dynamic programming post optimization, or DPPO for short), that generates an optimal loop hierarchy for any given SAS. The cost metric used for this approach is that each edge is implemented as a separate buffer. We briefly review that technique here because the technique we describe for generating good loop hierarchies under the shared model is similar.
For the non-shared model, we define the non-shared buffer memory requirement of a schedule by ,
(EQ 2)
where the summation is over all edges in . Thus, , and in figure 2.
The lexical ordering of a SAS , denoted , is the sequence of actors such that each is preceded lexically by . Thus,
. Given an SDF graph, an order-optimal schedule is a SAS that has minimum non-shared buffer memory requirement from among the valid SASs that have a given lexical ordering.
One of the central observations that allows the development of an efficient algorithm for optimizing buffer memory under the non-shared model can be stated intuitively in the following way: fusing adjacent loops together by taking out their common factor (and creating an outer loop that has the common factor as its iteration count) not only gives us a valid schedule but also gives us a schedule that has a nonshared buffering memory requirement that is equal or smaller than the non-shared buffering memory requirement of the schedule with the set of separate loops. Formally, we can state it as [24] :
Suppose that is a valid schedule for an SDF graph , and suppose that is a schedule loop in of any nesting depth such that Informally, the DPPO algorithm uses a divide-and-conquer approach that looks at a chain of actors in the schedule, and examines each point in the chain to determine the buffer cost of breaking the chain there and fusing the "left" and "right" parts. It then picks the best point to split the chain at and records it, and considers bigger chains. Because this problem can be shown to have the "optimal subproblem" property, meaning that optimal solutions for the "left" and "right" parts lead to an optimal solution to the whole chain, DPPO is an optimal algorithm for the non-shared buffer model [24] .
Formally, suppose that is a connected, delayless, acyclic SDF graph, is valid SAS for ,
, and is an order-optimal schedule for . If contains at least two actors, then it can be shown [24] that there exists a valid schedule of the form such that and for some , and .
Furthermore, from the order-optimality of , and are also order-optimal (otherwise, we can show that we could replace or by order equivalent versions without affecting the split costs)
From this observation, we can efficiently compute an order-optimal schedule for if we are given an order-optimal schedule for the subgraph corresponding to each proper subsequence of such that (1) . and (2). or . Given these schedules, an order-optimal schedule for can be derived from a value of , that minimizes , where is the set of edges that "cross the split" if the schedule is split between and . This information is then used to determine an optimal split and the minimum buffer memory requirement for each three actor subsequence ; the minimum requirements for the two-and three-actor subsequences are used to determine the optimal split and minimum buffer memory requirement for each four actor subsequence; and so on, until an optimal split is derived for the original -actor sequence
. An order-optimal schedule can easily be constructed from a recursive, top-down traversal of the optimal splits [24] .
Loop fusion for the shared buffer model-SDPPO
Now that we have a model for sharing buffers, we develop an algorithm for organizing loops efficiently in an SAS such that the shared buffer cost is minimized. This algorithm is similar to the DPPO algorithm already described. Consider the following dynamic programming formulation: "Time"
lexorder S
( ) ( ) A 1 A 2 , ( ) A 2 A 3 , ( ) … A n 1 -A n , ( ) A i A i 1 + A i 2 + , , ( ) n lexorder S ( ) ( ) bufmem S 1 n , ( ) MIN 1 x n ≤ ≤ MAX bufmem S 1 x , ( ) bufmem S x 1 + n , ( ) , { } TNSE e ( ) e E s ∈ ∑ +       = bufmem S ( ) bufmem S ( ) 0 = S bufmem S 1 x , ( ) bufmem S x 1 + n ,( ) Buffers
Memory
Schedule: internal edges (that is, edges whose terminal points are all actors that are being merged). We perform loop fusion if there are internal edges, even though this might sometimes be suboptimal (if the reduction of the buffer sizes on the internal edges is less than the increase due to the overlap of the input and output buffers). Of-course, we could attempt to compute this increase or decrease, but this would increase the complexity of the algorithm. We choose to use the simpler approach of using the heuristic approach to determine whether to fuse or not, and leave for future work to explore more complex approaches. We define this formulation of DPPO (equation 3 together with the heuristic of deciding when to perform loop fusion) as shared-DPPO, or SDPPO.
Note that the best schedule under the non-shared buffering model is not necessarily the best schedule under the shared model as shown in figure 9 . Finally, we can also see that RPMC is an attractive heuristic for the shared buffer model since the cut-crossing buffers will not be disjoint, and cannot be shared. Hence, it makes intuitive sense to drive the partitioning process by minimizing the size of these buffers,
and this is what RPMC attempts to do.
Creating the interval instances from a single appearance schedule
Once the schedule and the loop hierarchy have been determined, the next step in the compilation process is to perform memory allocation. Even though the SDPPO algorithm gives us a number for the overall memory requirement, it is only an estimate since the SDPPO algorithm cannot determine whether that estimate can actually be achieved. The main difficulty is that packing a number of arrays of different sizes optimally is an NP-complete problem; hence, the optimal amount of memory required after packing cannot be determined until the packing has actually been performed.
The two main steps for memory allocation are to extract the buffer lifetimes, and then perform allocation using those lifetimes. Extracting the lifetimes efficiently requires several algorithms for determining the durations, the start and stop times. These lifetimes could also be periodic; it would be desirable to represent the periodicity implicitly, without having to physically create an interval for each occurrence.
Hence, the lifetime extraction algorithms also have to model this periodicity efficiently. Given the lifetimes, the allocation step (packing arrays in other words) determines the physical location in memory where the buffer will reside.
We compute these parameters for the buffers from the schedule tree. However, note that a parameter computed for a buffer is a function of a pair of actors (that constitute the edge that the buffer is on). The schedule tree does not represent these edges directly, only the actors and structure of the nested loops.
Hence, we first compute these parameters for nodes in the schedule tree; these parameters will represent the start time, stop time, and durations of the various nested loops. We can then deduce the buffer lifetimes from the computed quantities for the nested loops.
We will use a running example to show these various algorithms; the example SDF graph is depicted in figure 10(a) , with an SAS.
Computing the duration times of loop nests
The function (defined in section 5.1) is used in the computation of the duration times, for all nodes (i.e, loop nests) in the schedule tree by depth-first-search on the tree: 
Computing the start and stop times of loop nests
The next task is to compute the start and stop times for each nested loop. These times are defined as:
, , .
These times can also be computed using depth-first-search (taking time ), as shown by the pseudocode in figure 11 . The SAS from figure 10(b) is shown with the computed start and stop times in figure 12 . The start time of a nested loop represents the first time the loop nest starts execution. The stop 
(3A E 2F) 2(3C 5(B 3D G H))
Pseudocode for the depth-first-search algorithm for computing start and stop times for all the nested loops.
Procedure computeStartStopTimes(scheduleTree) { doComp(root,0) }
time represents the first time the loop nest finishes execution. For example, consider the edge in figure 10 (a). The stop time computed for leaf node in the schedule tree is . This corresponds to the first time finishes execution: five firings of correspond to 15 "steps" in the measuring scheme described in section 7.1 (hence the stop time computed for the node marked "aef" in figure 12 is 15), and after this there is a firing of , and , giving 17 as the stop time for the first execution of as shown in figure 12.
Computing the start, stop, and durations for buffer lifetimes
Now we have to compute the lifetimes for the buffers from the parameters we have computed for the loop nests. We introduce some notation for nodes in the schedule tree first:
A common ancestor of a pair of nodes is any node that contains the nodes as leaf nodes in the subtree rooted at .
Definition 2:
The least common ancestor of a pair of nodes is the first node (measured from the leaf nodes) that contains nodes as leaf nodes in the subtree rooted at .
Definition 3:
The greatest common ancestor of a pair of nodes is the last node that contains nodes as leaf nodes in the subtree rooted at such that all ancestors of have a loop value of unity.
Definition 4:
The common ancestor set of a pair of nodes is the set of all common ancestors of on the path from the least common ancestor to the greatest common ancestor.
The least and greatest common ancestors of a pair of leaf nodes correspond to the innermost and outermost loops that contain the actors corresponding to the leaf nodes. In figure 12 , the common ancestor set for the leaf node pair is . The start time of the lifetime of a buffer on an edge is clearly the start time computed for the leaf node in the schedule tree corresponding to actor , since the 
(3A E 2F) 2(3C 5(B 3D G H))
[0,1) 
Creating the interval instances from a single appearance schedule
Shared Buffer Implementations of Signal Processing Systems Using Lifetime Analysis Techniques 23 of 42
firing of actor makes the buffer on edge live. The stop time of the buffer interval is the first time it stops being live. Note that an interval can be periodic and become live again later on. We are interested in the first time it stops being live since that quantity, along with the periodicity parameters will completely characterize the interval. This stop time, however, is not simply the stop time of the leaf node corresponding to the sink actor; this is because the stop time computed for the leaf node represents the first time the corresponding sink actor finishes execution. However, the first time the sink actor finishes execution is not necessarily the first time all tokens in the buffer would have been consumed. For instance, consider the edge in figure 10(a) . The stop time computed for leaf node in the schedule tree is . However, notice that buffer will not stop being live until all 10 firings of have occurred.
Hence, we really need to compute the time when the last execution of the sink actor of the buffer takes place in the loop nest of interest. The loop nest of interest is the smallest loop nest containing both and since the total number of tokens consumed by in this loop nest has to equal the number produced by in that loop nest. However, the stop time computed for the node representing the smallest loop nest (that is, the least common ancestor in the schedule tree) includes the execution of all loop nests contained within it. We want to exclude the contribution to the stop time from all nests that follow the sink actor in the last execution of the loop nest of interest. Again using the example of figure 12 , the loop nest of interest for buffer is the root node since that is the least common ancestor. 
Computing the periodicity parameters of buffer lifetimes
Given a schedule, the buffer on each edge in the SDF graph has a particular lifetime profile. This profile can be periodic as shown in figure 14 . The periodicity arises due to the nested loops. By periodic, we mean that the lifetime is fragmented in a deterministic, predictable manner. More precisely, the times
during which the buffer is live can be described much more succinctly than by simply enumerating all the occurrences of the live portions. It is useful to keep track of this periodicity in certain cases since two buffers could have disjoint lifetimes that can be shared, as shown in figure 14 for buffers on edges and .
For a buffer on an edge in the SDF graph, let the common ancestor set be denoted as nodes , where is the least common ancestor, is the next ancestor, and so on. For example, for the buffer on edge in figure 12 , the least and greatest common ancestors are the nodes marked and for the leaf node pair . The node is also in the common ancestor set, and is on the For example, for the buffer in figure 14 , which we denote , we have that During storage allocation, we will have to determine whether, at a time , a buffer is live or not.
In essence, we have to determine whether the equation
where the are variables that range over has a solution in . A solution, if it exists, can be found easily because of the following property on the . Let the be sorted in increasing order. Then the following must hold:
Intuitively, the lemma states that the start time of a buffer due to the th outer loop has to be after all occurrences of start times for the buffer have taken place due to the th inner loop and all loops contained in the th loop. This is intuitively true because since the loops are nested, the outer loop count increments only after the inner loops have counted through their entire range.
Proof: (by induction on ). Since , and , for all nodes in the schedule tree, we have that for the parent of , Proof: ( direction): Since , we have for the largest where the two tuples differ, .
Since the last components of and are the same, we have since and by using lemma 1.
Again, letting be the highest index where and differ, we need to show that . Suppose not; suppose but . We have by lemma 1 again, contradicting our assumption that . QED.
An algorithm for computing buffer liveness at a particular time
Given lemma 1, equation 5 can be solved by the algorithm in figure 15 . The algorithm first subtracts the start time of the buffer since all computations can be made relative to the start time. It then simply determines the maximum factor for each , to determine the closest starting point of an occurrence of the periodic interval to time .
Claim 1:
at every stage in the algorithm.
Proof:
Note that for all . Hence . QED.
Claim 2:
The solution computed by the algorithm gives the starting point of the interval closest to .
Proof: Let be the tuple consisting of the . This means that any tuple gives an interval of starting time greater than . Indeed, suppose not and there is a tuple where . Then, for the largest index where the two tuples differ. Till the th step, the value of computed by the algorithm would be identical to the computation using the values from . When , the algorithm com- else not live. 
T'
T'
. Suppose . Clearly, anything larger than will mean that , giving a start time greater than . If , then we cannot have since is the largest value the th component is allowed to take by definition. Hence, we cannot have , contradicting the assumption that . QED.
The last step of the algorithm checks whether to determine whether the interval with closest starting time less than or equal to is still alive.
If is not live at time , we will need to determine when the next instance of its periodic interval will occur. This computation is needed to determine whether some other interval of a particular duration is completely disjoint with the set of intervals corresponding to -that is, to determine whether some other interval can be fitted into the same location that might be assigned to. 
T b1 b2
Closest b1 interval before b2 stops before b2 starts Closest b1 interval after b2 starts after that b2 interval finishes
In other words, the two intervals do not intersect if and only if the closest interval of that starts before the start time of finishes before the start time of AND the closest interval of that starts after the start time of starts after that interval of finishes ( figure 16(b) ). Notice that the lemma says that we do not have to consider other occurrences of the periodic interval to determine overlap; only the first occurrence.
Proof:
The forward direction is trivially true. The reverse direction can be established via a case analysis.
Let the two edges, on which buffers reside, be given by and . Since , the ordering of these actors in the schedule must be one of , , or . Clearly, the condition of equation 6 cannot be satisfied for the third order since is live the entire time that is. For the other two orders, we have to consider the different ways in which the loops could be nested. For each order, there are five distinct ways of nesting the loops. These five are, for the first order, the following: , , ,
. Note that we only consider the part of the overall schedule that contains these four actors (the subtree of the schedule tree rooted at the least common ancestor of these four nodes), and we ignore any other actors that appear in the order or nesting as they do not affect the properties of the particular buffers we are interested in. Figure 17 shows the buffer profiles for these five cases. As can be verified, equation 6 holds if and only if the intervals do not intersect. We can similarly verify that the lemma is true for the five nestings for the other order. QED.
Given this method for testing whether a periodic buffer is live at a given time, we can easily test whether two periodic buffers are disjoint, or whether they intersect. The test would take time in the worst case, where is the set of actors in the SDF graph. The reason is that an SAS that has a schedule
...
... figure 15 , , and the test takes time . However, on average, it is more likely that the schedule tree will have logarithmic depth; in such cases, the running time of the testing procedure will be . The next step is to allocate the various buffers to memory.
Dynamic Storage Allocation
Once we have all of the lifetimes, we have to do the actual assignment to memory locations of the buffers. This assignment problem is called dynamic storage allocation (DSA) and the problem is to arrange the different sized arrays in memory so that the total memory required during any time is minimized. The assignment to memory is assumed to have the following properties: a) an array is assigned to a contiguous block of memory, b) once assigned, an array may not be moved around, c) all occurrences of an array with a periodic lifetime profile are assigned to the same location in memory. Figure 18 (a) depicts these properties. Of-course, we could relax any of these restrictions and perhaps get smaller memory requirements but it might come at the expense of other overheads (like moving arrays around if (b) were relaxed). We leave to future work to investigate these other models for allocation. Formally, DSA is defined as:
Definition 5: Let be the set of buffers. Let , the number of elements in . For each , is the time at which it becomes live, is the time at which it dies, and is the size of buffer . Note that the duration of a buffer is . Given the values for each , and an integer , is there an allocation of these buffers that requires total storage of units or less? By an allocation, we mean a function such that for each , and if two intervals and intersect (using the intersection test for periodic buffer lifetimes as described earlier) then or .
The "dynamic" in DSA refers to the fact that many times, the problem is on-line in nature: the allocation has to be performed as the intervals come and go. For SDF scheduling, the problem is not really "dynamic" since the lifetimes and sizes of all the arrays that need to be allocated are known at compile time; thus, the problem should perhaps be called static storage allocation. But we will use the term DSA since this is consistent with the literature.
Theorem 1: [9] DSA is NP-complete, even if all the sizes are 1 and 2.
Some notation
An instance is a set of buffers. An enumerated instance is an instance with some ordering of the buffers. For an instance, we have associated with it a weighted intersection graph (WIG) where is the set of buffers, and is the set of edges. There is an edge between two buffers iff their lifetimes overlap in time. The graph is node-weighted by the sizes of the buffers. For any subset of nodes , we define the weight of , to be the sum of the sizes for all .
A clique is a subset of nodes such that there is an edge between every pair of nodes. The clique weight 
Heuristic for DSA
First fit (FF) is the well-known algorithm that performs allocation for an enumerated instance by assigning the smallest feasible location to each interval in the order they appear in the enumerated instance [13] . It does not reallocate intervals that have been allocated already, and it does not consider intervals not yet assigned. The pseudocode for this algorithm is shown in figure 19 . We refer the reader to a technical report [25] and references therein for a more detailed treatment of this very interesting DSA problem.
Briefly, the algorithm takes as input an enumerated instance. We tested two types of orderings for generating enumerated instances [25] : ordering by start times, and ordering by durations. It then builds the WIG using the routine buildIntersectionGraph. The WIG is built using the general test developed for determining intersection of possibly periodic buffers. The firstFit algorithm then examines the WIG for each buffer : first it examines all nodes adjacent to in the WIG (i.e, buffers that intersect ). It collects
Procedure 
the memory allocations of all the adjacent nodes that appear before in the enumeration. After sorting these allocations, it sees where can be allocated; in the worst case, it has to be allocated at the end of all of the allocations because there are no regions big enough in between to accommodate . After an allocation is determined for , the next buffer is examined in the enumeration until all have been allocated.
Our study shows that in practice, firstfit is a good heuristic, and we use it in our compiler framework here. Our empirical study on random WIGs shows that ordering the buffers by durations gives the better results [25] . But, in our experiments in section 10, we will apply firstfit on both ordering by start times (abbreviated ffstart), and ordering by durations (ffdur).
In order to analyze the running time, we observe that , and for sparse SDF 
Computing the maximum clique weight
It is clear that the maximum clique weight is a lower bound on the chromatic number of a weighted interval graph. It is known that the chromatic number can be as much as 1.25 times the maximum clique weight for particular instances; however, it is not known whether 1.25 is a tight upper bound. The maximum clique weight is thus a good lower bound to compare the performance of an allocation strategy on a particular set of lifetimes. Given that the experiments on random instances in [25] show that ffdur comes within 7% on average of the maximum clique weight, in practice, the chromatic number is not much bigger than the maximum clique weight, certainly not as much as 1.25 times as big. Hence, we use the maximum clique weight for comparison purposes in our experiments in the next section.
While the maximum clique weight can be computed easily and exactly for an instance without fragmented lifetimes, computing it for instances with fragmented (but periodic) buffer lifetimes is more difficult. Consider the case where all intervals are continuous (i.e, not fragmented). Let be the set of all times (i.e, schedule steps) where there is maximum overlap of the intervals; that is, where the overlap amount is equal to the maximum clique weight. It is easy to see that must contain the start time of some interval. Hence, the maximum clique weight can be computed easily by sorting the intervals by their starting times, and determining the overlap at each starting time.
Now suppose that some of the intervals are periodic. It is still the case that will contain the start time of at least one interval; however, this need not be the earliest start time. It could be the start time of some periodic occurrence (greater than the earliest start time) of the interval (see figure 20) . Hence, to compute the maximum clique weight in this scenario, we would have to consider start times of all occurrences of a periodic interval; this becomes a non-polynomial time algorithm and could potentially take a long time if there are many periodic occurrences. Hence, in our experiments, we use two heuristics to compute these values. The first heuristic gives an optimistic estimate; it only considers the earliest start time of each interval, and it determines whether there is any overlap with other intervals at that time by using the algorithm of figure 15 . This is an optimistic estimate since the maximum clique weight could occur at a time that is not the earliest start time of any interval. The second heuristic gives a pessimistic estimate; it simply ignores the periodicity of periodic intervals, and assumes that a periodic interval is live the entire time between its earliest start time, and the last stop time (that is, the stop time of the last occurrence of the interval).
Experimental Results
We have tested these algorithms on several practical benchmark examples, as well as on random graphs. As mentioned earlier, the crux of the experiment is to study the memory requirement as a result of using the best combination of the four possibilities:
That is, perform the scheduling by using one of or to generate the topological ordering, and perform loop fusion on that schedule using . Then, perform memory allocation using one of (firstFit with buffers ordered by starting times) or (firstFit with buffers ordered by durations). We compare the best memory requirement obtained this way to the best memory requirement from Figure 21 shows the percentage improvement on 16 systems we tested. As can be seen, there is, on average, more than a 50% improvement by using the compiler framework of this paper compared to previous techniques. On some examples, the improvement is as high as 83%. Details of the experiments are given below.
Practical multirate systems
The practical multirate examples are a number of one-sided filter bank structures [32] as shown in figure 22 , two-sided filter banks [32] , as shown in figure 23 , and a satellite receiver example [29] as shown in figure 24 . Another type of variation that occurs frequently in practical signal processing systems is variation in sample-rate change ratios. For example, figure 22 shows a filter bank with 1/3, 2/3 rate changes;
these are the changes that occur across actors and for instance. Other ratios that could be used include . "satrec" is the satellite receiver example from [29] . The other examples included are "16qamModem," an implementation of a 16-QAM modem; "4pamxmitrec," a transmitter-receiver pair for a 4-PAM signal; "blockVox," an implementation of a vocoder (a system that modulates a synthesized music signal with vocal parameters); "overAddFFT," an implementation of an overlap-add FFT (where the FFT is applied on successive blocks of samples overlapped with each other); and "phasedArray," an implementation of a phased array system for detecting signals. These examples are all taken from the Ptolemy system demonstrations [36] .
The second column contains the results of running RPMC and post-optimizing with DPPO on these systems, assuming the non-shared model of buffering. This column gives us a basis for determining the improvement with the shared model. In general, "(R)" refers to RPMC and "(A)" refers to APGAN.
The third column has the results of applying the new dynamic programming heuristic (sdppo) post-optimization for shared buffers on an RPMC generated topological order. The fourth and fifth columns contain optimistic (mco) and pessimistic (mcp) estimates of the maximum clique weight for the schedule generated by sdppo (on the RPMC generated topological order). The sixth and seventh columns contain the actual allocations achieved after applying the firstfit ordered by durations, and firstfit ordered by start times heuristics. The eighth column contains the BMLB [4] values for each system. Briefly, the buffer memory lower bound (BMLB) is a lower bound on the total buffering memory required over all valid SASs, assuming the non-shared model of buffering. The rest of the columns contain the results after applying these heuristics on APGAN-generated topological orders. One each row, two numbers are shown in bold: the better DPPO result (RPMC or APGAN) and the best shared implementation (between
ffdur(R),ffstart(R),ffdur(A),ffstart(A)).
The last column has the percentage improvement over the nonshared implementation; this is computed as
As can be seen, the improvements average more than 50%, and are dramatic in some cases, with up to 83% improvement in the depth 5 filter bank of 1/2, 1/2 rate changes (the most common type of filter bank). It is interesting to note that the methods of Ritz et al. [29] for shared-buffer scheduling achieve an allocation of more than 2000 units for "satrec"; in contrast, the methods in this paper achieve 991, an improvement of more than 50%.
It is also interesting to note that of the four possible combinations ( ), the combination of gives the best results the most often. However, most of the best results are on the fairly regular qmf filterbanks; the more irregular systems are apparently better suited to the combination.
Another experiment was conducted to determine whether applying ffdur or ffstart on the sdppo schedule gives better results than applying ffdur or ffstart on the dppo schedule. The maximum improve- In order to determine whether RPMC and APGAN are generating good topological sorts, we tested the results against the best allocation we could get by generating random topological sorts. We applied the sdppo technique, and the firstfit heuristics on this random topological sort to determine the best allocation. For the small graphs like "satrec" and "blockvox" (both with about 25 nodes), we found that it took about 50 random trials to beat the best result generated by the better of RPMC and APGAN-generated schedules. However, even after 1000 trials, the best random schedule resulted in an allocation of 980 for the "satrec" example, and an allocation of 193 for the blockVox example. The best RPMC/APGAN-based allocations are 991 and 199 respectively. So even though we can generate better results just by random search, we cannot improve upon RPMC/APGAN by much, and a lot of time has to be spent doing it.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------100
The relative improvement over random schedules increases when larger graphs are examined, such as the "qmf12_5d" and "qmd235_5d" examples (these have about 200 nodes each). Here, after 100 trials, the best allocations were 79 (qmf12_5d) and 8011 (qmf235_5d), compared to 58 and 5690 for the RPMC/ APGAN based allocations respectively. Since the running time for 100 trials was already several minutes long on a Pentium II-based PC, we conclude that on bigger graphs, it will require large amounts of time and compute power to equal or beat the RPMC/APGAN schedules. Hence, we conclude that for compact, shared buffer implementation, APGAN and RPMC are generating topological sorts intelligently, and cannot be easily beaten by non-intelligent strategies such as generating random schedules.
Homogenous graphs
Unlike previous loop scheduling techniques for buffer memory reduction, the techniques described in this paper are also effective for homogenous SDF graphs. This is because of the allocation techniques; the sharing strategy can greatly reduce the buffer memory requirement in many cases. As an example, con- 
Conclusion
We have developed a powerful SDF compiler framework that improves upon our previous efforts demonstrably. By incorporating lifetime analysis into all aspects of scheduling and allocation, the framework is able to generate schedules and allocations that reuse buffer memory, thereby reducing the overall memory usage dramatically. However, in order to produce code competitive to hand-coded implementations, there are many ways in which additional optimization problems can be formulated. One particular problem that has not been addressed is the issue of recognizing regularity that might occur in graphical specifications (for instance, a fine-grained description of an FIR filter). Regularity extraction has been applied in the past to high level synthesis [26] [27], and Keutzer [14] has applied pattern matching algorithms from compiler design to silicon compilers; perhaps these techniques can be applied in the context of SDF compilers. In addition, it would be useful to study techniques that can make use of the regularity implied by the use of hierarchy and graphical higher-order functions [18] in dataflow specifications.
