Memory cost is responsible for a large amount of the chip and/or board area of customized video and image processing system realizations. In this paper, we present a novel technique { founded on data-ow analysis { which allows to address the problem of background memory size evaluation for a given non-procedural algorithm speci cation, operating on multi-dimensional signals with a ne indices. Most of the target applications are characterized by a huge number of signals, so a new polyhedral data-ow model operating on groups of scalar signals is proposed. These groups are obtained by a novel analytical partitioning technique, allowing to select a desired granularity, depending on the application complexity. The method incorporates a way to trade-o memory size with computational and controller complexity.
Introduction
Speech, image and video processing applications involve a large amount of multi-dimensional signals which lead to large memory units. These result in signi cant area and power consumption cost in (application-speci c) architectures for real-time multi-dimensional signal processing (RMSP), in most cases dominating the data-path contribution for complete systems 34, 20] . A 128 kbit embedded SRAM takes about 115mm 2 in a 1.2 m CMOS technology, as compared to 5 to 15mm 2 for an entire (very) complex data-path. The di erence in importance would increase further if the data-paths in an architecture design would be optimized without considering the implications on memory. For these reasons, we have opted to optimize rst the This research was partly sponsored by the ESPRIT Basic Res. Action 3281 (NANA II) project of the E.C. y Professor at the Katholieke Universiteit Leuven 1 high-level storage organization for the multi-dimensional signals in our CATHEDRAL script 36] . Note that this high-level memory management stage is fully complementary to the traditional high-level synthesis step known as \register allocation/assignment" 31, 19, 15, 1, 29] which deals with individual storage places for scalars, after scheduling. Part of this e ort is also needed in the CATHEDRAL context, but this decision on scalar memory management is postponed to our low-level data-path mapping stage 15] .
Three main partly con icting objectives can be identi ed during high-level memory management, solvable within di erent steps: (1) optimizing the memory access by allocating a number of memory units of speci c types and by distributing and organizing the signals eciently over a set of background memories and ports (referred to in the sequel as background memory allocation), (2) reducing the memory size by in-place storage of signals, and (3) minimizing the area overhead of address generation circuitry by optimizing the memory access sequences. Before these steps, we have added a global loop hierarchy optimization technique 36], based on abstract cost measures, providing constraints for the subsequent steps.
The type of control and data ow constructs, handled during memory management, mainly include irregularly nested loops and indexed signals. Our model is intended for multi-dimensional (delayed) signals with complex a ne index expressions, and nests of loops having as boundaries a ne (linear of each variable) functions with integer coe cients of the surrounding loop iterators. The functional speci cations can contain condition expressions { data-dependent (i.e. dependent on produced/input signals) or not (dependent only on iterators). This covers the majority of the RMSP application domain 1 . Pointers, interrupts, data-dependent loopbounds and indices are not incorporated in the scope of this work.
The problem of memory size estimation, tackled in this paper, is formulated as follows: \Given a RMSP algorithm described in a high-level applicative language, containing a large amount of scalars, organized mainly in multi-dimensional signals and assigned to a single multiport memory unit, estimate the background memory (necessary to store the multi-dimensional signals) area, subject to given throughput and loop organization constraints." Di erent from past approaches, requiring that detailed operation scheduling has been previously performed, our formulation disregards this essential requirement, tackling the problem under more general conditions. The motivation for this is given in Section 2.
The memory size estimation kernel as developed here, is useful in two di erent contexts. This research was done as part of solving, under the same application domain assumptions, the more general problem of background memory allocation 4]. The estimation model is targeted to (multi-port) random access memories. Allocation aspects like memory hierarchy, bandwidth, latency, are beyond the scope of this paper. However, the method proposed here is also applicable in other system exploration contexts. In particular, if several descriptions of an algorithm, or several alternative algorithms for a given RMSP application, have to be traded o against one another, it is important to evaluate their memory cost in a very early system design stage. In summary, the ability to predict memory characteristics of designs without synthesizing them is vital to producing high quality designs in a reasonable time.
2 Background memory size estimation: context and state-of-the-art
To our knowledge, almost all techniques for dealing with the allocation of storage units are scalar-oriented and employ a scheduling-directed view (see e.g. 1, 2, 19, 29, 31] ) where the control steps of production/consumption for each individual signal are determined beforehand. This applies also for memory/register estimation techniques (see e.g. 18, 13] and their references). This strategy is mainly due to the fact that applications targeted in conventional high-level synthesis contain a relatively small number of signals (at most of the order of magnitude 10 3 ). The control/data-ow graphs addressed are mainly composed of potentially conditional updates of scalar signals. Therefore, as the major goal is typically the minimization of the number of registers for storing scalars, the scheduling-driven strategy is well-tted to solve a register allocation problem, rather than a background memory allocation problem. In that case, (binary) ILP formulations 1, 2] and graph colouring 29] or clique partitioning 31] techniques have provided satisfactory results for register allocation, signal-to-register and signal-to-port assignments, under the usually implicit assumptions mentioned above. For many of our targeted applications, where memory is the dominating resource, this general strategy { imposing a strict precedence between scheduling and storage management { can lead to costly global solutions, as the search space for memory optimization becomes heavily constrained after scheduling. Manual experiments carried out on several test-vehicles 34] { as autocorrelation computation, singular value decomposition, solving Toeplitz systems { have shown that memory management decisions, taken before scheduling, can reduce signicantly the background memory cost, while the freedom for data-path allocation and scheduling remains almost una ected 6]. Furthermore, within the scheduling-directed view, many examples are untractable because of the huge size of the ILP formulations. Exceptions to this are PHIDEO 20] { where streams are used during the memory allocation but which is still done after scheduling, and recently also MeSA 26] { where memory allocation cost is based both on a layout model and on the expected performance, but where the possibility of signals to share common storage locations when their life times are disjoint is neglected. 
/* m , n , M , N are predefined constants */
(5) Figure 1 : SILAGE code extracted from the motion detection algorithm A behavioural description like that of Fig. 1 is untractable by means of a classical scalaroriented technique. First, this class of examples usually contains large amounts of scalar signals. At the same time, scalar-oriented approaches entail a loss of the code regularity. According to our experience, this leads to an unacceptable growth of the controller size. In addition, a behavioural description language like SILAGE 17] is by de nition non-procedural: besides the natural production-consumption order imposed by the dependence relations existent in the code, there is much freedom left in the execution ordering. This can be exploited by a designer in order to meet his goals { e.g. to take pro t of the parallelism \hidden" in the code. The current in-place mapping and address generation tools of the CATHEDRAL system 33] cannot handle the example of Fig. 1 if e.g. lines (1) and (5) are interchanged, as they interpret the source code procedurally: the operation ordering taken into account is only that one from the code, where all signals must be produced before being consumed. Moreover, the allocation is done manually in CATHEDRAL-2/3.
The allocation tool MEDEA from the PHIDEO system 20] meets also problems in dealing with examples like that of Fig. 1 . It is partly capable of handling non-procedural applications 2 , and the stream model is very well suited and yields good results for applications where the dependence vectors 5] have constant elements, as in many front-end video applications. By means of a hierarchical stream model, it is possible to handle multi-dimensional signals with complex a ne indices, but only in nests of loops with constant boundaries. The PHIDEO stream model is not tuned to image, speech or medium-level video processing systems which typically do not exhibit xed periods and xed length streams. Hence, handling loop nests with non-constant boundaries, and non-uniform dependence relations between signals is not always possible.
Background memory size estimation has only recently gained attention in high-level synthesis. The rst estimation methods are based on symbolic evaluation { a scalar-oriented technique { which consists in enumerating all indexed signals for all index combinations 24]. More recently, novel results have been obtained for the case when the algorithm speci cation is non-procedural. Modi cations of the loop hierarchy and the sequence of execution as speci ed in the source code are used to optimize the storage requirements 36]. Retaining only the data ow constraints, a new control ow is provided by placing polytopes of signals in a common space (with an ILP technique), and searching for an ordering vector in that space. For solving the memory size estimation problem in the non-procedural case, data-ow analysis has been consistently employed by us as the main exploration strategy 3]. In a recent work, data dependence information provided by the Omega test 25] has also been applied in memory size estimation 35] . Although the method is very appealing in terms of speed, the assumptions regarding loop hierarchy and control ow { e.g. a nest of loops is executed in the sequence as speci ed in the source code, and a sequence of loops is executed in the order speci ed in the source code { only partly remove the procedural constraints.
A precise evaluation of the memory area implies two distinct aspects:
1. assessment of the number of storage locations, which determines the total number of 2 In terms of stream ordering. However, internally, the elements of each stream are procedurally ordered. 2. assessment of the number and type (read, write, or read=write) of memory ports, which heavily in uences the area cost of a single cell. Finding the minimum number of memory locations/registers when the operation ordering is that one provided by the order of statements, and the nesting of the loops in the algorithm code (or when the scheduling of the operations was previously accomplished), has already been solved. Knowing the detailed sequence of operations entails precise information concerning the life times of variables (signals). Consequently, a \left-edge" type algorithm 15, 19] is su cient to nd out in O(n log n) time, both the minimum number of storage requirements and the assignment of individual signals to memory locations. Extensions to deal with cyclic graphs have been proposed also in 15, 29] . Unfortunately, the same problem becomes signi cantly harder when the operation ordering is still not xed, belonging to the NP class 14] even in the absence of conditionals. A straightforward way of estimating the number of memory locations when the operation scheduling is still not decided is by means of an accurate data-ow analysis.
In parallel compiler theory, data dependence analysis has been a topic of research for a long time. Transforming programs so as to make e cient use of massively parallel machines is a very demanding task. Determining whether a dependence exists between two array references (see 5] for a good overview) is a crucial issue for doing code transformations. Methods based on parametrized integer programming 12], and Fourier-Motzkin algorithm 9] are extensively used 25]. Data dependence has been employed to maximize the ne-or coarse-grain parallelism in loop nests, or to improve data locality 38].
As RMSP algorithms contain usually a huge amount of scalars, a data-ow analysis operating with groups of signals (rather than a attened one operating with individual signals) is compulsory. On the other hand, doing a su ciently accurate memory size estimation irrespective of an operation ordering, and when attening is impractical, requires more data-ow information 3 than the existence of dependences between array references 25] (as in code restructuring). In the SILAGE code of Fig. 1 , the decision regarding the memory requirement e.g. for signals Delta cannot be taken only by determining the existence of data dependences between the 4 array references of Delta in lines (2),(3), (4) . It can be proven, employing the number of dependences between di erent parts of the arrays, that (M ? m + 1)(N ? n + 1) 3 The structure and number of dependences between di erent parts of the array references. 6 is the best memory size upper-bound 4 . Basing the decisions only on the existence of data dependences can lead to an important over-estimation/allocation of memory. These aspects led to the development of an accurate analysis of the dependence structure, operating on a polyhedral dependence graph model. This paper presents a non-scalar method for estimating the area of the background memory, before operation scheduling, for RMSP algorithms with non-procedural speci cations. This technique can then be used to guide the distributed memory allocation 4] or early systemlevel exploration tasks. In order to address applications with large amounts of scalars, an analytical partitioning of the indexed signals will be presented in Section 3. A polyhedral dependence graph model { built around the concept of basic set of signals { will be presented in Section 4. The assessment of the number of storage locations for algorithms described in a non-procedural language is presented in Section 5. Results obtained so far are substantiated in Section 6, followed by the conclusions and our future directions of research in Section 7. In the sequel, matrices and vectors are denoted with bold characters, matrices being in capitals.
Analytical partitioning of multi-dimensional signals
The set of appearances of an indexed signal in the RMSP algorithm is characterized by the collection of de nition (left-hand side) and operand (right-hand side occurrences) domains for that signal 36] . Each domain has an index space which is a linearly bounded lattice (LBL) 30] { the image of an a ne mapping over a set of linear inequalities representing a polytope 5 : f x = T i + u j A i b g (1) where x2 Z m is the coordinate vector of an m-dimensional signal, and i 2 Z n is the vector of loop iterators. The a ne function is characterized by T2 Z m n and u2 Z m , while the integral polytope { de ning the set of iterator vectors { is characterized by A2 Z 2n n and b2 Z 2n .
For instance, the index space of the operand domain Delta i] j] (2 m + 1) (2 n + 1)] in 4 In the sense that there is an operation ordering requiring (M ?m+1)(N ?n+1) locations, but no ordering requiring more. Of course, there are orderings needing less memory. 5 In general, the index space may be a collection of linearly bounded lattices. E.g. a conditional instruction if ( i 6 = j ) determines two LBLs for the domains of signals within the scope of the condition { one corresponding to i j + 1 , and another corresponding to i j ? 1 . Without any decrease in generality, we can assume in the sequel that each index space is represented by a single linearly bounded lattice.
line (4) .
Once the collections of de nition/operand domains are extracted from the signal processing algorithm, a partitioning process into non-overlapping \pieces" { called basic sets in the sequel { is performed for each collection. The motivation is the following: in order to derive dependence relations between groups of signals, it is necessary to know which part of an operand is produced by which de nition domain, and which part of a de nition is consumed by which operand domain. This information is needed for memory estimation (see Section 5) .
The signals common to a given set of domains and only to them will constitute a basic set. E.g. a collection of 2 domains results in at most 3 basic sets; 3 overlapping domains result in maximally 7 sets. More general, if the number of de nition/operand domains of a multi-dimensional signal is denoted by s , the number of basic sets is upper-bounded by C 1 s + C 2 s + + C s s = 2 s ? 1 . However, this evaluation proved to be extremely pessimistic in practical cases. In the illustrative example in Fig. 2a , for instance, the number of domains for signal A is s = 9, and the resulting number of basic sets is only 10 (see subsection 3.2).
The partitioning
The aim of the partitioning process is to determine what parts of an operand domain are not needed any more after the computation of a de nition domain, assuming a given loop hierarchy and a certain data-ow. In other words, which are the scalar signals consumed for the last time, and what is their number, when the signals within a de nition domain are produced. The last question is related directly to the evaluation of the storage requirements, as it allows to compute exactly how many memory locations are needed and how many can be freed when a certain group of signals is produced, assuming a certain data-ow. Furthermore, the subsequent partitioning of the indexed signals, together with dependence information, can o er solutions for partially rearranging the loop organization provided in the initial code, in order to obtain a decrease of the storage requirements (see Section 5).
The partitioning process { described in the sequel { yields a separate inclusion graph for each of the collections of linearly bounded lattices. The direct inclusion relation is denoted by \ " , while its transitive closure (the existence of a path in the graph) is denoted by \ " . An inclusion graph is constructed as follows: The procedure AddInclusion creates new inclusion relations (arcs) between groups of signals, but deletes the resulting transitive arcs: keeping a strict hierarchy for the LBLs is essential for the partitioning phase. The intersection \\" of two linearly bounded lattices is described in subsection 3.3. An example will be provided in subsection 3.2.
Afterwards, the basic sets of a signal are derived from the inclusion graph with a simple bottom-up technique: If a node has no components, a new basic set { equal to the corresponding linearly bounded lattice { is introduced; otherwise, if all its components have already been partitioned, the union of the component partitions and (potentially) a new basic set will constitute the current node partitioning. In the latter case, the new basic set appears only if there is a di erence between the size of the node and the total size of its components.
The computation of LBL sizes is described in subsection 3.4. The e ciency of this operation and of the intersection procedure are crucial for the practical time complexity of the whole algorithm. The partitioning process is exempli ed in subsection 3.2.
The importance of this analytical partitioning into groups of signals { called basic sets { is manifold: (1) It allows a quick veri cation (without descending at the scalar level) whether the program complies with the single assignment conjecture: it is su cient to check that each basic set belongs to one single de nition domain. 
An illustrative example
The partitioning algorithm is exempli ed for the simple SILAGE code in Fig. 2a .
The nal inclusion graph is shown in Fig. 2b . The basic sets are derived bottom-up, after the computation of all node sizes in the inclusion graph, with the technique described in Algorithm 2 (subsection 3.1). The LBLs in the inclusion graph are partitioned in the order: a; j; k; m; p; n; o; r; c; d; e; f; i; l; q; b; g; h. The nal partitioning for our illustrative example is shown in Remark Basic sets cannot be always represented as linearly bounded lattices (as it is the case in this illustrative example). In general, they can be decomposed, or they can be expressed as di erences of a ne mappings of polytopes. E.g. signal A from the SILAGE code in Fig. 1 has two basic sets; one of them { A1 { cannot be represented as a linearly bounded lattice (see Fig. 4b ).
Intersection of linearly bounded lattices
Let fx = T 1 i 1 + u 1 j A 1 i 1 b 1 g ; fx = T 2 i 2 + u 2 j A 2 i 2 b 2 g be two LBLs derived from the same indexed signal, where T 1 and T 2 have obviously the same number of rows { . Therefore, several algorithms with provable polynomial worst-case complexity have been proposed (e.g. 16, 27] ). The size of Diophantine systems in most of our practical cases does not justify the overhead (in programming and practical computation e ort) implied by the use of one of those more sophisticated algorithms with provable polynomial worst-case complexity. Therefore, we have searched for a simpler technique which works well for smaller problem sizes.
In our current implementation, each unimodular transformation reduces the coe cients of an equation (except the smallest one) to at most 1/2 of the smallest coe cient, therefore employing usually a shorter sequence of unimodular transformations than 22] { where the reduction factor is 1 , or 25] { where the reduction factor is 2/3 . The basic idea is the following: consider the Diophantine equation a 11 x 1 + + a 1n x n = a 1 2 Z , where a 1i 2 Z?f0g , and suppose, without loss of generality, that a 11 is one of the coe cients such that ja 11 j def = minfja 11 j; : : :; ja 1n jg . In the unimodular substitution x 1 = X 1 + c 2 x 2 + + c n x n , the coe cients c k can be chosen to be integers satisfying ja 11 An undesirable side-e ect of intersection is the rapid size increase of the polytope description for the resulting linearly bounded lattice, due to the coalescing of the two constraint sets. Therefore, minimizing the set of constraints proved to be a necessity in order to restrict the computational e ort of e.g. counting the lattice points of the resulting LBLs (see subsection 3.4). The technique employed for minimizing the set of linear constraints is based on Chernikova's algorithm 8] { a nonpivoting method for nding all vertices of convex polytopes. The implementation in use stems from the polyhedral library developed at IRISA 37].
Computation of the a ne image size of a polytope
When the linear function t : Z n ! Z m , de ned by t(i) = Ti is injective, the number of lattice points 7 of the image of a polytope is equal to the number of points of integer coordinates inside the polytope. The latter problem was tackled long ago,but only recently it was proven 10] the existence of a polynomial-time solution up to the 4-D case.
As a more general solution { able to handle signals of any dimension { was needed, a novel technique based on the Fourier-Motzkin elimination 9] has been developed. The routine for counting the lattice points inside a given polytope is described below. In the sequel, the columns of a matrix are denoted by subscripted vectors: e.g. A = a 1 a 2 ] ; the number of columns is denoted by A:nCol . The main idea of the algorithm is the following: the number of lattice points in the given polytope, having as rst coordinate z 0 , is equal to the number of lattice points in the polytope a 1 z 1 + a 2 z 2 + b ? a 0 z 0 (which has one dimension less). The required result is obtained by accumulating this number over the entire range of z 0 (determined by FourierMotzkin elimination). The worst-case time complexity of the algorithm is exponential. The in uence of this negative aspect is attenuated by minimizing rst the number of inequalities with Chernikova's algorithm. In addition, the method does not need to be applied on the constraint matrix (A in (1)) directly. Indeed, the linear constraints of a basic set are derived from the constraints on loop boundaries and control-ow conditions. As loop boundaries are frequently constant, the Fourier-Motzkin elimination is usually applied only on a reduced constraint matrix. Moreover, the initial number of columns of matrix A is the number of the surrounding loops, which is usually small. Two questions arise: how can it be decided whether function t is injective or not, and what is to be done if the injectivity condition is not ful lled. In this context, it has to be emphasized that generating all lattice points in the polytope and collecting their images in a set is very ine cient for \large" polytopes.
For any matrix T 2 Z m n there exists a unimodular matrix S 2 Z n n such that T S (2) r < n . Then x 1 = x 2 implies that only the rst r components of j 1 and j 2 are resp. equal. This means that all vectors j satisfying AS j b , and having the same pre x j 1 : : : j r , contribute to the image set with one single distinct value because they are mapped to the same point. Consequently, the a ne image size of the polytope is the number of all valid pre xes j 1 : : : j r { for which there exist vectors j having that pre x. If for each of the distinct pre xes j 1 : : : j r there is at most one single vector j in the polytope AS j b having that pre x, the a ne mapping is injective. Therefore, the complete algorithm for counting the image size the pre xes of all vectors j satisfying AS j b , that is f 7 j 1 ? j 3 0 ; 7 j 2 ? j 3 0 ; 7 j 3 0 g , have to be checked for validity. These pre xes (j 1 j 2 ) result to have the ranges j 1 2 0; 14] and j 2 2 maxf0; j 1 ? 7g; minf14; j 1 + 7g] . All 169 of them prove to be valid, hence the domain size is 169. The a ne mapping is not injective, as it can be easily seen noticing that the triplets (i; j; k) equal to (0,1,1) and (1,2,0) are mapped to the same point. 4 The polyhedral data-ow analysis After the partitioning process described in Section 3, a data-ow graph (DFG) with exact dependence relations is produced (Fig. 3a) . However, unlike the classic case of data-ow analysis, the nodes in the graph do not correspond to individual variables/signals, but to groups of signals (covered by the basic sets derived in Section 3), and the arcs correspond to the dependence relations between these groups. The nodes in the data-ow graph are weighted with the size of their corresponding basic sets, and the arcs between nodes are weighted with the exact number of dependences between the basic sets corresponding to the nodes. The outcome of such a strategy is the successful memory estimation for RMSP algorithms with high amount of scalar signals (see Section 6). Subsection 4.1 presents the computation of dependences between the basic sets of signals. Subsection 4.2 explains the modi cations of a data-ow graph in order to deal with delayed signals, while subsection 4.3 deals with the data-ow analysis at di erent granularity levels.
Computation of dependence relations
As the nodes in the DFG and their weights { the basic sets of signals and their sizes { are known from Section 3, the construction of the graph is completed by determining the arcs between the nodes { the dependence relations between the basic sets { and their weights { the number of dependences.
Suppose, without decrease in generality, that two basic sets can be represented as LBLs: The expression of the iterator vector corresponding to the basic set S 2 is obtained similarly:
There is a dependence relation between S 1 and S 2 if there is at least one iterator vector corresponding to both of them. The number of iterator vectors yields, in this case, the number of dependences. The problem is solved by intersecting the linearly bounded lattices (2) and (3) . If the intersection is empty, there is no dependence relation between S 1 and S 2 . Otherwise, the size of the intersection (see subsection 3.4) represents the number of dependences.
Example In the SILAGE code of Fig. 2a , basic set A8 belongs to the index space D8 of the de nition domain A j] n+ i + 1] (j i) (node g in Fig. 2b) , and basic set A9 belongs to the index space D9 of the operand domain A j] n + i] (j i) (node h in Fig. 2b) . Employing a non-matricial notation, the linearly bounded lattices of the basic sets and of the index spaces are (see Fig. 2c ): A8 = f x = 1 ; y = 2 + n j n ? 1 1 ; 2 1 ; 1 ? 2 1 g A9 = f x = ; y = n j n ? 1 1 g D8 = f x = j ; y = i + n + 1 j n ? 1 j 0 ; n ? 1 i 0 ; j i g D9 = f x = j ; y = i + n j n ? 1 j 0 ; n ? 1 i 0 ; j i g
The set of iterators corresponding to A8 in the index space D8 is described by f i 1 = 0 ; j 1 = " j n ? 1 " ; 0 0 ; " ? 0 2 g . The set of iterators corresponding to A9 in the index space D9 is: f i 2 = 0; j 2 = j n ? 1 1 g . The intersection of the two LBLs is represented as: f i = 0 ; j = j n ? 1 2 g .
Hence, the number of dependences between A9 and A8 is n?2 , i.e. the size of the intersection.
The data-ow graph of the example in Fig. 2a is shown in Fig. 3a . The nodes are labelled with the signal name, basic set number, and size; the arcs are labelled with the number of dependences. OUT is a dummy node, necessary for handling delayed signals.
Handling the delayed signals
RMSP algorithms describe the processing of streams of data samples. The source code of these algorithms can be imagined as surrounded by an imaginary loop having time as iterator. Consequently, each signal in the algorithm has an implicit extra dimension corresponding to the time axis. RMSP algorithms contain usually delayed signals, i.e. signals produced or inputs in previous data-sample processings, which are consumed during the current sample processing. The delay operator \@" refers relatively to the past processings. The delayed signals must be kept \alive" during several time iterations, i.e. they must be stored in the background memory during several data-sample processings.
In order to simulate the e ect of the delayed signals in terms of memory requirements, a simple but e ective method for handling delays is introduced in the sequel.
First, the delayed operand domains take part in the partitioning process (Section 3.1) as any other signal domain. Afterwards, the construction of the data-ow graph needs the following preprocessing step: This modi cation of the data-ow graph allows to take into account the e ect of delays, \translating" the basic sets from previous data-sample processings into the current one. When the delay values are constant, the extension of all domains with one extra dimension { corresponding to the implicit time loop { is avoided, hence reducing the computational e ort 11 . Fig. 4a shows the data-ow graph for the SILAGE code in Fig. 1 . The basic sets corresponding to delayed signals are labelled with \@delay value".
Data-ow analysis at di erent granularity levels
The method described so far operates on groups of signals obtained from a coarse-grain partitioning { using the index spaces of operand and de nition domains directly derived from the loop organization provided in the initial code. Due to the (possibly large) size of the basic sets, the number of memory locations necessary to compute the data-ow graph may result to be much higher than the absolute minimal storage requirement of the signal processing algorithm, when only scalar instances would be considered. To overcome this e ect, the index spaces can be decomposed before starting the partitioning process, removing gradually the restrictions imposed by the loop organization provided in the initial code. This domain decomposition is carried out by slicing the n-dimensional polytopes A i b into sets of (n ? 1)-dimensional polytopes. We name this an increase of granularity level, and the outcome is a ner-grain partitioning. The slicing process can be eventually continued until the domains are decomposed into scalars. The attened data-ow analysis can therefore be obtained as a particular case. The linearly bounded lattices are then newly derived and partitioned corresponding to that granularity level. The slicing operation is carried out with a Fourier-Motzkin technique, but equivalent results can be obtained, for instance, by unrolling gradually the loops. It must be also remarked that starting with a certain granularity level (dependent on the application) the DFG becomes a directed acyclic graph. Fig. 3b shows the same data-ow graph as in Fig. 3a but expanded one loop level. There are two contrasting objectives when selecting the granularity level. On one hand, the number of basic sets is rapidly increasing, which is an undesirable e ect due to the growth of the computational e ort. Moreover, also the complexity of the controller in the architecture to be realized will grow exponentially. On the other hand, the memory size will gradually get closer to an absolute lower bound. In practice, descending to the scalar level ought to be avoided. Instead, for each application, the granularity should be gradually increased until a good trade-o is obtained.
Evaluation of Memory Size
After partitioning the signals from the given RMSP algorithm, and after accumulating all the possible dependence information at the level of basic sets, the subsequent step is to obtain an accurate evaluation of the minimal memory size (locations) compatible with the resulting data-ow graph.
Even the simpler problem of nding the minimum number of memory locations necessary to compute a directed acyclic graph has been proven to be an NP-complete problem 28]. Structurally, the DFGs determined as in Section 4 can be more complex: e.g. they may contain cycles, as 2 groups of signals may contain 2 subsets with opposite dependences to the other group 12 . Consequently, the assessment of the minimal memory size necessary to compute a data-ow graph is achieved basically by means of a heuristic traversal. It must be emphasized that the goal of this approach is to introduce only a partial operation ordering { necessary to reduce the storage requirements { while a proper scheduling implies a total ordering { which is unnecessary for our problem. The DFG traversal provides a data-ow which is equivalent to a certain reorganization of the code, without a ecting the loop hierarchy. The procedural execution of this functionally equivalent code entails a low (eventually minimum) number of storage locations for computing the respective data-ow graph.
Applying the same approach to the DFG for a higher granularity level (see subsection 4.3), the resulting data-ow corresponds to a certain code reorganization when the nests of loops are (implicitly!) unrolled until a depth equal to the granularity level. As the granularity is increasing, the storage requirement to compute the corresponding graph is decreasing, but the number of supplementary scheduling constraints are also increasing.
Subsection 5.1 presents the basic features of the DFG traversal. Subsection 5.2 introduces the methodology of in-place mapping, incorporated in the traversal approach. The main ideas in this section will be highlighted in subsection 5.3, while processing the example in Fig. 1 .
Data-ow graph traversal
The DFG traversal attempts to nd the best possible ordering in which the basic sets of signals should be produced 13 such that the memory size is kept as low as possible. It is assumed that a basic set can be produced only after the production of all basic sets having dependences to it.
While producing a basic set, an increase of memory size occurs (see Fig. 5a ). This increase is followed by an eventual decrease, representing the total size of basic sets consumed for the last time (having no other dependence relations towards basic sets still not produced). The production of a basic set may not a ect the current value of the maximum memory size { registered after starting the traversal (Fig. 5a ), or may lead to exceeding this value (MaxMemorySize) which has to be therefore updated (Fig. 5b) . The amount of the memory is yielded by the worst-case in-place mapping (see subsection 5.2).
Taking into account the remarks and assumptions mentioned above, the data-ow graph traversal works according to the following general scheme: This is a relatively greedy approach, but it is based on a thorough global analysis for deriving the cost factors (see also subsection 5.2) which decide the order of the basic set production. The results in Section 6 will demonstrate the e ectiveness of the traversal scheme.
In-place mapping
The problem of in-place mapping 34] refers to the possibility that signals in a RMSP algorithm share the same memory locations during algorithm execution. This problem may be approached from two viewpoints: (1) high-level in-place mapping { referring to the memory sharing due to the data-ow, hence independent of any operation scheduling; (2) low-level in-place mapping { which refers to the same issue but dependent on the scheduling.
The latter problem can be easily tackled with a left-edge algorithm (e.g. 15, 19] ), but the former is more di cult as it requires the computation of tight upper-bounds on the (usually huge) set of valid operation orderings.
Example The SILAGE code in Fig. 6a yields the basic sets in Fig. 6b and the data-ow graph in Fig. 6c . Assuming that operations are executed sequentially, and that signals A are not consumed elsewhere in the code, the problem is to nd an exact upper-bound of the minimum number of memory locations.
A thorough analysis shows that 110 locations are su cient, for any sequential computation of signals B i] j]. There are orderings yielding lower memory sizes 14 but our model requires upper-bounds, as the script assumes that scheduling has not been decided yet (see Section 2). The formal methodology for the computation of exact memory upper-bounds { based on the dependence information embedded in the data-ow graph { will be described elsewhere.This approach has been incorporated in our memory estimation tool, in order to determine the term MemoryIncrease from the DFG traversal scheme (see subsection 5.1).
Example of memory size evaluation
The memory size evaluation methodology is exempli ed for the SILAGE code in Fig. 1 .
The number of memory locations required for the algorithm execution is approximated with the number of locations necessary to compute its data-ow graph for a chosen granularity level. For instance, choosing the level equal to 0, we have to compute the number of locations required for the computation of the DFG in Fig. 4 .
According to the traversal scheme (subsection 5.1), the basic sets are produced/loaded in the order: A0@1, A1@1, A0, Delta0, Delta2, Delta1, optDelta0, optDelta2, optDelta1, opt0, A1 . This traversal yields (taking into account the in-place mapping) MemorySize = Fig. 8 . Only two partial constraints are essential to guide the subsequent scheduling stage: the basic set optDelta0 must be produced after Delta1, and the input basic set A1 must be loaded after the production of opt0 . The increase of the granularity level entails a data-ow analysis with smaller groups of signals. This provides lower values for the memory size at the expense of the gradual expansion of the loop organization. At the same time, more constraints are conveyed to the scheduler. The memory size evaluation for the data-ow graph corresponding to granularity level 1 yields a storage requirement of 660 locations. The variance of the memory during the graph traversals is displayed in Fig. 7 for the DFGs of granularity levels 0 and 1. Similar evaluations for granularity levels 2, 3, and 4 { the scalar level { yields the same memory size, i.e. 627 locations. The analysis of the code in Fig. 1 shows that (M + m + 1)(N + n + 1) + 2 = 627 locations represents indeed the absolute lower-bound on storage.
This example also shows that it is not necessary to descend to the scalar level in order to obtain a signi cant reduction for the memory size. Furthermore, even more crude evaluations for lower granularity levels (as that one corresponding to level 1) may be considered good Figure 8 : The SILAGE code compatible with the DFG traversal enough for this stage of the synthesis. As the amount of constraints is relatively reduced, there is more freedom for scheduling and data-path allocation. Further memory adjustments can be done afterwards, taking into account the detailed scheduling decisions.
Overview of main results
The implementation of the presented approach was done in C++, under the CATHEDRAL framework. It was tested on an HP 9000/735 workstation. The novel estimation method has been evaluated on three RMSP applications: (1) a singular value decomposition (SVD) updating algorithm 21] { an important algebraic kernel used e.g. in beamforming and Kalman ltering; (2) a motion detection algorithm 7] from a video coding application (see Fig. 1 for part of the code); (3) the kernel of a complex voice coding application { essential component of a mobile radio terminal 32]. Table 1 shows the results obtained by the presented approach concerning the storage size, in comparison with those yielded by s2p=agora { the memory tool currently employed by the CATHEDRAL system { which uses a scalar-oriented technique for the in-place mapping. The results are provided for the di erent levels of granularity (column 2), up to the maximal loop depth { for which the number of signal instances (scalars) in the algorithm (column 3) equals the number of basic sets (column 4). The CPU times and the estimated storage requirements are listed in columns 5 and 6. The minimum number of memory locations (N) yielded by s2p=agora on the same algorithm representations are displayed in column 7. The last two columns show the absolute lower bounds on memory locations, and the relative di erence between our estimations and these values.
Di erent from our approach, the memory tool s2p=agora interprets procedurally the source code of the signal processing algorithm. Employing a \left-edge" algorithm, s2p=agora nds the minimum number of memory locations only for one single operation ordering { provided by the code of the application. As our tool interprets the source code non-procedurally, it can nd better execution orderings (in terms of memory), as it happens for the SVD updating. This explains why our estimations from column 6 may be lower than the exact memory values obtained by s2p=agora (column 7) for the non-optimized procedural speci cation.
It can be noticed from Table 1 that the relative di erence between our estimations and the memory lower-bounds is decreasing towards 0 as the data-ow analysis becomes ner 15 . However, the CPU time is growing exponentially as getting closer to the scalar level. The results have been presented for all the granularity levels only to highlight the hierarchical capabilities of our model, and not for practical reasons. As it can be seen from the results, descending towards the scalar level is unnecessary and time consuming. Furthermore, this is impossible for most of the real-life RMSP applications, as shown in the sequel. Table 2 exempli es the use of our approach when the amount of scalars in the application is huge. s2p=agora { the memory tool in the CATHEDRAL system { cannot handle the motion detection algorithm for the sets of parameters shown in Table 2 . On the contrary, our approach can still handle these cases, although descending towards the scalar level is not possible any more. For the rst set of parameters, the highest reachable granularity level was 2 (rather than 4). For the second set of parameters, the motion detection algorithm contains almost 14 million scalars. Only the rst granularity level could be reached, but the relative di erence between our estimation and the computed memory lower-bound is very acceptable. The evaluation of the port requirements (number and type) is achieved by a partial ordering of the read=write operations within a given cycle budget, taking into account also the constraints derived from the DFG structure and traversal. All the signals are assumed to share the same global memory, having the wordlength equal to the maximum signal bit size 16 . The characteristics of the resulting memory are the wordlength, number of locations, and port distribution. The corresponding area is expressed in mm 2 , assuming a CMOS technology with 1.2 m minimum geometry. The rst area column in Table 3 displays the area values obtained from the layout model 23] embedded in our tool. The second area column displays the values obtained building the memory oorplan with a module generator 17 . The results obtained by our method t within a range of 10% the experimental results, con rming the validity of the layout model we use.
Conclusions
In the context of background memory allocation for multi-dimensional signal processing circuits, we have addressed the problem of memory area evaluation for high-level non-procedural speci cations of RMSP algorithms. In the paper, we have presented a novel technique founded on data-ow analysis, which allows to address this problem while operation scheduling is still unknown. In most multi-dimensional signal processing examples, the number of signal instances is huge, so a attened data-ow analysis is impossible. Even if the size of the data-ow graph had allowed the attening, the resulting controller would likely be too complex. Consequently, a new data-ow model grouping scalar signals in so-called basic sets is proposed. Based on this, a novel approach for assessing the memory area has been presented. Actually, the method incorporates a way to trade-o the memory size with the complexity of both the computational e ort and the controller to be achieved. These claims are substantiated with results for realistic applications extracted from video coding and mobile radio systems.
