In real-time multimedia processing systems a very large part of the power consumption is due to the data storage and data transfer. Moreover, the area cost is often largely dominated by the memory modules. In deriving an optimized (for area and/or power) memory architecture, memory size computation is an important step in the exploration of the possible algorithmic specifications of multimedia applications. This paper presents a novel non-scalar approach for computing exactly the memory size in real-time multimedia algorithms. This methodology uses both algebraic techniques specific to the data-flow analysis used in modern compilers and, also, more recent advances in the theory of polyhedra. In contrast with all the previous works which are only estimation methods, this approach performs exact memory computations even for applications significantly large in terms of the code size, number of scalars, and number of array references.
Introduction
In advanced telecom and real-time multimedia processing systems -including video and image processing, medical imaging, artificial vision, real-time 3D rendering, advanced audio and speech coding -a very large part of the power consumption is due to the data storage and data transfer. A typical system architecture includes custom hardware (application-specific accelerator datapaths and logic), programmable hardware (DSP core and controller), and a distributed memory organization which is usually expensive in terms of power and area cost. Data transfer and memory access operations typically consume more power than a datapath operation. For instance, fetching an operand from an off-chip memory for an addition consumes 33 times more power than the actual computation; even a transfer from an on-chip memory consumes about 4 to 10 times more power than the addition itself [5] . Moreover, the area cost is often largely dominated by memories. Hence, the optimization of the memory architecture is a crucial step in the design methodology for this type of applications.
In deriving an optimized memory architecture, memory size estimation/computation is an important step in the * An abbreviated version of this paper was presented at ASP-DAC 2006 [18] . This material is based upon work supported by the National Science Foundation under Grant Nr. 0133318.
a) E-mail: fbalasa@cs.uic.edu DOI: 10.1093/ietfec/e89-a. 12.3378 early phase of the design, the system-level exploration. This problem has been tackled in the past both in register-transfer level (RTL) programs at scalar level (e.g., [10] ), and in behavioral specifications at non-scalar level (see below).
Common to all scalar-based storage estimation techniques is that the number of scalars is drastically limited. When multi-dimensional arrays are present in the algorithmic specification of the targeted applications, the computation times of these techniques increase dramatically if the arrays are flattened and each array element is treated like a separate scalar.
To overcome the shortcomings of the scalar estimation techniques for high-level algorithmic specifications where the code has an organization based on loop nests and multidimensional arrays are present, several research teams proposed different techniques exploiting the fact that, due to the loop structure of the code, large parts of an array can be produced or consumed within a same array reference. These estimation approaches can be basically split in two categories: those requiring a fully-fixed execution ordering [13] , [16] , [17] , and those assuming non-procedural specification where the execution ordering is not (completely) fixed [2] , [9] * * . This paper presents a non-scalar method for computing exactly the memory size in multi-dimensional signal processing algorithms where the code is procedural, that is, where the loop structure and sequence of instructions induce the (fixed) execution ordering. This approach uses both algebraic techniques specific to the data-flow analysis used in modern compilers [11] , and also more recent advances in the theory of polyhedra. In contrast with previous works which utilize only approximate methods due to the size of the problems (in terms of number of scalars, number and complexity of array references), this approach obtains exact determinations even for applications significantly large. Since the mathematical model is very general, this novel approach is able to handle the entire class of "affine" specifications (see Sect. 2), therefore practically the entire class of real-time multi-dimensional signal processing applications.
The paper is organized as follows. Section 2 explains the problem of memory size computation. The core of the paper -Sects. 3 and 4 -presents technical aspects of this novel approach. Section 5 will briefly discuss implementation aspects and present several experimental results. Sec-tion 6 will summarize the main conclusions of this work.
The Memory Size Computation Problem
The (real-time) multimedia processing algorithms are typically specified in a high-level programming language, where the code is organized in sequences of loop nests having as boundaries linear functions of the outer loop iterators, conditional instructions where the conditions may be both data-dependent or data-independent (relational and/or logical operators of linear functions of loop iterators), and multidimensional signals which array references have (complex) linear indices. This class of specifications is often referred to as affine [5] .
Real-time multimedia algorithms describe the processing of streams of data samples. The source code of these algorithms can be imagined as surrounded by an implicit loop having a discrete time as iterator. Consequently, each signal in the algorithm has an implicit extra dimension corresponding to the time axis. These algorithms often contain delayed signals, i.e., signals produced (or inputs) in previous data-sample processings, which are consumed during the current sample processing. The delay operator "@" indicates such delayed signals, the following argument signifying the number of previous samples. The delayed signals must be kept "alive" during several time iterations, i.e., they must be stored in the background memory during one or several data-sample processings.
An illustrative example, derived from a motion detection algorithm [5] , is given below:
The problem is to determine the minimum amount of memory locations necessary to store the signals of a given multimedia algorithm during its execution, or, equivalently, the maximum storage occupancy assuming any scalar signal must be stored only during its lifetime. The total number of scalars in the algorithm above is 3,749,063. But due to the fact that scalars having disjoint lifetimes can share the same memory location, the amount of storage can be much smaller than the total number of scalar signals. Actually, only 33,284 memory locations are necessary for this example -as computed by hand and confirmed by our tool presented in Sect. 5.
It must be emphasized that image and video processing applications contain deeper loop nests with iterators having typically large ranges, resulting in extremely large numbers of scalar signals. The simulated execution of the code or RTL approaches based on the left edge algorithm [10] , although appealing by means of simplicity, are too computationally expensive in such cases, often prohibitive to use.
What is fundamentally different from these previous works [13] , [16] , [17] doing only a memory size estimation, the algorithm presented in this paper is able to compute exactly the storage requirements for multimedia applications, even when the number of scalar signals is large (typically, 2-3 orders of magnitude larger than in previous works [2] , [9] , [13] , [16] , [17] ). The basic reasons of its efficiency are: (a) the use of a relatively recent mathematics advance -the polynomial-time decomposition of an n-dimensional polyhedron into unimodular cones [3] , (b) the efficient decomposition of the array references of the multi-dimensional signals in disjoint linearly bounded lattices [15] , and (c) an efficient mechanism of pruning the code of the algorithmic specification.
Computation of Array Reference Size
Definitions A polyhedron is a set of points P ⊂ n satisfying a finite set of linear inequalities: P = {x ∈ n |A · x ≥ b}, where A∈ m×n and b∈ m . If P is a bounded set, then P is called a polytope.
in the scope of a nest of n loops having the iterators i 1 , . . . i n , is characterized by an iterator space and an index space. The iterator space signifies the set of all iterator vectors i = (i 1 , . . . , i n ) ∈ Z n in the scope of the array reference. The index space is the set of all index vectors x = (x 1 , . . . x m ) ∈ Z m of the array reference. When the indices of an array reference are linear mappings with integer coefficients of the loop iterators, the index space consists of one or several linearly bounded lattices (LBLs) [15] , that is, the image of an affine vector function over the iterator polytope † A · i ≥ b:
where x∈ Z m is the index vector of the m-dimensional signal and i∈ Z n is an n-dimensional iterator vector (see the example below). In our context, the elements of the matrices T and A, as well as those of the vectors u and b are considered integers.
In order to address the computation of the memory size necessary for the execution of a multi-dimensional signal processing algorithm, a simpler problem must be addressed first: the computation of the number of distinct scalars in an array reference, that is, how many locations are needed to store one array reference. † The iterator space of an array reference is not always a convex polytope; it can be a non-convex polyhedron, or even a union of convex and nonconvex polyhedra. But, nevertheless, it can be decomposed into a finite set of disjoint (convex) polytopes, as it will be explained in Sect. 4 . This is why, without lack of generality, the iterator space can be considered one (convex) polytope. If the rank of matrix T is equal to its number of columns, then the vector function x = T · i + u between the iterator and index spaces is proven to be a one-to-one mapping [2] , and the computation of the number of distinct signal indices (i.e., the amount of memory necessary to store the scalars covered by the array reference) is hence reduced to counting the number of iterator vectors or, equivalently, the number of lattice points (i.e., points having integer coordinates) in the iterator polytope A · i ≥ b in (1). In such a situation, a computation technique based on the decomposition of a simplicial cone into unimodular cones [3] is used
† . An example is given below, illustrating both the concepts and the technique. Due to space limitation, several details of the computation had to be skipped, along with part of the theoretical justifications. However, this example succeeds to illustrate the main steps of the computation flow. Moreover, it will clearly show the scalability of the technique -which makes it adequate for the multimedia applications, where the array references may cover large sets of scalars.
How many memory locations are necessary to store the array reference A[2i
The linearly bounded lattice (LBL) corresponding to this array reference is {x
(u=0 here) and the problem is equivalent to computing the size of this set. Since the rank of matrix T is 2 -equal to its number of columns; therefore, the vector function x=Ti + u is a one-to-one mapping [2] . The computation of the number of scalars covered by A[2i + 3 j][5i + j] is equivalent to counting the number of lattice points in the iterator polytope Ai ≥ b shown in Fig. 1 . This latter operation is done as explained below. But, first, a few definitions are necessary.
Definitions Let r 1 , . . . , r d ∈ Z d be linearly independent integer vectors. The (rational polyhedral) cone, pointed in the origin, generated by the rays r 1 , . . . , r d is the set
For instance, the set of points inside the angle iV 1 z is a 2-dimensional cone generated by the rays r 1 = [ 1 2] T and r 2 = [1 0] T (see Fig. 1 ). To each vertex of a polyhedron corresponds a supporting cone. The supporting cone of the vertex V 1 (Fig. 1) , denoted C(V 1 ), is the one generated by the rays r 1 and r 2 . A cone is called unimodular if the matrix of the rays [r 1 · · · r d ] is unimodular (i.e., its determinant is ±1).
Step 1 Find the vertices of the iterator polytope Ai ≥ b and their supporting cones.
Given the inequalities {0 ≤ i ≤ 4, 0 ≤ j ≤ 2i, j ≤ −i + 6} defining the iterator space, the vertices and the rays are computed using the reverse search algorithm [1] . The supporting cones corresponding to the vertices V 1 , . . . , V 4 of the iterator polytope, as well as their generating rays shown below as column vectors, are:
Step 2 Apply Barvinok's algorithm [3] to decompose the supporting cones into unimodular cones. The first two cones in our example are not unimodular. Their decomposition is given below, without any additional explanation due to lack of space:
Step 3 Find out the generating function F(V i ) of each supporting cone and the generating function F = i F(V i ) of the whole iterator polytope P.
is defined as the generating function of the polytope P. For example, the generating function of the triangle with the vertices (0, 0), (1, 2), (2, 0) is
, each monomial term corresponding to one of the lattice points inside or on the border of the triangle: e.g., x 1 y 2 corresponds to the point (1,2), and so on. By evaluating F at z = 1, we get the number of lattice points in P [3] . For instance, if x = 1, y = 1 then F = 5, which is the number of lattice points in the triangle.
Writing F as a sum of monomial terms would be impractical for large polytopes. Fortunately, F can be com- † There are also other methods, for instance, based on Ehrhart polynomials [6] , or even simpler, by adapting the Fourier-Motzkin elimination [7] . The current technique was chosen because of its scalability, as explained later.
pactly written as an algebraic sum of rational functions, each term corresponding to one of the unimodular cones in the decomposition of the supporting cones C(V i ) (i = 1, . . . , 4) of the polytope vertices (Steps 1 and 2 ). According to [3] , each supporting cone C(V) has associated a generating function F(V) of the form V(a, b) has two coordinates a, b in this case), the sum is over all the unimodular component cones, and the product at the denominator is over all the generating rays r i .
For instance, the generating function F(V 1 ) has two terms, one for each unimodular cones in the decomposition (2) T of the first unimodular cone (2) yield the first denominator:
T and [0 − 1] T of the second unimodular cone (2) yield the denominator:
. Therefore, the generating function of the cone C(V 1 ) is
With similar computations, the generating functions for the other supporting cones are:
The sum of these rational functions yields the generating function F of the whole quadrilateral in Fig. 1 .
Step 4 Compute the number of lattice points from the generating function F = i F(V i ) of the polytope.
In order to evaluate F = i F(V i ) at z = 1, F must be transformed into a one-variable function. This is done with a substitution z −→ t λ , where
T is an integer vector chosen such that the substitution x = t λ 1 , y = t λ 2 would not make any denominator equal to zero [8] . In our example, choosing λ 1 = 0 is not possible since the denominator factor 1−x of F(V 1 ) would be zero. Similarly, we must have λ 2 0 and λ 1 + 2λ 2 0. We choose, for instance,
T † . With the substitution x = t, y = t −1 , the generating function of the iterator polytope becomes:
After eliminating the negative exponents in the denominators, we factorize t −2 in order to eliminate all the negative exponents in F; this makes the further computations simpler, without changing the final result -the evaluation of F in t = 1. Computationally, this is accomplished by substituting t = s + 1 and doing a Taylor expansion about s = 0 via a polynomial division. After the substitution t = s + 1, we obtain rational terms of the form
, where P(s) and Q(s) are polynomials, and d = 2 is the dimension of the iterator space: 
After polynomial divisions in all the terms of F, the algebraic sum of the coefficients c 2 (since the space dimension d = 2) is the evaluation of F in s = 0, that is, the number of lattice points [3] . In this example, the 6 coefficients c 2 (one for each term of F) are {−3, −3, 0, Assume now that the range of the first iterator in Example 1 is 0 to 400 (rather than 0 to 4) and in the second loop the condition j ≤ −i+6 is replaced by j ≤ −i+600. The iterator polytope is a quadrilateral similar with the one in Fig. 1 , but much larger, the similarity ratio being 100. The computation effort necessary to find the number of memory locations for the array reference A[2i + 3 j][5i + j] is not affected by the very significant increase in size of the iterator space. Indeed, the 4 supporting cones are generated by the same rays, the decompositions are the same, the generating functions are almost the same. The only difference appears at the numerators of F(V 2 ), F(V 3 ), and F(V 4 ) due to the modifications of the coordinates of these vertices. For instance, the numerator of F(V 3 ) becomes x 400 y 200 since the new coordinates of V 3 are (400,200). The storage requirement for this case is 100,501 locations. Note that the number of lattice points does not scale up with the square of the similarity ratio like, for instance, the area of the quadrilateral.
Moreover, the technique sketched above, although illustrated for a 2-dimensional signal in the scope of an iterator space of dimension 2, works for arbitrary numbers of dimensions of both the index and iterator spaces. Therefore, it is well-suited to address the size of array references typical to multimedia applications.
The example above illustrated the case when there is a one-to-one mapping between the iterator and index spaces. But this is not always true. When the rank r of matrix T is smaller than n, the number of columns of T, the memory occupied by the array reference is upper bounded by the number of lattice points in the r-dimensional polytope pr r (Ai ≥ b) -the real projection of Ai ≥ b on R r along the first r coordinates. pr r (Ai ≥ b) can be easily computed by eliminating the last n − r iterators in Ai ≥ b with the Fourier-Motzkin technique [7] . It must be noticed that not necessarily all the lattice points in pr r (Ai ≥ b) represent projections of lattice points from Ai ≥ b [12] . These invalid projections are detected by replacing the r coordinates of the projection point under question in the polytope Ai ≥ b and checking if the resulting (n − r)-dimensional Z-polytope is empty.
The general idea is to bring, first, the matrix T of the mapping to the Hermite Normal Form (HNF) [14] . For instance, post-multiplying the matrix of the vector mapping T = 3 1 by the unimodular matrix S = 0 1 1 −3 yields the HNF: T · S = 1 0 . Note that while HNF is a canonical form, the factorizing matrix S is not necessarily unique when T is not square and non-singular. Most of the algorithms computing HNF propose also techniques to build the factorizing (unimodular) matrix. In our context, any such an approach will do, provided it has polynomial complexity for reason of efficiency † . The rank of matrix T is r = 1, hence less than the number of columns n = 2 of T; in this case, the vector function x=Ti + u may not be a one-to-one mapping. Indeed, the iterator vectors
T from the iterator space in Fig. 1 are mapped to the same index 3i + j = 9. The transformation S modifies the initial iterator space (which is the same as in
T is the new iterator vector after the transformation S) whose exact 1D projection (since r = 1) is { 0 ≤ k ≤ 14 }, obtained eliminating l in the inequalities above [7] . The points in the "dark shadow" [12] 4 ≤ k ≤ 14 correspond to lattice points (k,l) in the modified iterator space. The other values of k are individually checked. k=1 and 2 result to be invalid projections: replacing these values in the modified iterator space, no integer solution for l can be found. Conversely, k=0 and 3 are valid projection, since (k,l)=(0,0) and (3,1) belong to the modified iterator space. Therefore, storing A[3i + j] requires 15 − 2 = 13 locations. Indeed, the index 3i + j can take all the values between 0 and 14, except 1 and 2.
Memory Size Computation Algorithm Based on Data-Dependence Analysis
The main steps of the memory size computation algorithm will be discussed below.
Step 1 Extract the array references from the given algorithmic specification of the multimedia application and decompose the array references for every indexed signal into disjoint linearly bounded lattices. The analytical partitioning of the array references of every signal into disjoint LBLs can be performed by a recursive intersection, starting from the array references in the code. Let
LBLs of an indexed signal in the algorithmic specification, where T 1 and T 2 have the same number of rows (the signal dimension). Intersecting the two linearly bounded lattices means, first of all, solving a linear Diophantine system † † T 1 i 1 − T 2 i 2 = u 2 − u 1 having the elements of i 1 and i 2 as unknowns. If the system has no solution, the intersection is empty. Otherwise, let
be the solution of the Diophantine system [14] . If the set of coalesced constraints of the two LBLs
has integer solutions, then the intersection is a new LBL:
The disjoint LBLs of signal Dlt from the illustrative example in Sect. 2 are (in non-matrix format): Figure 2 shows a polyhedral dependence graph built from the illustrative example in Sect. 2, where the nodes are the disjoint LBLs determined at this step and the arcs are the dependence relations between them derived from the code. The nodes are labeled with the number of scalar signals they cover and the arcs are labeled with the number of dependencies (both computed using the algorithm from Sect. 3).
If the iterator space of an array reference is not a (convex) polytope, it can be partitioned into disjoint polytopes with the same decomposition algorithm presented above. Note that the convex polytope is a particular case of LBL when in the vector mapping x = T · i + u the matrix T is I n † These algorithms are designed to prevent the so-called "coefficient swell" [14] .
† † Finding the integer solutions of the system. Solving a linear Diophantine system was proven to be of polynomial complexity, all the known methods being based on bringing the system matrix to the Hermite Normal Form [14] . T at assignment (1) . There are, in total, 242 assignments executed in this loop nest. The first consumed element is A [10] in the iteration (10,0), at assignment (2). Hence, after the 222-nd assignment one memory location (the one storing A [10] ) becomes free. Then, after the 223-rd assignment the location storing A [11] becomes free, followed by the one storing A [9] after the 224-th assignment, etc.
Actually, part of the LBLs produced or consumed in the block can be conveniently skipped if their effect on the memory variation can be taken into account without generating the scalars they cover. For instance, in the illustrative example from Sect.
2, each iterator vector [i j k l]
T corresponds to a unique produced scalar
The effect of the two array references on the memory variation is +1 − 1 = 0 in each iteration and, therefore, these operands can be skipped from further analysis, pruning that increases significantly the computation speed.
Experimental Results
A memory size computation tool (named K2 after the famous peak which climbing adversity intends to suggest the difficulty of its implementation) has been implemented in C++, incorporating the ideas and algorithms described in this paper. For the syntax of the algorithmic specifications, we adopted a subset of the C language † (see, e.g., the illustrative example in Sect. 2) "enriched" with the delay operator @. In addition to the computation of the minimum memory size requirements and different statistical data on the memory usage by the multi-dimensional signals in the multimedia specification, the tool can optionally generate dependence graphs (like the one in Fig. 2 ) at different granu- larity levels, which provides information about the relations between different groups of signals, and also the trace of the memory occupancy during the execution of the input specification. Such a memory trace is shown in Fig. 3 . It must be stressed that the tool does not simulate the execution of the specification code; the tool finds the points where the memory occupancy changes value, this number of points being 14,806 for the illustrative example, which represents only 0.4% of the number of datapath instructions when the illustrative code is executed. Table 1 summarizes the experimental results. The benchmarks used are: (1) a real-time regularity detection algorithm used in robot vision; (2) the kernel of a voice coding application -essential component of a mobile radio terminal; (3) a singular value decomposition (SVD) updating algorithm used in spatial division multiplex access (SDMA) modulation in mobile communication receivers, in beamforming, and Kalman filtering; (4) a 2D Gaussian blur filter from a medical image processing application which extracts contours from tomograph images in order to detect brain tumors; (5) a motion detection algorithm used in the transmission of real-time video signals on data networks [5] ; (6) a dynamic programming algorithm used in several multimedia applications.
Columns 2 and 3 display the numbers of array references in the code and, respectively, of the disjoint LBLs derived in the partitioning process (Sect. 4). Column 4 displays the numbers of scalar signals, column 5 shows the storage requirements (number of memory locations), and column 6 gives a measure of the memory sharing due to the disjoint lifetimes of the scalar signals. The last column displays the running times of our experiments carried out on a PC with a 1.85 GHz Athlon XP processor and 512 MB memory.
Since there is no other similar tool able to validate the memory size results, we adopted indirect validation ways. We built a large number of artificial test programs, complex enough to cover all the relevant situations, but not so large to prevent us from verifying the memory size variation by sim- ulated execution of the code † . Subsequently, we compared these memory traces with the results provided by the tool. A second validation approach was to study thoroughly the code of the applications and try to find theoretically the minimum memory size. This second strategy had, obviously, a limited success due to the inherent complexity of most of the applications. However, we succeeded in a few cases to find by hand computation the minimum memory size (like, e.g., for the motion detection kernel).
This tool can process large specifications in terms of number of loop nests, lines of code, number of array references. For instance, the voice coding application contains 232 array references organized in 40 loop nests. In one of our experiments, the tool processed in less than 8 minutes a code with 113 loop nests 3 level deep, containing 906 array references, many having very complex indices.
A comparative evaluation with the previous non-scalar works reveals the following notable differences:
1. Part of the previous works impose important constraints on the properties of the algorithmic specifications they can process.
For instance, Ramanujam et al. address only specifications with perfectly nested loops (i.e., in which every statement is inside the innermost loop) [13] . Verbauwhede et al. consider loops with non-constant boundaries, but in such cases the varying boundaries are internally replaced with upper-and lower-bounds in order to fit their computational model [16] .
In comparison, our model handles the entire class of "affine" specifications -as described in Sect. 2 -without any constraints.
2. Part of the previous works do not report running times [9] , [13] .
Although [9] is the only previous work -to the best of our knowledge -reporting results on a complex application (the MPEG-4 motion estimation kernel) with a significant amount of scalars (262,400), the absence of running time information is a shortcoming.
3. All the previous works except [9] use benchmarks with a relatively small number (up to tens of thousands) of scalars. These works do not offer any information on the scalability of their techniques.
From our past experience, even if a memory size estimation technique behaves reasonably well when dealing with examples containing thousands of scalars, the computation time can sharply increase till becoming ineffectual for examples where the number of scalars is larger by 1-2 orders of magnitude.
In comparison, our approach can handle applications with millions of scalars in acceptable times. The running times of our approach are also increasing significantly with the size of the problem, but our tool is still effective when processing examples with at least one (but often 2-3) order(s) of magnitude more scalars.
4. The previous memory estimation techniques yield storage results that may be highly inaccurate.
Verbauwhede et al. state that their determinations are exact when the loop boundaries are constant, but overestimates occur when they are not [16] . However, they do not report any concrete results on the amount of the overestimates they experienced. Ramanujam et al. report exact determinations for all their benchmarks, except one exhibiting an estimation error of 13% [13] . Zhao and Malik obtained an estimation of 1372 memory locations for the motion detection kernel, when the set of parameters is M = N = 32, m = n = 4 [17] . This is a rather poor estimation since the correct result (computed by our tool, but also theoretically confirmed) for the same set of parameters is 2740 storage locations.
In contrast, our tool performs only exact determinations. One could argue that it may be better to obtain fast estimations, even not very accurate, rather than exact determinations with a significantly higher computation effort. This argument is fair enough, but it does not apply to the present situation. The aforementioned estimation result [17] was obtained in 21 seconds on a Sun Enterprise 4000 machine with 4 (336 MHz UltraSparc) processors and 4 GB memory, whereas our computation time was of only 2 seconds on the Athlon XP PC and 7 seconds on a Sun Ultra 20. Therefore, not only our approach found the exact result, but it did it faster than the estimation technique [17] .
Conclusions
This paper has presented a non-scalar approach for com- † Such tools exhibit, in general, a poor scalability, being ineffectual for large programs, with numerous scalars and deep loop nests.
puting the memory requirements for real-time multimedia applications, where the storage of large multi-dimensional signals causes a significant cost in terms of both area and power consumption. This method uses modern elements of the theory of polyhedra and algebraic techniques specific to the data-flow analysis used nowadays in compilers. Different from past works which were only performing a memory size estimation, our approach does exact computations.
