Many signal processing systems, particularly in the multimedia and telecommunication domains, are synthesized to execute dataintensive applications: their cost related aspects -namely power consumption and chip area -are heavily influenced, if not dominated, by the data access and storage aspects. This paper presents an energy-aware memory allocation methodology. Starting from the high-level behavioral specification of a given application, this framework performs the assignment of the multidimensional signals to the memory layers -the on-chip scratchpad memory and the off-chip main memory -the goal being the reduction of the dynamic energy consumption in the memory subsystem. Based on the assignment results, the framework subsequently performs the mapping of signals into both memory layers such that the overall amount of data storage be reduced. This software system yields a complete allocation solution: the exact storage amount on each memory layer, the mapping functions that determine the exact locations for any array element (scalar signal) in the specification, and an estimation of the dynamic energy consumption in the memory subsystem. key words: multidimensional signal processing, memory management, memory allocation, dynamic energy consumption, signal-to-memory assignment
Introduction
Many multidimensional signal processing systems, particularly in the areas of multimedia and telecommunications, are synthesized to execute data-intensive applications, the data transfer and storage having a significant impact on both the system performance and the major cost parameters -power and area.
In particular, the memory subsystem is, typically, a major contributor to the overall energy budget of the entire system. The dynamic energy consumption is caused by memory accesses, whereas the static energy consumption is due to leakage currents. Savings of dynamic energy can be potentially obtained by accessing frequently used data from smaller on-chip memories rather than from the large offchip main memory, the problem being how to optimally as-sign the data to the memory layers. As on-chip storage, the scratch-pad memories (SPMs) -compiler-controlled static random-access memories, more energy-efficient than the hardware-managed caches -are widely used in embedded systems, where caches incur a significant penalty in aspects like area cost, energy consumption, hit latency, and real-time guarantees * * . Different from caches, the SPM occupies a distinct part of the virtual address space, with the rest of the address space occupied by the main memory. The consequence is that there is no need to check for the availability of the data in the SPM. Hence, the SPM does not possess a comparator and the miss/hit acknowledging circuitry [4] . This contributes to a significant energy (as well as area) reduction. Another consequence is that in cache memory systems, the mapping of data to the cache is done during the code execution, whereas in SPM-based systems this can be done at compilation time, using a suitable algorithm -as this paper will show.
The energy-efficient assignment of signals to the onand off-chip memories has been studied since the late nineties. These previous works focused on partitioning the signals from the application code into so-called copy candidates (since the on-chip memories were usually caches), and on the optimal selection and assignment of these to different layers into the memory hierarchy. For instance, Kandemir and Choudhary analyze and exploit the temporal locality by inserting local copies [12] . Their layer assignment builds a separate hierarchy per loop nest and then combines them into a single hierarchy. However, the approach lacks a global view on the lifetimes of array elements in applications having imperfect nested loops. Brockmeyer et al. use the steering heuristic of assigning the arrays having the lowest access number over size ratio to the lowest memory layer first, followed by incremental reassignments [5] . Hu et al. can use parts of arrays as copies, but they typically use cuts along the array dimensions [10] (like rows and columns of * * A detailed study [4] comparing the tradeoffs of caches as compared to SPMs found in their experiments that the latter exhibit 34% smaller area and 40% lower power consumption than a cache of the same capacity. Even more surprisingly, the runtime measured in cycles was 18% better with an SPM using a simple static knapsack-based allocation algorithm. As a general conclusion, the authors of the study found absolutely no advantage in using caches, even in high-end embedded systems in which performance is important. (Caches have been a big success for desktops though, where the usual approach to adding SRAM is to configure it as a cache.)
Copyright c 2009 The Institute of Electronics, Information and Communication Engineers matrices). Udayakumaran and Barua propose a dynamic allocation model for SPM-based embedded systems [17] , but the focus is on global and stack data, rather than on multidimensional signals. Issenin et al. perform a data reuse analysis in a multi-layer memory organization [11] , but the mapping of the signals into the hierarchical data storage is not considered. The energy-aware partitioning of an on-chip memory in multiple banks has been studied by several research groups, as well. Techniques of an exploratory nature analyze possible partitions, matching them against the access patterns of the application [7] . Other approaches exploit the properties of the dynamic energy cost and the resulting structure of the partitioning space to come up with algorithms able to derive the optimal partition for a given access pattern [1] .
Despite many advances in memory design techniques over the past two decades, existing computer-aided design (CAD) methodologies are still ineffective in many aspects. In the previous works, the reduction of the dynamic energy consumption in hierarchical memory subsystems is mainly based on heuristic explorations of the solution space rather than on a formal methodology. Also, several models of mapping the multidimensional signals into the physical memory were proposed in the past (see [13] for an overview). However, they all failed (a) to provide efficient implementations, (b) to prove their effectiveness in hierarchical memory organizations, and (c) to provide quantitative measures of quality for the mapping solutions. Moreover, the reduction of power consumption and the mapping of signals in hierarchical memory subsystems were treated in the past as separate problems.
This paper presents a memory allocation methodology for embedded data-intensive signal processing applications. Starting from the high-level behavioral specification of a given application, where the code is organized in sequences of loop nests and the main data structures are multidimensional arrays, this framework performs the assignment of the multidimensional signals to the memory layers -the onchip scratch-pad memory and the off-chip main memorythe goal being the reduction of the dynamic energy consumption in the memory subsystem. The previous works do not take into account the non-uniform access patterns within same arrays; they make distinction only between the access intensity of entire arrays [5] (e.g., array A is more accessed than array B), or they try to heuristically identify the more accessed parts by imposing constraints on their shape and/or size [10] (e.g., the rows 1-2 of array A are more accessed than its rows 3-4). In contrast to the previous works, our analysis is more refined, allowing to formally identify those intensely-accessed areas of the array space -independent of their shape, size, or array dimensions. Assigning the most heavily-accessed parts of the arrays into the scratchpad layer (requiring less energy consumed per access) entails important savings of the dynamic energy consumption in the memory subsystem. This is why our framework is energy-aware.
Based on the assignment results, the framework subsequently performs the mapping of signals into both memory layers such that the overall amount of data storage be significantly reduced. Different from the previous works, this mapping technique is designed to work in hierarchical memory organizations, operating with parts of the arrays that can be assigned to different physical memories. The polyhedral framework, common to both design phases (the signal assignment to the memory layers and the signal mapping into the data memories), entails a high computation efficiency since both phases rely on similar polyhedral operations. This software system yields a complete allocation solution: the exact storage amount on each memory layer, the mapping functions that determine the exact locations for any array element (scalar signal) in the specification, metrics of quality for the allocation solution, and an estimation of the dynamic energy consumption in the memory subsystem using the CACTI power model [19] .
The rest of the paper is organized as follows. Section 2 presents the algorithm that assigns the signals to the memory layers, aiming to minimize the dynamic energy consumption in the hierarchical memory subsystem subject to SPM size constraints. Section 3 describes the global flow of the memory allocation approach, focusing on the mapping aspects. Section 4 discusses on implementation and presents experimental results. Finally, Sect. 5 summarizes the main conclusions of this research.
Energy-Aware Signal Assignment to the Memory Layers
The algorithms describing the functionality of real-time multimedia and telecom applications are typically specified in a high-level programming language, where the code is organized in sequences of loop nests having as boundaries linear functions of the outer loop iterators. Conditional instructions are very common as well, and the multidimensional array references have (possibly complex) linear indexes (the variables being the loop iterators). Figure 1 shows an illustrative example whose structure is similar to the kernel of a motion detection algorithm [6] . The problem is to automatically identify those parts of arrays from the given application code that are more intensely accessed, in order to steer their assignment to the energyefficient data storage layer (the on-chip scratch-pad mem- The drawbacks of such an approach are twofold. First, the simulated execution may be computationally ineffective when the number of array elements is very significant, or when the application code contains deep loop nests. Second, even if the simulated execution were feasible, such a scalaroriented technique would not be helpful since the addressing hardware of the data memories would result very complex. An address generation unit (AGU) is typically implemented to compute arithmetic expressions in order to generate sequences of addresses [14] ; a set of array elements is not a good input for the design of an efficient AGU.
Our proposed computation methodology for energyaware signal assignment to the memory layers is described below, after defining a few basic concepts.
Let
. . , i n )] be an array reference of an m-dimensional signal M, in the scope of a nest of n loops having the iterators i 1 , . . . , i n . The array reference is characterized by an iterator space and an index (or array) space. The iterator space signifies the set of all iterator vectors i = (i 1 , . . . , i n ) ∈ Z n in the scope of the array reference, and it can be typically represented by a so-called Z-polytope (a polyhedron bounded and closed, restricted to the set Z n ): [15] : Step 1 Let M be an indexed signal from the algorithmic specification. Decompose the array references
The motivation of the decomposition of the array references relies on the following intuitive idea: the disjoint lattices belonging to many array references are actually those parts of the array space of M more heavily accessed during the code execution. This decomposition into disjoint lattices -used also in [2] , where it is explained in detail -can be performed analytically, by recursively intersecting the array references of signal M.
Step 2 Compute the average number of memory accesses for each disjoint lattice of signal M.
The total number of memory accesses to a given linearly bounded lattice of M is computed as follows:
Step 2.1 Select an array reference of M and intersect the given lattice with it. If the intersection result is not an empty set, it follows that the selected array reference and the given lattice have M-elements in common. The intersection is done in order to determine the expressions of the loop iterators for these common M-elements. (An example will be given below.)
Step 2.2 Compute the number of points in the (non-empty) intersection -a linearly bounded lattice as well [15] : this yields the number of memory accesses to the given lattice, as part of the selected array reference.
Step 2.3 Repeat steps 2.1 and 2.2 for all the signal's array references in the code, cumulating the numbers of accesses to the given lattice.
Step 2 is executed for each disjoint lattice obtained at Step 1. The overall result is a map of the memory accesses to the array space of signal M.
For example, let us consider one of signal A's lattices † { 64 ≥ x , y ≥ 16}. Intersecting it with the array refer-
(see the code in Fig. 1) , we obtain the lattice
The size of this set is 1,809,025, which is the number of memory accesses to the given lattice as part of the array reference A[k] [l] . Since the given lattice is also included † When the lattice has T=I -the identity matrix -and u=0, the lattice is actually a Z-polytope, like in this example. Hence, the total amount of memory accesses to the given lattice is 2,614,689+1,809,025=4,423,714. Since the number of A-elements covered by this lattice is 2,401, the average number of accesses for this lattice is 1,842.45. Figure 3 displays a computed map of memory accesses for the signal A, where A's index space is in the horizontal plane xOy and the numbers of memory accesses are on the vertical axis Oz. This computed map is an approximation of the exact map in Fig. 2 , since the access distribution within each lattice is considered uniform (equal to the average value of accesses). Computing such approximate maps of accesses has an important advantage: the (usually very time-expensive) simulation is not needed any more, being replaced by algebraic computations. Note that a finer granularity in the decomposition of the index space of a signal into disjoint lattices entails a computed map of accesses closer to the exact map.
Step 3 Select the lattices having the highest access numbers, whose total size does not exceed the maximum SPM size (assumed to be a design constraint), and assign them to the SPM layer. The other lattices will be assigned to the main memory.
Storing on-chip all the signals is, obviously, the most desirable scenario in point of view of dynamic energy consumption. This is usually not possible since the SPM size is, typically, limited to smaller values than the overall data storage requirement of the algorithmic specification. In our tests (Sect. 4), we compute the ratio between the expected dynamic energy reduction and the SPM size after mapping (see Sect. 3); the value of the SPM size maximizing this ratio is selected, the goal being to obtain the maximum benefit in energy point of view for the smallest SPM size.
Mapping Signals within Memory Layers
This design phase has the following goals: (a) to map the signals (already assigned to the memory layers) into amounts of data storage as small as possible, both for the SPM and the main memory; (b) to compute these amounts of storage after mapping on both memory layers (allocation solution) and be able to determine the memory location of each array element from the specification (assignment solution); (c) to use mapping functions simple enough in order to ensure an address generation hardware of a reasonable complexity; (d) to ascertain that any scalar signals (array elements) simultaneously alive are mapped to distinct storage locations.
Since the mapping models [8] and [16] play an important part in this section, they will be explained and illustrated below.
To reduce the size of a multidimensional array mapped to memory, the model [8] considers all the possible canonical † linearizations of the array; for any linearization, the largest distance at any time between two live elements is computed. This distance plus 1 is then the storage "window" required for the mapping of the array into the data memory. More formally, |W A | = min max { dist(A i , A j ) } + 1, where |W A | is the size of the storage window of a signal A, the minimum is taken over all the canonical linearizations, while the maximum is taken over all the pairs of A-elements (A i ,A j ) simultaneously alive.
This mapping model will be illustrated for the loop nest from Fig. 4(a) , index 1 =0,18), index 2 =0,9), two elements simultaneously alive, placed the farthest apart from each other, are A [9] [0] and A [9] [9] . The distance between them is 9×19=171. Now, if we consider the array linearization by row concatenation in the increasing order of the rows ((A In order to avoid the inconvenience of analyzing dif- † For instance, a 2-D array can be typically linearized concatenating the rows or concatenating the columns. In addition, the elements in a given dimension can be mapped in the increasing or decreasing order of the respective index. In the illustrative example shown in Fig. 4(a) , the bounding window of the signal A is W A = (11 , 10). It follows that the storage allocation for signal A is 100 locations if the linearization model is used, and w 1 × w 2 = 110 locations when the bounding window model is applied. However, in the example shown in Fig. 4(b) , where the code has a similar structure, the bounding window model yields a better allocation result -30 storage locations, since the 2-D window of A is W A = (5 , 6), whereas the linearization model yields 32 locations (the best canonical linearization being the row concatenation in the increasing order of rows).
Our software system incorporates both mapping models, their implementation being based on the same polyhedral framework operating with lattices, used also in Sect. 2. This is advantageous both from the point of view of computational efficiency and relative to the amount of allocated data storage -since the mapping window for each signal is the smallest one of the two models. Moreover, this methodology can be applied independently to the memory layers, providing a complete storage allocation/assignment solution for distributed memory organizations.
Before explaining the global flow of the algorithm, let us examine the simple case of a code with only one array reference in it: take, for instance, the two nested loops from Fig. 4(b) , but without the second conditional statement that consumes the A-elements. In the bounding window model, W A = (11,7) can be determined by computing the integer projections on the two axes of the lattice of
, represented graphically by all the points inside the quadrilateral from Fig. 5(a) . It can be observed from the figure that this can be reduced to the computation of the integer projections of a polytope (even when there are holes in the index space), which is a wellstudied problem [18] .
In the linearization model, the distance between two Aelements A 1 2 ], assuming row concatenation in the increasing order of the rows, is: dist(A 1 , A 2 ) = (x 2 − x 1 )Δy + (y 2 − y 1 ), where Δy is the range of the second index (here, equal to 7) in the array space † . Then, the Aelements at a maximum distance have the minimum and, respectively, the maximum index vectors relative to the lexicographic order. These array A-elements are represented by the points M = A [2] [7] and N = A [12] [7] in Fig. 5(a) , and dist(M, N) = (12 − 2) × 7 + (0 − 0) = 70. Similarly, in the linearization by column concatenation, the array elements at the maximum distance from each other are still the elements with (lexicographically) minimum and maximum index vectors, provided an interchange of the indexes is applied first. These are the points M = A [4] [9] and N = A [10] [4] in Fig. 5(b) , and the distance between them is (10 − 4) × 11 + (4 − 9) = 61.
More general, the maximum distance between the † To ensure that the distance is a nonnegative number, we shall assume that [ points of a live lattice in a canonical linearization is the distance between the (lexicographically) minimum and maximum index vectors, providing an index permutation is applied first. The distance between the array elements
where the index vector of A j is lexicographically larger than of A i (Δx i the range of x i ). Algorithms for the computation of both the integer projections of a lattice and the (lexicographically) minimum/maximum index vectors in a lattice were proposed in [13] .
If in the linearization some dimension is traversed backwards, then a simple transformation reversing the index variation must be also applied as shown in Fig. 5(c) Let A be an m-dimensional signal and P A be the set of disjoint lattices of A assigned to the current memory layer by the algorithm described in Sect. 2. The notations are explained in the pseudo-code.
and x max (L), the lexicographically minimum and maximum index vectors of L relative to C; for ( each boundary n between the loop nests n and n + 1 ) { let P A (n) ⊆ P A be the set of lattices alive at the boundary n; // (lattices produced/consumed before/after the boundary) for ( each boundary n between the loop nests n and n + 1 ) { let P A (n) ⊆ P A be the set of lattices alive at the boundary n; Step 1 finds the exact values of the window sizes for both mapping models only when every loop nest either produces or consumes (but not both!) the signal's elements. Otherwise, when in a certain loop nest elements of the signal are both produced and consumed (see the illustrative example from Fig. 4(a) ), then the window sizes obtained at the end of Step 1 may be only underestimates since an increase of the storage requirement can happen inside the loop nest. Then, an additional step is required to find the exact values of the window sizes in both mapping models.
Step 2 Update the mapping windows for each indexed signal in every loop nest producing and consuming elements of the signal.
The guiding idea is that local or global maxima of the bounding window size |W 2 | are reached immediately before the consumption of an A-element, which may entail a shrinkage of some side of the bounding window encompassing the live elements. Similarly, the local or global maxima of |W C | are reached immediately before the consumption of an A-element, which may entail a decrease of the maximum distance in the linearization C between live elements. Consequently, for each A-element consumed in a loop nest which also produces A-elements, we construct the disjoint lattices partially produced and those partially consumed until the iteration when the A-element is consumed. Afterwards, we do a similar computation as in Step 1 which may result in increased values for |W 1 | and/or |W 2 |.
Finally, the amount of data memory allocated for signal A on the current memory layer is the smallest data storage provided by the bounding window and the linearization models: |W A | = min { |W 1 | , |W 2 | }. In principle, the overall amount of data memory after mapping is A |W A | -the sum of the mapping window sizes of all the signals having lattices assigned to the current memory layer. In addition, a post-processing step attempts to further enhance the allocation solution: our polyhedral framework allows to efficiently check weather two multidimensional signals have disjoint lifetimes, in which case the signals can share the largest of the two windows. More general, an incompatibility graph [9] is used to optimize the memory sharing among all the signals at the level of whole code.
Experimental Results
The polyhedral framework for the memory management of multidimensional signal processing applications has been implemented in C++, incorporating the algorithms described in this paper. The behavioral specifications of the applications are expressed in a subset of the C language, illustrated in the examples used in the paper. Tables 1 and 2 summarize the results of our experiments, carried out on a PC with an Intel Core 2 Duo 1.8 GHz processor and 512 MB RAM running Ubuntu 6.06. The benchmarks used (columns 1) are algebraic kernels (like Durbin's algorithm for solving Toeplitz systems) and algorithms used in multimedia applications (like, for instance, an MPEG4 motion estimation algorithm). Table 1 focuses on our signal mapping methodology (Sect. 3). Columns 2 and 3 display the numbers of array references and array elements (scalar signals) in the specification code. The next column shows the memory allocation results -the overall data storage after mapping. The 5th column displays two metrics of quality for the memory allocation solutions: (a) the sum of the minimum array windows (that is, the optimum memory sharing between elements of same arrays), and (b) the minimum storage requirement for the execution of the application code (that is, the optimum memory sharing between all the scalar signals or array elements in the code). The results from this column are obtained with the algorithm [2] , also part of this framework.
The data memory size after mapping is typically lowerbounded by the first value in the 5th column, since the memory window of an array cannot be smaller than the minimum storage requirement of the array. The exception which occurs for the motion detection kernel is due to the existence of a scalar signal whose lifetime is disjoint from the lifetime of another array in the code. As explained at the end of Sect. 3, signals having disjoint lifetimes can share the same window. The importance of the second value in the 5th column is mainly theoretical, serving for the evaluation of the allocation results (column 4). A minimum data storage could be obtained in principle, but the price to be paid is a very complex address generation hardware. By means of the signal-to-memory mapping model, an excess of storage in memory allocation is traded-off against a less complex address generation hardware (practically all the mapping models using modulo computations [14] ). This framework has the unique feature of yielding a measure of this compromise -the amount of additional storage over the absolute lower bound.
The last group of columns displays the CPU times when running the mapping algorithms [8] , [16] , and the current one in Sect. 3, respectively. Note that our algorithm takes into consideration both models, selecting for each array the smaller memory window. When the nested loops either produce or consume (but not both) elements of same arrays (as in the case of the Durbin benchmark), our algorithm is significantly faster since only Step 1 need to be executed. Even if this is not the case, the current algorithm is regularly faster due to our polyhedral framework, efficiently operating with polytopes and lattices. Table 2 displays in columns 2 and 3 the total memory accesses and the dynamic energy consumption in the case of a single (off-chip) memory layer. Using the CACTI power model, we chose as parameters a technology of 65 nm, 4 memory banks, and a line size of 16 bytes. Assuming a two-layer memory organization, columns 4-7 display the SPM size and the savings of dynamic energy † applying, respectively, a previous model steered by the total number of accesses for whole arrays [5] , another previous model steered by the most accessed array rows/columns [10] , and the current assignment model (Sect. 2), versus the singlelayer memory scenario (column 3). Finally, column 8 shows the CPU times.
The SPM sizes (column 4) are computed as follows: the lattices of all the arrays in the application are ordered decreasingly based on the average number of accesses per array element; in this order (such that the most accessed lattices come first), the lattices are gradually assigned to the SPM, increasing the SPM size with discrete amounts; for each new SPM size, the CACTI model computes the energy per access; afterwards, we determine the reduction of dynamic energy versus the scenario when all the signals are stored off chip. We choose the SPM size maximizing the ratio between the dynamic energy reduction and the size.
Our results are regularly better than the other models since, building the map of memory accesses for each array (see Fig. 3 ), our framework identifies with accuracy those parts of arrays intensely accessed, whose assignment to the SPM layer yields the highest benefit in terms of dynamic energy consumption. For instance, the energy consumptions for the motion estimation benchmark were, respectively, 1894, 1832, and 1522 µJ; the saved energies relative to the energy in column 3 are displayed as percentages in columns 5-7. Besides the memory allocation solution, the signal mapping algorithm computes the mapping functions for all the arrays, so we can determine the exact locations for any array element in the specification. This provides the necessary information for the automated design of the address generation unit, which is one of our future development directions.
Conclusions
This paper has presented an integrated CAD methodology for power-aware memory allocation, targeting embedded data-intensive signal processing applications. The memory management tasks -the signal assignment to the memory layers and their mapping to the physical memories -are efficiently addressed within a common polyhedral framework. † The use of CACTI to estimate the dynamic energy consumption of SPMs is explained in an appendix of [4] .
