Many signal processing systems, particularly in the multimedia and telecommunication domains, are synthesized to execute data-intensive applications: their cost related aspectsnamely power consumption and chip area -are heavily influenced, if not dominated, by the data access and storage aspects. This chapter presents a power-aware memory allocation methodology. Starting from the high-level behavioral specification of a given application, this framework performs the assignment of of the multidimensional signals to the memory layers -the on-chip scratch-pad memory and the off-chip main memory -the goal being the reduction of the dynamic energy consumption in the memory subsystem. Based on the assignment results, the framework subsequently performs the mapping of signals into the memory layers such that the overall amount of data storage be reduced. This software system yields a complete allocation solution: the exact storage amount on each memory layer, the mapping functions that determine the exact locations for any array element (scalar signal) in the specification, and, in addition, an estimation of the dynamic energy consumption in the memory subsystem.
Introduction
Many multidimensional signal processing systems, particularly in the areas of multimedia and telecommunications, are synthesized to execute data-intensive applications, the data transfer and storage having a significant impact on both the system performance and the major cost parameters -power and area. In particular, the memory subsystem is, typically, a major contributor to the overall energy budget of the entire system (8) . The dynamic energy consumption is caused by memory accesses, whereas the static energy consumption is due to leakage currents. Savings of dynamic energy can be potentially obtained by accessing frequently used data from smaller on-chip memories rather than from the large off-chip main memory, the problem being how to optimally assign the data to the memory layers. Note that this problem is basically different from caching for performance (15) , (22) , where the question is to find how to fill the cache such that the needed data be loaded in advance from the main memory. As on-chip storage, the scratch-pad memories (SPMs) -compiler-controlled static random-access memories, more energy-efficient than the hardware-managed caches -are widely used in embedded systems, where caches incur a significant penalty in aspects like area cost, energy consumption, hit latency, and real-time guarantees. A detailed study (4) comparing the tradeoffs of caches as compared to SPMs found in their experiments that the latter exhibit 34% smaller area and 40% lower power consumption than a cache of the same capacity. Even more surprisingly, the runtime measured in cycles was 18% better with an SPM using a simple static knapsackbased allocation algorithm. As a general conclusion, the authors of the study found absolutely no advantage in using caches, even in high-end embedded systems in which performance is important. 1 Different from caches, the SPM occupies a distinct part of the virtual address space, with the rest of the address space occupied by the main memory. The consequence is that there is no need to check for the availability of the data in the SPM. Hence, the SPM does not possess a comparator and the miss/hit acknowledging circuitry (4) . This contributes to a significant energy (as well as area) reduction. Another consequence is that in cache memory systems, the mapping of data to the cache is done during the code execution, whereas in SPM-based systems this can be done at compilation time, using a suitable algorithm -as this chapter will show. The energy-efficient assignment of signals to the on-and off-chip memories has been studied since the late nineties. These previous works focused on partitioning the signals from the application code into so-called copy candidates (since the on-chip memories were usually caches), and on the optimal selection and assignment of these to different layers into the memory hierarchy (32), (7) , (18) . For instance, Kandemir and Choudhary analyze and exploit the temporal locality by inserting local copies (21) . Their layer assignment builds a separate hierarchy per loop nest and then combines them into a single hierarchy. However, the approach lacks a global view on the lifetimes of array elements in applications having imperfect nested loops. Brockmeyer et al. use the steering heuristic of assigning the arrays having the lowest access number over size ratio to the lowest memory layer first, followed by incremental reassignments (7) . Hu et al. can use parts of arrays as copies, but they typically use cuts along the array dimensions (18) (like rows and columns of matrices). Udayakumaran and Barua propose a dynamic allocation model for SPM-based embedded systems (29) , but the focus is global and stack data, rather than multidimensional signals. Issenin et al. perform a data reuse analysis in a multi-layer memory organization (19) , but the mapping of the signals into the hierarchical data storage is not considered. The energy-aware partitioning of an on-chip memory in multiple banks has been studied by several research groups, as well. Techniques of an exploratory nature analyze possible partitions, matching them against the access patterns of the application (25), (11) . Other approaches exploit the properties of the dynamic energy cost and the resulting structure of the partitioning space to come up with algorithms able to derive the optimal partition for a given access pattern (6), (1) . Despite many advances in memory design techniques over the past two decades, existing computer-aided design methodologies are still ineffective in many aspects. In several previous works, the reduction of the dynamic energy consumption in hierarchical memory subsystems is addressed using in part enumerative approaches, simulations, profiling, heuristic explorations of the solution space, rather than a formal methodology. Also, several models of mapping the multidimensional signals into the physical memory were proposed in the past (see (12) for a good overview).
However, they all failed (a) to provide efficient implementations, (b) to prove their effectiveness in hierarchical memory organizations, and (c) to provide quantitative measures of quality for the mapping solutions. Moreover, the reduction of power consumption and the mapping of signals in hierarchical memory subsystems were treated in the past as completely separate problems. This chapter presents a power-aware memory allocation methodology. Starting from the highlevel behavioral specification of a given application, where the code is organized in sequences of loop nests and the main data structures are multidimensional arrays, this framework performs the assignment of of the multidimensional signals to the memory layers -the on-chip scratch-pad memory and the off-chip main memory -the goal being the reduction of the dynamic energy consumption in the memory subsystem. Based on the assignment results, the framework subsequently performs the mapping of signals into the memory layers such that the overall amount of data storage be reduced. This software system yields a complete allocation solution: the exact storage amount on each memory layer, the mapping functions that determine the exact locations for any array element (scalar signal) in the specification, metrics of quality for the allocation solution, and also an estimation of the dynamic energy consumption in the memory subsystem using the CACTI power model (31) . Extensions of the current framework to support dynamic allocation are currently under development. The rest of the chapter is organized as follows. Section 2 presents the algorithm that assigns the signals to the memory layers, aiming to minimize the dynamic energy consumption in the hierarchical memory subsystem subject to SPM size constraints. Section 3 describes the global flow of the memory allocation approach, focusing on the mapping aspects. Section 4 discusses on implementation and presents experimental results. Finally, Section 5 summarizes the main conclusions of this research.
Power-aware signal assignment to the memory layers
The algorithms describing the functionality of real-time multimedia and telecommunication applications are typically specified in a high-level programming language, where the code is organized in sequences of loop nests having as boundaries linear functions of the outer loop iterators. Conditional instructions are very common as well, and the multidimensional array references have linear indexes (the variables being the loop iterators). 2 Figure 1 shows an illustrative example whose structure is similar to the kernel of a motion detection algorithm (9) (the actual code containing also a delay operator -not relevant in this context). The problem is to automatically identify those parts of arrays from the given application code that are more intensely accessed, in order to steer their assignment to the energyefficient data storage layer (the on-chip scratch-pad memory) such that the dynamic energy consumption in the hierarchical memory subsystem be reduced. The number of storage accesses for each array element can certainly be computed by the simulated execution of the code. For instance, the number of accesses was counted for every pair of possible indexes (between 0 and 80) of signal A (see Fig. 1 ). The array elements near the The drawbacks of such an approach are twofold. First, the simulated execution may be computationally ineffective when the number of array elements is very significant, or when the application code contains deep loop nests. Second, even if the simulated execution were feasible, such a scalar-oriented technique would not be helpful since the addressing hardware of the data memories would result very complex. An address generation unit (AGU) is typically implemented to compute arithmetic expressions in order to generate sequences of addresses (26); a set of array elements is not a good input for the design of an efficient AGU. Our proposed computation methodology for power-aware signal assignment to the memory layers is described below, after defining a few basic concepts. Each array reference M[x 1 (i 1 , . . . , i n )] · · · [x m (i 1 , . . . , i n )] of an m-dimensional signal M, in the scope of a nest of n loops having the iterators i 1 , . . . , i n , is characterized by an iterator space and an index (or array) space. The iterator space signifies the set of all iterator vectors i = (i 1 , . . . , i n ) ∈ Z n in the scope of the array reference, and it can be typically represented by a so-called Z-polytope (a polyhedron bounded and closed, restricted to the set Z n ):
The index space is the set of all index vectors x = (x 1 , . . . , x m ) ∈ Z m of the array reference. When the indexes of an array reference are linear mappings with integer coefficients of the loop iterators, the index space consists of one or several linearly bounded lattices (27) : 
The A-elements of the array reference have the indices x, y:
whose boundary is the image of the boundary of the iterator space P (see Fig. 2 ). However, it can be shown that only those points (x,y) satisfying also the inequalities −6x + 8y ≥ 19k − 30, x − 2y ≥ −4k + 3, and y ≥ 2k ≥ 0, for some positive integer k, belong to the index space; these are the black points in the right quadrilateral from Fig. 2 . In this example, each point in the iterator space is mapped to a distinct point of the index space; this is not always the case, though. Algorithm 1: Power-aware signal assignment to the SPM and off-chip memory layers
Step 1 Extract the array references from the given algorithmic specification and decompose the array references for every indexed signal into disjoint lattices. www.intechopen.com
The motivation of the decomposition of the array references relies on the following intuitive idea: the disjoint lattices belonging to many array references are actually those parts of arrays more heavily accessed during the code execution. This decomposition can be analytically performed, using intersections and differences of lattices -operations quite complex (3) involving computations of Hermite Normal Forms and solving Diophantine linear systems (24), computing the vertices of Z-polytopes (2) and their supporting polyhedral cones, counting the integral points in Z-polyhedra (5; 10), and computing integer projections of polytopes (30) . Figure 3 shows the result of such a decomposition for the three array references of signal M. The resulting lattices have the following expressions (in non-matrix format):
Step 2 Compute the number of memory accesses for each disjoint lattice. The total number of memory accesses to a given linearly bounded lattice of a signal is computed as follows:
Step 2.1 Select an array reference of the signal and intersect the given lattice with it. If the intersection is not empty, then the intersection is a linearly bounded lattice as well (27).
Step 2.2 Compute the number of points in the (non-empty) intersection: this is the number of memory accesses to the given lattice (as part of the selected array reference).
Step 2. Fig. 1) , we obtain the lattice . Hence, the total amount of memory accesses to the given lattice is 2,614,689+1,809,025=4,423,714. Figure 4 displays a computed map of memory accesses for the signal A, where A's index space is in the horizontal plane xOy and the numbers of memory accesses are on the vertical axis Oz. This computed map is an approximation of the exact map in Fig. 1 since the accesses within each lattice are considered uniform, equal to the average values obtained above. The advantage of this map construction is that the (usually time-expensive) simulation is not needed any more, being replaced by algebraic computations. Note that a finer granularity in the decomposition of the index space of a signal in disjoint lattices entails a computed map of accesses closer to the exact map.
Step 3 Select the lattices having the highest access numbers, whose total size does not exceed the maximum SPM size (assumed to be a design constraint), and assign them to the SPM layer. The other lattices will be assigned to the main memory. Fig. 4 . Computed 3D map of memory read accesses for the signal A from the illustrative code in Figure 1 .
Storing on-chip all the signals is, obviously, the most desirable scenario in point of view of dynamic energy consumption, which is typically impossible. We assume here that the SPM size is constrained to smaller values than the overall storage requirement. In our tests, we computed the ratio between the dynamic energy reduction and the SPM size after mapping; the value of the SPM size maximizing this ratio was selected, the idea being to obtain the maximum benefit (in energy point of view) for the smallest SPM size.
Mapping signals within memory layers
This design phase has the following goals: (a) to map the signals (already assigned to the memory layers) into amounts of data storage as small as possible, both for the SPM and the main memory; (b) to compute these amounts of storage after mapping on both memory layers (allocation solution) and be able to determine the memory location of each array element from the specification (assignment solution); (c) to use mapping functions simple enough in order to ensure an address generation hardware of a reasonable complexity; (d) to ascertain that any scalar signals (array elements) simultaneously alive are mapped to distinct storage locations. Since the mapping models (13) and (28) play an important part in this section, they will be explained and illustrated below.
To reduce the size of a multidimensional array mapped to memory, the model (13) considers all the possible canonical 5 linearizations of the array; for any linearization, the largest distance at any time between two live elements is computed. This distance plus 1 is then the storage "window" required for the mapping of the array into the data memory. More formally, operands in the next iterations), while the circles representing A-elements already 'dead' (i.e., not needed as operands any more). The light grey points to the right of the dashed line are A-elements still unborn (to be produced in the next iterations). If we consider the array linearization by column concatenation in the increasing order of the columns (( 
In the illustrative example shown in Fig. 5(a) , the bounding window of the signal A is W A = (11 , 10) . It follows that the storage allocation for signal A is 100 locations if the linearization model is used, and w 1 × w 2 =110 locations when the bounding window model is applied. However, in the example shown in Fig. 5(b) , where the code has a similar structure, the bounding window model yields a better allocation result -30 storage locations, since the 2-D window of A is W A = (5 , 6) , whereas the linearization model yields 32 locations (the best canonical linearization being the row concatenation in the increasing order of rows).
Our software system incorporates both mapping models, their implementation being based on the same polyhedral framework operating with lattices, used also in Section 2. This is advantageous both from the point of view of computational efficiency and relative to the amount of allocated data storage -since the mapping window for each signal is the smallest one of the two models. Moreover, this methodology can be applied independently to the memory layers, providing a complete storage allocation/assignment solution for distributed memory organizations. Before explaining the global flow of the algorithm, let us examine the simple case of a code with only one array reference in it: take, for instance, the two nested loops from Fig. 5(b) , but without the second conditional statement that consumes the A-elements. In the bounding window model, W A can be determined by computing the integer projections on the two axes of the lattice of A[i][j], represented graphically by all the points inside the quadrilateral from Fig. 5(b) . It can be directly observed that the integer projections of this polygon have the sizes: w 1 = 11 and w 2 = 7. In the linearization model, denoting x and y the two indexes, the distance between two A-elements A 1 (x 1 , y 1 ) and A 2 (x 2 , y 2 ), assuming row concatenation in the increasing order of the rows, is: dist(A 1 , A 2 ) = (x 2 − x 1 )∆y + (y 2 − y 1 ), where ∆y is the range of the second index (here, equal to 7) in the array space. 6 Then, the A-elements at a maximum distance have the minimum and, respectively, the maximum index vectors relative to the lexicographic order. These array A-elements are represented by the points M = A[2] [7] and N = A[12] [7] in Fig. 5(b) , and dist(M, N) = (12-2)×7 +(0-0)=70. Similarly, in the linearization by column concatenation, the array elements at the maximum distance from each other are still the elements with (lexicographically) minimum and maximum index vectors, provided an interchange of the indexes is applied first. These are the points M ′ = A[9] [4] and N ′ = A[4] [10] in Fig. 5(b) . More general, the maximum distance between the points of a live lattice in a canonical linearization is the distance between the (lexicographically) minimum and maximum index vectors, providing an index permutation is applied first. The distance between the array elements A i (x i 1 , x i 2 , . . . , x i m ) and A j (x
where the index vector of A j is lexicographically larger than of A i (∆x i is the range of x i ). Algorithm 2: For each memory layer (SPM and main memory) compute the mapping windows for every indexed signal having lattices assigned to that layer.
Step 1 Compute underestimations of the window sizes on the current memory layer for each indexed signal, taking into account only the live signals at the boundaries between the loop nests. Let A be an m-dimensional signal in the algorithmic specification, and let P A be the set of disjoint lattices partitioning the index space of A. A high-level pseudo-code of the computation of A's preliminary windows is given below. Preliminary window sizes for each canonical linearization according to DeGreef's model (13) are computed first, followed by the computation of the window size underestimate according to Tronçon's model (28) in the same framework operating with lattices. The meaning of the variables are explained as comments.
for ( each canonical linearization C ) { for ( each disjoint lattice L ∈ P A ) // compute the (lexicographically) minimum and maximum ... compute x min (L) and x max (L) ; // ... index vectors of L relative to C for ( each boundary n between the loop nests n and n + 1 ) { // the start of the code is boundary 0 let P A (n) be the collection of disjoint lattices of A, which are alive at the boundary n ; // these are disjoint lattices produced before the boundary and consumed after it let X min n = min L∈P A (n) {x min (L)} and X max n = max L∈P A (n) {x max (L)} ; |W C (n)| = dist(X min n , X max n ) + 1 ; // The distance is computed in the canonical linearization C } |W C | = max n { |W C (n)| } ; // the window size according to (13) for the canonical linearization C } // (possibly, an underestimate) for ( each disjoint lattice L ∈ P A ) for ( each dimension k of signal A ) compute x min k (L) and x max k (L) ; // the extremes of the integer projection of L on the k-th axis for ( each boundary n between the loop nests n and n + 1 ) { // the start of the code is boundary 0 let P A (n) be the collection of disjoint lattices of A, which are alive at the boundary n ;
for ( each dimension k of signal A ) { let X min k = min L∈P A (n) {x min k (L)} and X max k = max L∈P A (n) {x max k (L)} ; w k (n) = X max k − X min k + 1 ; // The k-th side of A's bounding window at boundary n } } for ( each dimension k of signal A ) w k = max n {w k (n)} ; // k-th side of A's window over all boundaries |W| = Π m k=1 w k ; // the window size according to (28) (possibly, an underestimate) Step 1 finds the exact values of the window sizes for both mapping models only when every loop nest either produces or consumes (but not both!) the signal's elements. Otherwise, when in a certain loop nest elements of the signal are both produced and consumed (see the illustrative example from Fig. 5(a) ), then the window sizes obtained at the end of Step 1 may be only underestimates since an increase of the storage requirement can happen inside the loop nest. Then, an additional step is required to find the exact values of the window sizes in both mapping models.
Step 2 Update the mapping windows for each indexed signal in every loop nest producing and consuming elements of the signal.
The guiding idea is that local or global maxima of the bounding window size |W| are reached immediately before the consumption of an A-element, which may entail a shrinkage of some side of the bounding window encompassing the live elements. Similarly, the local or global maxima of |W C | are reached immediately before the consumption of an A-element, which may entail a decrease of the maximum distance between live elements. Consequently, for each A-element consumed in a loop nest which also produces A-elements, we construct the disjoint lattices partially produced and those partially consumed until the iteration when the A-element is consumed. Afterwards, we do a similar computation as in Step 1 which may result in increased values for |W C | and/or |W|. Finally, the amount of data memory allocated for signal A on the current memory layer is |W A | = min { |W| , min C { |W C | } }, that is, the smallest data storage provided by the bounding window and the linearization mapping models. In principle, the overall amount of data memory after mapping is ∑ A |W A | -the sum of the mapping window sizes of all the signals having lattices assigned to the current memory layer. In addition, a post-processing step attempts to further enhance the allocation solution: our polyhedral framework allows to efficiently check weather two multidimensional signals have disjoint lifetimes, in which case the signals can share the largest of the two windows. More general, an incompatibility graph (14) is used to optimize the memory sharing among all the signals at the level of whole code.
Experimental results
A hierarchical memory allocation tool has been implemented in C++, incorporating the algorithms described in this chapter. For the time being, the tool supports only a two-level memory hierarchy, where an SPM is used between the main memory and the processor core. The dynamic energy is computed based on the number of accesses to each memory layer. In computing the dynamic energy consumptions for the SPM and the main (off-chip) memory, the CACTI v5.3 power model (31) was used. Table 1 summarizes the results of our experiments, carried out on a PC with an Intel Core 2 Duo 1.8 GHz processor and 512 MB RAM. The benchmarks used are: (1) a motion detection algorithm used in the transmission of real-time video signals on data networks; (2) the kernel of a motion estimation algorithm for moving objects (MPEG-4); (3) Durbin's algorithm for solving Toeplitz systems with N unknowns; (4) a singular value decomposition (SVD) updating algorithm (23) used in spatial division multiplex access (SDMA) modulation in mobile communication receivers, in beamforming, and Kalman filtering; (5) the kernel of a voice coding application -essential component of a mobile radio terminal. The table displays the total number of memory accesses, the data memory size (in storage locations/bytes), and the dynamic energy consumption assuming only one (off-chip) memory layer; in addition, the SPM size and the savings of dynamic energy applying, respectively, a previous model steered by the total number of accesses for whole arrays (7) , another previous model steered by the most accessed array rows/columns (18) , and the current model, versus the single-layer memory scenario; the CPU times. The energy consumptions for the motion estimation benchmark were, respectively, 1894, 1832, and 1522 µJ; the saved energies relative to the energy in column 4 are displayed as percentages in columns 6-8. Our experiments show that the savings of dynamic energy consumption are from 40% to over 70% relative to the energy used in the case of a flat memory design. Although previous models produce energy savings as well, our model led to 20%-33% better savings than them. Different from the previous works on power-aware assignment to the memory layers, our framework provides also the mapping functions that determine the exact locations for any www.intechopen.com array element in the specification. This provides the necessary information for the automated design of the address generation unit, which is one of our future development directions. Different from the previous works on signal-to-memory mapping, our framework offers a hierarchical strategy and, also, two metrics of quality for the memory allocation solutions: (a) the sum of the minimum array windows (that is, the optimum memory sharing between elements of same arrays), and (b) the minimum storage requirement for the execution of the application code (that is, the optimum memory sharing between all the scalar signals or array elements in the code) (3).
Conclusions
This chapter has presented an integrated computer-aided design methodology for poweraware memory allocation, targeting embedded data-intensive signal processing applications. The memory management tasks -the signal assignment to the memory layers and their mapping to the physical memories -are efficiently addressed within a common polyhedral framework.
