Introduction and motivation
The PC industry has successfully completed several evolutionary memory transitions: from Fast Paged Mode memory to EDO, to PC66 SDRAM, to PC100 SDRAM. While memory banking has been a widely employed technique in the past for increasing performance, its use in saving energy is relatively new. One important advantage of a banked memory system from the energy consumption viewpoint is that the banks that are not used by the current computation should not be powered up, thereby reducing overall energy consumption.
Our focus is on a banked memory architecture where each bank can be power-controlled independently. More specifically, each bank can be placed into low-power operating mode when it is not used by the current computation. In an RDRAM-like architecture, one may have multiple low-power modes (states) to choose from when a memory bank is detected to be idle. A major tradeoff between these different modes is between the energy saving (while in the low-power mode) and resynchronization cost (i.e., the time it takes to bring back a memory bank from the lowpower state to the fully-operational active state). Figure 1 shows typical low-power operating modes and transitions between them. The values associated with the nodes correspond to the per cycle energy consumption for the bank, whereas the values associated with the edges indicate the resynchronization costs. These values clearly illustrate the tradeoff between energy and performance. Specifically, a more energy saving operating mode also incurs a higher resynchronization cost.
It should be observed that the benefits from low-power operating modes can increase if, somehow, idleness of the bank could be increased. This is because in this case the resynchronization cost can be compensated by significant savings in energy. In comparison, a short idleness either will not allow us to place the bank into the most energy-saving mode (i.e., the one that consumes the smallest amount of energy per cycle), or will incur large performance penalty. Consequently, an important goal is to increase idleness of memory banks as much as possible. This can be achieved by smart data allocations to memory banks and/or re-ordering computations. In this study, we make a case for compiler-oriented data layout transformations for array data since it can increase the effectiveness of low-power modes available in the memory system. We also need to mention that in this work we assume that no virtual memory support exists in the system under consideration. Consequently, the compiler can directly work with physical addresses; that is, it can layout data in physical memory and place banks into low-power modes based on the information it collects during program analysis. Note that there exist many embedded systems that work without a virtual memory support [5] . Work is in progress to extend the techniques discussed in this paper to environments with virtual memory (by enlisting help from the operating system (OS)).
Previous research shows that compiler-based (e.g., [2] ), OSbased (e.g., [8] ), and pure hardware-based schemes (e.g., [2] ) are possible to decide the most suitable low-power mode to use when a memory bank is detected to be idle. Since in this work we focus on array-intensive applications, we opted to use a compiler-based approach, where an optimizing compiler (taking into account loop access patterns and array-to-bank mappings -that is, the layout of data in banked memory) decides which operating mode to use. Note that (where applicable) such a compiler-based strategy has an important advantage over pure OS-based and hardware-based techniques. Specifically, the compiler-based strategy (unlike pure OS or pure hardware-based strategies) does not rely on history of data access patterns; that is, the compiler can predict (quite accurately for the array-intensive codes) future data access patterns (and also, future idle times), and select the most appropriate mode to switch to when idleness is predicted. In addition, the compiler can also predict when an idle bank will be requested in the future, and can pre-activate it in an attempt to eliminate re-synchronization latency. Details of the compilerbased low-power mode detection strategy are beyond the scope of this paper. It should be noted, however, that the compiler-based [11, 9] , in this paper, we use them for energy optimization in banked memory architectures.
[3] studies OS-based DRAM power control policies. [7] evaluates the impact of classical loop optimizations on energy consumption of banked memories. [6] presents an iteration space reordering technique for banked memories. In contrast, the work presented in this paper is oriented toward increasing the benefits of low-power modes by data distributions across memory banks. [4] shows how a sleep mode can be exploited for memory partitions. [15] and [14] discuss techniques for exploiting dual banks for ASIPs and DSPs, respectively. [12] addresses the topic of incorporating the application-specific customization of memory bank configuration into behavioral synthesis. In comparison, we study how compiler-directed data optimization can improve energy behavior of a multi-banked system.
The rest of this paper is organized as follows. Section 2 presents background on iteration space and data space representations for array-intensive applications. Section 3 explains array-to-virtual bank mappings and Section 4 discusses virtual bank-to-physical bank mapping. Section 5 shows how data transformations can be useful in increasing the effectiveness of low-power operating modes. Section 6 introduces our experimental platform, and presents data that show the effectiveness of our strategy. Section 7 gives our concluding remarks.
Preliminaries
We can define iterations of a loop nest as a set, each element of which corresponds to an iteration vector. Each execution of loop body uses a vector from this set. Given this, an array access within a loop nest can be expressed as RI + r, where R is the access matrix, I is the iteration vector, and r is an offset vector [16, 10] . As an example, for an array reference such as A(i-1,j+2) that occurs within a loop nest where i is the outer loop and j is the inner, R is the identity matrix, I = (i j)T, and r = (-1 2)T. It is to be noted that each iteration vector I accesses an element via this array reference. An array element accessed by an iteration vector represents an index vector. In the example above, this vector is a = (i-1 j+1)T. It can be observed that if the nest in question has n loops and the reference in question belongs to a kdimensional array, R is a k-by-n matrix, I is an n-entry vector, and r is a k-entry vector.
Informally, a data transformation indicates a mapping of the index space. In mathematical terms, a data transformation can be represented using a pair (M,m), and it transforms an original index vector RI+r to MRI+Mr + m. If we restrict ourselves to dimension-preserving data transformations, M is a k-by-k matrix and m is a k-entry vector (for a k-dimensional array). For instance, assuming that M = and m = , the index vector a = (i-1 j+1) T is mapped to (j+1 i-1) T .
Array-to-virtual bank mapping
In our framework, array elements are mapped to (physical) memory banks using a two-level mapping. In the first level, an index vector (of an array) is mapped to a virtual bank, and in the second level, each virtual bank is mapped to a physical bank. This two-level process is depicted in Figure 2 .
The compiler operates under the assumption of a virtual bank space (VBS), which can be multi-dimensional. Given an array index vector a, we find the virtual bank it is mapped using an affine mapping θa + ϕ. Therefore, two different elements (of the same array) represented by index vectors a and b map onto the same virtual bank if and only if:
In other words, (a-b) should be in the kernel set of θ. In this paper, the pair (θ,ϕ) is called the bank mapping. Note that different criteria can be used in determining suitable bank mappings, and each array can have a different bank mapping than the other arrays in a given application.
Virtual bank-to-physical bank mapping
A virtual bank-to-physical bank mapping (or a physical mapping for short) determines how virtual banks are mapped to the physical banks in the architecture. Let v be a virtual bank. The corresponding physical bank can be determined using a mapping such as: ξv + σ. Now, in order for two different array elements a and b to be mapped to the same physical bank, one should have:
There is a good reason to adopt such a two-level mapping (instead of using a more intuitive one-level mapping that maps arrays directly to the physical banks). In many cases, we want a compiler optimization to be easily portable to another platform without much difficulty. Because of this, it makes sense to work with VBS rather than PBS (physical bank space). In other words, we can write our compiler-based optimization strategy only once (using the VBS as the reference point), and when we want to port it to other architectures (with different physical bank structures), we only need to change the virtual bank-to-physical bank mapping. Note that, in general, the virtual bank-to-physical bank mapping can reduce the dimensionality and/or extents (dimension sizes) of the VBS. Informally, a mapping is specified by giving a decomposition style for each dimension of the virtual bank space along with the physical bank size in each dimension (of the physical bank space). For example, a mapping such as (b 1 ,b 2 ,b 3 ) → (b 1 /p,*,*) indicates that a three-dimensional virtual bank space is mapped to a one-dimensional physical bank space. The * notation indicates that the corresponding virtual bank dimension is not /q,*) which indicates that a three-dimensional virtual bank space is mapped to a two-dimensional physical bank space that contains pxq banks (this might be useful, for example, for some SDRAMs where memory banks form actually a two-dimensional grid). It should also be noted that when multiple virtual banks are mapped to the same physical bank, the loop iterations that access those virtual banks are localized (that is, they exhibit bank locality as they now -after folding -access the same physical bank). Therefore, selection of a suitable mapping function can be important. A virtual bank-to-physical bank mapping tries to take advantage of spatial locality between neighboring banks. However, from the compiler's perspective, it should be sufficient to work with the VBS instead of the PBS. This is because whenever we achieve access locality for a virtual bank, it is guaranteed that that locality will extend to the PBS as well since a virtual bank is mapped to only a single physical bank. Therefore, optimizing bank locality in the VBS is sufficient for our purposes. The rest of this paper discusses our approach to optimizing bank locality in the VBS.
Role of data transformations in bank locality
As discussed earlier, our focus in this paper is on studying the cases where a data transformation might be of use in exploiting bank locality for array-intensive applications. Let us start by defining formally what we mean by bank locality.
Definition: If two iteration vectors, say I and J, are very close to each other (that is, I-J = a lexicographically small value, preferably (0 0 0 … 0 1) T ), they are said to have temporal affinity.
Definition: If two iteration vectors with temporal affinity access the array elements in the same virtual bank, they are said to exhibit bank locality. Now, let us determine the condition for bank locality. Let I + = I + (0 0 0 … 0 1) T , where + denotes vector addition. In order to have bank locality, the array elements accessed by I and I + (via the same array reference in the code) should be on the same virtual bank. In mathematical terms, we need to have: θ(RI + r) + ϕ = θ(RI + + r) + ϕ ⇒ θRI = θRI + ⇒ θh = 0, where h is the last column of R. This type of bank locality can be termed as intra-reference bank locality, i.e., the bank locality that originates from a single array reference in the application code.
It is important to note here that for a given θ matrix, the vector h may or may not be in its kernel set. Therefore, it is not guaranteed that we can achieve intra-reference bank locality. Now, let us assume that we use a data transformation represented by (M,m) in the array in question. In this case, rewriting the condition for intra-reference bank locality, one might have:
θ(MRI + Mr + m) + ϕ = θ(MRI + + Mr + m) + ϕ ⇒ θMRI = θMRI + ⇒ θMh = 0. It is to be observed that, now, we have a flexibility of selecting a suitable M such that h is in the kernel set of θM. Therefore, we can conclude that data transformations can be useful for achieving intra-reference bank locality.
Example: Let us assume an array reference A(i+j+1,j-1) within a loop nest with two loops: i (outer) and j (inner). It is easy to see that: R = and r = Assuming that θ = (1 0), we can see that
Therefore, we can conclude that it is not possible to exploit intra-reference bank locality under this distribution (bank mapping). However, if we use a data transformation matrix M, from
we can find that m11 + m12 = 0. A solution to this last equation is m11 = 1 and m12 = -1, which can subsequently be completed to a full data transformation matrix M = .
In other words, it is possible to find an M matrix to satisfy intra-reference bank locality. This small example illustrates how useful a data transformation can be in optimizing bank locality.
We next focus on inter-reference bank locality. Let RI + r and R'I + r' be two different references to the same array. In order to have inter-reference bank locality, we should have:
θ(RI + r) + ϕ = θ(R'I + r') + ϕ ⇒ θ(R-R')I = θ(r'-r). Let us consider two cases: Case I. R = R'. This represents a very common case in arrayintensive embedded image/video applications. In this case, the relation above reduces to θ(r'-r) = 0. Consequently, if r'-r is not in the kernel set of θ, we cannot have inter-reference bank locality. On the other hand, if we employ a data transformation represented by (M,m), we have θ(MRI + Mr + m) + ϕ = θ(MR'I + Mr' + m) + ϕ ⇒ θM(R-R')I = θM(r'-r). Since, we have R = R', this last equation reduces to θM(r'-r)=0. Now, it may be possible to select a suitable M such that r'-r is in the kernel set of θM. That is, data transformation increases the chances for inter-reference bank locality. To illustrate how this works in practice, we consider the following example.
Example: Let us assume two array references, A(i+j+1,j-1) and A(i, j), within a loop nest with two loops: i (outer) and j (inner). Assuming, as before, that we use θ = (1 0) as our bank mapping, we can find that θ(r'-r) = (1 0) { -} = (1 0) = 1.
Since θ(r'-r) ≠ 0, it is not possible to satisfy inter-reference bank locality. On the other hand, if we are allowed to use a data transformation matrix M, from θM(r'-r) = (1 0) = 0, we have m11 -m12 = 0. A possible solution is m11=1 and m12 =1, which can subsequently be completed to a full data transformation matrix 
M =
So, it is possible to obtain inter-reference bank locality using this data transformation. This example clearly illustrates that data transformations can be very useful in exploiting inter-reference bank locality.
Case II. R ≠ R'. In this case, if we do not use any data transformation, we have θ(R-R')I = θ(r'-r), as determined above. So, there is no way that this equality can be satisfied since the right side is constant while the left side can take different values for different iteration vectors I. However, if we use a data transformation (M,m), we need to satisfy θM(R-R')I = θM(r'-r). This can be achieved by satisfying the following two equalities:
θM(R-R') = 0, and θM(r'-r) = 0. That is, even in this case, it might be possible to find an M matrix to satisfy these two constraints at the same time, and thus obtain inter-reference bank locality.
So far, we have only considered bank locality problem from the perspective of a single array (that is, intra-array bank locality whether it is intra-reference or inter-reference). It is also possible to exploit inter-array bank locality. Let as assume that RI + r and R'I + r' are references to two different arrays. For inter-array bank locality, one should have: θ 1 (RI + r) + ϕ 1 = θ 2 (R'I + r') + ϕ 2 ⇒ (θ 1 R-θ 2 R')I = θ 2 r' -θ 1 r +ϕ 2 -ϕ 1 . In this formulation, (θ 1 , ϕ 1 ) and (θ 2 , ϕ 2 ) represent array-tovirtual bank mappings for the two arrays under consideration. If θ 1 R-θ 2 R' = 0, then the above equation gets reduced to: θ 2 r' -θ 1 r +ϕ 2 -ϕ 1 = 0. However, in the general case θ 1 R-θ 2 R' ≠ 0, and as a result, it is not possible to satisfy this equation.
On the other hand, if we assume data transformations (M 1 ,m 1 ) and (M 2 ,m 2 ) for these two arrays, we need to satisfy:
This last equality can be satisfied if one can satisfy the following two equations: θ 1 M 1 R-θ 2 M 2 R' = 0, and θ 2 M 2 r' -θ 1 M 1 r + θ 2 m 2 -θ 1 m 1 + ϕ 2 -ϕ 1 In fact, if, using the first equality, we can find M 1 and M 2 matrices, and we can substitute them in the second equality and solve it for m 1 and m 2.
Experiments

Setup
All energy numbers presented in this paper have been obtained using a custom memory energy simulator. This simulator takes as input a C program and a banked memory description (i.e., the number and sizes of memory banks as well as available low-power operating modes with their energy saving factors and re-synchronization costs). As output, it gives the energy consumption in memory banks along with a detailed bank inter-access time profiles. By giving original and optimized programs to this simulator as input, we measure the impact of our data transformation strategy on memory system energy.
The data transformation framework presented in this paper has been fully implemented using the SUIF infrastructure from Stanford University [1] . SUIF has independently developed compilation passes that work together by using a common intermediate format (IF) to represent programs. A typical compilation framework based on SUIF includes the following components: front end, data dependence analysis, and several optimization modules. Our framework is implemented as a separate optimization module within SUIF. We also use a powerful back-end compiler (when converting the C code to executable) that performs instruction scheduling and graph coloring-based global register allocation. Unless stated otherwise, 8x8MB (that is, 8 memory banks, each has a capacity of 8MB) is our default bank configuration. Note that if a bank is not accessed during the execution of an application, it is never activated for both the original and the optimized code versions. In other words, even our base case takes advantage of low-power operating modes, and the approach proposed in this paper tries to improve over it. That is, all the energy benefits reported in this work are coming from our data layout optimization strategy. Also, we use a default array-to-bank mapping in most of our experiments. In this default mapping, each array is laid out in memory in a row-major fashion, and the next array (in the program declaration part) is stored starting from the location next to the one where the previous array ends (that is, the arrays are stored in memory one after another). However, we also report some results with smarter array mappings (distributions). While in this work we use the energy consumption and resynchronization values shown in Figure 1 , our framework is general enough in that it can work with different set of lowpower modes as well. The energy values shown in Figure 1 have been obtained from the measured current values associated with memory banks documented in memory data sheets (for 3.3V, 2.5 nsec cycle time, 8MB memory) [13] . The re-synchronization latencies have also been obtained from the same data sheets. Based on the trends gleaned from the data sheets, the energy values are increased by 30% when bank size is doubled. Unless stated otherwise, our architecture does not have a data cache (since we want to isolate the energy benefits in the banked memory system). However, later in this section we also report experimental data with different data cache sizes. In fact, our results indicate that the proposed strategy is successful with both cacheless and cache-based systems.
Results
To evaluate our strategy quantitatively, we performed experiments with nine array-intensive benchmarks. Important characteristics of these codes are listed in Table 1 . The third column gives the execution cycles, while the fourth column shows the memory energy consumption without our optimization. 
As mentioned earlier, even this baseline version makes full use of low-power operating modes available in the architecture.
The first bar for each benchmark in Figure 3 (called Base) gives the energy consumption due to our strategy, normalized with respect to the default array distribution without any program transformation (the last column in Table 1 ). We see that our data optimization brings 29.6% improvement on the average. We also note that relative energy savings depends on the benchmark used. For example, the savings with some benchmarks such as hier are not as good as those with the others, mainly due to fact that the compiler was not able to select good data transformations for the arrays in these codes. The main reason for this is that the references to the same array create conflicts that prevent the compiler from using the ideal data transformation matrix (M) for the array in question. Still, even with such benchmarks, our approach achieves energy savings around 15%. In comparison, in benchmarks such as compress, there are few references per array; hence fewer chances for conflict in selecting the most appropriate data transformation.
While our energy savings are significant, one might argue that data distribution (across the memory banks) also plays a key role in shaping energy consumption behavior. So, we also measured the energy savings with respect to a distribution-optimized code. The specific data distribution algorithm (that is, the algorithm that decides which arrays should be mapped to which banks) is from [2] . The second bar for each benchmark in the figure (named Base+Distr) shows the normalized energy consumption of our strategy. The average energy reduction is around 21.2%, indicating that our approach is still very successful in optimizing memory energy behavior. The third bar for each code (named Base+Loop) gives the normalized energy consumption of our strategy with respect to a version that uses the default array distribution and data locality oriented loop transformation. The rationale behind this version is that locality oriented loop transformations in general improve spatial access patterns, and this can also improve bank locality. The specific loop transformation strategy that is used here is from [10] . We see that the average energy savings brought by our approach with respect to this version is around 19.1%. What this result says is that the data transformation strategy complements the loop transformation based optimization. Finally, the last bar in the figure (referred to as Base+Distr+Loop) shows the normalized energy consumption of our scheme with respect to a strategy that uses both optimized data distribution and loop transformation. Even against this highly-optimized version we achieve 14.7% energy savings on the average when considering all codes in our benchmark suite. Overall, these results clearly show that our data transformation based approach is very effective in increasing the effectiveness of low-power operating modes.
It should be emphasized that the energy benefits shown in Figure 3 have been obtained by trying to satisfy both intrareference constraints and inter-reference constraints. To illustrate individual contributions coming from these different types of localities, we show in Figure 4 how the energy benefits are broken down (the results are normalized with respect to the Base+Distr+Loop version). One can observe from the trends shown in this graph that most of the energy savings are coming from optimizing intra-reference bank locality. The main reason for this behavior is that satisfying intra-reference locality brings more benefits since this captures access behavior across loop iterations. This is in contrast to inter-reference locality whose impact is limited by the number of references to the same array in the loop body. Nevertheless, we still observe that the contribution of satisfying inter-reference constraints to overall memory energy savings is around 20.8% on the average, which indicates that it is important to take care of them as well.
An important parameter that influences the magnitude of energy savings is the number of memory banks. This is because a larger number of banks give a finer-granular control to the compiler to place memory regions into low-power operating modes. In order to quantify the impact of our approach with different bank configurations, we performed another set of experiments. More specifically, keeping the total memory size fixed at 64MB, we conducted experiments with 2, 4, 8, 16, and 32 banks. The results are given in Figure 5 (again as values normalized with respect to the Base+Distr+Loop version). We can observe from these results that working with larger number of banks (i.e., with smaller bank sizes) in general increases the energy benefits coming from the proposed data layout optimization strategy. This is because, as indicated above, smaller bank sizes give our strategy more opportunities for energymanaging even smaller portions of main memory. Such a finergrain management, in turn, increases the energy benefits. However, we also note that in some applications, increasing the number of banks beyond a specific number does not increase savings. This occurs because of the data access pattern of such applications. Specifically, the access pattern of those applications spans more banks when the number of banks is increased beyond a specific value.
In our experiments so far we have focused on a memory architecture without data cache. Including a cache in the hierarchy can filter some requests, thereby increasing the idleness of memory banks. However, since even unoptimized codes benefit such filtering, we can expect some reduction in energy savings. The result shown in Figure 6 corroborates this expectation. Still, even with a 16KB data cache, we obtain an average energy saving of roughly 7% over the Base+Distr+Loop version (which itself is highly optimized). Thus, our data optimization is beneficial even with data caches. It should also be emphasized that the Base+Distr+Loop version is already a highly optimized version, and it is really difficult to further improve its energy behavior.
While the experimental data presented so far clearly demonstrate the energy benefits of our strategy, to be fair, one needs to consider performance impact as well. Therefore, in Figure 7 , we give the percentage increase in original execution cycles (i.e., the cycles when no power control is present) when the proposed data transformation is used. Overall, one can see that the increase in execution cycles varies from 0.79% to 2.21% depending on the benchmark used, averaging in 1.35%. The reason that we do not incur much performance penalty is that the compiler pre-activates a memory bank before it is actually needed. Note that this is possible in our application domain since (considering the array-to-bank mappings) the compiler can accurately predict the next access to a given bank. This bank preactivation strategy in turn limits the potential degradation in performance.
Concluding remarks
Energy consumption is becoming a first-order design parameter as processor-based systems continue to become more and more complex. Off-chip memory energy consumption in particular can be a limiting factor in many system designs. In this work, we focus on executing array-intensive benchmarks in banked memory architectures, and propose a compiler-directed strategy that modifies data layouts in memory to place more memory banks into low-power mode and/or keep memory banks in low-power operating modes longer. The experimental data obtained using nine array-intensive benchmarks and a simulation environment show the potential of our approach in saving memory energy. 
