Memory system usually consumes a significant amount of energy in many batteryoperated devices. In this paper, we provide a quantitative comparison and evaluation of the interaction of two hardware cache optimization mechanisms and three widely used compiler optimization techniques used to reduce the memory system energy. Our presentation is in two parts. First, we focus on a set of memory-intensive benchmark codes and investigate their memory system energy behavior due to data accesses under hardware and compiler optimizations. Then, using four motion estimation codes, we look at the influence of compiler optimizations on the memory system energy considering the overall impact of instruction and data accesses.
INTRODUCTION
Hardware and software techniques to reduce energy consumption have become an essential part of current system designs [27] . Such techniques have particularly targeted the memory system due to the prevalent use of data-dominated signal and video applications in mobile devices. Various low power circuit techniques [4] , energy efficient memory and cache architectures [1, 9, 12, 7, 19] , and power-aware compilation techniques [16, 3] have been proposed. However, there is still much work to be done in understanding the interaction of hardware and software optimizations and in evaluating their relative energy gains.
This paper explores the interaction of hardware and software optimizations by considering the memory system energy savings obtained when using energy-efficient cache architectures and compiler optimization techniques. Specifically, we set out to answer following two questions.
As far as data caches are concerned, what is the influence of compiler optimizations and hardware enhancements on on-chip cache/off-chip rnemory energy? Apart from presenting a comprehensive energy evaluation of hardware and compiler techniques, such a study will also reveal how these optimizations interact with each other.
How do the variations in energy due to data accesses and energy due to instruction accesses compare when performance-oriented compiler optimizations are applied?
Consequently, our presentation is in two parts.
In the first part of our presentation (Section 2), we first apply the hardware and software optimization techniques individually using the original codes and base cache configurations, respectively, in order to evaluate their relative energy savings in the memory system. This evaluation is performed using a suite of data-intensive codes that operate on multi-dimensional arrays which are typical in many embedded video and signal processing applications. Further, energy consumption is estimated using an energy model for a memory system that consists of an on-chip data cache and an off-chip main memory.
Next, we evaluate various energy-efficient onchip cache configurations with different combinations of cache sub-banking and block buffering optimizations [19, 11] . These techniques help to reduce the cache energy by limiting the data access to just one bank of a cache or to just the previously accessed cache line. We find that these optimizations result in an energy saving of 44-89% on an average in the data cache. After that, the energy consumption of the memory system (considering data cache and off-chip memory) was evaluated after applying three widely used code optimization techniques, namely, linear loop transformations (e.g., loop permutation, loop skewing, etc.), loop tiling and loop unrolling [14, 21, 22] . In this case, we find that the energy consumption in the data cache increases on an average by 51%. This is due to the increase in data accesses as a result of applying these optimizations in conjunction with scalar replacement [14] . However, considering just the cache energy consumption is misleading as the entire memory system energy reduces by 42-79% after applying these transformations. This effect is due to the decreased number of accesses to off-chip memory due to improved locality after applying the optimizations. These results also corroborate a popular belief that most significant energy gains can be obtained at the higher levels of system design [8] .
We find that the compiler optimizations reduce the overall memory system energy consumption by up to 75% (79%) as compared to the maximum of 4% (15%) reduction achieved when using 4K (16K) energy-efficient cache architectures. This result shows the importance of energy-aware software in system design. It must also be emphasized that the hardware optimizations are still important for reducing the on-chip energy consumption. As a final set of experiments in this part, we apply the hardware and software optimizations concurrently and find that the combination of the considered hardware and software optimizations results in up to 88% savings in energy consumption of the memory system due to data accesses.
In the second part of our presentation (Section 3), we focus on the energy consumption considering the entire memory system including both instruction and data accesses. This is important as many of the high-level compiler optimizations are oriented more towards improving the data locality rather than instruction locality and code size. Consequently, aggressive data locality optimizations may be detrimental from the code size and energy consumption for instruction accesses perspectives. These [4] . In this paper, we focus on two cache optimizations, namely, block buffering [19, 9, 12] and cache subbanking [19, 9, 12] , as the cache is one of the major energy consuming components in current processors. We choose these cache optimizations since the effectiveness of block buffering is influenced by software optimizations while that of sub-banking is not. Also, neither technique affects performance in a noticeable way.
In the block buffering scheme, the previously accessed cache line is buffered for subsequent accesses. If the data within the same cache line is accessed on the next data request, only the buffer needs to be accessed. This avoids the unnecessary and more energy consuming access to the entire cache data array. Thus, increasing temporal locality of the cache line through software techniques can save more energy. In the cache sub-banking optimization, the data array of the cache is divided into several sub-banks and only the sub-bank where the desired data is located is accessed. This optimization reduces Figure ld (t denotes the tile size). As long as the tiles from the three arrays fit in the cache, we can expect a good performance [22] . We also expect a decrease in power consumed in memory, due to better data reuse after applying tiling. It must be noted that in comparison to linear loop transformations, tiling exploits temporal locality across multiple loop levels. computations. More specifically, the same [3] value needs to be loaded once for each of the
computations (a total of b times). Thus, the outer loop unrolling can reduce the number of times that the elements of the array w need to be loaded by a factor of b. From the energy perspective, fewer accesses to the memory means less energy consumption.
Experimental Strategy
In order to evaluate the effectiveness and interaction of the hardware and software optimizations, the C versions of benchmarks shown in Table I were used. All these codes are representative of the multi-dimensional array domain, the domain that many signal and video processing applications belong to. In this study, we zoom in on the mxm benchmark results when varying the different parameters and finally summarize the behavior across all benchmarks in the end. To determine the energy consumed by these codes, we obtained memory reference traces while executing the benchmarks using the SimplePower cycle-accurate simulator [24] . These traces were then analyzed using a configurable memory system simulator that was built in-house. The memory system simulator allows the configuration of cache sizes, block sizes, associativities, write and replacement policies, the number of cache sub-banks and cache block buffers used. In particular, we used the various cache configurations shown in Table II . Henceforth, all the reported results will use the configuration numbers shown in this table. Also incorporated in our memory system simulator is the on-chip cache energy model proposed in [11] using 0.8# technology parameters [20] , the off-chip main memory energy per access cost of the Cypress CY7C1326-133 chip and the I/O pad energy costs [18] .
To evaluate the impact of compiler optimizations on the overall energy consumption, we used a high-level compilation framework based on loop (iteration space) and data (array layout) transformations. For this study, the framework proposed in [13] was enhanced with iteration space tiling and loop unrolling. Our enhanced framework takes as input a code written in C and applies the enabled optimizations to generate the optimized high-level C code. The tiling technique employed is similar to one explained in [22] and selects a suitable tile size for a given code, input size, and cache configuration. The loop unrolling algorithm carefully weighs the advantages of increasing data reuse and the disadvantages of larger loop body in selecting an optimal degree of unrolling (the parameter b in Fig. e ) and is similar in spirit to the technique discussed in [22] . We generated the optimized codes after applying the loop transformation, tiling and unrolling optimizations individually and also when applied in combination. Figure 8 shows the memory system energy consumption when a combination of different software and hardware optimizations is applied. The corresponding variation in on-chip data cache energy consumption for these optimizations is given in Figure 9 . It can be observed from Figure 8 matches, according to a certain criterion, a given block in the current image frame [25] . The displacement between the coordinates of the block in the current frame and the matched block in the reference frame is called a motion vector. Brief descriptions of these four codes are as follows:
Full Search
In full search block-matching motion estimation, each reference macro-block is compared to all candidate macro-blocks in the search area to determine the best match [6] . It is able to find the best matched block, but requires a significant amount of computation. This is the most dataintensive version of the codes we used. 
Hierarchical Motion Estimation
The hierarchical block matching algorithm is based on the image pyramid [17] . The Figure 12 shows the change in energy consumption due to data accesses after applying the highlevel optimizations. It is observed that the energy reduction is most significant for the full search algorithm that is most data-intensive. This reduction is due to the significant decrease in number of data accesses as a result of improved locality. For example, scalar replacement converts memory references to register accesses. However, this also leads to an increase in dynamic instruction count.
We can also see from Figure 12 that, except for one case, high-level compiler optimizations improve the data energy consumption for all motion estimation codes in all configurations. The average data energy reduction over all studied cache sizes is 30.9% for direct-mapped caches, 39.4% for 2-way caches and 39.8% for 4-way caches. Our experiments also show that in hier and parallel hier, after the optimizations, there is an increase in the number of conflict misses (as we do not use array padding [15] ). In particular, with parallel bier, when the cache size is very small and cache is direct-mapped, these conflict misses offset any benefits that would otherwise be obtained from improved data locality, thereby degrading the performance from the energy perspective. Increasing the associativity eliminates this conflict miss problem.
It can be observed from Figure 14 
