Abstract-With the emergence of a plethora of embedded and portable applications, energy dissipation has joined throughput, VLSI layout area, and accuracy/precision as a major design constraint. Thus, designers must be concerned with both estimating and optimizing the energy consumption of circuits, architectures, and software. Most of the research in energy optimization and/or estimation has focused on single components of the system and has not looked across the interacting spectrum of the hardware and software. The novelty of our energy estimation framework, SimplePower, is that it evaluates the energy considering the system as a whole rather than just as a sum of parts, and that it concurrently supports both compiler and architectural experimentation. We present the design and use of the SimplePower framework that includes a transition-sensitive, cycle-accurate datapath energy model that interfaces with analytical and transition-sensitive energy models for the memory, clock and bus subsystems, respectively. Such an architectural-level energy estimation framework is invaluable in making good energy-conscious decisions early in the design cycle. We analyzed the energy consumption of 10 codes from the multidimensional array domain, a domain that is important for embedded video and signal processing systems. Our study shows that the pipeline registers and the register file are the datapath energy hotspots consuming 58-70 percent of overall datapath energy and that the clocking of the on-chip memory structures is the major source of the on-chip clock networks energy consumption. Further, we find that the off-chip main memory is the overall energy bottleneck of the entire system. However, we found that the application of high-level compiler optimizations reduces the main memory energy significantly, causing the contribution of the data cache, on-chip clock network, instruction cache, and datapath to become more important. We found that the improved locality of the optimized codes is useful in not only reducing the accesses to the main memory but also in exploiting the more energy-efficient cache architectures much better than unoptimized codes. Optimized codes saved 21 percent more energy using the most recently used way-prediction cache scheme as compared to executing unoptimized codes from the multidimensional array domain. We also observed that emerging technologies such as embedded DRAM coupled with a combination of energy-efficient circuit, architectural and compiler optimizations can potentially shift the energy hotspot. Thus, we have demonstrated that early estimates from the powerful SimplePower energy estimation framework can help one to identify the system energy hot spots and enable architects and compiler designers to focus their efforts on these areas.
INTRODUCTION
W ITH more than 95 percent of current microprocessors going into embedded systems [2] , the need for low power design has become vital. Even in environments not limited by battery life, power has become a major constraint due to concerns about circuit reliability and packaging costs. The increasing need for low power systems has motivated a large body of research on low power processors. Most of this research, however, focuses on reducing the energy 1 in isolated subsystems (e.g., the processor core, the on-chip memory, etc.) rather than the system as a whole [3] . The focus of our research is to provide an insight into the energy hot spots in the system and to evaluate the implications of applying a combination of architectural and software optimizations on the overall energy consumption across the different parts of the system. In order to perform this research, architectural-level power estimation tools that provide a fast evaluation of the energy impact of various optimizations early in the design cycle are essential [4] . However, only prototype research tools and methodologies exist to support such high-level estimation. In this paper, we present the design of an architectural-level energy estimation framework, SimplePower [1] . This framework is, perhaps, the first with a capability to evaluate the integrated impact of hardware and software optimizations on the overall system energy. In contrast to the coarse grain current measurement-based techniques [5] , [6] , our new tool is cycle-accurate and provides a fine-grained energy estimate of the processor core (currently a five-stage pipelined instruction set architecture (ISA)) while also accounting for the energy consumed by the memory and bus subsystems. SimplePower also leverages from the SimpleScalar toolset [7] as it executes the integer subset of SimpleScalar ISA.
The memory subsystem is the dominant source of power dissipations in various video and signal processing embedded systems [8] , [9] . Existing low power work has focused on addressing this problem through the design of energy efficient memory architectures and power-aware software [10] , [11] , [12] . However, most of these efforts do not study the influence on the energy consumption of the other system components and even fewer consider the integrated impact of the hardware and software optimizations. It is important to evaluate the influence of optimizations on the overall system energy savings and the power distribution across different components of the system. Such a study can help identify the changes in system energy hot spots and enable the architects and compiler designers to focus their efforts on addressing these areas.
This study embarks on this ambitious goal, specifically trying to answer the following questions:
. What is the energy consumed across the different parts of the system? Is it possible to evaluate this energy distribution in a fast and accurate fashion for different applications? . What is the effect of the state-of-the-art performanceoriented compiler optimizations on the overall system energy consumption and on each individual system component? Does the application of these optimizations cause a change in the energy hotspot of the system? . What is the impact of power and performanceoriented memory system modifications on the energy consumption? How do compiler optimizations influence the effectiveness of these modifications? . What is the impact of advances in process technology on the energy breakdown of the system? Can emerging new technologies (e.g., embedded DRAMs [13] , [14] ) result in major paradigm shifts in the focus of architects and compiler writers? This work studies these issues in a unified framework for the entire system. This paper sets out to answer some of the above questions using codes drawn from the multidimensional array domain, a domain that is important for signal and video processing embedded systems. While studies in energy-performance tradeoffs are also important, this issue is orthogonal to our goal, which focuses on the estimation and optimization of system energy. Our work is complementary to recent studies that have focused on estimating and optimizing power consumption using simulation-based framework. Brooks et al. have presented Wattch [15] , a framework for analyzing and improving power dissipation of microprocessors. They show that Wattch is much faster than current circuit-level energy estimation tools at the expense of some accuracy. Simunic et al. [16] have focused on embedded systems relying on current-measurementbased models and presented a cycle-accurate energy simulator. The results obtained from simulating the energy consumption of a StrongARM-1100-based system indicate that their simulation accuracy is within 5 percent of hardware measurement. Our framework is also primarily intended for embedded systems and is unique in its support for studying the influence of compiler optimizations on energy consumption.
The rest of this paper is organized as follows: The next section presents the design of our energy estimation framework, SimplePower. Section 3 presents the distribution of energy across the different system components using a set of benchmark codes. The influence of performanceoriented compiler optimizations on system energy is examined in Section 4. Section 5 investigates the influence of energy-efficient cache architectures on system power. Section 6 studies the implications of emerging memory technologies on system energy that can be drawn from this study. Finally, Section 7 summarizes the contributions of this work and outlines directions for future research.
SIMPLEPOWER: AN ENERGY ESTIMATION FRAMEWORK
Answering the questions posed in Section 1 requires tools that allow the architect and compiler writer to estimate the energy consumed by the system. The energy estimation framework that we have developed for this purpose, SimplePower, is depicted in Fig. 1 . For the purposes of this work, we are using the system defined in Fig. 2a consisting of the processor core (datapath), on-chip instruction and data caches, off-chip memory, and the interconnect buses between the core and the caches and between the caches and the off-chip memory. What we need in our framework are tools that allow us to estimate the energy consumed by each of the modules in the system.
Modeling Caches
Analytical models for memory components have been used successfully by several researchers [17] , [11] to study the power tradeoffs of different cache/memory configurations. These models attempt to capture analytically the energy consumed by the memory address decoder(s), the memory core, the read/write circuitry, sense amplifiers, and cache match support hardware (e.g., tag match logic). Some of these models can also accommodate low power cache and memory optimizations such as cache block buffering [17] , cache subbanking [11] , [17] , bit-line segmentation [10] , etc. These analytical models estimate the energy consumed per access, but do not accommodate the energy differences found in sequences of accesses. For example, since energy consumption is impacted by switching activity, two sequential memory accesses may exhibit different address decoder energy consumption. For memories, the energy consumed by the memory core and sense amplifiers dominates these transition-related differences. Thus, simple analytical energy models for memories have proven to be quite reliable. This is the approach used in SimplePower to estimate the energy consumed in the memories.
Modeling the Buses
The energy consumption of the buses depends on the switching activity on the bus lines and the interconnect capacitance of the bus lines (with off-chip buses having much larger capacitive loads than on-chip buses). When the switching activity is captured by the energy model, we refer to the technique as a transition-sensitive approach (in contrast to, for example, the analytical model used for the memory subsystem). The energy model used by SimplePower for system buses is transition-sensitive. A wide variety of techniques have been proposed to reduce system level interconnect energy ranging from circuit level optimization, such as using low-swing or charge recovery buses, to architectural level optimizations such as using segmented buses, to algorithmic level optimizations, such as using signal encoding (encoding the data in such a way as to reduce the switching activity on the buses) [18] . As technology scales into the deep submicron, chip sizes grow, and multiprocessor chip architectures become the norm, system level interconnect structures will account for a larger and larger portion of the chip energy and delay. In this paper, we include the energy consumed in the off-chip buses with the main memory energy consumption and the on-chip buses with the cache energy consumption, unless specified otherwise.
Modeling the Clock Network
The SimplePower framework also uses an analytical energy model to account for the energy consumed by the clock generation and clock distribution network. Fig. 3 shows a schematic of the modeled clock network. The clock subsystem of the target architecture is implemented using a first level H-tree and a distributed driver approach that supplies clocking to four main units: the data cache, the instruction cache, the register file, and the datapath (pipeline registers). The simulated architecture assumes static CMOS gates and single phase clocking for all sequential logic. We also model the impact of clock gating, at different levels and for different units. The key to the clock power model is to estimate the capacitive load due to the different parts of the clock network shown in Fig. 3 . First, we consider the capacitive load on the clock network due to the precharge circuitry in the on-chip memory structures. Second, we consider the power consumption of the clock generator, namely, the Phase Locked Loop (PLL) in a locked condition. The pipeline registers form the next component of interest in our clock network model. Next, we have to consider the additional load due to the drivers and buffers in the clock network. For the clock buffers in the terminal points of the distribution network, we assumed them to be built as a chain of variable size inverters and optimized for speed. Finally, the clock capacitance due to interconnect wiring is calculated for a H-tree distribution network using a model proposed by the authors in [19] . For validation purposes, the predicted values from the clock energy model were compared against the energy measurements given by HSPICE simulations of clock distribution layouts. The average error for the analytical clock model was found to be within 10 percent. More details of our validation are presented in [20] .
Modeling the Processor Core
The final system module to be considered is the processor core. To support the architecture and compiler optimization research posed in Section 1, the energy estimation of the core must be transition-sensitive. At this point in the design process, in order to support "what-if" experimentation, the processor core is specified only at the architectural-level (RTL level). However, without the structural capacitance information that is part of a gate-level design description (obtained via time consuming logic synthesis) and the interconnect capacitance information that is part of a physical-level design description (obtained via very timeconsuming VLSI layout), it is difficult to obtain the capacitance values needed to estimate energy consumption. SimplePower solves this dilemma by using predefined, transition sensitive models for each functional unit to estimate the energy consumption of the datapath. This approach was first proposed by Mehta et al. [21] . These transition-sensitive models contain switch capacitances for a functional unit for each input transition obtained from VLSI layouts and extensive HSPICE [22] simulation. Once the functional unit models have been built, they can be reused for many different architectural configurations.
SimplePower is, at this time, only capturing the energy consumed by the datapath. Developing transition-sensitive models for the control path would be extremely difficult. One way to model control path power would be analytically. In any case, for the SimplePower processor core, the energy consumed by the datapath is much larger than the energy consumed by the control logic due to the relatively simple control logic. The architecture simulated by SimplePower in this paper is the integer ISA of SimpleScalar, a five stage RISC pipeline. Functional unit energy models have been developed for various units including (for 2:0, 0:8, and 0:35) flipflops, adders, register files, multipliers, ALUs, barrel shifters, multipliers, and decoders.
SimplePower outputs the energy consumed from one execution cycle to the next. It mines the transition sensitive energy models provided for each functional unit and sums them to estimate the energy consumed by each instruction cycle. The energy model is a table containing the switch capacitance for each input transition of the functional unit exercised. Table 1 shows the structure of such a table for a n-input functional unit.
Developing Transition-Sensitive Energy Models
The construction of these tables is based on the structure of the functional units. Each functional unit can be divided into one of the following classes: bit-independent functional units or bit-dependent functional units. In a bit-independent functional unit, the operation for each bit slice does not depend on the values of other bit slices. In this case, the only switch capacitance table we need is a small table for a onebit slice. The total energy consumed by the module can be calculated by summing the energy consumed by each bit transition. Examples of bit-independent functional units include pipeline registers, the logic unit in the ALU, latches, and buses.
In a bit-dependent functional unit, the operation in one bit slice depends on the values of other bit slices (for example, a 32-bit adder). Their energy characterization is based on a table lookup consisting of a full energy transition matrix where the row address is the previous input vector, the column address is the present input vector, and the matrix value is the switch capacitance (as shown in Table 1 ). Unfortunately, the size of this table grows exponentially with the size of the inputs. A clustering algorithm helps with this problem [21] while bounding the loss in accuracy. Although this algorithm can compress the table for a 16-bit ripple carry adder (with 2 32 entries) to a table with only 97 entries, it is very difficult to compress the table for a 32-bit adder with acceptable accuracy. Lin et al. proposed a power modeling and characterization method for functional units (called macrocells in their paper) using structure information [23] . However, their technique was more suitable for circuits with small input size. For instance, it took 29.3 hours to simulate a 4-bit fast adder (nine inputs) with a reduced number of entries (26, 244) .
To address the table size problem, we partition the functional units into smaller submodules. For example, a register file (shown in Fig. 2b ) is partitioned into five major submodules: five 5:32 decoders, word-line drivers, write data drivers, read sense-amplifiers, and a 32 Â 32 cell array. Energy tables were constructed for each submodule. For example, a 1,024 entry table indexed by the pair of five current and five previous register select bits was developed for the register file decoder component. This table is then shared by all the five decoders in the register file. Since the write data drivers, read sense amplifiers, word line dividers, and all array cells are all bit-independent submodules, their energy tables are quite small. The power consumed by the register file is found to underestimate the energy as compared to the circuit level HSPICE simulations for the entire register file because the interconnect capacitance between the subcomponents was not accounted for. A constant multiplicative factor based on the technology (between 1.1 and 1.2 for 0:8-0:35) was used to account for the interconnect capacitance between the submodules. Our final model was then found to be within 2 percent of the HSPICE simulated energy values for randomly chosen input transitions. For the 32*32 5-port register file, our power estimation approach took less than 0.1 seconds for each input transition as opposed to the 556.42 seconds required for circuit level simulation using HSPICE. The machine running the HSPICE simulation and our simulator is a SUN ULTRA-10 with 640 MBytes memory.
Combining Memory System Components
As mentioned earlier, SimplePower currently uses a combination of analytical and transition-sensitive energy models for the memory system. The overall energy of the the memory system is given by
The energy consumed by the instruction cache (Icache), E Icache , and by the data cache (Dcache), E Dcache , are evaluated using an analytical model that has been validated to be accurate (within 2.4 percent error) for conventional cache systems [17] , [25] . We extended this model to consider the energy consumed during writes as well and have also parameterized the cache models to capture different architectural optimizations. E Buses includes the energy consumed in the address and data buses between the Icache/Dcache and the datapath. It is evaluated by monitoring the switching activity on each of the bus lines assuming a capacitive load of 0.5pF per line. The energy consumed by the I/O pads and the external buses to the main memory from the caches, E P ads , is evaluated similarly for a capacitive load of 20pF per line. The main memory energy, E MM , is based on the model in [25] and assumes a 
Overall Framework
Finally, we would like to address two important issues regarding the overall framework: flexibility and overall accuracy of the framework. First, we address the issue of ease in evaluating other architectural configurations using this framework. The SimplePower framework provides flexibility in experimenting with modifications in the cache hierarchy at the command prompt level. On the other hand, any modifications or additions to the datapath will require modification of the C routines that model the corresponding components in the cycle-accurate simulator. The simulator is written in a modular fashion to facilitate such changes. Further, the energy interface routines that mine the transition sensitive energy models will need to be modified accordingly. Particularly, new components need to be characterized using submodules that have energy models. In some cases, additional energy models will need to be developed through circuit-level simulation to characterize a new component. Finally, any additions or modifications to the instruction set architecture would also require modifications to the compilation framework. Second, we highlight issues concerning overall accuracy of the entire framework. While we have shown validation using circuit level simulation for the individual blocks, the issue of validating the entire aggregated framework including the datapath and the memory system is difficult as circuit level simulation of the entire framework is very time-consuming. Further, the flexibility provided by the architectural tool in evaluating different configurations makes even the task of choosing a configuration to validate challenging. As a step toward addressing this challenging problem, the entire datapath of a commercial merged RISC/ DSP architecture that is modeled using smaller validated energy models of datapath submodules was performed in [24] . Finally, it must be stressed the use of the architecturallevel simulator is not for providing highly accurate estimates but for providing fast and relatively accurate estimates to evaluate different high level alternatives.
ENERGY DISTRIBUTION
With the emergence of energy consumption as a critical constraint in system design, it is essential to identify the energy hotspots of the system early in the design cycle. There has been significant work on estimating and optimizing the system power [3] . However, many have focused on estimating/optimizing only specific components of the system and most do not capture the integrated impact of circuit, architectural and software optimizations. Further, most existing high-level RTL energy estimation techniques provide a coarse grain of measurement resulting in 20-40 percent error relative to that of a transistor level estimator [4] . SimplePower provides an integrated, cycleaccurate energy estimation mechanism that captures the energy consumed in the different components of the system.
In this section, we present the energy characteristics of 10 benchmark codes written in C language 2 (shown in Table 2 ) from the multidimensional array domain. An important characteristic of these codes is that they access large arrays using nested loops. The applications run on energy-constrained signal and video embedded processing systems exhibit similar characteristics. Since SimplePower currently works only with integer data types, floating-point data accessed by these codes were converted to operate on integer data. In order to limit the simulation times, we scaled down the input sizes; however, all the benchmarks were run to completion. The experimental cache sizes (1K-16K) used in our study are relatively small as our focus is on resource-constrained embedded systems.
The energy consumed by the system is divided into three parts: datapath energy, memory system energy, and on-chip clock energy. The major energy consuming components of the datapath are the register file, pipelined registers, the functional units (e.g., ALU, multiplier, divider), and datapath multiplexers. The memory system energy includes the energy consumed by the Icache and Dcache, the address and data buses, the address and data pads, and the off-chip main memory. The clock energy is consumed in both the clock generation circuit and the distribution network. Table 3 provides the energy consumption (in mJ) of our benchmarks for the datapath and memory system for 2. Original codes are in Fortran and were converted into C by paying particular attention to the original data access patterns. various Dcache configurations. For all the cases in this paper, an 8K direct mapped Icache, line sizes of 32 bytes (for both Dcache and Icache), writeback cache policy, a 2Mbits main memory unit, and a core based on 0:8, 3.3V technology were used. We also present only a single datapath energy value for the different configurations due to the efficient stall power reduction techniques (e.g., clock gating on the pipeline registers) employed in the datapath. Using these techniques, we observed that energy consumed during stall cycles is insignificant for our simulations. For example, tomcatv expends a maximum of 1 percent of the total datapath energy on stalls for all cache configurations studied. Fig. 4 provides the clock energy breakdown due to various clock loads when 8K direct mapped ICaches and
VIJAYKRISHNAN ET AL.: EVALUATING INTEGRATED HARDWARE-SOFTWARE OPTIMIZATIONS USING A UNIFIED ENERGY ESTIMATION... 7 TABLE 3 Energy Consumption in the Benchmarks for Various Dcache Configurations
For all the cases, an 8K direct-mapped Icache, 32 byte line sizes, writeback policy, 2Mbits main memory unit, and a core based on 0:8, 3.3V technology are used.
Dcaches are used. The absolute clock energy for the different benchmarks for this configuration is provided in Table 4 . It must be observed that the clock energy unlike datapath energy is influenced by variation in cache sizes. More details of the influence of different system parameters in clock energy is provided in [26] .
Datapath Energy
We observe that the datapath energy consumption ranges from 1.577mJ to 59.776mJ for the various codes depending on the dynamic instruction length and the switching activity in the datapath. Compared to the memory system energy, the datapath energy is an order or two smaller in magnitude. This result corroborates the need for extensive research on optimizing the memory system power [11] , [8] , [17] , [27] , [3] . Next, we zoom-in on the major energy consuming components of the datapath. It is observed from Fig. 5a that the pipeline registers and register file form the energy hotspots in the datapath contributing 58-70 percent of the overall datapath energy. The extensive use of pipelining in DSP data paths to improve performance [28] and facilitate other circuit optimization, such as voltage scaling, will exacerbate the pipeline register energy consumption. Also, larger and multiple-port register files required to support multiple issue machines will increase the register file energy consumption further. The core energy distribution is also found to be relatively independent of the codes being analyzed. The energy consumed by each stage of the pipeline is calculated by SimplePower and is shown in Fig. 5b . The Icache and Dcache energy consumption are not included in this figure. Also, the decode stage energy does not include control logic energy consumption which is not modeled by SimplePower. While the pipeline register in the memory stage is the main contributer to the energy consumed in the memory stage, the execution stage of the pipeline that contains the arithmetic units is the major energy consumer in the entire datapath. It must be noted that the register file energy consumption, though larger than arithmetic unit energy consumption, occurs during both the decode and writeback stages. 
Memory System Energy
The memory system energy consumption generally reduces with decrease in capacity and conflict misses when Dcache size or associativity is increased (see Table 3 ). Yet, in 37 out of the 50 cases, when we move from a 4-way to 8-way Dcache, the memory system energy consumption increases. A similar trend is observed in 15 out of 40 cases when we move from an 8K to 16K Dcache. Moving to a larger cache size or higher associativities increases the energy consumption per access. However, for many cases, this per access cost is amortized by the energy reduction due to a fewer number of accesses to the main memory. Of course, if the numbers of misses/hits are equal, using a less sophisticated cache leads to lower energy consumption. Fig. 6a shows the energy distribution in the memory system components for a 1K 4-way Dcache configuration where the main memory energy consumption dominates due to the large number of Dcache misses. For btrix and amhtm, the data accesses per instruction are the smallest. In amhtm, the majority of instruction accesses are satisfied from the Icache resulting in a more significant Icache energy consumption, whereas btrix exhibits a relatively poor Icache locality (the number of Icache misses is 100 times more than the next significant benchmark), resulting in increased energy consumption in main memory. When we increase the data cache size, the majority of data accesses are satisfied from the data cache. Hence, the overall contribution of the Icache and Dcache becomes more significant, as observed from Fig. 6b . SimplePower provides a comprehensive framework for identifying the energy hotspots in the system and helps the hardware and software designers to focus on addressing these bottlenecks. The rest of this paper evaluates software and architectural optimizations targeted at addressing the energy hotspot of the system, namely, the energy consumed in data accesses.
IMPACT OF COMPILER OPTIMIZATIONS
To evaluate the impact of compiler optimizations on the overall energy consumption, we used a high-level compilation framework based on loop (iteration space) and data (array layout) transformations. For this study, the framework proposed in [29] was enhanced with iteration space tiling, loop fusion, loop distribution, loop unrolling, and scalar replacement. Thus, our compiler is able to apply a suitable combination of loop and data transformations for a given input code, with an optimization selection criteria similar to that presented in [29] . Our enhanced framework takes as input a code written in C and applies these optimizations (primarily) to improve temporal and spatial data locality. The tiling technique employed is similar to one explained in [30] and selects a suitable tile size for a given code, input size, and cache configuration. The loop unrolling algorithm carefully weighs the advantages of increasing data reuse and the disadvantages of larger loop nests in selecting an optimal degree of unrolling and is similar in spirit to the technique discussed in [31] .
There have been numerous studies showing the effectiveness of these optimizations on performance (e.g., [32] , [33] ); their impact on energy consumption of different parts of a computing system, however, remains largely unstudied. This study is important because these optimizations are becoming popular in power-aware systems, keeping pace with the increased use of high-level languages and compilation techniques on these systems [34] . Through a detailed analysis of the energy variations brought by these techniques, architects can see which components are energy hot spots and develop suitable architectural solutions to account for the influence of these optimizations.
Our expectation is that most compiler optimizations (in particular, when they are targeted at improving data locality) will reduce the overall energy consumed in memory subsystem. This is a side effect of reducing the number of off-chip data accesses and satisfying the majority of the references from the cache. Their impact on the energy consumed in the datapath and clock system, on the other hand, is not as clear. Also, as observed in Section 3, the energy consumed in the memory subsystem is much higher than that consumed in the datapath. While this might be true for unoptimized codes (due to the large number of offchip accesses), it would be interesting to see whether this still holds after the compiler optimizations. Table 3 also shows the resulting datapath and memory system energy consumption as a result of applying our compiler transformations. The most interesting observation is that the optimizations increase the datapath power for all codes except btrix. This increase is due to more complex loop structures and array subscript expressions as a result of the optimizations. Since, in optimizing btrix, the compiler used only linear loop transformations (i.e., the transformations that contain only loop permutation, loop reversal, and loop skewing [33] ), the datapath energy did not increase. Next, we observe that the reduction in the memory system energy makes the datapath power more significant. For example, after the optimizations, in the mxm benchmark, the datapath power constitutes 23 percent of the overall system energy for an 8K 8-way Dcache configuration (as compared to 9.2 percent before the optimizations). In fact, the datapath power becomes larger than that consumed in the memory if we do not consider the energy expended in instruction accesses. This is significant as our optimizations were targeted only at improving the data cache performance. Thus, it is important for architects to continue to look at optimizing the datapath energy consumption rather than focus only on memory system optimizations.
We also found that the clock energy consumption increased by 22 percent, on the average, after applying the compiler transformations. The transformations have two different influences on the clock energy. First, clock energy consumption increases due to the additional cycles resulting from the more complex operations. Second, the number of stall cycles decreases due to the improved locality. However, the relative clock energy consumed during stall cycles itself is small as only the clock generation circuit and the main drivers are active in these cycles. Thus, the first influence dominates the second influence of the transformations.
The compiler optimizations had little effect on the energy distribution on the datapath components and pipeline stages, as shown in Fig. 7a and Fig. 7b . However, the energy distribution (shown in Fig. 8 ) in the memory system shows distinct differences from the unoptimized (original) versions (see Fig. 6 ). In the optimized case, the relative contribution of the main memory is significantly reduced due to more Dcache hits. Hence, we observe that the contribution of the Icache and Dcache energy consumption becomes more significant for all the optimized codes. Thus, energy-efficient Icache and Dcache architectures become more important when executing the compiler optimized codes. Thus, evaluating the effectiveness of architectural and circuit techniques used to design energy-efficient caches is revisited later.
As mentioned earlier, normally, our compiler automatically selects a suitable set of optimizations for a given code and cache topology. Since, in doing so, it uses heuristics, there is no guarantee that it will arrive at an optimal solution. In addition to this automatic optimization selection, we have also implemented a directive-based optimization scheme which relies on user-provided directives and, depending on them, applies the necessary loop and data transformations. Next, we forced the compiler using these compiler directives to apply all eight combinations of three mainstream loop optimizations, namely, loop unrolling, tiling, and linear loop transformations to the mxm benchmark. The results presented in Fig. 9 reveal that the best compiler transformation from the energy perspective varies based on the cache configuration. This observation presents a new challenge for the compiler writers for the power-conscious systems, as the most aggressive optimizations (although they may lead to minimum execution times) do not necessarily result in the best code from the energy point of view. 
ENERGY EFFICIENT CACHE ARCHITECTURES
The study of cache energy consumption is relatively new and the optimization techniques can be broadly classified as circuit and architectural. The main circuit optimizations include activating only a portion of the cells on the bit (DBL) and word lines, reducing the bit line swings using pulsed word lines (PWL) and isolated sense amplifiers (IBL), and charge recycling in the I/O buffer [10] , [35] . The application of these optimizations is independent of the code sequences themselves. Many architectural techniques have been proposed as optimizations for the memory system [11] , [17] , [36] , [37] . Many of these techniques introduce a new level of memory hierarchy between the cache and the processor datapath. For instance, the work by Kin et al. [36] proposed accessing a small filter cache before accessing the first level cache. The idea is to reduce the energy consumption by avoiding access to a larger cache. While such a technique can have a negative impact on performance, it can also result in significant energy savings. The block-buffering (BB) mechanism [17] uses a similar idea by accessing the last accessed, buffered cache line before accessing the cache. Unlike circuit optimizations, the effectiveness of these architectural techniques is influenced by the application characteristics and the compiler optimizations used. For instance, software techniques can be used to improve the locality in a cache line by grouping successively accessed data. Then, a cache buffering scheme can exploit this improved locality. Thus, increasing spatial locality within a cache line through software techniques can save more energy. A detailed study of such interactions between software optimizations and the effectiveness of energy-efficient cache architectures will be useful to both compiler writers and hardware designers. SimplePower provides a powerful framework for studying the combined influence of the circuit, architectural and compiler optimizations on the memory system energy.
Influence of Circuit Optimizations
To capture the impact of circuit optimization in the energy estimation framework, we measured the influence of applying different combinations of circuit optimizations using four different layouts of a 0.5Kbits SRAM using HSPICE simulations. It was observed that the energy consumed can be reduced on an average by 29 percent and 52 percent as compared to an unoptimized SRAM when applying the (PWL+IBL) and (PWL+IBL+DBL) optimizations. We conservatively utilize the 29 percent reduction achieved by the (PWL+IBL) scheme to capture the efficiency of the circuit optimizations. We refer to the (PWL+IBL) scheme as IBL in the rest of this paper for convenience.
Interaction between Compiler and Architectural Optimizations
First, we studied the interaction between the compiler optimizations and the effectiveness of the BB mechanism. In order to study this interaction, the Dcache was enhanced to include a buffer for the last accessed set of cache blocks (one block buffer for each way). A code that exhibits increased spatial and temporal locality can effectively exploit the buffer. We define the relative energy savings ratio of an optimized code (opt) over an unoptimized code (orig) for a given hardware optimization hopt as:
where E optcode, E orig are the energy consumed due to the execution of optimized and unoptimized code respectively without hopt and E optcode hopt , E orig hopt are the corresponding values with hopt. This measure enables us to evaluate the effectiveness of compiler optimizations in exploiting the hardware optimization technique. Fig. 10 shows the relative energy savings ratio for BB. It can be observed that the block buffer mechanism was more effective in reducing energy for the optimized codes (except for eflux). This is due to the better spatial and temporal locality exhibited by the compiler optimized codes. This improved locality results in more hits in the block buffer.
On an average, the optimized codes achieve 19 percent (18 percent) more energy savings relative to the original codes using a direct-mapped (4-way) cache with BB. The reason that optimized eflux code does not take better advantage of BB than unoptimized code is that the accesses with temporal locality in the unoptimized code were better clustered, leading to increased data reuse in the block buffer. Next, we applied a combination of the IBL and BB and executed the optimized codes to find the combined effect of circuit, architectural, and software optimizations on the overall memory system energy. It can be observed from Fig. 11 that the Dcache energy consumption can be reduced by 58.8 percent (58.7 percent) for the direct-mapped (4-way) cache configuration. Thus, architectural and circuit techniques working together can reduce the energy consumption of even highly optimized codes significantly. While the BB and IBL optimizations are very effective for reducing the energy consumed in the Dcache, it is important to investigate their impact on the overall memory energy reduction. It was found that the memory system energy reduces by 6.7 percent (11 percent) using the direct-mapped (4-way) cache configuration (see Fig. 12 ). We also investigated the influence of the BB + IBL optimization for the Dcache due to the reduction in the energy per main memory access (E m ) as a result of emerging technologies such as the embedded DRAM (eDRAM) [13] , [14] . Fig. 13 shows that the combined BB and IBL technique reduces memory system energy by 27.7 percent with new (future) technologies (E m = 4.95e-10J) as compared to the 16.3 percent reduction in current technology (E m = 4.95e-9J). These small cache buffers, if placed closer to the processing units, can also be beneficial in limiting the interconnect energy consumption that is becoming important in deep-submicron technologies. SimplePower can similarly be used to evaluate the influence of other new technologies and energy-efficient techniques such as BB on the energy consumed by the system as a whole and an individual component in particular.
Interaction between Different Architectural Optimizations
Next, we evaluated the combination of a most recently used way-prediction cache and BB mechanism. The way-prediction caches have been used to address the longer cycle time in associative caches as compared to direct mapped caches [38] . While most prior effort has focused on way-prediction caches for addressing the performance problem, the energy efficiency of these cache architectures was evaluated recently by Inoue et al. [39] . In their work, an MRU (Most Recently Used) algorithm that predicts and probes only a single way first was used. If the prediction turns out to fail, all remaining ways are accessed at the same time in the next cycle. We refer to this technique as the MRU scheme and the caches that use them as MRU caches. It must be noted that MRU caches could increase the cache access cycle time [38] , [40] . However, our work focus is on energy estimation and optimization rather than investigating energy-performance tradeoffs. Here, we study the effectiveness of combining two different architectural techniques to optimize system energy and also evaluate the impact of software optimizations enabled by the SimplePower optimizing compiler on the MRU prediction and consequent energy savings. We studied the energy savings that can be obtained using MRU caches for 4-way associative cache configurations. It can be observed from Fig. 14 that the optimized codes benefit more from the MRU scheme and can obtain 21 percent more savings than the original code on an average. The increased locality in the optimized codes increases the number of successful probes in the predicted way of the MRU cache. We also find that using the MRU scheme reduces Dcache energy by 70.2 percent on an average for optimized codes as compared to using a conventional 8K, 4-way associative caches (see Fig. 15a ). The incremental addition of BB and IBL provided an additional 10.5 percent and 5.5 percent energy reduction, respectively. Fig. 15b shows the energy savings in the entire From the study in this section, we find that the optimized codes are not only efficient in reducing the number of costly (in terms of energy) accesses to main memory, but they are also more effective in exploiting the energy efficient architectural mechanisms such as MRU caches and BB. We also find that the incremental benefits of applying the BB scheme over a MRU cache is significantly smaller as compared to using these techniques individually. A designer can use similar early energy estimates provided by SimplePower to perform energy-cost-performance tradeoffs for new energy efficient techniques.
IMPLICATIONS OF ENERGY-EFFICIENT MEMORY SYSTEMS
Emerging new technologies combined with the energyefficient circuit, architectural and compiler techniques for reducing memory system energy can potentially create a paradigm shift in the importance of energy optimizations from the memory system to the datapath and other units.
Here, we consider the influence of changes in the energy consumed per main memory access, E m . Such changes are eminent due to new process technologies [14] and reduction in physical distance between the main memory and the datapath. Table 5 shows the memory system energy for different values of E m for four different cache organizations using two optimized codes. Note that E m ¼ 4:95 Â 10 À9 J is the value that we have used so far in this paper. The lowest E m value that we experiment with in this section (4:95 Â 10 À11 ) corresponds to the magnitude of energy per first-level on-chip cache access with current technology.
Recall that the datapath energy consumption for the optimized mxm and psmoo codes were 83.7mJ and 16.1mJ, respectively (see Table 3 ). Considering the fact that large amounts of main memory storage capacity are coming closer to the CPU [14] , we expect to see E m values lower than E m ¼ 4:95 Â 10 À9 J in the future. Such a change could make the energy consumed in the datapath larger than the energy consumed in memory. For example, with E m ¼ 4:95 Â 10 À10 and a 1K, 4-way cache, the energy values of datapath becomes larger than that of the memory system for mxm. It must be noted that, while the code optimizations such as tiling were effective in reducing the memory system energy, they increase the datapath power. Since the compiler optimizations used in our SimplePower framework are very popular and used extensively by commercial compilers, we predict that research (in hardware and software) on reducing the datapath power will become more important in future.
The energy reduction in the pipeline registers can be a potential target for datapath optimizations as this component was observed to be the major consumer of datapath energy (see Section 3). There have been various efforts to reduce the power consumption of these units by gating the clock of the entire pipeline stage during pipeline stalls or flushes [41] , [42] . In addition to gating the entire pipeline stage, we investigate a simple technique to reduce the pipeline register switching activity by using the control signals of the datapath for selectively gating subsets of the pipeline registers. Pipeline registers latch their data inputs to their outputs unconditionally when the evaluating clock edge arrives. Usually, each pipeline register contains two types of inputs: control and data. The behavior of the functional units following a pipeline register is controlled by the corresponding control signals. If the control signals are not active for a functional unit, the latchings of the data inputs of that functional unit are not necessary since the data will not be used. Thus, we can gate the data portion of the pipeline register by using the corresponding control signals. The advantage is that no extra logic is needed to generate the control signals that are used to gate the clock signal and only the clock gating logic needs to be added. Since all the data bits can share the same gated clock, the control logic overhead is small. For pipeline register MEM/WB, shown in Fig. 16 , fields MemData (32 bits), AluOut (32 bits), RtData (32 bits), and Writereg0 (5 bits) can be gated by the control signal Regwrite0 (1 bit) contained in the control field wb_cntl. These data fields can be gated because Regwrite0 is set based on whether the executed instruction writes back into the register file. Since Regwrite0 is active high, the gated clock for these fields can be implemented as shown in Fig. 16a . Because the clock gating logic is shared by all the D flip-flops in those fields, its power overhead is insignificant. Many similar cases of selective gating of data fields in the various pipeline registers occur. The investigation of the selective pipeline gating optimization was observed to reduce the datapath energy by 22 percent on an average for the codes shown in Fig. 16b . In addition to the datapath energy reduction, the clock energy consumption is reduced by 16 percent on the average due to the reduced clock load. More such datapath energy optimizations will become essential as memory system continues to become more efficient sometimes at the expense of datapath energy.
CONCLUSIONS
The need for energy efficient architectures has become more critical than ever with the proliferation of embedded devices. Also, the increasing complexity of the emerging systems on a chip paradigm makes it essential to make good energy-conscious decisions early in the design cycle to help define design parameters and eliminate incorrect design paths. This study has introduced a comprehensive framework that can provide such early energy estimates at the architectural level. The uniqueness of this framework is that it captures the integrated impact of both hardware and software optimizations and provides the ability to study the system as a whole and each individual component in isolation. This work has tried to answer some of the questions raised in Section 1 using this framework. The major findings of our research are the following:
. A transition-sensitive, cycle-accurate, architecturallevel approach can be used to provide a fast (as compared to circuit-level simulators) and relatively accurate estimate of the energy consumption of the datapath. For example, the register file energy estimates from our simulator are within 2 percent of circuit level simulation. . The energy hotspots in the datapath were identified to be the pipeline registers and the register file. They consume 58-70 percent of the overall datapath energy for executing (original) unoptimized codes. However, the datapath energy is found to be an order or two magnitude less than the memory system energy for these multidimensional array codes. . The main memory energy consumption accounts for almost all the system energy for small cache configurations when executing unoptimized codes. The application of high-level compiler optimizations significantly reduces the main memory energy causing the Dcache, Icache, clock network, and datapath energy contributions to become more significant. For example, the contribution of datapath energy to overall system energy, with an 8K, 8-way Dcache, increases from 9.2 percent to 23 percent when benchmark mxm is optimized. We also found that the clock energy consumption increased by 22 percent, on the average, after applying the compiler transformations. . The improved spatial and temporal locality of the optimized codes is useful in not only reducing the accesses to the main memory but also in exploiting energy-efficient cache architectures better than with unoptimized codes. Optimized codes saved 21 percent times more energy using the most recently used way-predicting cache scheme as compared to executing unoptimized codes. They also save 19 percent more energy when using block buffering. . Emerging technologies coupled with a combination of energy-efficient circuit, architectural, and compiler optimizations can shift the energy hot spot. We found that, with an order of magnitude reduction in main memory energy access made possible with eDRAM technology, the datapath energy consumption becomes larger than the memory system energy when executing an optimized mxm code with a 1K 4-way Dcache. In this work, we observed that the compiler optimizations provided the most significant energy savings over the entire system. The SimplePower framework can also be used for evaluating the effect of high-level algorithmic, architectural, and compilation tradeoffs on energy. With SimplePower, we could, perhaps, corroborate the popular belief that most significant gains in energy can be obtained at the algorithmic selection level. Such selections may be imperative in battery-operated devices. For example, the energyaware algorithmic selection can be performed among three well-known sorting algorithms, bubble sort, quick sort, and heap sort. When these algorithms were used to sort 100 integers on SimplePower, quick sort over bubble sort and heap sort reduced the datapath energy consumption by 83 percent and 30 percent (for 0:35 technology), respectively. Also, we observed that energy-efficient architectures can reduce the energy consumed by even highly optimized code significantly and, in fact, much better than with unoptimized codes. An understanding of the interaction of hardware and software optimizations on system energy gained from this work can help both architects and compiler writers to develop more energy-efficient systems. Hyun Suk Kim received the BS degree from Kyungbook National University, Korea, in 1994 and the MS degree from Pohang University of Science and Technology, Korea, in 1996. She worked for Samsung Electronics from 1996 to 1998. She has been pursuing the PhD degree at Penn State University since 1998. Her major research interests include low power computer architecture, embedded systems, and compiler optimizations for low power.
Wu Ye received the BS degree in information engineering from Beijing University of Posts and Telecommunications in 1992 and, the PhD degree in computer science and engineering from the Pennsylvania State University in 2000. His research interests include architectural level power estimation and optimization, low power system design, high performance ASIC design, and verification. Past work includes designing and implementing the first version of SimplePower tool suite. He is currently working for Applied Micro Circuits Corporation.
David Duarte received the BS degree in electronics engineering from the Pontificia Universidad Javeriana, Bogota, Colombia, in October 1996. He was a research engineer for TELECOM (the largest Colombian telecommunications company) starting in April 1996. Later, he initiated graduate studies in electrical engineering, receiving the MSEE in May 1999 from the Pennsylvania State University. Since then, he has been a PhD candidate in the Department of Electrical Engineering, the Pennsylvania State University. He spent the summer of 2000 working at Intel's Circuit Research Lab as a graduate technical intern. His research interests are in the areas of VLSI analog and digital design, with interest in both architectural-level tools and circuit-level optimizations for low-power. He is a student member of IEEE.
. For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
