One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that p ower these mobile devic es. Memory is a particularly important tar get for e orts to improve energy e ciency. Memory technolo gy is becoming available that o ers power management features such as the ability to put individual chips in any one of several di erent power modes. In this paper we explor e the interaction of page plac ement with static and dynamic har dware p olicies to exploit these emer ginghardwar efeatures. In p articular, we c onsider p age allo cation p olicies that c an be employed b y an informed operating system to complement the hardware power management strategies. We perform experiments using two complementary simulation envir onments: a tracedriven simulator with workload traces that are r epresentative of mobile computing and an execution-driven simulator with a detaile d processor/memory model and a more memoryintensive set of benchmarks (SPEC2000). Our r esults make a compelling case for a cooperative hardwar e/softwar e approach for exploiting power-aware memory, with down to as little as 45% of the Energy Delay for the best static policy and 1% to 20% of the Ener gy Delay for a traditional fullpower memory.
INTRODUCTION
One of the major challenges of the post-PC environment|encompassing ubiquitous mobile, embedded, and wireless devices|is the need to reduce the energy consumed in their operation, thereby extending the lifetime of the batteries that pow er them.P ow er consumption is an issue that extends well beyond the realm of battery-powered mobile devices to an y computing platform in which the production of heat or fan noise is a consideration (e.g., medical applications). Energy e ciency of computers is also desirable from the economic and environmental poin ts of view.
Sustained exponential gro wth in processor performance and memory density means that embedded processors and handheld devices can soon have performance characteristics comparable to today's w orkstations.This increased performance is usually accompanied by increased pow er consumption. Memory is a particularly important target for e orts to address the energy e ciency issue. Instructions invoking memory operations have a relativ ely high po w er cost, both within the processor and in the memory system 43]. Intel's guidelines for mobile power 19] indicate that the target for main memory should be approximately 4% of the power budget (e.g. an average 1.3W for 96MB) for year 2000 laptops. This percentage can dramatically increase in systems with low p o wer processors (e.g., Transmeta Crusoe 15]), displays 35], or without hard disks. Recent studies 10] show that memory system behavior can produce variations in energy consumption of 100% for a pocket computer with the abo vecharacteristics versus 9% for a con ventionallaptop with a high pow er processor.
Since many small devices have no secondary storage and rely on memory to retain data, there are power costs for memory even in otherwise idle systems. The amount of memory available in mobile devices is expanding with each new model to support more demanding applications (e.g., m ultimedia)while the demand for longer battery life also con tinues to grow signi cantly.
Hardware components, such as memory chips, are becoming available that o er pow er management features. In particular, we consider pow er-aw are DRAM c hips that support several di erent power modes: activ e, standby, nap and pow erdo wn in order of decreasingpow er consumption but increasing access time. Our goal in this wo r k i s t o d etermine how to exploit these emerging hardware features for the most e ective main memory pow er management.
Speci cally, w e a s k t w o basic questions: 1. How can the various pow er modes available in state-ofthe-art DRAM devices be utilized? We consider both static and dynamic hardware policies for determining the power state. Our dynamic scheme uses the time between DRAM chip accesses to determine pow er state transitions. 2. What is the e ect of code and data placement within such power-aware memory chips? Thus w e consider page allocation strategies that complement the abilit y of the hardware to adjust pow er modes. Our w ork is based on the premise that a cooperative hardware/soft w are approac h will o er expanded opportunities for energy e ciency. A primary contribution of this paper is a quantitative study that explores the interaction of virtual memory page allocation with dynamic hardware policies to orchestrate the use of power modes provided in emerging DRAM devices.
We measure the energy savings within the memory system and any additional delay in execution time resulting from these power management strategies, expressed in terms of an Energy Delay metric. Using trace-driven simulation with a simpli ed processor and memory system model we evaluate our ideas for a set of productivity applications as a workload representative of mobile laptop and handheld devices. We also use an execution-driven simulator with a more detailed processor and memory model to evaluate a set of programs from the integer SPEC2000 suite that place higher demands on the memory system than the available traces.
Our results show the following: Among static policies in which every power-aware DRAM chip in the system resides in the same base power mode between accesses, choosing the nap mode as the base achieves the lowest Energy Delay product for our workload (15% to 40% of staying in active mode). Power-aware page allocation by an informed operating system coupled with dynamic hardware policies can dramatically improve energy e ciency of memory. Power-aware allocation allows a 6% to 55% improvement i n Energy Delay over the best static hardware policy. Power-aware page allocation when used with static hardware policies can improve Energy Delay b y u p t o 30%. Dynamic hardware policies without informed OS support (i.e., using random page allocation) do not improve energy e ciency as measured by Energy Delay. In the next section, we describe the power-managed memory technology upon which this study is based and present related work. Section 3 describes the policies that determine which p o wer mode each c hip should be in. Then, in Section 4, we discuss simple page allocation strategies that exploit the power management features of the hardware. Section 5 presents our experimental methods and results are presented in Section 6. Finally, we conclude in Section 7 and describe future work.
BACKGROUND AND RELATED WORK

Rambus RDRAM
Memory technology has developed to respond to the needs of mobile computer designers to limit power consumption in the face of increasing demand for performance. One concrete example is Direct Rambus DRAM (RDRAM) 40]. The Direct Rambus technology delivers high bandwidth (1.6GB/sec per device), using a narrow bus topology operating at a high clock rate. As a result, each R D R A M c hip can be activated independently. RDRAM o ers four power modes: active, standby, nap, and powerdown. Because of the narrow topology, each c hip can be independently set to an appropriate power state. Conventional DRAMs generally require multiple active c hips to achieve high bandwidth.
Whereas we could apply the ideas presented in this paper to these conventional memory systems, it would sacri ce bandwidth. By adopting the RDRAM model, we can concentrate on the interactions of page allocation with the power modes without the concern for a tradeo between bandwidth and energy consumption.
An RDRAM device must be in active mode while performing a read or write transaction. Active mode consumes the most power. A c hip that is not servicing a memory request can be in any of the lower power states. However, these states incur additional delay f o r c l o c k resynchronization. Standby is fast and uses 60% of the power of active mode. Greater power savings can be achieved by using nap mode (10% of the power of active) with an additional resynchronization time required to transition to the active s t a t e in order to service a memory request. Powerdown mode has the minimal power consumption (1% of active), but a significant d e l a y f o r c l o c k synchronization (100 times that needed by nap mode) to enter the active state. Figure 1 shows the power states and their relative power costs as well as the possible transitions and relative transition times into active mode. The challenge for the laptop or handheld designer is to utilize these modes e ectively. It is not only the availability of these power states but the ability to transition between them dynamically on a per-chip basis that gives the RDRAM its potential for power management.
Power-Aware DRAM Model
Rather than trying to model the full complexity of the Direct RDRAM speci cations, we incorporate the essential features (i.e., multiple, independently controlled memory chips with multiple power states) into an abstract model of a P ower-Aware DRAM (PADRAM). We can choose parameters that are consistent w i t h t h e p o wer modes, the relative power costs, and relative resynchronization times given in Figure 1 so our results are relevant t o RDRAM however, we do not claim to have precise numbers for the power consumed in each s t a t e and state transition of any particular RDRAM implementation. We make simplifying assumptions about the power consumption during state transitions and we concentrate only on the transition times that impose additional latency on a memory request.
We focus on improving the energy consumption of main memory, ignoring the energy used by all other system components (including processor and cache). In our studies, the processor and cache a ect memory energy e ciency by in uencing execution time and miss rate (the numberofDRAM accesses).
Related Work
Architectural studies have examined the impact of software structure on power consumption 6, 34, 44] 40] . However, to our knowledge, ours is the rst quantitative study to explore the interaction of page allocation with dynamic hardware policies to orchestrate the use of power modes being provided in emerging PADRAM-class memory devices.
A novel aspect of our work is the cooperative hardware/OS approach to exploit PADRAM features. Previous OS-level studies focusing on power management include work on scheduling for low p o wer processor modes 31, 32, 46] , spindown policies for disks and alternatives 1, 7, 8, 9, 16, 25, 30, 47] , and managing wireless communication 18, 24, 42] . A consortium of companies has developed a speci cation 20] that addresses the lower-level OS/device interface, providing one model for gross system-wide power states and per-component p o wer states as a basis for the development of OS-directed power management. Recent w ork with Odyssey 37, 11] demonstrates how system support for application-aware adaptation can bene t energy e ciency. Common themes that appear in these power management strategies are the identi cation / prediction of idleness in the activity patterns of a component and techniques that attempt to change those activity patterns. A particularly valuable approach i s b a s e d on the \ski rent-to-buy" problem formulation for competitive algorithms 22, 25] .
Another related area involves operating system page placement policies. Virtual memory page research originally concentrated on techniques for improving program execution time, focusing on replacement algorithms. Recent studies examined page coloring policies for selecting appropriate physical page frames to minimize cache misses 2, 23] . Other recent w ork has studied page placement aimed at improving TLB performance 41] or NUMA multiprocessor memory access 27, 28, 45, 12, 3] . Each of these problems bears some resemblance to the issues we face since they all attempt to exploit the exibility a vailable in mapping virtual to physical pages.
HARDWARE POWER MANAGEMENT
This section explains various hardware policies for controlling PADRAM power states. Since each chip is controlled independently, the memory controller can implement a v ariety o f p o wer management policies. In this paper we investigate two t ypes of policies: static and dynamic.
Static Policies
The static schemes we i n vestigate correspond to placing all PADRAM chips in a single power state. We note that for an access to occur, the PADRAM chip must rst transition to the active state. Only when there are no outstanding requests for the device does it return to the speci ed static power state. Our rst static policy assumes that all PADRAM devices are in the active state. This corresponds to a conventional performance oriented design, targeted at reducing execution time.
The next three static schemes place all PADRAM chips in the standby, nap, and powerdown state, respectively, when there are no accesses to service. These policies correspond to implementations targeting energy e ciency by sacri cing performance, since the memory access time increases as the power consumption is reduced. Ideally, we want to maximize performance while minimizing energy consumption. The remainder of this section describes policies with this goal.
Dynamic Policies
To obtain higher performance and energy e ciency we must relax the constraint that each P ADRAM chip return to the same base power state when there are no pending accesses. This allows the possibility of exploiting locality in the program's memory access pattern to reduce energy consumption. To accomplish this, we need to dynamically determine the power state of each c hip. Clearly, a c hip needs to be in the active state to perform an access. The more dicult decision is to determine when the chip should transition to a lower power state.
Our approach uses the time between accesses to a chip as a metric for transitioning to lower power states. I f a c hip is not accessed for a threshold amount of time it transitions to the next lower power state. This allows individual chips to reside in di erent p o wer states, based on their individual access patterns.
The threshold values are an important parameter in this approach. Too large a threshold and the chip will spend too much time in the higher power state, increasing energy consumption. In contrast, if the threshold is too small, then the chip will transition into a slower, but lower power state, increasing execution time.
Dynamic power state management exploits locality of reference to individual PADRAM chips. Reference locality is determined in part by the algorithm/data structures and in part by the mapping of program virtual addresses to physical addresses. The next section discusses how the operating system can in uence energy e ciency through physical page allocation. Source code and data structure transformations for improving energy e ciency is an important and interesting topic, but is beyond the scope of this paper.
PAGE ALLOCATION
An important c o n tribution of this paper is to re-examine virtual memory page allocation policies in light of new PADRAM technology. Previous page allocation studies ignored which actual DRAM chips contained the allocated page frame. In contrast, our work focuses speci cally on this parameter in an e ort to maximize energy e ciency. Given hardware mechanisms, as described above, that can determine when to transition between power states, the operating system may further improve energy e ciency by allocating physical pages in a manner that fully exploits the hardware. As a rst step, the page allocation should cluster an application's pages into the minimum numberofPADRAM chips.
To determine the bene ts of power aware page allocation (see Section 6) we compare random and the well-known sequential rst-touch placement policies. Our rst policy randomly chooses a PADRAM chip for the physical page. We believe that the allocation policies in conventional operating systems would appear to be essentially a random assignment with respect to chip selection.
The sequential rst-touch policy allocates pages in the order they are accessed, lling an entire PADRAM chip before moving on to the next. This scheme minimizes the numberof PADRAM chips utilized for a given application. Therefore, the hardware can automatically place unused PADRAMs in the powerdown state, and hence potentially reduce energy consumption.
This new form of page coloring targets reducing power consumption rather than improving performance. However, we note that conventional page coloring for improved cache performance can still be utilized when selecting pages from within a PADRAM chip. We also assume that physical addresses are not interleaved across PADRAM chips. We c a n interleave at the word, cache line, or page granularity within the PADRAM chip, since each c hip will likely contain multiple independent banks.
Finally, experience has shown that rst-touch is often not representative of subsequent locality since it may capture only an initialization phase of the program. Thus, we a l s o consider the potential for limited reassignment i n tended to cluster pages with similar access patterns within PADRAM chips. The Frequency policy attempts to improve u p o n a n initial allocation of frequently accessed pages at some point into the execution. Identi cation of candidates for reassignment is done with small per-page hardware counters, recording frequency of accesses to each page, outside of the L1 and L2 caches, over a window of time. A limited number of the most frequently accessed pages are then moved into a common chip. In our formulation of this scheme, a block of free page frames is reserved in one chip during initial placement to serve as a destination during this later one-time reallocation. Of course, this could be repeated, but we l e a ve multiple \corrections" as future work.
In Section 6, an o ine version (counting over the entire trace and then placing pages accordingly) is rst considered in order to ascertain that \better" placements are possible using such frequency information. Then the online policy, described above, is simulated, including the costs of page migration.
METHODOLOGY
To e v aluate energy e ciency, w e use the Energy Delay product 13]. This metric captures our goal of achieving high performance (seconds) while minimizing energy consumption (joules). Although total system energy consumption is important, it is highly dependent on speci c design choices (e.g., processor, display type, wireless network interface, etc.). Therefore, we concentrate only on PADRAM energy consumption, and ignore the energy consumed by all other system components.
To compute energy e ciency, we developed two simulators: a trace-driven simulator and a detailed executiondriven out-of-order processor simulator. One of the primary considerations that went i n to our experimental design was the choice of a workload that would seem appropriate to mobile/wireless devices. The availability o f traces from a set of popular applications used on laptops motivated the development of our trace-driven simulator. While these traces satis ed the need for a representative w orkload for the target environment, they had disadvantages for memory research: low miss rates and the constraints of trace-driven simulation (e.g., no detailed processor timing). Thus, the executiondriven simulator was developed to address the need for a more detailed processor/memory model and more memoryintensive b e n c hmarks. Rambus. While the values for a particular power state are taken from the literature, we approximate the power consumption associated with a transition between two power states as the average of the power consumed in the two states. The total energy consumption depends on the time for the transition to complete, also shown in Table 1 .
Trace-Driven Simulation
The trace-driven simulator processes instruction and data address traces and uses a simpli ed PADRAM model. This simulator models a two-level cache hierarchy with a 16KB, direct-mapped level one cache and a 256KB directmapped second-level cache, both caches have 32B cache blocks. Results for a 4-way associative L2 were qualitatively similar to the direct-mapped cache, therefore we omit them. We a l s o model the individual PADRAM chips and their associated power state. Each c a c he is lockup-free and can have up to eight outstanding misses. In this simulator, we do not model memory bus contention or the internal DRAM banks. Instead we optimistically assume all requests to a single PADRAM can be overlapped (i.e., no bank conicts). In these studies we only model the transition from the lower power state to active. The transitions from active to lower power states are assumed to incur no delay o r e nergy consumption. These assumptions are removed in our execution-driven simulator.
For timing considerations (necessary to compute energy consumption), we use a simpli ed processor model that executes one instruction per cycle, and never stalls due to long latency operations (i.e., execution only stalls when the maximum number of outstanding misses is reached). We assume a 500Mhz processor clock, the level one cache takes 2 cycles to access, while the level two cache incurs an additional 10 cycles. We s i m ulate a non-interleaved main memory system with eight 32Mb PADRAM chips, for a total main memory capacity of 32MB.
For our trace-driven studies we use instruction traces from personal productivity applications executing on an Intel processor with Microsoft Windows NT. These traces, provided by the University o f W ashington Etch project 29], include instruction and data accesses for several popular applications typical of those used on laptops today. Table 2 provides information on the applications we use. The rst six benchmarks are from the NT traces.
Execution-Driven Simulation
To o vercome the limitations of trace-driven simulation, we augmented the SimpleScalar execution-driven simula- page and close page policies, and various interleaving strategies for mapping physical addresses to speci c chips and banks within chips. This simulator provides a more accurate model of timing at all levels of the memory hierarchy, including contention at each l e v el and within each PADRAM device and transitions from higher to lower power states. In particular, active to either nap or powerdown takes 8 cycles, standby to nap takes 12 cycles, nap to powerdown takes 61 cycles because we m ust rst enter the active state. Active to standby either takes 1 cycle or 73 cycles, depending on the DRAM page mode (See Section 6.4.1). We simulate a non-interleaved main memory system with eight 256Mb chips for a total capacity of 256MB.
Due to excessive simulation time, we fast-forward the simulator over the rst 4 billion instructions, and then simulate in detail the next 100 million committed instructions. This allows us to skip over program initialization, however page placement is based on accesses from the beginning of program execution (during the fast-forwarding). In addition to the two SPEC95 benchmarks for which N T traces also exist (compress and go, above), we use three integer programs from the SPEC2000 suite (bzip, gcc, and vpr) for our execution-driven analysis (described at the bottom of Table 2 ). These three were chosen because they exhibited the highest data cache miss ratios. For all benchmarks, we use the reference input data set.
EXPERIMENTAL RESULTS
This section presents our results on power management for PADRAM. We begin with analysis of static hardware power state policies and their interaction with page allocation (Section 6.1). This is followed in Section 6.2 by analysis of the e ects of page allocation on dynamic hardware power management policies described in Section 3. We then investigate the e ects of open/close DRAM page policies and interleaving on energy e ciency.
The main results from this study are: 1. Cooperative hardware and software for power aware page allocation can improve main memory energy eciency, measured in terms of Energy Delay, b y 6 % t o 55% over the best static policy. 2. Nap mode is the most energy e cient static policy for our applications. 3. Power aware page allocation without dynamic hardware support can improve energy e ciency by u p t o 30% over static nap, depending on application characteristics.
4. Dynamic hardware schemes do not improve energy efciency for random page allocation.
Static Power State Policies
In this section we e v aluate the static policies that uniformly place all PADRAM chips in the same power state. We begin by e v aluating PADRAM power management t e c hniques in the context of random physical page allocation. In other words, the operating system is oblivious to the power management capabilities of the underlying hardware. Figure 2 shows the Energy Delay product for the four static policies (active, standby, nap, and powerdown) normalized to the active policy for each program. Table 3 shows the absolute values for runtime, energy, and Energy Delay p r o duct. From Figure 2 we see that placing all PADRAM chips in the nap state provides the lowest Energy Delay product for all applications in both simulations. Nap achieves approximately 15% of the Energy Delay of active for the trace-driven simulations, while it achieves 20% to 40% of active for the execution-driven results. Powerdown is generally the poorest performing, followed by active.
These results match our expectations, since powerdown incurs a signi cant increase in access delay, while active consumes too much energy when it is not servicing requests. The notable exception is acrord32, where powerdown is better than active. This is due to the low r a t e a t w h i c h acrord32 generates DRAM accessess. From Tables 2 and 3 we see that acrord32 has the lowest rate of DRAM accesses. Therefore, it still achieves energy savings even though its delay i ncreases. We also note that the Energy Delay o f powerdown is directly related to the rate at which benchmarks generate DRAM accesses.
Standby is the next best mode after nap achieving 60% of active in the trace-driven simulations, and 60% to 70% of active in the execution-driven simulations. Standby is worse than nap because the additional time penalty o f nap causes only a slight increase in total run time, while the power reductions are very large (30mW vs 180mW).
We note that the relative Energy Delay values for active, standby, and nap follow the relative ratios of power consumption. This is particularly true for the trace-driven simulations, and is a direct result of the low L2 miss rates exhibited by those programs (< 1%). The extremely high time penalty o f powerdown is too much f o r e v en these low miss rates, and Energy Delay increases dramatically.
Impact of Page Allocation
We now examine the bene ts of sequential-rst-touch page allocation over random page allocation for the static hardware power management s c hemes. Figure 3 shows the Energy Delay of sequential allocation normalized to the Energy Delay of random allocation. From Figure 3a we see that page allocation has very little e ect on energy efciency for active, standby, or nap, using the trace-driven simulations, producing at most a 6% reduction for nap (go). For these policies with random allocation, each chip consumes near its minimum energy because the programs have very low miss ratios. Packing all the program's pages into the minimum numberofchips reduces the unused chips' energy by v ery little, which is o set by the increase in energy consumption for the more utilized chips. We note that sequential page allocation dramatically improves the energy e ciency for the powerdown static policy, a c hieving 30% to 70% of the random allocation. This is because the delay to transition out of powerdown is extremely long, and consumes a signi cant amount of energy. When program text and data are packed into the minimum number of chips, each c hip is likely to statisfy more requests when it reaches the active state than when pages are randomly spread across chips. This observation is supported by our data that shows an increase in the number of references that occur when the target chip is already in the active s t a t e .
In contrast to the trace-driven results, our executiondriven results (see Figure 3b) show that power aware page allocation does improve energy e ciency for the nap policy by 12% to 30%. In particular, we note that compress and go, the SPEC95 benchmarks, show larger improvements than those observed in the trace-driven experiments. This is due in part because rst-touch produces lower L2 cache miss ratios and part to the more detailed processor model used by SimpleScalar. Recall, our trace-driven simulator does not model data dependencies or nite processor resources, which minimizes the e ects of long latency operations. Our execution-driven simulator accurately models these constraints and the corresponding additional delays when long latency operations cause resources (e.g., instruction bu ers) to be overcommitted. Finally, as before, we see very little improvement for active and standby, while powerdown bene ts the most from sequential page allocation.
Dynamic Power State Management
We n o w examine more sophisticated hardware support for PADRAM power management. By dynamically determining each c hip's power state based on recent references, we hope to improve o verall energy e ciency. Figure 4 shows the Energy Delay o f v arious dynamic policies normalized to the static nap policy for our trace-driven simulations using sequential rst-touch allocation. Each bar in the graph represents a di erent set of thresholds (in nano-seconds) for transitioning from active to nap (x) and from nap to powerdown (y), represented as x/y. Table 4 shows the raw data for the static nap and powerdown schemes along with the best dynamic scheme.
We determined a loose lower bound on the time required to be spent i n a l o wer power state in order to overcome the transition costs by analytically computing the penalty v s . reward for transitioning to the lower power state. We use that bound to guide the choice of threshold values to explore. Appendix A provides details on our threshold computation. Our analysis determined that there was very little bene t for remaining in standby, and that the active to nap threshold should be on the order of 100's of nanoseconds, while the nap to powerdown threshold should be on the order of 10,000ns. Our trace-driven simulation results show that thresholds of 100ns/5,000ns produce the best overall Energy Delay.
From Figure 4 we see that the combination of power aware page allocation and dynamic hardware policies can produce Energy Delay values that are 50% to 94% of the static nap policy. Five of the six benchmarks achieve 8 0 % or lower.
Our execution-driven results show that dynamic hardware policies improve energy e ciency of sequential page allocation by 42% for bzip, 43% for compress, 50% for go, 55% for gcc, and 30% for vpr over static nap, the best static policy. Furthermore, this is an overall improvement o f 5 0 % to 60% compared to static nap using random page alloca- tion. Due to excessive s i m ulation time we did not perform as exhaustive o f a n e v aluation as with the trace-driven studies. The best results in the limited experiments we did perform are produced by threshold values of 0ns between active and standby, 2,000ns between standby and nap, and 50,000ns between nap and powerdown. Section 6.4.1 discusses asp e c t s o f o u r RDRAM model that produce the 0ns threshold. While further improvements might b e a c hieved by ne tuning the thresholds, these results are su cient to show that cooperative h a r d w are/software techniques can improve energy e ciency.
The energy e ciency of sequential-rst-touch page allocation and dynamic hardware power state management is signi cantly better than using a traditional full-power memory system, e.g., static active. Our cooperative hardware/software schemes achieve from 7% to 20% of the Energy Delay for the static active policy for the executiondriven simulations and from 1% to 10% of the Energy Delay for the trace-driven experiments.
An important observation from our simulation results is that dynamic power state management does not improve e nergy e ciency for random page allocation over the static nap Table 3 : Raw Data for Static Policies with Random Allocation, Energy Delay is de ned in terms of joules seconds policy. In particular, for the execution-driven experiments above, the dynamic policies with random placement are over an order of magnitude worse than the static nap policy for two o f t h e b e n c hmarks. This poor performance is a result of moving to the powerdown state too soon, and incurring the large delay and corresponding energy consumption to transition out of powerdown. This overhead can be reduced by increasing the nap to powerdown threshold, and thus preventing any c hip from entering powerdown. We v eri ed this behavior through simulation, and achieved energy e ciency comparable to the static nap policy. Further tuning of the other thresholds produced only minor bene ts. We also note that for sequential page allocation, the higher powerdown thresholds do not signi cantly change the results from those presented above. This is important since we w ant the dynamic polices to produce comparable results to the static schemes in cases where the operating system is unable to successfully perform power aware page allocation.
Frequency-based Page Placement
The primary goal of page placement t h us far has been to cluster all pages into the minimum numberofPADRAM chips. In this section, we p r e s e n t preliminary results from an alternative placement t e c hnique that further re nes page allocation based on access frequency. To achieve this, we rst construct a histogram of page accesses o ine. The results of this pro le run are then used to determine initial page placement, starting with the most frequently accessed page and continuing to the least accessed. Figure 5 shows the Energy Delay for both the frequency and sequential rst-touch allocation policies and the dynamic hardware policy with thresholds of 100ns/5,000ns normalized to sequential rst-touch static nap. These results clearly show that rst-touch is not the best placement policy. Compress and go do not show a n y bene t since they both t entirely on a single chip. Acrord32, netscape, and powerpoint all reduce the Energy Delay b y a p p r o ximately 20% beyond the values achieved by rst-touch. Winword exhibits the largest bene t of frequency based placement, achieving 60% of the static nap value, whereas rst-touch did not improve energy e ciency at all.
We are currently investigating online techniques to reassign pages based on reference frequency. Our initial implementation reserves 128 physical pages in chip 0, reallocates the 128 most frequently accessed pages from the other chips to chip 0, and then packs the remaining pages into the smallest numberofchips. We execute the program for a 100ms warmup period to skip initialization, and then sample page accesses for 2ms. We associate a 10-bit saturating counter with each p h ysical page, and increment the appropriate counter for each page accessed during the sample period. At the end of the sample period, the OS sorts the counters and performs the movement and repacking operations, and resumes program execution. We include the cost of page moves as 0.011ms and 0.008mJ, obtained by measuring the energy and delay o f a b c o p y using our executiondriven simulator.
The above implementation produces a 10% reduction in Energy Delay for winword, the program with the most opportunity, o ver static nap. This is because winword is a long running program that accesses a large amount of memory. We did not see any improvement for the other programs. The other programs either do not run very long or do not stress the memory system much. Furthermore, the other programs achieve signi cant gains from rst-touch, while winword does not. We are currently investigating other applications and other, less hardware intensive, techniques for obtaining page reference frequency. However, we note that conventional page reference counting may not directly apply since large L2 caches can lter many accesses, whereas it is L2 misses that dictate DRAM access frequency.
Alternative DRAM Architectures
In this section we examine the e ects of two important DRAM architectural alternatives: DRAM page policy and interleaving.
Open vs. Close Page Policy
The previous execution-driven results use a 0ns threshold for transitioning from active to standby. This is a result of the detailed DRAM model used in those simulations. Most current DRAM devices support two operating modes: open page and close page. These modes indicate what occurs after the DRAM services a request. In open page mode, data from a DRAM page 1 remains on the sense ampli ers in anticipation of future accesses to nearby data. However, subsequent accesses to a di erent DRAM page incur an additional precharge delay before fetching the appropriate DRAM page. In contrast, close page mode immediately precharges the DRAM bank after an access in an attempt to avoid the precharge delay. If the same DRAM page is accessed, it incurs higher delay than the open page technique, since the data must be fetched again.
The DRAM page policy relates to power management i n that an important di erence between the active and standby power states is whether there is data on the sense ampli ers. To e n ter standby all pages must be closed (i.e. precharge all sense ampli ers). Furthermore, the resynchronization delay o f standby applies only to the column address and data bus and can be completely overlapped with the row a c t i v ate command. Therefore, in close page mode, when there are no requests to any banks of a chip, it enters standby (0ns threshold from active), since this will not introduce any a dditional delay, but can reduce energy consumption. This is the policy used to obtain the previous results. In open page mode, a device can remain in the active state while it retains data on the sense ampli ers. The threshold for transitioning to standby determines when all DRAM pages on a device should be closed. For our con guration, there is also an additional 73 cycle delay incurred to issue appropriate commands to close the open DRAM pages.
We use the execution-driven simulator to evaluate the impact of open vs. close page modes on energy e ciency. The trace-driven simulator does not model the PADRAM devices in su cient detail to perform this study. W e use the dynamic hardware policies with sequential rst-touch page allocation and non-interleaved main memory. Our simulations show that close page mode produces Energy Delay values 20% lower than open page mode.
Interleaving
The results thus far do not use any i n terleaving physical addresses are mapped sequentially to each chip, so chip 0 contains physical pages 0 to N-1, chip 1 contains N to 2N-1, etc. However, we d o i n terleave cache blocks across internal DRAM banks. 2 This allows sequential cache block accesses within a page to overlap much of their DRAM latency.
Alternatively, w e m a y get higher performance if we c a n spread pages across DRAM chips, potentially exposing more 1 A DRAM page is one row o f a n i n ternal DRAM bank. 2 RDRAM has 32 internal banks, a maximum of 16 can be accessed in parallel.
parallelism by reducing DRAM bank con icts. For example, we could interleave at the page granularity, such t h a t physical pages are allocated in a round-robin manner across chips (e.g., page 0 to ch i p 0 , p a g e 1 t o c hip 1, etc.). While this may reduce execution time, it forces many c hips to be active, similar to random page allocation. The operating system could still pack pages into the minimum numberof DRAM chips, but that produces the same DRAM access pattern as no interleaving. It also has the additional disadvantage of potentially using only a subset of large physically indexed caches.
Execution-driven simulation results reveal that pagegrain interleaving produces Energy Delay values close to random page allocation, as expected. Further experiments that vary the cache block i n terleaving within a DRAM chip reveal no signi cant di erences among alternatives.
CONCLUSION
In this paper, we have built a compelling case for cooperative hardware/software policies that can exploit the power management features o ered by new PADRAM memory devices, such as the Rambus RDRAM, to dramatically improve the Energy Delay of main memory. We use tracedriven simulations of a set of personal productivity applications and execution-driven simulation of integer SPEC2000 benchmarks to evaluate static and dynamic hardware policies that determine the power states of each memory chip. We s h o w that statically assigning the nap mode as the base power mode for all memory chips in a system is a successful strategy, a c hieving an Energy Delay of 15% to 40% of static active m o d e . We s h o w t h a t p o wer aware page allocation can improve e n e r g y e c i e n c y b y up to an additional 30%.
Using power-aware page allocation in conjunction with hardware policies that dynamically adjust the power mode of each individual memory chip based on thresholds of inactivity can provide 6% to 55% improvement in Energy Delay over the best static hardware policy and o ers 99% to 80% improvement o ver a traditional full-power memory system with random page placement.
There are many opportunities left for future work with the PADRAM model of memory devices and especially with the interaction between hardware and software management. Following our belief that energy conservation should become a \ rst class" design goal for higher levels of system design, many of our plans explore ways to give the OS more explicit control over PADRAM power modes. This may e v en eventually extend into API features that allow some degree of application-level direction of memory power states. Sequential rst-touch is a simple page allocation scheme. We may consider other \page coloring" techniques and further explore the movement o f pages between chips to improve initial placements based on observed access patterns.
We note that our clustered page allocation has other power-related side-e ects. It can also be used to reduce DRAM refresh rates. By compacting physical pages into the minimum numberofinternal memory banks, we can potentially eliminate refresh for entire DRAM banks in which there are no active pages.
The threshold values in our dynamic policy are an important parameter. Unfortunately, using the same threshold value for all programs and all PADRAM chips may not produce the best results. Thus, another possible direction we are exploring is a dynamic policy that attempts to adap-tively determine the best threshold values for each c hip.
Our dynamic policies have concentrated on the transition into lower power states. Policies that support pretransitioning into higher power states, in anticipation of imminent access in a manner analogous to prefetching, may also have a role to play in improving the Energy Delay m e tric of some applications.
