The increasing importance of energy e ciency has produced a m ultitude of hardware devices with various power management features. This paper investigates memory controller policies for manipulating DRAM power states in cache-based systems. We d e v elop an analytic model that approximates the idle time of DRAM chips using an exponential distribution, and validate our model against trace-driven simulations. Our results show that, for our benchmarks, the simple policy of immediately transitioning a DRAM chip to a lower power state when it becomes idle is superior to more sophisticated policies that try to predict DRAM chip idle time.
INTRODUCTION
Energy e ciency is becoming an increasingly important target for optimization in many system designs. Mobile computing devices require techniques to extend battery lifetime, while others must reduce power to meet heat or fan noise limitations (e.g., medical applications). Even desktop and server systems should be energy e cient for economical and environmental considerations. Main memory is consuming an increasing proportion of the power budget and thus motivates e orts to improve DRAM energy e ciency.
DRAM manufacturers are meeting this demand by developing DRAM chips with multiple power states such as active, standby, n a p a n d p o werdown. The chip must be in the active state to service a request. The remaining states are in order of decreasing power consumption but increasing time to transition back t o a c t i v e. Energy e ciency can be improved by placing the chips in a lower power state when not used. The challenge for the system designer is to utilize these modes most e ectively.
In our previous work 5], we investigated memory controller policies for making DRAM chip power state transitions in conjunction with software page placement policies. The power-aware page allocation policies exploit working set locality to increase the opportunity for the memory controller to make e ective transition decisions.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED'01, August 6-7, 2001 , Huntington Beach, California, USA. Copyright 2001 ACM 1-58113-371-5/01/0008 ..$5.00
The goal of this work is to understand the characteristics of memory access patterns in a cache-based memory architecture and how those patterns a ect the design of controller policies that transition among power states. For a memory system without caches, there is work showing potential bene ts of an adaptive policy that attempts to predict the time between consecutive accesses as a basis for deciding when to make transitions 1]. By contrast, we consider the behavior of policies for memory requests generated by representative productivity applications and ltered through a 2-level cache. We consider access patterns produced by random page allocation as well as the sequential rst-touch policy previously shown to be e ective 5] when used in conjunction with simple power-aware controller policies. The basic question is whether simple policies are adequate to capture the relevant features of cache-ltered accesses.
To c haracterize memory access patterns, we de ne the notion of gapas the interval between clustered accesses. We nd that most memory traces from our workload ltered by 2 -l e v el cache have gap distributions that can be approximated by an exponential distribution. They also have l a r g e average gap values (greater than 200ns). A critical parameter in the design of memory controller policies is the length of time spent in the current p o wer state before a transition to a lower state is made. We refer to this as the threshold. We analyze the relationship between threshold values and a model of exponentially distributed memory access gaps. The analytical result shows that, for our benchmarks, the simple instant transition policy (threshold = 0) produces maximum bene t. Finally, w e experimentally validate this theoretical conclusion through trace-driven simulation.
The remainder of this paper is organized as follows. In the next section, we provide background on power-aware memory design. We i d e n tify the primary factors that characterize memory access patterns and a ect the behavior of power control. Then, we introduce our evaluation metric and our method of generating cache-ltered memory traces. Section 4 examines gap distributions and analyzes the relationship between gaps and thresholds. We present simulation results and show h o w close they are to the theoretical analysis. Section 5 concludes.
BACKGROUND
This section reviews modern DRAM power management features and appropriate memory controller policies for exploiting these features. We also identify the important c haracteristics of DRAM access patterns and how they interact with memory controller power management policies. 
Rambus DRAM
Memory technology has developed to respond to the needs of mobile computer designers to limit power consumption in the face of increasing demand for performance. One concrete example is Direct Rambus DRAM (RDRAM) 7]. The Direct Rambus technology delivers high bandwidth (1.6GB/sec per device), using a narrow bus topology operating at a high clock rate. As a result, each RDRAM chip can be activated independently. RDRAM o ers four power modes: active, standby, nap, and powerdown. Because of the narrow topology, each c hip can be independently set to an appropriate power state.
An RDRAM device must be in the active state to perform a read or write transaction, which t a k es 60ns and consumes 300mW. A chip that is not servicing a memory request can be in any of the lower power states. However, these states incur additional delay for clock resynchronization. Standby is fast and uses 60% of the power of active mode. Greater power savings can be achieved by using nap mode (10% of the power of active) with an additional resynchronization time required to transition to the active state in order to service a memory request. Powerdown mode has the minimal power consumption (1% of active), but a signi cant delay f o r clock synchronization (100 times that needed by nap mode) to enter the active state. Table 1 shows the power states with the power cost values used in this study as well as the possible transitions and additional transition times into active mode 7, 4].
DRAM Power Management
The challenge for the memory controller designer is to utilize these modes e ectively. It is not only the availability o f these power states but the ability t o t r a n s i t i o n b e t ween them dynamically on a per-chip basis that gives the RDRAM its potential for power management. The key for the memory controller policy is to determine when the bene t of transitioning to a low p o wer state is greater than the penalty for transitioning back to the active state.
The time between DRAM accesses is the important c haracteristic that in uences the memory controller policy design. Furthermore, we note that any DRAM chip access that arrives during the service time of the previous access can immediately be serviced and will increase the time the chip is in the active state. We call a sequence of such D R A M chip accesses clustered a c cesses since the DRAM chip can not transition to a lower power state. Therefore, it is actually Figure 1 ): a g i v en DRAM chip remains in the active s t a t e until the gapexceeds a threshold amount of time, it then transitions to the low p o wer state until the next access. The key to this policy is to determine the appropriate threshold value to maximize energy e ciency. However, this depends on the DRAM access characteristics in terms of gap. The following section outlines our methodology for exploring the relationship between gapand threshold.
METHODOLOGY
To evaluate energy e ciency, we use the Energy Delay product (E D) 2]. This metric captures our goal of achieving high performance (seconds) while minimizing energy consumption (Joules). Although total system energy consumption is important, it is highly dependent on speci c design choices (e.g., processor, display t ype, wireless network interface, etc.). Therefore, we concentrate only on DRAM energy consumption, and ignore the energy consumed by all other system components.
To fully explore the relationship between DRAM access gapsand the memory controller threshold values, we use a c o m bination of trace-driven simulation (described below) and analytic evaluation (see Section 4). The trace-driven simulator is used to both characterize the DRAM access patterns and to validate our analytic model.
The trace-driven simulator processes instruction and data address traces of personal productivity applications 6] and uses a simpli ed RDRAM model. This simulator models a two-level cache hierarchy with a 16KB L1 and a 256KB L2 cache, both caches are direct-mapped with 32B blocks and can support 8 outstanding misses. Higher associative caches do not qualitatively change our results. We model the individual RDRAM chips and their associated power state, but do not model memory bus contention or the internal DRAM banks. In these studies we only model the transition from the lower power state to active. The transitions from active t o l o wer power states do not incur any delay o r e n e r g y consumption.
For timing considerations (necessary to compute energy consumption), we use a simpli ed processor model that exe- cutes one instruction per cycle, and never stalls due to long latency operations (i.e., execution only stalls when the maximum number of outstanding misses is reached). We assume a 500Mhz processor clock, the level one cache takes 2 cycles to access, while the level two cache incurs an additional 10 cycles. We s i m ulate a non-interleaved main memory system with eight 32Mb RDRAM chips, for a total main memory capacity of 32MB.
EVALUATION
Recognizing that gap and threshold are two factors that may a ect our power control e ectiveness, we study the relationship between E D and these two factors.
DRAM Access Characteristics
The rst step is to capture the distribution of access gaps in the execution of di erent benchmarks within our cachebased architecture. We observe c a c he misses from each i n d ividual memory chip and measure the time between clustered misses. Figure 2 is Figure 2 also plots the exponential distribution with the same mean gap size. From Figure 2 we observe that the gap distributions for compress95 and winword match the general shape of the exponential. Table 2 shows the results of applying the ChiSquare test of our observed data to the exponential distribution of gap sizes 3]. The Chi-Square test reveals that three of our applications pass the test with signi cance = 0 :05, whereas three fail this test. Nonetheless, using the exponential as an approximation of the real gap distribution is su cient as it produces results consistent with simulation (see Section 4.3). Modeling the gap distribution with the exponential allows us to perform analysis and more extensively explore the design space. We assert that the errors inherent in this approximation results in a pessimistic bias in the results.
Analysis
We use the always-active policy as a baseline for comparison. For simplicity, w e c hoose nap as the low p o wer state in our 2-state threshold waiting control policy. Let g denote the gap between clustered memory accesses and T h be the waiting threshold before transitioning to the low p o wer state. Assuming the memory access gap follows an exponential distribution, its density function is: With these per-gap mean delay and energy changes, we compute the change of total E D product in one run. From Figure 1 , the per-gap mean energy consumption with the always-active policy is: e0 = Pa(tacc + ) (4) and the mean delay is: d0 = tacc + (5) With the power transition policy, the per-gap mean energy consumption e and mean delay d are: e = e0 + e d = d0 + d Let D0 and E0 denote the original runtime and energy consumption with the always-active policy, D and E are those with power state transition. Assume n is the number of gaps in the same run. Since E = ne, D = nd, E0 = ne0 and D0 = nd0, t h e t o t a l c hange of E D is calculated by:
(d0 e + de0 + d e) (7) We de ne per-gap mean change in energy delay product as (e d).
(e d) = d0 e + de0 + d e (8) From Equation 7 we see the change in total energy delay product (E D) is linear to the per-gap mean change (e d). So we use this (e d) as the metric to evaluate the control policy. If it is positive, the policy is worse than the always-active policy if it is negative, the policy is better. As (e d) decreases, the bene t increases.
From Equations 2-5 and Equation 8 we h a ve: (e d) = f( Th) (9) Now, we can use this analytic result to explore the parameter space. We use the parameter values in Table 1 to model a single RDRAM chip. By substituting these parameter values into our formulas, we obtain Figure 3 and Figure 4 . Figure 3 shows (e d) as a function of (mean gap) with di erent x e d threshold values. Figure 4 shows (e d) a s a function of threshold with di erent mean gapvalues. Since our empirical gap distributions have relatively large average gap values, we rst focus on the case where is large. As we can see from the two graphs, when threshold is xed and is large enough, (e d) is a monotonically decreasing function of while with xed , ( e d) increases as threshold increases. Threshold 0 produces maximum benet on (e d). When is large, the energy savings can overcome the extra transition cost. Because of the memoryless property of exponential distribution, waiting for a threshold amount of time does not provide any knowledge about the future access. Therefore the instant transition policy is the best policy for the distributions with large mean gaps.
When is small (the part left of the crossover point i n Figure 3 , the line (e d)( = 5 0 T h ) in Figure 4 ), a larger threshold performs better than a smaller threshold but worse than the original always-active p o l i c y . This is because with Threshold (ns) a small mean value most gaps cannot bring a potential energy saving large enough to cover the transition cost. The larger threshold causes fewer power state transitions and thus avoids some resynchronization costs. The always-active (no transition) policy is the best for this case.
Validation
To v alidate our analytic model we use trace-driven simulation of both random and sequential rst-touch page allocation policies 5]. By comparing (e d) obtained from the simulation to that obtained from the model, we can gauge the accuracy of the model. For this analysis we focus only on the transitions from active to nap. 1 Therefore, with sequential rst-touch page allocation, only one DRAM chip is used. In an actual implementation, the unused chips would transition to the powerdown state, independent o f the active to nap threshold. For random allocation, although all eight DRAM chips are active, for brevity w e consider only the four chips with the smallest average gap. W e present results for only the benchmarks compress95 and winword. We note that winword produces the largest error between the simulation and the model. The other benchmarks produce results similar to these two b e n c hmarks. Figure 5 shows the (e d) values obtained from both simulation and the model. Table 3 shows the raw simula-1 Given our average gap sizes, our analytic results suggest no viable role for the intermediate standby state. tion data and the relative di erence between the model and the simulation for compress95. Each r o w of the table corresponds to one DRAM chip for the given threshold, hence only one chip per threshold with sequential rst-touch and four chips per threshold with random page allocation. From these results we see that for compress95, in most cases, the results from the model are within 5% of the simulation results. For winword, the error is larger, approaching 50% in some cases. Although this error seems large, the qualitative result is the same for both simulation and the model|zero threshold performs best. Furthermore, the trend in the results is the same, zero threshold performs best and increasing the threshold decreases the energy bene ts. We note that the relative di erence increases as the threshold increases, and is generally larger for sequential rst-touch page allocation than for random page allocation. For sequential-rst-touch allocation, the absolute (e d) values are small, so small changes produce large relative errors. This page allocation policy bene ts mostly from the unused chips entering powerdown. Nonetheless, a zero threshold is the best solution for both the simulation results and the model. The increasing relative di erence as the threshold increases is due to the approximated gapdistribution di ering from the real distribution. For larger thresholds this di erence is magni ed, while for the zero threshold (e d) is distribution independent (see Equations 2-3). 
CONCLUSION
Modern DRAM chips provide power management features to help meet the increasing demand for energy e cient c o mputing. The challenge is to develop memory controller policies that best exploit these features. This paper explores DRAM power management policies for cache-based systems using analytic modeling validated with trace-driven simulation. Our results reveal that, for most workloads on cachebased systems, DRAM chips should immediately transition to a lower power state when they become idle and will not bene t from sophisticated power management policies.
