Abstract-Recent research at Intel suggests that chips with hundreds of processor cores are possible in the not-so-distant future. As the number of cores grows, so does the size of the cache systems required to allow them to operate efficiently. Caches have grown to consume a significant percentage of the power utilized by a processor. In this research, we extend the concept of location cache to support chip multiprocessors (CMPs) systems in combination with low-power L2 caches based upon the gated-ground technique. The combination of these two techniques allows for reductions in both dynamic and leakage power consumption. In this paper, we will present an analysis of the power savings provided by utilizing location caches in a CMP system. The performance of the cache system is evaluated by extending the capability of CACTI and Simics using the SPLASH-2 and ALPBench benchmark suites. These simulation results demonstrate that the utilization of location caches in CMP systems is capable of saving a significant amount of power over equivalent CMP systems that lack location caches. Index Terms-Cache architecture, dynamic and leakage power dissipation, gated ground technique, location cache, low-power design, power analysis.
Location Cache Design and Performance Analysis for Chip Multiprocessors
Jason Nemeth, Rui Min, Wen-Ben Jone, Senior Member, IEEE, and Yiming Hu, Senior Member, IEEE Abstract-Recent research at Intel suggests that chips with hundreds of processor cores are possible in the not-so-distant future. As the number of cores grows, so does the size of the cache systems required to allow them to operate efficiently. Caches have grown to consume a significant percentage of the power utilized by a processor. In this research, we extend the concept of location cache to support chip multiprocessors (CMPs) systems in combination with low-power L2 caches based upon the gated-ground technique. The combination of these two techniques allows for reductions in both dynamic and leakage power consumption. In this paper, we will present an analysis of the power savings provided by utilizing location caches in a CMP system. The performance of the cache system is evaluated by extending the capability of CACTI and Simics using the SPLASH-2 and ALPBench benchmark suites. These simulation results demonstrate that the utilization of location caches in CMP systems is capable of saving a significant amount of power over equivalent CMP systems that lack location caches.
Index Terms-Cache architecture, dynamic and leakage power dissipation, gated ground technique, location cache, low-power design, power analysis.
I. INTRODUCTION

I
N RECENT years, microprocessor companies have had difficulty in increasing the performance of CPUs by simply increasing their clock frequencies. Research has moved to parallelism in an effort to maintain performance increases [1] , [2] . It is now increasingly common for multiple processor cores to be included on a single silicon die, creating what is called a chip multiprocessor (CMP). A CMP typically contains multiple cores operating at the same clock frequency, and those cores tend to share at least part of their cache system on the chip. For example, an Intel Xeon MP processor contains a pair of processor cores, and includes a cache system consisting of private L1 and L2 caches in addition to a large L3 cache that is shared between the cores [3] .
As cache systems have grown in size to satisfy the additional needs of these CMP systems, so does the amount of power they consume. Several techniques, such as subbanking [4] , [5] , bitline segmentation [4] , and phased cache [5] - [7] are commonly used to reduce the amount of dynamic power used by a cache.
While the dynamic power used for read and write accesses still plays a large part in overall power usage, leakage has grown to dominate the power consumed by the cache system [8] . While continually reducing the process size increases speed and reduces area, it also increases the sub-threshold leakage power consumed by the chip. In addition, the increased power consumption may also lead to thermal issues on the chip, and design must proceed carefully in order to eliminate potentially-damaging hot spots. Several different techniques, such as gated-Vdd [9] , drowsy caches [10] , and DRG-caches [11] , have been presented to reduce sub-threshold leakage power.
A location cache is a small direct-mapped cache that stores information relating an address to its location in the target cache [12] . This capability can save dynamic power upon cache reads and writes. More importantly, this behavior is capable of being exploited when used in combination with low-leakage techniques to save a significant amount of leakage power. In this work we propose a method of extending the concept of location cache to support CMP systems. We utilize CACTI 5.0 [13] to provide both dynamic and leakage power measurements for all the caches in this work. CACTI does not natively support location caches or caches utilizing low leakage techniques, so it is extended to support these cache architectures. Simics [14] is extended to provide the capability for cycle-by-cycle power estimation for traditional caches, location caches, and gated-ground caches. These extensions provide a framework for testing a variety of cache systems by running benchmarks on actual operating systems. Models of both the location caches and the gated-ground low leakage L2 caches are created and simulated using CACTI and Simics. The power utilization of the cache system is presented for a number of possible configurations using the SPLASH-2 and ALPBench multithreaded benchmark suites, and a discussion of the results is provided.
This work contains three major contributions: 1) the concept of location cache proposed for a single-core processor is extended to CMPs with coherency mechanisms well considered; 2) design parameters such as private versus shared location caches and the number of location cache entries are fully explored and evaluated to further understand the behavior and design tradeoff of a location cache system; 3) potential power savings of using location caches on a CMP system are simulated for the design space explored.
Section II presents the location cache design and working principle for single-core processors. The extension of location cache to multi-core using private or shared location cache is presented in Section III. Low leakage (L2) cache design and its location cache support is explained in Section IV. Power simulation for a cache system supported by location cache is discussed in Section V. Section VI gives the experimental results. Finally, Section VII concludes this paper. 
II. BACKGROUND
This section discusses the working principal and design of the location cache concept [12] , [15] for a single-core processor. In this work, we assume that L2 is the highest level cache.
A. Structure of Location Cache
The location cache shown in Fig. 1 is a small direct-mapped cache, using address affinity information to provide the accurate location information for L2 cache [12] . The proposed location cache technique reduces the L2 cache power consumption, when compared with a conventional set-associative L2 cache. Depending on the L2 cache architecture, a location cache can be physically addressed or virtually addressed. Fig. 1 illustrates a revised L2 cache system architecture with a location cache, which is physically addressed.
In this physically addressed cache system, the location cache is physically addressed as well. It caches the access way location information of the L2 cache (the way number in one set where a memory reference falls). This cache works in parallel with the L1 cache. As a location cache tries to cache the L2 location information, the block address (composed of the index address and the tag address) of the location cache should be of the same length as that of the L2 cache. For instance, in Intel Itanium 2 the physical address is 50 bits and the L2 cache block size is 128 bytes (instead of 64 bytes of block size for L1), so the block address of the location cache has 43 (50-7) bits. If the location cache has 512 entries (i.e., the index contains 9 bits), then each tag array entry will have 34 (43-9) bits.
B. Working Principle of Location Cache
The proposed cache system works in the following way. The location cache is accessed in parallel with the L1 cache. If the L1 cache sees a hit, then the result obtained from the location cache is discarded. If there is a miss in the L1 cache and a hit in the location cache, the L2 cache is accessed as a direct-mapped cache and the access power of the L2 cache will be greatly reduced. When both the L1 cache and the location cache see a miss, the L2 cache is accessed as a conventional set-associative cache and the content (i.e., the new way information) of the location cache is updated. When the location cache stores the location (way) information of the L2 cache, it uses the same block address as the L2 cache, instead of the L1 cache. As opposed to the way-prediction methods [16] - [18] , the cached location is not a prediction. Even if there is a location cache miss, we do not see any extra delay penalty as seen in way-prediction caches.
Normally, the block size of the L2 cache is larger than that of the L1 cache; for instance, in Intel Itanium 2, the L1 block size is 64 bytes while the L2 block size is 128 bytes. Due to this difference in L1 and L2 block sizes, the location cache can still catch many references which are L1 misses but location cache hits. For example, in Intel Itanium 2 the physical address is 50 bits and the L2 (L1) cache block size is 128 (64) bytes as discussed above. Given an address, the L1 (L2) cache will interpret the address as 38 (35) bits for tag, 6 (8) bits for index, and 6 (7) bits for offset in the block by the L1 (L2) cache. Assume one byte is to be accessed with the last seven bits of its address equal "0111111" in the binary form. Thus, the entire block (64 bits) containing this byte is accessed from the L2 cache to the L1 cache. Also, the corresponding way location of this access is stored into the location cache. In the next memory access, assume the next byte is accessed by the CPU. The last seven bits of the address thus contains "1000000" again in the binary form. For the L1 cache, the index has changed one bit and an L1 cache miss might occur. However, for the location cache, the index is not changed and the location cache hits the L2 memory access successfully. Note that the index field of the location cache is not changed for the new address, since it has the same block address as the L2 cache.
Even when the L2 block size is the same as the L1 block size, the location cache still can hit many memory accesses with L1 miss. The reason comes from the fact that the location cache entry number (e.g., 128) can be allocated to be larger than that of the L1 cache (e.g., 64) without exceeding the access time of the L1 cache. This is because the location cache data array (which contains the way information only) and tag array are both smaller when compared with the L1 data array (which contains data or instructions) and tag array. As a result, the location cache still can catch many L1 misses due to its larger entry number, when the block size in the L1 cache is the same as that in the L2 cache.
III. LOCATION CACHES ON CMP SYSTEMS
Previous works utilizing location caches have been limited to single processor systems [12] . With multicore chips becoming increasingly prevalent, the concept of location cache needs to be adapted to these new types of systems. The following discussions describe two approaches to creating location caches capable of functioning within CMP systems.
A. Shared Location Caches in CMP Systems
The most straightforward approach to adding a location cache to a CMP system involves sharing. Multicore processors commonly share an L2 or L3 cache among all of the cores. Similarly, it is possible to create a single location cache capable of being accessed by each of the cores in the system if those cores share a cache at some level. For example, if all four cores share a single L2 cache, these cores can be served by a single location cache. If those four cores instead share a pair of L2 caches, a shared location cache approach can be implemented in two different ways: 1) cores sharing the same L2 cache also share the same location cache as shown in Fig. 2 and 2) all cores share only a single location cache. In the following discussions, by the term a location cache, we mean a location cache design which might have more than one location cache depending on the detailed implementation.
In the case where the highest-level cache is L2, the location cache operates on every access initiated by every processor core it serves. The source of the access is completely disregarded, and only the transaction's address is taken into consideration. We only focus on the cache system shown in Fig. 2 with four cores sharing two location caches, since this architecture has been applied by commercial multi-core processors. Though this architecture is in fact a semi-shared location cache system, for each L2 cache, it is a purely shared location cache system. With all cores sharing a single location cache is not practical, because this architecture suffers location cache line replacement problems as will be discussed later. That is, too many cores are fighting for a limited number of location cache lines and this makes the single location cache almost useless. Further, the (single) location cache has to store extra information such as the L2 where a specific data exists.
Assume Core 0 initiates a memory access. Its L1 cache and shared location cache LCache0 parse the tag and index information and check for matches. If the L1 cache hits, the result of the location cache access is ignored and L1 returns the requested data to the core. If the L1 cache misses, the result of the location cache access determines how to proceed. If the location cache also hits, the way information stored in the location cache is used to access the L2 cache as if it were direct-mapped. If the location cache hits, a hit in L2 is guaranteed. If the location cache misses, L2 is accessed in its normal set-associative manner and the new way information is provided to the location cache for future use. Note, however, that a miss in the location cache would not necessarily indicate a miss in the L2 cache. Now assume Core 1 attempts to access the same memory address, and it is not found in its L1 cache. The way information for this address was previously stored in the location cache by Core 0, and can now be used to access L2 as a direct-mapped cache.
Let us consider another example. Assume that the L1 caches of both Core 0 and Core 1 have cached the same line of data. Core 0 now performs a write to this address, changing the data. At this point the MESI [19] protocol triggers the L1 of Core 1 to change the corresponding line's state from Shared to Invalid, and the newly-updated line in Core 0's L1 will retain its Shared status. Note that the corresponding line in the location cache does not need to be removed or modified, as it still points to the correct location in L2. If Core 1 now tries to access this address again, it will find that its own copy in L1 is marked Invalid. It can now use the shared location cache, which still knows the location of the line in L2, to access L2 as if it were a direct-mapped cache. In this case a location cache can be very useful for programs that require multiple cores accessing the same memory location.
Other than the simplicity of implementation, the other advantage to this configuration is that it lacks any coherency issues. Since all of the memory transactions sharing the same L2 cache pass through the same location cache, no additional implementation changes are required to keep the location cache and its target cache coherent. As a matter of fact, the location cache associated with each L2 cache just stores whether a specific data is in L2. If yes, what is the way number. The coherency mechanisms implemented in L1 and L2 are still serving their purposes. Thus, the function of location cache and coherency protocols implemented in L1 and L2 are orthogonal. Assume that the L1 caches of both Core 0 and Core 1 have cached the same line of data (called ) again, and the tag and way (in L2) are stored in intex of the shared location cache. Now, assume Core 0 performs a read operation which causes the eviction of the data line from L2. The MESI protocol triggers the L1 of Core 1 to change the line's state from Shared to Invalid; further, the new data will be brought into L2 and then to L1 of Core 0. The new tag and way (for L1 of Core 0) will be written into the location cache line with index . Later, when Core 1 tries to access data line , both L1 of Core 1 and the location cache miss due to tag mismatches. From the above example, it is clear that the function of coherency protocol mechanisms implemented in L1 and L2 and the function provided by location cache are orthogonal, and will not interfere with each other.
Similarly, when a cache system utilizes a pair of L2 caches, as in the case mentioned above, as long as the L2 caches remain coherent with each other, the location caches will also remain coherent. If the L2 coherency is handled by the MESI protocol, no modifications to the location caches are required to allow them to support such a configuration. That is, the location caches do not need to be equipped with MESI protocol bits to remain coherent with both L2 caches. Further, the coherency issue for L1 caches sharing the same location cache is completely taken care of by the MESI protocol implemented for the L1 caches.
While this setup is easy to implement, it does have drawbacks. An access initiated by Core 0 is likely to overwrite way infor- mation in the location cache that will be used again by Core 1 in the near future. In addition, with multiple cores expecting complete accesses to a location cache, the location cache will need a read and write port for every core supported. Simultaneous accesses to the same line in the location cache will increase latency, and reduce the efficiency of the location cache itself. Resolving these issues will increase both the complexity and power consumption of the location cache.
Another concern is that of replacement. Four cores running four different programs could create a great degree of churn in the limited number of location cache lines. The hit rate of a location cache would suffer as the unrelated accesses from cores sharing the location cache are constantly replacing each other. While this behavior can be relieved somewhat by increasing the size of the location cache, this unwanted behavior will always exist to some degree.
B. Private Location Caches in CMP Systems
A cache system utilizing a private location cache for each core is shown in Fig. 3 . When used in a CMP system, the simplistic location cache design is prone to incoherency when multiple location caches assist a single target cache (e.g., L2). For example, if Core 0 writes a value into its L2 cache, the way information is then stored in Core 0's location cache (LCache0). Later, if Core 1 writes a value into the L2 cache it shares with Core 0 and evicts this previous entry from L2, Core 0's location cache now points to data that is no longer present. Due to this possibility, it is necessary to extend the concept of location cache presented above for a CMP system. In the following discussions, by the term a location cache, we mean a location cache design which might have more than one location caches depending on the number of cores.
We propose extending the location cache design with a simplified version of the MESI protocol. This modification can be performed by adding only a single additional bit to each line in the location cache. This bit will determine whether the location cache line is in a Shared or Invalid state as shown in Fig. 4 . A line in a location cache is marked Invalid if the line it references is no longer present in the target cache or if the line has not yet been written to. In all other cases, the line is marked Shared. When a location cache lookup is performed, in order for a location cache hit to occur, the following two conditions must be satisfied:
1) requested line must be present in the location cache; 2) requested line must be marked Shared. This alteration does not come without a cost. In order to perform this operation, additional care needs to be taken when lines are evicted from the target cache. When such an eviction occurs in the target cache, the tag and index portion of this reference is passed to each of the connected location caches. The location caches then check if they contain lines matching the newly-evicted entry. If the evicted transaction address is not present in other location caches, no further operation is performed. However, if the evicted transaction address is present in the other location caches, the lines in the location caches are marked Invalid by transition "LOC_INV" (i.e., location cache invalid) as shown in Fig. 4 . This will prevent future use of these location cache lines, which will be overwritten at the next opportunity.
The operation of location caches here is a little different from that of the shared case. When the way information is stored for a transaction initiated by Core 0, for example, it is stored in Core 0's private location cache. This location cache, LCache0, cannot be read from or written to by Core 1. This alleviates the problem where each core is constantly overwriting each other's cache information, and results in increased location cache hit rates. However, let us say Core 0 has way information for a transaction stored in its private location cache. Now Core 1 performs an access that ultimately results in the line pointed to by Core 0's location cache being evicted from L2. Core 0's location cache now points to an address that no longer exists in L2. Here our coherency protocol would require L2 to transmit the address of the evicted L2 line to each of its connected location caches. If that address is present in any line of a location cache, the location cache line is marked Invalid. This ensures our private location caches remaining coherent. It should be noted that the location cache line is updated and changed to the shared state only when a message from the L2 cache is received back at the L1 cache again. The way information of the particular block in the L2 cache is sent along with the message that is sent from the L2 cache to the L1 cache. The update occurs when any data message is received by the L1 cache from the L2 cache. Now assume the L1 caches of Core 0 and Core 1 contain the same line of data, and their private location caches have stored the way information for the line's location in L2. Then, Core 0 writes a new data into the line, causing the status of the line in Core 0's L1 cache to move from Shared to Modified, and from Shared to Invalid in Core 1's L1 cache. However, since the line is not evicted from L2, its way information remains the Shared designation in both Core 0's and Core 1's location caches. When Core 1 tries to access the line again, it will find that it is marked Invalid in its L1 cache, and Shared in its location cache. Thus, it can access L2 using the known way information that remained in the location cache.
This extension is powerful in that it allows several location caches to be utilized against a single target cache. Since two states are added, Shared and Invalid, only a single additional bit of storage for each line is required in a location cache. Combined with the fact that a location cache can be efficient with a very small number of lines [12] , very little overhead is created by using this modification. Similar to the case of shared location caches in a CMP system, if the L2 coherency is handled by the MESI protocol, no protocol bits are required in the location caches for L2 data coherency. In summary, for private location caches, the protocol bit added in each location cache line is used to indicate the (non-)existence of data in L2; the MESI protocol bits used for L1 and L2 will take care of L1 and L2 data coherency issues automatically.
IV. LOW LEAKAGE CACHE DESIGN AND LOCATION CACHE SUPPORT
The gated-ground technique was developed to reduce the amount of leakage power consumed in large cache systems [11] . Currently, neither gated-ground caches nor location caches are supported by the widely-used CACTI cache power estimation tool. In this section we will discuss the utilization of location caches in combination with a gated-ground cache, as well as the modifications made to CACTI to support such an architecture.
A. Support for Gated-Ground Caches
In its default form, CACTI calculates the leakage power in normal mode. Many modern processors utilize the gated-ground technique to reduce leakage power consumed by these increasingly large caches. The following discussions explain the changes made to allow CACTI to compute power estimates for these types of caches.
1) Calculation of Leakage Power: CACTI was modified to allow the computation of gated-ground leakage by utilizing a new scaling factor, . Computation of this ratio between normal mode leakage and gated-ground leakage was accomplished using HSpice. A 65-nm process was utilized to construct and measure the leakage through a single cell operating in the normal mode. In addition, a row of 16 cells was constructed to share a single gated-ground transistor sized at , where is 32.5 nm for our 65-nm process, and the leakage through one of these cells was computed for comparison to the normal mode (1) The ratio of the leakage between normal and gated-ground modes, shown in (1), was found to be 0.114227. Thus, a cache in gated-ground mode consumes just 11% of the power consumed by the same cache in normal mode. This value is multiplied by the leakage power obtained directly from CACTI to arrive at an approximation of leakage power consumed by the cache in gated-ground mode. Note that the virtual ground voltage under this experimental set up is 0.25 V, while the normal power supply is 1.1 V.
2) Calculation of Activation Energy: In order to provide an accurate measure of the power used in a gated-ground cache, it is necessary to compute the energy consumed when a cache line is activated. CACTI provides no built-in capability for calculating the activation energy of a gated-ground cache, so a modification had to be devised. In order to ensure that the computed activation energy directly correlated with the rest of our results, we could not simply use the measurement from HSpice. Instead, a method was developed to directly relate the activation energy to a known value that CACTI could already compute.
Activation energy is closely related to leakage energy. In Fig. 5(a) , the leakage current through an SRAM cell in gated-ground (sleeping) mode is shown. Fig. 5(b) shows the same cell as it is being activated. These two figures illustrate that the leakage energy and the activation energy consumed by the cell during the wake-up phase are closely related. The activation energy is nothing more than an increased amount of leakage energy, and it is for this reason that we have chosen to base our calculation of the activation energy on a memory cell's leakage energy (2) As in the previous section, a single normal mode cell and a set of 16 cells sharing a gated-ground transistor are examined using HSpice. The wake-up time of the gated-ground array was determined to be approximately 0.3 ns. Then the amount of leakage energy that is consumed during this period was computed. These values were used to compute a ratio between the leakage energy and the activation energy during the wake-up period, as shown in (2) (3) The ratio between the activation energy and the normal mode leakage was determined to be approximately 2.828. Using this value, we were able to modify CACTI such that it could compute the activation energy of a cache line in the gated-ground mode. This was accomplished by multiplying the normal mode leakage calculation by our ratio, , resulting in (3). The result of this equation is the activation energy of a single line of the cache that is directly correlated to the rest of the CACTI power output.
B. Location Cache Support
CACTI provides no built-in support for location caches, so CACTI 5.0 was revised to support such a model. The cache configuration generated by CACTI lends itself well to the inclusion of a location cache. CACTI was modified to allow activation of the cache on a subarray by subarray basis, allowing a greater degree of control over the portions of the cache which will be activated.
CACTI places data in a given line by evenly dispersing the data bits among an entire row of subarrays, as shown in Fig. 6 where there are 12 subarrays or 3 mats equivalently. The shaded portions of the subarrays represent the cells that make up a single cache line access in an L2 cache, for example. The H-Tree configuration of the cache allows signals to propagate through all cells simultaneously, resulting in an architecture that is resistant to hot spots. However, this configuration is not conducive to power savings when a location cache is connected. Since the data is dispersed through the row of subarrays, each subarray must be activated in all cases.
We propose a change that involves packing the entire cache line into a single subarray, as shown in Fig. 7 . Here an access that used to require bits from six subarrays now involves data from only a single subarray. While such an implementation in an L1 cache may lead to hot spots in certain areas of the chip, L2 is accessed with such relative infrequency that hot spots can safely be avoided [20] . In addition, when CACTI is not using bitline multiplexing, such a modification will not increase chip area due to the fact that CACTI places a sense amplifier for each bit line [13] (4) This modification allows for significant power savings in the subarrays, as well as minor savings in the predecoder blocks. The major power saving results from the following two aspects.
1) Instead of activating all subarrays in the first row as shown in Fig. 6 , only one subarray as shown in Fig. 7 needs to be activated due to the guidance of a location cache. Therefore, switch (dynamic) power can be saved. 2) Since only one subarray is activated, all other subarrays can be maintained in sleeping mode to save leakage power. Thus, both dynamic power and leakage power can be greatly saved, when the cache access is hit by the location cache. Note that the number of subarrays that need to be activated when the location cache hits depends on the actual cache configuration as will be discussed later.
In the CACTI code, many power calculations rely on the computation of the number of active mats where a mat in Figs. 6 and 7 is a group of four subarrays sharing a single predecoder. By default, CACTI calculates the number of active mats as in (4). If we assume a cache with an Ndwl of 6 as shown in Fig. 6 , three mats will be activated. Note that Ndwl is the number of divisions of the wordlines, which stipulates the number of subarrays arranged along the entire width of the array. Further, CACTI has the following design parameters: Ndbl is the number of divisions in the bitlines, which is in fact double of the number of subbanks present in the array; Nspd defines the number of sets (where each set contains a number of ways equal to the associativity of the cache) per cache line (see Fig. 8 ), where fractional numbers indicate that a set is spread out across multiple cache lines [13] .
Our modification allows us to activate mats or subarrays individually.
, the number of divisions in the word line, also indicates how many subarrays are configured across a row of the L2 cache. Given that a mat is two subarrays wide, we can determine that the number of mats present along a row of the cache is calculated using (4) . CACTI assumes that this entire row of mats must be activated during an access. Recall that is the number of sets present on a given line in the L2 cache. Therefore, the number of subarrays required to store a single set of data is shown in (5). Given that there are four subarrays per mat, For example, CACTI determines the optimal configuration of a 16-way set associative 2 MB L2 cache used in this work to be , , and as shown in Fig. 8 . Applying this configuration to the default CACTI calculation shown in (4) results in 4 mats utilized during an access. In Fig. 8, since , the entire set of cache data is distributed into two rows of subarrays. According to (5) , the set of data is distributed into 16 subarrays , which is 4 mats by (6) . In the case of a location cache hit, only a single way from the set needs to be accessed because the way number for this address in L2 was stored during a prior transaction. Due to the packing structure described above, we can divide the total number of subarrays used during a location cache hit by the associativity of the cache to determine the number of subarrays required to access a single way of data. Incorporating this into (6) results in (7), the new equation utilized by CACTI to determine the number of subarrays to access upon a location cache hit. For a location cache hit, (7) results in a value of 0.25, implying only a single subarray must be accessed (0.25 mats). The power overhead to enable a cache access in the subarray level is limited and can thus be ignored. The bulk of the savings can be attributed to the reduction of cells being accessed, which is included in the power computation for the bitlines in CACTI.
V. LOCATION CACHE POWER SIMULATION
In this section we will discuss our use of the Simics system simulator to simulate a location cache implementation for CMP machines. Section V-A discusses the activation policy used throughout the work. Section V-B details the extensions made to the simulator to allow for cycle-by-cycle power estimation.
A. Sleep Cache Activation Policy
Lines in the L2 cache are put into gated-ground (sleep) mode according to the drowsy cache policy set forth in [10] . To utilize this activation policy on our gated-ground cache, a Sleep Controller module was created in Simics. This module gives Simics the ability to simulate the process of putting lines to sleep and activating them.
At the start of a simulation run, the Sleep Controller puts the entire cache into sleep mode so that it is consuming the least amount of leakage power possible. When cache operations begin, the module activates a single counter to keep track of the number of CPU cycles elapsed since the entire cache was last driven into sleep mode. During L2 cache access operations, cache lines are activated on a subarray basis, discussed in detail in Section IV. These lines remain active until the next time the cache is globally driven into the sleep mode.
A cache's window size is the number of CPU cycles between these global sleep events. Once the Sleep Controller's counter exceeds the window size, the entire cache is returned to sleep mode as soon as the cache becomes idle and the counter is reset. This is important as the cache must not be driven into gatedground mode in the middle of an access, or the data written or read by the transaction would be corrupted. An 8000 cycle window size was chosen according to the optimum in-order execution window size determined in [8] . In addition, it has been shown that this simple windowing sleep policy can be as effective as more complicated policies in most cases [8] .
B. Cycle-by-Cycle Power Estimation
One of the difficulties in utilizing Simics for this work was that Simics includes no provisions for gathering statistics on power usage. Due to the complexity of our modifications, it would be prohibitive or impossible to derive the proper power equations by hand. Instead, an extension to Simics was created to provide support for basic power statistics. The following discussions give the creation and use of this extension.
This Simics g-cache module was further modified to allow for the calculation of dynamic read and write energies, as well as leakage power. The energy values supported by this modification are shown in Table I , which are provided by CACTI, with example values from the L2 cache with 64 byte lines used in this work. From these values it is possible to build up a working estimation of power consumption within Simics. All values are provided to Simics in Joules. For those values which are usually represented in terms of Watts, such as leakage power, conversion to Joules is performed by multiplying by the known CPU cycle time prior to passing them to Simics (8) Leakage power is computed on a cycle-by-cycle basis. Due to our modifications, Simics is aware of the number of active ways (or subarrays) in each cache at all times. The leakage power consumed during every cycle is then computed by (8) . In addition, all dynamic energies are included in the power statistics. In the case of a location cache miss, or in the absence of a location cache, the dynamic values and in Table I are used for . When a transaction results in a hit in the location cache, and in Table I are used. If a cache is implemented using the gated-ground technology, activation energy is also accounted for through the utilization of (9) By utilizing (9) on a CPU cycle-by-cycle basis in each g-cache module, it is possible for Simics to account for the total energy consumption of the system while benchmarking. Note that in (9) can be one of the four dynamic read/write energy attributes shown in Table I depending on whether the access is a read or write, and a location cache hit or miss. Similarly, the value of of the entire L2 cache and for the currently-requested access also depend on whether the access is a location cache hit or miss. Dynamic energy for a single access is computed and added to the total energy during the cycle the access begins, regardless of how many cycles the access will actually take to complete. Therefore, if a cache access is not first initiated during this cycle, and are set to zero even though they may still be in progress.
This scheme was chosen for its simplicity and ease of implementation, but introduces a small amount of error upon ending the simulation. If the simulator is halted after an access has begun, but before it has completed, an amount of error no greater than is introduced into the system. Due to the small amount of energy consumed during a single access, and given the sheer number of total accesses over the course of a simulation, the error introduced is negligible. The energy values determined by CACTI for a location cache with 32 entries are shown in Table II . The values are much smaller when compared with those in Table I .
VI. EXPERIMENTAL RESULTS
In this section, results of our experiments are presented. Simulation data such as location cache hit ratios and power savings rates are provided for both shared and private location caches in a CMP processor like the ones currently in use around the The SPLASH-2 [21] benchmarking suite was run using the configurations described briefly in Table III . ALPBench [22] was run in the four configurations created by fixing the location cache type to private and the number of location cache entries to 128, as shown in Table IV . In the following discussions, we assume that the L2 access latency is 18 CPU cycles and the L2 sleep window size is 8000 cycles [10] . Figs. 9 and 10 show the hit rates of the private location caches for all possible numbers of location cache entries for the SPLASH-2 and ALPBench benchmarks, respectively. The L2 line size was fixed at 64 bytes, and the private location cache configuration was utilized. For Fig. 10 , the number of location cache entries is fixed at 128. The location cache hit ratio for our configuration was calculated using (12) . Note that for a location cache accumulates the number of location cache accesses missed by , only when its corresponding L1 cache misses (and this causes L2 cache access). Further, transactions for a location cache equals the total number of misses at L1 cache . The location cache hit rates varied depending upon the number of location cache entries, where generally a higher number of entries resulted in a better hit rate. These figures indicate that location caches are capable of operating efficiently in a CMP environment. In fact, 11 of the 17 benchmarks achieve location cache hit rates of over 95% (13) These excellent hit rates translate directly to keeping the L2 cache in the gated-ground mode for greater periods of time, as shown in Fig. 11 that utilizes the same configuration as that of is the total number of subarrays in the L2 cache, and is the total simulation time of the benchmark in CPU cycles. When a location cache is not present or misses, the L2 must be accessed set-associatively, and thus requires waking up all related lines from gated-ground mode. In the case of a location cache hit, fewer lines need to be activated, allowing a greater portion of the L2 cache to remain in the more efficient gated-ground mode. This advantage is clearly seen in Fig. 11 , where many of the benchmarks spend significantly more time in the gated-ground state. As we will see in the following discussion, this behavior translates directly to power savings.
A. Location Cache Efficiency
B. Power Savings: Private Versus Shared Location Caches
When calculating the power savings realized by adding location caches to an existing cache configuration, it is important to include the additional power utilized by the location caches themselves in the overall power statistics. Due to the small number of entries and small line size, even the frequent accesses to location caches result in them consuming less than 0.5% of the total power consumed by the cache system. This is significant, as if even the worst case occurs and a location cache is able to provide no power savings whatsoever, the additional power wasted through the use of the location cache is minimal. Therefore to achieve the most accurate results, the following statistics include not only power consumed by the L2 caches, but also the location caches and L1 caches. However, we emphasize that the power saving comes from L2 caches whose accesses are aided by the proposed location cache architecture. Take the Radiosity benchmark from SPLASH-2, for example, with a cache configuration consisting of private location caches with 128 entries each, and each L2 line size of 64 bytes. In this case the L2 caches consume approximately 2.09 W, which is 67.6% of the total power consumed by the cache system. The L1 caches consume about 0.99 W and the four location caches combined use about 0.01 W, which contribute 32.0% and 0.4% of the total power consumption of the cache system, respectively (14) (15) (16) While location caches are capable of saving both dynamic and leakage power of L2, it was found that the majority of power saved by introducing location caches into a cache system is leakage power of L2. Equation (14) represents the total power consumed by a cache system without any location cache connected. Equation (15) represents the total power used by a configuration created by adding location caches to the cache system used in (14) . The Power Savings Rate, or the percentage of power saved by adding location caches to any given configuration, can therefore be calculated using (16) .
Using an L2 line size of 64 bytes and private location caches featuring 128 entries each, Fig. 12 shows that, as a percentage, moving to a private location cache system saves more leakage power than dynamic power. The power saving rates are calculated by comparing the power reduction caused by adding the private location caches to the system. For example, in the case of radix, about 45% of leakage power was saved, while about 8% of dynamic power was saved. This is a particularly interesting finding, as the amount of chip area dedicated to the cache system is increasing rapidly, along with the leakage power attributable to this increase. Any structure that can significantly reduce leakage power could become valuable for producing future low-power microprocessors.
In Fig. 13 we can see that the addition of even a shared location cache can save between 2.5% and 40% of the overall power usage, depending on the benchmark and number of location cache entries. Moving to a private location cache system, as shown in Fig. 14, increases the maximum savings to 42.5% with significantly improved average case performance. It is apparent that even with a very small number of entries, the location cache design is able to save a significant amount of power through a variety of benchmarks. It is interesting to find that some of the benchmarks, such as lu-cont, are sensitive to the number of location cache entries, while benchmarks like ocean-cont are not sensitive at all. We emphasize again that the majority of power saved is contributed by leakage power. This is the reason why most curves in Fig. 14 are quite consistent with the leakage curve in Fig. 12 .
The concept of a shared location cache is promising because it is easy to implement, and provides some energy savings over a cache configuration that lacks a location cache. Unfortunately these benchmarks showed that the configuration with shared location caches exhibited a significant amount of replacement happening as the multiple cores fought over the small number of available location cache lines. Dedicating a location cache to each processing core fared better, as shown in Fig. 15 . Here the shared location caches were provided 64 lines each and the private location caches 32 lines each. This was done to keep the total number of location cache lines in each configuration constant at 128 lines. About half of the benchmarks proved very sensitive to being assigned a private location cache for each core, and two performed slightly better using shared location caches. This solidifies our position that in a CMP system, each core must have its own location cache in order to operate at peak efficiency. Note that this will not result in too much hardware overhead, since each location cache is a small device. For example, using a 70 nm technology, the area of a 128-entry location cache is only 0.0050179 mm (0.0103518 mm for its data (tag) array. However, the area for the L1 cache described in our system architecture is 0.549098 mm (0.0312233 mm for its data (tag) array. The location cache is about 38 times smaller than the L1 cache. Power comparison of location cache with other devices can be found in Tables I and II , and Section VI-C.
C. Design Space Exploration
We have discussed the relationship between hit rates, power savings for different combinations of location cache type (private or shared), number of location cache entries (16, 32, 64, 128) . This sub-section further explores the design space by considering parameters such as: L2 line size, L2 access latency, normal mode to gated-ground mode leakage ratio, size of L1 and L2 caches, and number of L2 caches (two L2 caches or one single L2 cache).
Moving from a 64-byte to 128-byte L2 line size decreases the power savings realized by adding a private location cache. Simply increasing the L2 line size from 64 bytes to 128 bytes drastically reduced the overall power consumption of the cache system. This increase in L2 line size results in a more efficient cache configuration, leaving less room for the location cache to save additional power. The major reason for better efficiency with larger L2 line size is because one L2 access can be consumed by two L1 misses. Though the same number of memory cells will be accessed, it saves power in two ways: 1) dynamic power consumed in peripheral devices can be saved due to one access, instead of two and 2) It allows the memory cells in the same sub-array as the line accessed to sleep more. Despite these findings, location cache is still quite effective in the cases where a 128-byte L2 line size is chosen. This is illustrated in Fig. 16 (SPLASH-2) and Fig. 17 (ALPBench) , which show the percentage of power saved by using a private location cache with 64-byte and 128-byte L2 lines, respectively. While the overall savings provided by adding location caches to the L2 with 64-byte line size was greater, savings were still provided when the L2 was moved to 128-byte lines.
To better illustrate this point, we have plotted the use of both techniques in Fig. 18 . The baseline of Fig. 18 represents a cache with 64 bytes in each L2 line and a 18 cycle L2 latency. The following three configurations are plotted against the baseline: 1) baseline with the addition of 128-entry private location caches; 2) baseline, modified to support a 128-byte L2 line size; 3) baseline, modified to support a 128-byte L2 line size along with the addition of 128-entry private location caches. While each optimization performs well on their own, when combined they yield an average savings of about 35% over all benchmarks with a maximum savings of over 50%. This shows that, when paired with an increased L2 line size, private location caches can be a powerful tool in decreasing the power consumption of the increasingly large cache systems of modern processors.
One major concern about the location cache design involved the latency of the target cache. If the sleep window size remained constant, it was thought that a significant decrease in the target cache's latency could negate any power savings provided by the location cache design. This was found not to be the case, as is shown in Fig. 19 . The figure shows that while altering the latency of the target cache does have some effect on the performance of the connected location caches, this effect is minimal for even large fluctuations in latency. This behavior will allow a cache designer some flexibility, and proves the location cache design can be a beneficial addition even when paired with a fast target cache.
Another concern was that the value of would have a significant impact on the power savings provided by location caches. As discussed previously in Section IV-A1, we calculated the value of to be about 0.114. While this falls in line with previous calculations at other institutions [23] , further testing was performed by varying values between our calculated 0.114 and 0.5 to determine its effect on power savings. The results of this experiment are presented in Fig. 20 . The value of does have a significant, although linear, effect on the power savings provided by location caches. While this was expected, we also found that even when was set to 0.5, the location cache design was still able to save some amount of power in all benchmarks.
It is also very interesting to find the impact of L1 and L2 cache sizes to the performance of location cache. Table V gives six different configurations with private location caches implemented where each location cache contains 128 entries. Architectures 1 to 3 are related to L1 and L2 cache sizes, while Architectures 4 and 5 are related to unicore design. Architecture 6 is the reference architecture (Xeon E7320) described in the beginning of Section VI. Four benchmark programs selected from SPLASH-2 and ALPBench are simulated for quick comparison.
Instead of using two 2 MB L2 caches as in the reference architecture, all four cores share a single L2 cache with size 2 MB in Architecture 1. The simulation results are shown in the first Table V where, under each benchmark program, the first (second) item gives the power consumption without (with) location caches, while the third item gives the percentage of power saving by the use of location caches. For example, in the case of MPEG, the power consumption by Architecture 1 without (with) location caches is 2.63 w (2.17 w), and the power saving by location caches is about 17.49%. When compared with Architecture 6, which is the reference architecture, the power consumption by Architecture 1 is much smaller. This comes from the fact that Architecture 1 has only a single L2 cache, and the leakage power of L2 dominates the total power consumption (thus, location caches can help less). The power saving achieved by location caches is universally smaller than the reference architecture presented in Architecture 6.
Architecture 2 doubles the size of each L2 cache in the reference architecture, and the results are shown in the second row of Table V . It can be observed that, when compared with Architecture 6, the power consumption of Architecture 2 is much larger. Again, the reason is that the total L2 cache size in Architecture 2 is two times of that in Architecture 6, and the leakage power of L2 cache dominates. It is also very interesting to find that the power saving achieved by location caches in Architecture 2 is universally larger than the reference architecture presented in Architecture 6. By comparing Architectures 1, 2, and 6, we can find that location caches tend to have better performance with large L2 caches, and this is well matched with the technology trend.
Architecture 3 doubles the size of each L1 cache in the reference architecture, and the results are shown in the third row of Table V . It can be found that, when compared with Architecture 6, the power consumption of Architecture 3 is slightly larger due to the larger L1 cache size. Further, the power saving ratio achieved by location caches in Architecture 3 is not too much different from that by Architecture 6, as expected.
Another dimension for design space exploration is to replace a location cache by the L2 tag array. That is, we try to have L1 cache and the L2 tag array accessed simultaneously for each L1 access. According to our simulation using CACTI 5.0 and the system architecture presented in the beginning of Section VI, the L1 cache access latency is 0.945 ns while the L2 tag array access latency is 1.180 ns. This will increase the L1 access time by 24.87%. However, simultaneous access of L1 and location cache will not increase the access latency of L1, because location cache has a much smaller size than L1 and uses the direct-mapped access mechanism. In fact, the access latency of a location cache with 128 entries is only 0.530 ns. In the aspect of energy consumption, a location cache with 128 entries consumes 0.0025nJ for each read access; the L2 tag array consumes 0.018nJ for each read access. It is about ten times of difference.
Another common approach to implementing a low-power, highly associative cache is to use sequential tag/data array access. That is, if an L1 miss is detected, the tag array of L2 is accessed, and on a hit, the corresponding way in the data array is then accessed. Based on our CACTI simulation using the system architecture mentioned above, the tag array access latency is 1.180 ns as described before, while the data array access latency is 2.80 ns. This is a significant increase in L2 access latency. However, by using location cache, the access time can be hidden by the L1 cache access time. In terms of power dissipation, again, a signicant amount of power consumed by the L2 tag array can be saved if the location cache hits (0.018nJ versus 0.0025nJ). Table V has only a single core with one L1 and one L2 as those in Architecture 6. The simulation results are shown in the fourth row of Table V. Due to the single-core architecture (much smaller dynamic power) and smaller L2 cache (much smaller leakage power), the power consumption of Architecture 4 is greatly reduced. However, the execution time of the benchmark program will be much longer, though we did not accumulate this data. It is interesting that the power saving rate achieved by the location cache design in Architecture 4 is much larger than that in Architecture 6 for all benchmarks. A reasonable explanation is that the dynamic power contributed by L1 is greatly reduced in Architecture 4, so the power saving achieved by the location cache for L2 is amplified.
D. Location Caches for CMP and Uniprocessor
Architecture 4 in
Architecture 5 has the same configuration as Architecture 4 except that it has four times of the original L1 cache size. When compared with Architecture 4, as expected, the power consumption of Architecture 5 is increased in every benchmark simulation due to its much larger L1 size. However, the power saving ratio contributed by the location cache design in Architecture 5 is smaller than that in Architecture 4. Again, the reason comes from that dynamic power increase by the much larger L1 cache in Architecture 5 dilutes the power saving achieved in L2 by the location cache. Further, the four times larger L1 size in Architecture 5 greatly reduces the miss rate of L1, and thus the contribution of the location cache.
VII. CONCLUSION AND FUTURE WORK
In this research we have analyzed the power savings realized by utilizing location caches in a CMP system. The working principal of location cache for a single-core processor is reviewed, and extensions to this principal are proposed to allow location caches to support a CMP system. We found that the amount of power saved by adding location caches varies quite significantly depending upon the setup of the tested parameters. The tested location caches were able to save power over all tested configurations and benchmarks, though they were far more effective at reducing the amount of leakage power than dynamic power. The number of entries in each location cache displayed a surprisingly small effect on the cache's overall power savings, with only about six of the benchmarks showing sensitivity to this parameter. On the other hand, assigning private location caches is a far more effective use of resources. Our simulations show that adding private location caches to the cache system can save between 2% and 43%, depending on the benchmark, of the power utilized by an equivalent cache system without location caches. In addition, the power savings did not appear to be dependent on the latency of the L2 caches, which indicates that the location caches could be effective in systems with high-speed L2 caches. While location caches themselves provided less power savings when moving to an 128-byte L2 line size, this combination of techniques provided the most significant power reduction: as much as about 50% savings in some of the benchmarks. This shows that location caches can be a valuable tool for reducing leakage power consumed by a cache system. Many design parameters have been fully explored and discussed.
Looking towards possible future work, we suggest the following: 1) extension of this work to a cache system connected by network-on-chip (NoC); 2) learning the effect of OS techniques on the power savings provided by location caches in CMP systems; 3) extension of the current model to efficiently serve CMPs with dozens or even hundreds of cores; 4) extension of location caches to deal with exclusive caches [24] ; and 5) identifying the reasons why some benchmarks such as Radix (Ocean) seem to highly (or least) benefit from location caches.
