Abstract-This paper proposes a technique for reducing energy consumed by hybrid caches that have both SRAM and STT-RAM (Spin-Transfer Torque RAM) in multi-core architecture. It is based on dynamic partitioning of the SRAM cache as well as the STT-RAM cache. It assigns cache blocks to a specific region of a cache based on an existing technique called read-write aware region-based hybrid cache architecture. Thus, when a store operation from a core causes a write miss, the block is assigned to the SRAM cache. When a load operation from a core causes a read miss and thus causes a block fill, the block is assigned to the STT-RAM cache. However, if the core is already using maximum cache ways allocated to it in the SRAM, then the block fill is done into the SRAM. The partitioning is updated periodically. Simulation results show that the proposed technique improves the performance of the multi-core architecture and significantly reduces energy consumption in the hybrid caches compared to the state-of-the-art migration-based hybrid cache management.
INTRODUCTION
Non-volatile memories such as Spin-Transfer Torque RAM (STT-RAM) have been researched as alternatives to SRAMs due to their low static power consumption and high density. Among such memories, STT-RAM has a relatively high endurance compared to other memories and thus it is regarded as the best candidate for substituting SRAM used in last-level shared caches in modern chip multi-processors [1] .
However, STT-RAM has asymmetric characteristics of read and write. Writing into STT-RAM consumes significantly larger energy and takes more time than reading. Such characteristics can significantly increase energy consumption and degrade performance of the system if the program running on the processor cores need frequent writes into the shared cache. To mitigate the adverse effect of the characteristics of STT-RAM, there have been many researched [2, 3, 6, 8, 16, 17] on hybrid caches, where a small SRAM cache is combined with a large STT-RAM cache. In the hybrid caches, a block that is expected to be written frequently is placed in the SRAM cache, which has much lower write overhead compared to STT-RAM. So, in the hybrid cache approaches, where to place a block between SRAM and STT-RAM is one of the most important issues in reducing energy consumption and enhancing performance. Fig. 1 shows that block fill on a read miss is one major source of write into a cache. So reducing the number of block fills is another method of lowering dynamic energy consumption of hybrid caches. Cache partitioning technique is developed to increase performance of chip multi-processors by dynamically partitioning ways of the cache such that the number of misses is minimized and consequently the number of block fills is reduced. So the application of cache partitioning has great potential of reducing the dynamic energy consumption of the hybrid caches. However, conventional cache way partitioning techniques cannot be applied to hybrid caches directly, because ways in the hybrid caches are separated into SRAM and STT-SRAM and thus ways assigned to a core as a result of partitioning also should be divided into SRAM and STT-RAM. Properly dividing the ways of a partition into the two caches and placing a block into a more efficient cache are new challenging issues when applying a partitioning technique to hybrid caches.
In this paper, we propose a technique that adopts the cache partitioning scheme in a hybrid cache for reducing the energy consumption. We assume that the hybrid cache is a last-level shared cache in multi-core architecture. A conventional partitioning technique called utility-based partitioning is used to determine the sizes of the partitions, one for each core, such that the number of misses is minimized. To incorporate the technique into hybrid caches, the replacement policy should be redesigned. When a store operation of a core causes a miss in the shared cache, the corresponding new block is placed in SRAM. If all the allocated ways in the SRAM are already in use, a victim is selected among them. In the case of a load miss, the block is placed in STT-RAM if there is an unused way among those allocated to the core. If all the allocated ways in the STT-RAM are full, a victim block is selected. However, if all the ways allocated to the core have already been assigned to SRAM, a victim is selected among these blocks in SRAM. So, within a partition, the ratio between SRAM ways and STT-RAM ways can be adjusted and the ratio can be adjusted differently for different sets. Simulation results show that our techniques improve the performance of a quad-core system by 3.6%, reduce the energy consumption of hybrid caches by 11.0%, and decrease the DRAM energy by 4.8% compared to the state-of-the-art migration based hybrid cache management technique. This paper is organized as follows. Section II explains the background of STT-RAM, hybrid caches, and partitioning techniques. Section III describes the details of our proposed technique that exploits partitioning scheme for hybrid caches. Section IV shows the evaluation methodology and Section V discusses the results of evaluation. Section VI summarizes the survey of the related work and Section VII concludes this paper.
II. BACKGROUND

A. STT-RAM Technology
Spin-Transfer Torque RAM (STT-RAM) is an emerging memory technology, which uses a Magnetic Tunnel Junction (MTJ) as an information carrier. The MTJ consists of two ferromagnetic layers and one tunnel barrier between them as shown in Fig. 2 . The reference layer has a fixed direction of magnetic flow and the free layer can change its magnetic direction when a spin-polarized current flows through the MTJ with intensity above a threshold. If the directions of the two layers are in parallel, the MTJ has resistance lower than that of anti-parallel case. Thus, by measuring the voltage difference across the MTJ with a small current flow, the state of MTJ can be detected. Because of its non-volatility, STT-RAM has a very low leakage current. For read operations, it requires energy and latency comparable to SRAM, but in case of write operations, it consumes much higher energy and takes much longer than SRAM. Other noticeable properties of STT-RAM are its high endurance compared to other non-volatile memories such as Phase Change RAM (PRAM) or Resistive RAM (ReRAM) and its high density compared to SRAM.
B. Hybrid Approach for Last-Level Caches
Unlike other emerging non-volatile memories targeting offchip storage, STT-RAM is expected to replace SRAM used for current last-level caches (LLCs) in chip multiprocessors mainly due to its fast read latency close to that of SRAM and high endurance. But its high overhead of write operation obstructs the use of pure STT-RAM LLCs. To mitigate the shortcoming, there have been researches on hybrid caches that combine a small sized SRAM cache with a large sized STT-RAM cache, which lowers the write overhead of STT-RAM significantly. These two different caches are usually combined by a regionbased manner. The two memories are placed at a same level of cache hierarchy and share tags of the caches. Ways are divided into SRAM and STT-RAM region and consequently, block placement becomes an important issue; it will be more efficient to place blocks that will be frequently written in SRAM and place others in STT-RAM.
One of the state-of-the-art techniques is read-write aware hybrid cache architecture [17] , in which a block fill caused by a store miss is made to a write-efficient SRAM region and a block fill caused by a load miss is made to a read-efficient STT-RAM region based on the assumption that a block filled by a store miss is prone to frequent write and a block filled by a load miss is prone to frequent read. If there are consecutive hits on a block in a way opposite to the initial assumption, the block is migrated to the other region.
C. Cache Partitioning Technique
The utility-based cache partitioning technique is proposed to improve the performance of chip multiprocessors by minimizing the number of misses on a shared cache [11] . It has the same number of Utility-MONitors (UMONs) and CPU cores. A UMON consists of an Auxiliary Tag Directory (ATD) and counters for measuring the number of hits for each position of LRU stack by observing the access of sampled sets in a cache. The algorithm periodically calculates the partition size of each core to maximize the number of hits in the cache. Bits are added to cache tags to identify the core that owns the block. The replacement policy gradually adjusts the size (number of ways) of each partition to the calculated size. The partitioning technique is especially efficient when cache insensitive (e.g., low hit rate) applications and cache sensitive ones are running at the same time on a multi-core processor. It tends to restrict the cache usage of the low hit-rate applications, while increasing the performance of other cache sensitive ones.
III. PARTITIONING TECHNIQUE FOR HYBRID CACHES
This section explains the details of our technique that exploits the partitioning technique to reduce the energy consumption of hybrid caches in a multi-core processor.
A. Motivation
Read access of the last-level shared cache comes from a load or store miss of the upper level cache. In the case of store miss, a block written into the upper level cache becomes dirty, and so it eventually causes a write-back to the shared cache. Write access of the last-level cache can be classified into a write-back from the upper-level cache and a block-fill on a read miss. Read-write aware hybrid cache [17] utilizes the property of read access; if a read miss on the shared cache is caused by a store miss of the upper level cache, the new block is placed in SRAM so that a write-back to that block occurs in the write-efficient region. In case of a read miss caused by a load miss of the upper level cache, the block is placed in STT-RAM where read operations are efficient. If the decision is wrong, migration from one cache to the other occurs to remedy the situation. However, this technique does not consider the block-fill caused by a read miss in the shared cache and a write-back to a block (in the shared cache) that was originally loaded by a load miss of the upper level cache.
We propose a technique for hybrid caches that handles these cases by exploiting the advantage of dynamic cache partitioning technique. The dynamic partitioning scheme adjusts the sizes of partitions such that the total number of misses in the shared cache is minimized. Thus it helps reduce the number of block-fills caused by read misses. In case that the number of SRAM ways used by a core is already greater than 1 or equal to that allocated to the core, a new block loaded into the LLC due to a miss is placed in the SRAM even if the miss is a load miss. The rationale is that utilizing the already allocated SRAM ways helps reducing energy and latency since subsequent write-backs to the block can be done in SRAM.
B. Architecture Fig. 3 shows a structure to implement our technique. Basically, it combines cache partitioning architecture with hybrid caches. For a dynamic cache partitioning technique, it has a set of UMONs that sample cache accesses to designated sets of the cache and count the number of hits on each position of an LRU stack. It periodically calculates the partition size of the cache for each of the multiple cores such that the total number of misses in the cache is minimized. It is done by using the values of hit counters collected by the UMONs during the previous period. In the cache tags, bits are added per block to identify the core that owns the block.
The last-level shared cache consists of SRAM and STT-RAM. Thus the data array is a hybrid of the two memories, but the tags are made of SRAM only. Ways of a set are divided unevenly; SRAM contains a smaller number of ways and STT-RAM covers the rest. When a new block needs to be placed in hybrid caches, the result of the partitioning technique is used to decide the type of caches for allocation and select a victim 1 Since we perform dynamic partitioning, the number of cache ways allocated to a core can be reduced after repartitioning. Thus NS(i, j) can temporarily exceed NUT(i) set by the new partitioning. The excess ways can be claimed later by other cores and thus NS(i, j) will be reduced to the new value of NUT(i).
block in the resulting type of cache as explained in the following section.
C. Replacement Policy
To maximize the effect of applying a partitioning technique to hybrid caches, where to insert a new block should be carefully decided. If a core performs a store operation and eventually causes a miss in the shared cache, SRAM in the shared cache is selected as a location to place a new block because the block allocated in the upper level cache will become dirty, and thus there will be a write-back into the LLC. The victim selection policy within the SRAM is changed from the traditional LRU policy to a more complicated one that involves partitioning decision.
Let us define several notations as follows. 
NS(i
NUM(i, j) = NUT(i)-NS(i, j)
: maximum number of ways that can be allocated to core i in set j of STT-RAM..
If NS(i, j) NUT(i)
, then the LRU among these blocks is chosen as a victim. If NS(i, j) < NUT(i), the LRU among all blocks in SRAM is selected as a victim. This modification is to better utilize SRAM ways which is relatively smaller than STT-RAM ways.
If a miss in the shared cache is caused by a load miss in the upper level cache, the new block can be placed either in SRAM or in STT-RAM. If NS(i, j) NUT(i), the LRU among them is chosen as a victim. This decision procedure is exactly the same as that of the store miss case. It allows using already allocated SRAM ways and avoids unnecessary overhead of allocating a new way in STT-RAM. And by placing the block into the SRAM, the block-fill due to the read miss is done in the writeefficient region, reducing the dynamic energy of caches without performance degradation of multi-core processors.
If NS(i, j) < NUT(i), the new block is placed in STT-RAM. The total number of blocks in a set of both caches is maintained not to exceed NUT(i), and thus NM(i, j) is maintained not to exceed NUM(i, j). Therefore, NM(i, j) is adjusted dynamically according to the change of NS(i, j) and/or NT(i).
For example, if NM(i, j) becomes larger than NUM(i, j) due to a new partitioning, the LRU among the STT-RAM blocks is selected to be replaced by a new block of another core. If NM(i, j) < NUM(i, j), the LRU of blocks owned by other cores is chosen as a victim. Contrary to the utility-based partitioning, our approach allows zero value for NUM(i, j), and thus we allow to select the LRU of blocks allocated by other cores in the STT-RAM is chosen as a victim when there is no existing block in the STT-RAM owned by the core that requests the new block.
In our replacement policy, new blocks introduced by store misses can be placed only in SRAM, but new blocks introduced by load misses can be placed either in SRAM or in STT-RAM so that the utilization of the SRAM region can be maintained relatively high. The ratio between NS(i, j) and NM(i, j) can be adjusted freely under the constraint given by NS(i, j) + NM(i, j) NUT(i), which results in decrease of total misses in hybrid caches.
IV. EVALUATION METHODOLOGY
This section describes the experimental setup and methodology used to evaluate our hybrid cache partitioning technique.
A. Simulator
We evaluate our cache partitioning technique using a cycle accurate simulator MARSSx86 [10] . For off-chip memory model, we use DRAMSim2 simulator [12] , which is integrated into MARSSx86. The details of our system configuration are listed in Table I . The system has a 3.0 GHz, quad-core out-oforder processor based on x86 ISA. The cache hierarchy is configured with 32KB, 4-way set-associative L1 instruction/data caches and 4MB, 16-way set-associative L2 shared caches. We implement the shared L2 hybrid cache with bank contention model. The L2 cache is a 16-way set associative cache consisting of 4 ways of SRAM and 12 ways of STT-RAM with asymmetric read/write latency. The off-chip DRAM is configured as DDR3-1333 in which CL, tRCD, tRP timings are 10, 10, and 10, respectively.
The parameters of the hybrid cache model are calculated using NVSim [4] and CACTI 6.5 [9] under 45 nm technology. A tag of 16-way 4MB SRAM is borrowed for hybrid caches. 4-way 1MB SRAM and 4-way 1MB STT-RAM is configured for the data array. By combining one SRAM bank and three banks of STT-RAM, 16-way 4-bank data array is designed for the hybrid cache. Table II lists the energy consumption of the L2 cache.
B. Workloads
We use SPEC 2006 [5] with reference input as workloads for the evaluation. Among them, 18 benchmarks are selected by filtering out benchmarks that have small amount of accesses to the L2 cache. For multi-core simulation, we assemble 15 multi-programmed workloads from the benchmarks. To cover wide variety of characteristics, 18 benchmarks are sorted by L2 misses per kilo instructions (MPKI) as shown in Table III . From the highest one in the sorted list, nine benchmarks are categorized into the high-MPKI class, and the remaining ones are classified into the low-MPKI class. 5 multi-programmed workloads of high group are generated by randomly selecting 4 benchmarks among those in the high-MPKI class, 5 workloads of middle group are generated by mixing 2 benchmarks from the high-MPKI class and 2 benchmarks from the low-MPKI, and 5 workloads of low group are generated by combining 4 benchmarks in the low-MPKI class.
Before simulation, 10 billion instructions are fastforwarded per core to skip initialization phase of the codes. After 5 million cycles cache warm up, 2 billion instructions are simulated for multi-programmed workloads on quad-core processors. For the partitioning technique, we use a 5 million cycle period for monitoring and partitioning decisions like the previous work [11] .
V. RESULTS
This section discusses the results of simulation. We compared our technique to the state-of-the-art migration-based hybrid cache management technique, called a read-write aware hybrid cache architecture (RWHCA) [17] .
A. Performance
A weighted speedup is used as a performance metric, which is the sum of per-application speedups in IPC compared to the baseline (RWHCA running a single application is used as the baseline). Fig. 4 shows the weighted speedups normalized to those of RWHCA (the weighted speedups of RWHCA are also obtained by using the same baseline). The performance improvement is 3.5% in geometric mean over the total 15 workloads. For the workloads in the middle group, the performance improvement is 7.2% on average; the useless preemption of cache ways by benchmarks of high miss rates can degrade the performance of the cache-sensitive applications, and thus the partitioning technique can be very effective in the middle group. In the case of low group, the partitioning scheme can improve cache utilization by assigning an optimal number of ways to each application, but the improvement is not so significant as the middle group. The workloads in high group does not have benefits from the partitioning scheme. The performance of high1 workload is rather decreased by our technique because of too crowed SRAM ways. Fig. 5 reports the difference in L2 miss rates between the RWHCA and our technique. On average, it shows 3.8% reduction of miss rate for the 15 workloads. Some workloads including high2, low2, and low3 show increases in miss rate, but the differences are ignorable. Among the three groups, the middle group shows highest reduction of miss rates (7% on average), which is similar to the trend of performance.
B. Miss Rates
C. Cache Energy Consumption
Fig . 6 compares the L2 energy consumption of our technique against the reference. The reduction of energy consumption is 11% on average for total 15 workloads and 8.1%, 13.7%, and 11.2% for high, middle, and low group, respectively. The middle group shows the highest L2 energy saving due to partitioning effects. In the case of high group, the dynamic energy consumption is significantly reduced even though the miss rates have little difference compared to those of RWHCA. This is the result of assigning more blocks to SRAM so that some write operations to STT-RAM are moved to SRAM, reducing the write energy of STT-RAM. Among the workloads, mid4 shows the highest energy saving of 29%. This result shows the impact of our technique on energy efficiency in hybrid caches. Fig. 7 shows the DRAM energy consumption during the simulation. The results of our technique is normalized to that of RWHCA. The decrease of energy consumption is 4.8% for the 15 workloads. The middle group shows the highest energy saving of 9.1% on average and the low group shows 4.5%. The high group shows little difference of energy consumption. Among the workloads, mid4 shows the highest energy saving of 16%. The low2 workload is a special case in which DRAM energy consumption is increased by 4.3%. Because the benchmarks in low2 workload are all cache-sensitive applications, small growing of miss rates causes large increases of DRAM energy consumption.
D. DRAM Energy Consumption
E. Area Overhead
An area overhead of our technique comes from a partitioning scheme. For cache usage monitoring, every 32nd cache set is sampled, so one UMON has 16 of 4B counters and an ATD of 128 sets for a 4MB 16-way set-associative cache. One ATD consists of 16 entries, each has one valid bit, 24-bit tag and 4-bit LRU information. So the total overhead of one UMON is 7.3KB and the total overhead of cache usage monitoring for a quad-core processor is 29.25KB. In a tag of the cache, 2-bit is added to identify the core that owns each block, the sum of the overheads for the cache is 16KB. So the total overhead of the partitioning scheme for 4MB 16-way setassociative cache is 45.25KB, which is negligible compared to the size of the last-level cache.
VI. RELATED WORK
A. Reducing Write Overhead of STT-RAM
A write intensity predictor [2] is proposed to find a write intensive block and allocate the block to SRAM. This approach achieves high energy reduction of hybrid caches, but does not consider cache partitioning, and thus it may increase miss rates of the cache in some workloads, worsening the DRAM energy efficiency. Obstruction-aware cache management technique (OAP) [15] increases the efficiency of a last-level STT-RAM cache by bypassing some application that has no merits of using the last-level cache. The technique collects information on latency, number of accesses, and miss rates of the applications in a period and exploits the data for the detection of bypassing applications. But the target of this technique is a pure STT-RAM cache, so it cannot be applied directly to hybrid caches.
A lot of researches [3, 7, 8, 13, 16, 17] utilize a migration technique for adapting block placements, but in our proposed partitioning technique, a conventional migration scheme [17] can reduce the energy efficiency of hybrid caches by breaking the partitioning decision. But still, it has a room for adopting a migration approach to increase energy efficiency because our proposed partitioning technique cannot place a block brought in by a store miss in an upper level cache into the STT-RAM region.
B. Cache Partitioning for Energy Saving
Cooperative partitioning technique [14] is proposed to save energy consumption of a shared cache. In this technique, a partition is aligned physically and unused ways are disabled to reduce static power consumption. But this technique does not consider hybrid caches, and thus the block placement problem should be solved to apply this technique to hybrid caches. Writeback-aware partitioning [18] assumes an off-chip memory of phase change RAM (PRAM) and reduces the number of write operations in PRAM by dynamic partitioning of a shared cache to decrease the energy consumption of the write-inefficient memory.
VII. CONCLUSION
We address the potential of the dynamic cache partitioning technique for reducing energy consumption of hybrid caches. We propose an energy-efficient partitioning technique for hybrid caches in which the number of blocks installed by a core is adaptively balanced between SRAM and STT-RAM while satisfying a partitioning decision. If a miss is caused by a store miss in the upper-level cache, the corresponding block is placed in SRAM. If the miss is originated from a load miss in the upper-level cache, a new block can be placed in either SRAM or STT-RAM according to the number of blocks installed by the core in SRAM. The simulation results show that our partitioning technique improves the performance of the multi-core system by 3.6% on average, saves the energy consumption of the hybrid cache by 11.0%, and reduces the DRAM energy by 4.8% compared to the state-of-the-art migration-based hybrid cache management technique.
