In this paper, we present a flat address space organization called SILC-FM that allows subblocks from two pages to coexist in an interleaved fashion in die-stacked DRAM. Data movement at subblocked granularity consumes less bandwidth compared to migrating the entire large block and prevents fetching useless subblocks that may never get accessed. SILC-FM can get more spatial locality hits than CAMEO and PoM due to page-level operation and interleaving blocks respectively. The interleaved subblock placement improves performance by 55% on average over a static placement scheme without data migration. We also selectively lock hot blocks to prevent them from being involved in the hardware swapping operations. Additional features such as locking, associativity and bandwidth balancing improve performance by 11%, 8%, and 8% respectively, resulting in a total of 82% performance improvement over no migration static placement scheme. Compared to the best state-ofthe-art scheme, SILC-FM gets performance improvement of 36%.
INTRODUCTION
Die-stacked DRAM is an emerging technology that offers much higher bandwidth than the conventional DDR technology [8, 11, 12, 21] . Over the years, the capacity of available die-stacked DRAM modules has increased steadily from a few hundred megabytes to a few gigabytes, so exposing the die-stacked DRAM capacity for application usage can make an impact for capacity constrained applications [3] . Researchers focus on ways to manage tags (metadata) efficiently by managing data at different cacheline sizes to reduce the metadata storage overheads, while at the same time, they try to perform selective cacheline fetching from Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
PACT '16 September 11-15, 2016 , Haifa, Israel the off-chip DRAM to reduce off-chip bandwidth usage [4, 5, 6, 7, 13, 14, 15, 17, 19, 22, 24, 25] .
In this work, we present SILC-FM, Subblocked InterLeaved Cache-Like Flat Memory Organzation, which fully utilizes the die-stacked DRAM capacity while intelligently placing hot pages in die-stacked DRAM. The subblock level data movement reduces the bandwidth usage compared to PoM [23] and prevents fetching data that may not get utilized much. However, SILC-FM fetches a higher number of small size blocks at a time, resulting in more spatial locality hits compared to CAMEO [3] . In effect, SILC-FM gets benefits of both CAMEO and PoM. Furthermore, SILC-FM allows interleaving between die-stacked and off-chip DRAM hot blocks, so subblocks from both die-stacked and off-chip DRAM can coexist together in a die-stacked DRAM page. Our scheme is a hardware data management scheme that is robust to common problems associated with hardware data management such as thrashing and conflicts since SILC-FM can exclude and lock hot pages from being involved in data migration operations. Our associative structure also protects those pages that are not locked and are actively participating in hardware data migrations from being frequently swapped out of die-stacked DRAM. In addition, we maximize the memory bandwidth by utilizing both die-stacked memory bandwidth and off-chip memory bandwidth. We achieve this by moving only a certain subset of memory traffic to off-chip memory rather than servicing all memory requests from die-stacked DRAM. Thus, we further improve performance by utilizing the sum of both die-stacked and off-chip DRAM bandwidth.
SILC-FM MEMORY ARCHITECTURE
In this section, we will explain details of the SILC-FM architecture. The SILC-FM scheme uses Near Memory (NM) as OS visible space while internally operating with subblock based mechanisms. Each block contains four subblocks and the mapping is direct-mapped. This implies that multiple subblocks from only one large block (from the same set) in Far Memory (FM) can swap into corresponding subblocks in NM at any one point of time. The migration between two pages within the same set occurs at subblock granularity. This is bandwidth efficient as only 64B worth of data is migrated, yet managing the remap This exploits spatial locality as previously used subblocks are swapped at the same time, so any subsequent request to either of the subblocks results in a subblock serviced from NM. Since this scheme does not swap any other undesirable subblocks, it is more bandwidth efficient than large block based schemes, which have to swap every subblock within a large block. An index to access a page is calculated using the modulo operator in order to access the remap entry and data in NM. Also along with LLC access, the PC and the request address are used to access the predictor. The request is sent to NM using the calculated index and predicted way. If the lock bit and the remap field is set and matched, the NM data is fetched. In case of a remap mismatch, the request address is checked whether it falls under the NM address space. If so, the bit vector is checked to determine the location of the requested subblock (If resident in NM, the bit has to be 0). The prediction can skip the previously mentioned metadata fetching steps. If the block address does not fall under NM space, then the remap entry update and subblock swap from FM are initiated. The subblock is swapped to available ways within the set. A similar operation occurs for a remap mismatch. If the request was made to one of locked blocks, the remap entry is checked. If it matches, then the corresponding subblock is fetched from NM. If not, then the subblock is swapped from FM to NM blocks other than this locked block. For every swap from FM operation, the access rate is checked. If it is beyond a threshold, then the swap from FM bypasses NM and becomes a fetch from FM without any metadata update. In the case of the way prediction misspeculation, the remap entry check takes longer as four remap entries are checked in serial.
EXPERIMENTAL SETUP
To evaluate the SILC-FM scheme, we use a Pin-based Sniper simulator [2] . We also use a detailed memory simulator, Ramulator [16] . For NM memory, we use HBM generation 2 technology and derived timing parameters from 235A [11, 12] [10, 20] . Simulation parameters are shown in Table 1 . We compare our scheme against other three other die-stacked DRAM designs: Random Static Placement (Rand), HMA [18] , CAMEO (CAM). Random uses the entire NM and FM as OS visible address space and maps pages randomly. Thus, this scheme does not consider different bandwidth/latency characteristics of NM and FM, and rather, treats them the same. Figure 1 shows the performance improvement of our proposed scheme against other schemes. Unless otherwise specified, we use the baseline as the system without NM. First, the Random scheme does not see much significant performance improvement. The placement is done randomly without considering NM and FM characteristics, so pages are statically allocated. The SILC-FM scheme effectively removes conflicts by offering associativity and locked blocks. Unlike HMA, the blocks are locked in NM as soon as the access count reaches the threshold. This makes the hot block capturing ability of our scheme to respond much quickly to changes in the hot working set. The associativity reduces conflicts among those pages that are not locked yet. This additional performance gain makes SILC-FM achieve higher performance than the state-of-the-art scheme, CAMEO. Furthermore, the bypassing feature makes certain workloads such as milc extract performance opportunities by using FM bandwidth, which would have been idle in other schemes due to its high access rate.
EVALUATION

CONCLUSION
Prior approaches focused on using block or epoch based schemes, but adopting either one will benefit only a subset of different workloads. In this paper, we have presented an associative locking memory architecture called SILC-FM that locks hot pages in NM and intelligently remaps FM subblocks into NM. In the end, SILC-FM is able to achieve on average 36% performance improvement over state-of-theart die-stacked DRAM architecture. In conclusion, SILC-FM is a novel memory architecture that takes the advantage of the large NM capacity by holding as many hot data as possible in NM.
ACKNOWLEDGEMENT
This work was supported in part by National Science Foundation grants 1337393 and 1117895 and AMD. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of sponsors.
