Integrated Heterogeneous System (IHS) processors pack throughput-oriented General-Purpose Graphics Pprocessing Units (GPGPUs) alongside latency-oriented Central Processing Units (CPUs) on the same die sharing certain resources, e.g., shared last-level cache, Network-on-Chip (NoC), and the main memory. The demands for memory accesses and other shared resources from GPU cores can exceed that of CPU cores by two to three orders of magnitude. This disparity poses significant problems in exploiting the full potential of these architectures.
pack multiple CPU and GPU cores have higher memory bandwidth demands and these inter-die vias and interconnects in the 3D chips can supply these data hungry processors. Stacked DRAMs are not constrained by the bandwidth wall problem as the TSV interface between the DRAM stacks and the logic chip scales with the surface area of the chip rather than with the number of pins at its perimeter. This allows more direct, parallel access to the DRAM modules than accessing the memory external to the chip.
Stacked DRAM can also be used in conjunction with the on-chip SRAM/STT-RAM caches to augment the memory with higher capacities and bandwidth. We expect the key observations of the heterogeneous nature of the requests and the inferences thereof to be similar even with the shared on-chip caches in IHS architectures.
Several recent proposals advocate the use of on-chip DRAM capacity as a hardware-managed last-level cache for improving the performance of multi-core Chip Multi-Processors (CMPs) [21, 25, 34, 39] . In the context of IHS architecture, the stacked DRAM can cater to the large bandwidth requirements of throughput-oriented GPUs while the latency-sensitive CPU applications can benefit from reduced latency of data access.
Managing contention for shared DRAMCache in the memory hierarchy of the two heterogeneous processor architectures which have asymmetric sensitivity and demands, however, introduces novel challenges. When a GPU kernel is launched it creates a large number of concurrent threads which run in a lock-step Single Instruction Multiple Data (SIMD) execution model and sends a large number of requests into the memory hierarchy causing congestion. This causes bottlenecks in request queues at the DRAMCache thus severely hampering CPU performance. Further, GPUs are designed to tolerate longer memory latencies while the memory hierarchy of typical CPUs have been designed to optimize memory access latencies for CPUs. Thus, latencysensitive CPU applications would benefit from a DRAMCache design that offers lower hit-time and hence a lower miss penalty for the previous level caches (L2 cache). On the other hand, the GPU cores would benefit from higher bandwidth and higher hit-rates in DRAMCache even at the cost of higher hit-time. This necessitates the need for a careful design of memory system for IHS processors that can handle the GPGPU bursty behavior as well as service requests for the CPU with a consistent latency without idling resources while delivering improved system throughput at lower energy.
To improve the performance of IHS systems we introduce a large-capacity stacked DRAMCache, used as a hardware-managed cache, as the first-level shared resource in the memory hierarchies of CPU and GPU cores. Our organization, Heterogeneity-Aware Shared DRAMCache (HAShCache), is aware of the asymmetric and contrasting requirements from the heterogeneous processors. HAShCache attempts to meet CPUs requirement of reduced hit times and lower miss penalties while at the same time improving hit rates for GPU to allow it to better use the DRAMCache bandwidth. In the common case when CPU is running alone, HAShCache is an aggressive direct 51:4 A. Patil and R. Govindarajan mapped DRAMCache optimized for hit times and lower miss penalty through hit/miss prediction as in [39] . However, when GPU is active we propose (i) PrIS-a DRAMCache controller scheduling algorithm to prioritize scheduling of CPU requests in the queues to reduce the large waiting time caused due to burst of GPU requests; (ii) ByE-a mechanism to allow some of the CPU requests (access requests that are clean data in DRAMCache or data that is currently not in DRAMCache) to be temporally bypassed to utilize the under-utilized off-chip memory bandwidth; and (iii) Chaininga technique to force spatial occupancy control by providing a pseudo-associativity for GPUs to improve hit rates while still allowing certain guaranteed occupancy for CPU lines. HAShCache achieves these goals using a lightweight and dynamic scheme which does not impose hard partitions on the shared DRAMCache. Using detailed simulations, we show that the proposed optimizations improve overall system performance, on an average by 41% over a carefully designed, but heterogeneity-unaware DRAMCache. Although the techniques proposed in the article are simple and proposed in other contexts, deploying them in the DRAMCache helps to effectively address the disparate demands of IHS processors and improve the performance of both CPU and GPU cores by 211% and 20%, respectively, over an IHS processor with no DRAMCache.
Although there have been studies and proposals to share the on-chip last-level SRAM caches [31, 35] and non-volatile STT-RAM caches [47] in IHS architectures previously, to the best of our knowledge this is the first study on the interactions of such an architecture with a stacked DRAMCache focusing on shared cache management issues. The rest of the article is organized as follows: Section 2 demonstrates the performance that can be gained by adding a DRAMCache to an IHS processor chip and motivates the need for architecting a heterogeneity-aware DRAMCache organization. We present the design principles, organization, and the working of HAShCache in Section 3. Next, in Section 4, we describe the experimental setup and methodology. This is followed by evaluation and results in Section 5. Section 6 presents related work in this area and Section 7 concludes the article.
MOTIVATION
As mentioned in Section 1, the large capacity provided by the stacked DRAM is in line with the working set requirements of IHS processors. Using this capacity as a hardware-managed cache could provide performance gains without any application modifications. The bandwidth benefits and modestly improved latency provided by the DRAMCache help improve the performance of IHS processors over on-chip SRAM caches of reasonable sizes [46] .
In this article, we assume a cache organization similar to Alloy cache [39] with a block size of 128 bytes and study the problems and challenges in designing the DRAMCache for IHS architectures. We justify the design decisions in the following section. Each CPU core has a private L1 cache and a shared split L2 cache across the CPU cores. The GPU cores have private L1 and shared L2 cache among themselves. The stacked DRAMCache (of size 64MB) is the first-level shared cache across CPU cores and GPU cores. In our experiments, we consider a multi-programmed workload on the CPU cores and a GPGPU application on a single CPU core and multiple CUs. 1 We refer to this workload as Heterogeneous Application (HeA), as the CPU and GPU applications are co-run. We use the terms Homogeneous Application (HoA-CPU and HoA-GPU) to refer to the cases when (multiple) CPU applications are run alone and GPU application is run alone. Additional details relating to the experimental methodology and workloads are described in Section 4.
First, we evaluate the benefits of having a large DRAMCache for the CPU in an IHS architecture. Figure 2 (a) presents the performance improvement for CPU cores due to the addition of a stacked DRAMCache (in terms of harmonic mean of IPCs, normalized to HeA without a DRAMCache) when co-run with a GPU and when run alone. As can be seen from the figure, the addition of a stacked DRAMCache improves the performance of CPU applications in an IHS architecture, on average by just 42%. However, when the CPU runs alone with a stacked DRAMCache it achieves a 3.72× better performance than that achieved in IHS without a DRAMCache. We observe that this huge performance gap is primarily due to the interference of GPU applications. Although this HoA-CPU performance cannot be achieved in IHS, there still exists a significant opportunity for improvement. This gap in performance can be attributed to the unmanaged heterogeneity and interference in the DRAMCache organization.
We further investigate the cause of this performance gap. Figure 3 presents the DRAMCache access latency experienced by a CPU request (in terms of CPU cycles) on the primary y-axis and the CPU hit rates on the secondary y-axis, while running alone and co-run with the GPU application. We find that the presence of GPU application increases the average access latency of the DRAMCache by 213% while hit rates of the CPU are marginally impacted (only by about 4%) when co-running. This increase in average access latency is primarily attributed to the large number of GPU requests flooding the DRAMCache controller when the GPU kernels are co-running with the CPU applications. It should be noted here that, even though we use the terms CPU and GPU requests separately, they may refer to the same data. This terminology only indicates the source of the request.
Next, we study the impact of co-running on the GPU applications. Figure 2 (b) presents GPU performance obtained by the addition of a stacked DRAMCache when co-run with the CPU and when run alone. We observe that with the introduction of DRAMCache in an IHS, the GPU performance improves on average by 24%. This performance is within 10% of its HoA GPU performance with DRAMCache. Thus, co-running with CPU applications has only a minor impact on the GPU performance. Although the above discussion indicates that the design decisions of heterogeneityaware DRAMCache are influenced by the latency-sensitive CPU applications, we emphasize the need to effectively utilize the large capacity and the higher bandwidth of DRAMCache by GPU applications. Toward this goal, we experimented with a DRAMCache of 128MB and two-way associativity. We observe that the hit rate for GPU applications improves, on average, by 5%. While the improvement in hit-rate is not significant, as the number of GPU requests is very large (two or three orders of magnitude higher than the CPU requests), even such a small increase in the hit rate leads to large improvements in utilization of the DRAMCache bandwidth by the GPU application. This makes a case for improving the associativity for GPU requests by introducing some pseudo-associativity in the DRAMCache, without impacting the hit-time offered by a directmapped Tag-and-Data (TAD) cache.
Based on the above motivation study, we conclude that there are significant performance benefits that could be obtained with the introduction of the stacked DRAMCache for the latencysensitive CPU applications. However, in a naive implementation of the DRAMCache, these benefits can be lost due to interference from the co-running GPU application. Hence, it is important to carefully architect the DRAMCache organization to ensure CPU applications are not hampered due to this interference. This requires the design to be aware of the heterogeneity of the applications (CPU vs GPU) and their demands on the memory hierarchy.
To quantify the effect of interference between CPU and GPU, we further experimented with a hard partitioned DRAMCache. We partition the request queues and allocate a different set of banks for CPU and GPU data. We divide the 32 banks in the DRAMCache into 20 banks (62.5% cache size) and 12 banks (37.5% cache size) to alleviate bank conflicts. We use a data management and coherence policy such that the data requested by the core is found in its corresponding partition of the DRAMCache. Cache misses within the partition incur memory access latency. Although such a partitioning scheme is not realistic, it gives us the ceiling for the performance that can be expected with a partitioning approach. With this setup, we experiment with allocating a larger cache partition for the CPU first. The second bar in Figure 4 (a) and (b) shows the performance of the CPU and GPU with such a partitioning scheme (in terms of harmonic mean of IPCs, normalized to HeA with an un-partitioned DRAMCache). We observe that the performance of the CPU reduces by a mean of 5.2% while the performance of the GPU too reduces by a mean of 2.6%. Next, we allocate the larger DRAMCache partition to the GPU cores. The third bar in Figure 4 (a) and (b) shows the performance of CPU and GPU for such a partitioning scheme, respectively. We observe that the performance of the CPU and GPU cores reduces by 7.7% and less than 0.1%, respectively. Some of the workloads, e.g., Qg1, Qg2, and Qg5 experience considerable fall in the H-mean of CPU cores (by more than 30%), and in none of the workloads the performance of CPU or GPU cores improves significantly. Thus, hard partitioning of DRAMCache is not an effective design as it leads to under-utilization of the large-capacity stacked DRAM, as the effects of interference due to co-running GPU application is felt only when the kernel is run on the GPU sporadically. Further, effectively utilizing the under-utilized main memory bandwidth [15, 18, 20] is also important to achieve higher performance. Lastly, the design should ensure that the CPU applications and the GPU application are able to utilize the large capacity of the stacked DRAM effectively to meet the working set requirements of the respective applications in the best possible way. In the next section, we present our HAShCachedesign which addresses these issues.
HASHCACHE FOR IHS
HAShCache uses HMC-like [10] stacked DRAM memory. 2 It adapts an aggressive direct mapped cache design with tags stored in DRAM as TAD units [39] with a cache line size of 128 bytes. In this section, we first rationalize these design decisions.
HAShCache Design
Metadata overhead: The metadata requirement for DRAMCaches, even for caches of size 64MB, is large and is in the order of few MBs. The large storage requirement along with the associated cost if it has to be stored in SRAM, has driven DRAMCache designs to either use larger block sizes (of size 2KB or 4KB [26, 28] ) to reduce metadata overhead or co-locate metadata alongside data in the DRAMCache [25, 34, 39] .
To understand the spatial locality characteristics of large blocks in an IHS configuration, we experimented with 512 byte block organization for the DRAMCache. The 512 byte block size provides a good tradeoff between increased hit rates and reduced metadata for multi-core CPU workloads [21] . While we increase the cache line size of the DRAMCache, the higher caches (CPU and GPU L2) still operate at 128 byte cache line size, i.e., requests to and responses from the DRAMCache are for 128 byte cache lines. Thus, one 512 byte cache block in the DRAMCache has four 128 byte sub-blocks that can be requested by higher level caches. When a requested cache line is not found in the DRAMCache, a request for the corresponding 512 byte aligned cache block is sent to DRAM memory. Bringing in a larger block (512 byte) can improve the cache hit ratio (due to spatial locality); the miss penalty (of the DRAMCache) also increases due to the increased fetch and data transfer.
We evaluate how much of the data brought by the DRAMCache is actually used by the higher level caches in this setup. We find that on average, for 65% of the 512B blocks that are brought into the cache, at most two sub-blocks of 128 bytes are requested/used. This implies wasted capacity in the DRAMCache and wasted off-chip bandwidth. Figure 5 shows, on the primary y-axis, the performance of CPU and GPU with a 512 byte block size DRAMCache, in terms of harmonic mean of IPCs normalized to that of 128 byte block size DRAMCache. Although with the increased cache line size of 512 bytes, we observe an 11% and 8% improvement in hit rate in the DRAMCache (not shown in Figure 5 ) for CPU and GPU, respectively, the performance degrades by 12.2% and 10.6%, respectively, for the CPU and GPU cores compared to a 128 byte block organization. The poor performance of a large cache block size is attributed to the increased average access latency of the DRAMCache (due to larger block size fetch) and the wasted off-chip bandwidth. The secondary y-axis in Figure 5 shows the percentage increase in memory access latency for the 512 byte 2 There are two commercial implementations of the stacked DRAM available on the market today. Hybrid Memory Cube (HMC) is a joint Intel-Micron standard that is not yet a JEDEC standard. HMC has a logic die (ASIC) at the base of the stacked DRAM that houses the control logic for the DRAM stacks (as implemented in the Xeon Phi Knights Landing processor [7] ). On the other hand, the High Bandwidth Memory (HBM) proposed by AMD, NVIDIA, and Hynix has been ratified by JEDEC. HBM tightly couples the host compute die with an interface that is divided into independent channels. HBM is slated to be used in NVIDIA Pascal GPU and next generation AMD APU/GPU. Both these technologies promise high bandwidth up to 250GBps using TSV interconnects. The only difference in the technology would be the signaling interface. While HMC uses a SERDES interface to its stacked DRAMs logic die, HBM uses a traditional DDR signaling interface to the stacked DRAM chip.
Our evaluation in this work used a HMC-like model and timing parameters. However, HAShCache is flexible enough to be adapted using either technology. HAShCache addressing schemes, mechanisms, and associated metadata are stored in the DRAMCache controller. In the case of HMC, this would be implemented at the logic die while the same would be implemented at the host side stacked DRAM controller in the case of HBM. DRAMCache organization over the 128 byte DRAMCache organization. We observe that memory access latency increases by an average of 75% due to the unused data fetched.
Tags-in-DRAM designs have further focused on improving access latencies by removing tagserialization overhead using overlapped tag lookups [34] or storing TAD units [39] . These designs come close to tags-in-SRAM like access latency without concerns of spatial locality characteristics of large block sizes. Hence, HAShCache organizes data at 128 byte block size and stores data in DRAMCache as a cohesive TAD unit.
Associativity: Providing set associativity is known to improve cache hit rates by reducing conflict misses. In DRAMCaches where tag is stored in DRAM, associativity comes at a cost. Hit latencies increase due to the tag requiring to be burst out of the stacked DRAM. Hence, there is an implicit tradeoff between providing better hit-rates and reduced access latency. A higher associativity design is suitable for a GPGPU processor which can trade increased access latency for higher hit rates to make better use of the larger bandwidth of the DRAMCache. On the other hand, the CPU would suffer when using such a design due to the increased hit-latency and would instead prefer a latency-optimized direct-mapped cache [39] . Further, such a direct-mapped organization simplifies the design and eschews the need for tag-caches [25] and way locators [21] to improve hit times. HAShCache's organization is also in line with the commercially adopted stacked MCDRAM on the Knights Landing generation of the Intel XeonPhi processor [7] where the DRAMCache is organized as a direct-mapped cache.
Miss Penalty: The access latency provided by a stacked DRAM is only slightly better (say, 0.7× 3 ) compared to off-chip DRAMs. Thus, a miss in the DRAMCache would experience a delay of 1.7× as the DRAM (memory) access takes place after the miss is detected (serially). To overcome this, researchers have proposed cache line hit predictors [34, 39] which are critical to extract performance from DRAMCaches. These predictors start an early access to memory if they predict that the block will miss in the DRAMCache. Intuitively, we apply the MAP-I prediction [39] to CPU requests to start early memory access when an access is predicted to be a miss. GPU requests always proceed serially through the cache after verifying via tag match. This helps to (a) reduce the wastage of off-chip bandwidth for mis-predictions for GPU and (b) avoid large structures that will be required for making reliable predictions for GPUs which might require correlating warp, thread, and CU IDs.
Row Buffer Hits (RBHs) vs. Bank-Level Parallelism (BLP): Stacked DRAMs are organized as vaults (channels), layers (ranks), and banks within each layer as shown in Figure 1 (b). Each vault has several TSVs which constitute lanes in a channel. Stacked DRAMs provide large bandwidth by organizing DRAMs as several smaller banks within layers. Given this abundant BLP, should DRAMCaches exploit this parallelism over improved RBH? In other words, should the addressing scheme of a DRAMCache be organized as RoCoRaBaCh (Row,Column,Rank,Bank,Channel)-referred to as the BLP-scheme-which distributes the cache blocks in banks of different ranks as opposed to RoRaBaCoCh-referred to as the RBH-scheme-which stores cache blocks consecutively in the row of a bank. We experimented with both addressing schemes for an IHS processor with a DRAMCache and observe that both the CPU and GPU perform on average 3% and 1% worse, respectively, when using the BLP scheme. Consequently, HAShCache is addressed using a RBH-friendly addressing scheme (RoRaBaCoCh scheme).
HAShCache Mechanism
As shown in Figure 3 , despite the addition of such a carefully designed DRAMCache, the CPU still suffers significant performance losses while the GPU is relatively unaffected in an IHS architecture, compared to when they run alone. GPUs capability to context switch between warps make it more latency tolerant. This suggests that the DRAMCache should be optimized to regain the lost CPU performance without compromising the GPU performance. In this subsection, we propose three schemes for achieving this.
Heterogeneity-Aware DRAMCache
Scheduling. DRAM devices operate at a much lower clock rate compared to the processor. Moreover, DRAM cells need to be periodically refreshed to preserve the data which further reduces the available time for accessing the device. The imbalance in request arrival rate and service times creates a queuing effect. Hence, DRAM devices have traditionally had limited size queues to hold requests until they can be serviced by the device. However, we observe that in an IHS processor the large burst of requests from the GPU quickly exhausts the available queue positions (buffer locations) at the DRAMCache controller leading to requests being rejected causing the DRAMCache to be blocked. The CPU requests, which are interleaved with the GPU requests, are few and far spaced and thus suffer large waiting time due to retries. This is compounded by the fact that GPU exploits good row buffer locality and is preferentially scheduled by the DRAMCache controller (under First-Ready First-Come-First-Serve (FR-FCFCS) scheduling [9] ), causing increased queue latencies for CPU requests. Increasing queue lengths beyond a certain measure increases scheduling overheads as DRAM schedulers search the queues for the most suitable request to schedule based on certain heuristics.
HAShCache reduces waiting time of CPU requests by prioritizing them at the DRAMCache controller without starving GPU requests. For this, HAShCache applies a CPU Prioritized FR-FCFS algorithm over each of the Read, Write, and Fill queues to schedule a request at each bank. The scheduler is cognizant of the request heterogeneity and searches the short queues for either a first CPU row buffer hit request or a first CPU row activation request to schedule before scheduling a GPU request in a FR-FCFS manner. For GPU requests, starvation is avoided firstly, by allowing GPU requests to be scheduled to a prepped (open) row after the CPU has completed access to that row and secondly, by allowing GPU to schedule its requests to a bank immediately, when the queue has no more CPU requests to that bank. These scheduling decisions are made subject to the device timing constraints, similar to an FR-FCFS scheduler.
Prioritization of CPU requests alone may not help, as the flood of memory requests from GPU can quickly fill the precious buffer at the DRAMCache controller, resulting in CPU requests not even entering the buffer. HAShCache overcomes this problem by guaranteeing certain minimum occupancy for CPU requests in the buffer at the DRAMCache Controller. This is accomplished using a selective reject-retry mechanism 4 for GPU requests when the queues reach a certain critical level. Together, these two mechanisms attempt to reduce the DRAMCache latency experienced by CPU requests. We refer to these mechanisms collectively as PrIS (CPU Prioritized FR-FCFS with IHS-aware Scheduling).
PrIS differentiates requests broadly as CPU or GPU requests and not within individual CPU cores or GPU CUs for scheduling. 5 We find that in an IHS architecture the interference between CPU and GPU applications vastly overwhelms the interference between homogeneous application workloads. Hence, PrIS only has to make a binary selection for scheduling which greatly simplifies the scheduling algorithm. PrIS is a simple yet effective, single stage modified FR-FCFS algorithm that does not incur any hardware overhead in terms of multiple or large requests queues or batching stages as in [9] .
PrIS picks requests to be scheduled from the input queue. The exact selection of input queue (Read, Write, or Fill Queue) is done external to this algorithm based on certain heuristics and constraints which is beyond the scope of this scheduler algorithm. Broadly, the algorithm picks one of the three types of requests: (a) seamless row buffer hit, (b) hidden bank prep, or (c) prepped row, in that strict order of priority.
A seamless row buffer hit request refers to a request that can issue a column access to an already activated row in the bank, without any further delay. A hidden bank prep request is a request that can overlap the current operation in other banks (in the same rank) and issue a request to activate or precharge a row in the requested bank. Among the hidden bank prep requests an FCFS policy is followed. A prepped row request refers to a request that needs to wait for the current column access to complete to the currently active row in a bank. Thus, choosing a prepped row leads to a bubble in the pipeline of the scheduler where the request has to wait for the row to become available for a column access command.
Additionally, the PrIS algorithm picks a request in the priority order of CPU seamless row buffer hit > CPU hidden bank prep > CPU prepped row > GPU seamless row buffer hit > GPU hidden bank prep > GPU prepped row buffer hit
We experimented with several combinations of this priority order and find that prioritizing CPU requests at all levels provides the best performance for CPU requests, while the reduction in GPU performance due to de-prioritization between the schemes was not significant. Although PrIS may introduce delays in scheduling GPU requests, due to the relatively fewer numbers of memory requests coming from CPU (in comparison to the number of requests from 4 Interactions between the memory components is modeled as a master-slave port architecture in gem5-gpu. This is common practice in most architectures that involve requests and responses. A master port sends requests while the slave port receives requests and performs appropriate action on the requests. For example, when the directory controller's memory master port sends a request to the slave port of DRAMCache Controller it enqueues the request in the appropriate queue (read or write queue). Similarly, a slave port sends out responses and the master port receives these responses and performs appropriate action. For example, when the DRAM controller responds to the a read request, the DRAMCache controller clears the corresponding Miss Status Handling Register (MSHR) and sends a response to the requester.
However, a cache can be in a blocked state when MSHRs or targets within a MSHR or WriteBuffers are unavailable. In such situtations, a request is received at the DRAMCache slave port but is not able to act on it due to the cache being blocked or read/write queue being full. In such cases, the request is rejected and the reason for the blocking is remembered (i.e., the resource causing the block). When the corresponding resource becomes available, a retry message is sent to the master port from the slave port. Once the master port receives this message, the request is resent. This is called the rejectretry mechanism. The reject-retry mechanism for requests is already implemented/part of gem5-gpu [38] . This mechanism is independent of the NoC that handles routing, channel partitioning, and so forth.GPU), it is often the case that the GPU requests get serviced without much delay or starvation. In fact, our simulator framework shows that in all our experiments the GPU request is never delayed beyond 500 cycles for any of the benchmarks. 6 This suggests that the applied heuristic in PrIS allows such request service reordering without causing starvation or deadlocks.
Heterogeneity-Aware Temporal
Bypass. The large sizes of the stacked DRAMCache ensures cache lines have fairly long residency time before being evicted. Hence, DRAMCache has fairly large hit rates which leads to idling of off-chip DRAM bandwidth. Moreover, the stacked DRAM and off-chip DRAM utilize the same underlying technology and hence incur comparable latency (0.7× for DRAMCache vs. 1.0× for DRAM). In an IHS architecture, the increased access latency incurred by a CPU request at the DRAMCache (see Figure 3 ) when the GPU is running, makes the off-chip DRAM an attractive target to direct some of the CPU requests. This leads to improved resource utilization without incurring any increased latencies for CPU requests.
To hide the latency of miss, our aggressive baseline design already incorporates a hit/miss predictor (similar to MAP-I predictor [39] ) for CPUs to initiate an early access to off-chip DRAM when a miss is predicted in the DRAMCache. These requests are then enqueued in the DRAMCache queues for verification of a miss 7 by a tag match.
When the tag is matched in the DRAMCache, in the case of a hit, data from the DRAMCache is forwarded to the requester and the DRAM memory access is squashed or its response is ignored. In the case of a miss, data from the memory is forwarded and inserted into the cache. Normally, it is expected that the access to the DRAMCache completes earlier (due to its relatively lower access latency) than the DRAM response. However, when the GPU is running, the parallel request to off-chip DRAM memory often returns earlier and waits in the Miss Status Handling Registers (MSHRs). This is due to the increased queuing delay at the DRAMCache compared to the access latency at the DRAM. HAShCache exploits this observation to bypass CPU read requests for both misses and clean lines.
For this, HAShCache uses a Bypass Enabler (ByE). ByE uses a counting Bloom filter [12, 17] that tracks the dirty lines in the DRAMCache and provides a space-efficient way to determine if a given request can be bypassed. The property of a Bloom filter to answer "definitely not in set" allows us to bypass requests correctly, i.e., without verifying tags in the set of the DRAMCache. On a write request when a cache line becomes dirty in the DRAMCache, the address is hashed into ByE and the corresponding counters are incremented. When a dirty line is evicted from the DRAMCache, ByE attempts to remove the entry from the Bloom filter by decrementing the corresponding locations. 8 ByE bypasses CPU requests only when the GPU cores are executing the kernel. For this, all CPU read requests lookup into ByE as shown in Figure 6 . If the Bloom filter search returns a negative result, the address is guaranteed to be not dirty in the DRAMCache. Thus, the request can safely be bypassed to utilize the off-chip DRAM bandwidth. Further, when the bypassed CPU requests return from the off-chip DRAM access, these requests are directly forwarded to the requester and are not inserted into the cache. Firstly, this allows ByE to ensure that future write requests for the line do not hit in the DRAMCache as an increase in dirty lines would lead to reduced bypass 6 It is also possible to modify this scheme to reset priority/switch back to a baseline FR-FCFS scheme after servicing some number of CPU requests to avoid possible GPU starvation. 7 This is required to ensure that misprediction does not result in using data from the DRAM for lines modified (dirty) in DRAMCache. 8 Counting Bloom filters use saturating counters. If counter saturates, decrementing it can lead to false negatives (dirty lines predicted as clean lines and wrongly bypassed). Hence, saturated counters are never decremented. Although this may increase the false positives (clean lines being predicted as dirty), which only reduces the benefits obtained by ByE, it does not affect functional correctness. In our implementation, we observe that on average just 2% of 2-bit counters in the Bloom filter saturate out of the 512K counters.
51:12
A. Patil and R. Govindarajan efficiency. Secondly, this allows ByE to reduce some of the bloat caused by a Miss Fill [15] into the DRAMCache.
All write requests and GPU read requests proceed serially after looking into the cache. ByE does not bypass any write requests as it would otherwise require a back-invalidation of the cache line, if present in the DRAMCache, which would need a full DRAMCache access.
We find that a small 2-bit counting Bloom filter implemented with two H 3 hash functions [40] and 512K entries per controller is sufficient to produce reasonable bypass efficiency with a tolerable mis-prediction rate. The total overhead for ByE is 256KB for a 64MB DRAMCache, which is less than 0.4% of the cache size. 9 
Heterogeneity-Aware Spatial Occupancy
Control. The schemes proposed in the previous two subsections, PrIS and ByE, attempt to improve the latency of CPU requests. The mechanism described in this section details HAShCache's approach to improve the utilization of the DRAMCache for GPU requests, in order to exploit the higher bandwidth provided by it. We make the following observations and inferences: (i) For the GPU to be able to better utilize the DRAMCache bandwidth, the hit rates for GPU workloads should be large enough that the GPU does not have to frequently use the relatively constricted off-chip DRAM buses. (ii) As noted in Section 2, GPUs can trade access latencies for higher hit rates. Further, providing associativity for GPU requests improves the hit rate. (iii) The working sets of CPU applications tend to be limited to few tens of MBs due to the limited amount of MLP that can be exploited by the CPUs. Thus, providing a larger than certain share of cache leads to no further improvements in hit rates and IPC for CPU. Nevertheless, the CPU can still gain from some share of the DRAMCache due to reduced latency of access. (iv) Given that GPU can exploit much higher MLP using several thousands of threads, the relatively small GPU L2 cache provides limited filtering of traffic and has significantly high miss rates while, on the other hand, CPUs have sufficiently sized L2 cache sizes to be able to retain blocks longer before re-requesting a block.
The above observations lead to the following conflicting requirements. It is important to ensure that the CPU requests have a certain share (minimum occupancy) in the DRAMCache to ensure the benefits of lower latency, while to effectively use the larger share of DRAMCache for GPU requests, it may be required to increase associativity of the DRAMCache. However, such an associativity should not unduly increase the hit latency for CPU requests.
To accomplish the above goals, HAShCache uses the Chaining scheme which introduces (pseudo) associativity mainly for GPU requests, while ensuring a certain minimum occupancy for CPU lines. The Chaining scheme uses a linear probing-like technique inspired by the collision resolution mechanism of a hash map. To ensure minimum occupancy for CPU requests, Chaining maintains a low-threshold value (l cpu ) and when the occupancy of CPU lines 10 reaches this threshold, Chaining ensures GPU data does not replace data brought in by CPU.
In the other situations, HAShCache modifies the replacement policy in the DRAMCache depending on the requesting core type. For a GPU request that is evicting another GPU line, HAShCache looks for a line belonging to a CPU to replace within the same row in the next three consecutive locations, i.e., if B is the original cache block, then the blocks considered for insertions are (B + 1)%N s , (B + 2)%N s , and (B + 3)%N s , where N s is the number of blocks in a DRAMCache row (page). Hence, the inserted block always lies in the same row as the original cache block. Note that for every set, there could be at most one chained set, providing a pseudo-associativity of at most 1. We refer to this inserted location as the chained block and the actual cache block the request mapped to as the original block. The location of the chained block is then represented as a 2-bit offset and is stored along with the metadata for the original set (see Figure 7(a) ). When a cache block is evicted, if it was a chained block, to unchain it (from the original block), the offset stored in the reverse chain bits field in the metadata for the chained block is used. The chain dirty bit field (Figure 7(a) ) in the metadata indicates whether the chained location, if any, holds modified data. This is used to optimize the access path for CPU and reduce the adverse effect of a double set lookup for latency-sensitive CPU requests as shown in Figure 7 (b). Chaining relies on the hit/miss predictor to have started an early access to memory. This avoids the second set lookup for CPU if the parallel memory (PAM) access has returned and the chained block is clean. Additionally, each tag also stores 1 bit information about the owner of the block (CPU or GPU). This bit is used to make quick replacement decisions locally. The additional metadata required for Chaining is only 6 bits which can easily be accommodated in the existing 8 byte metadata. Lastly, the unused 8 bytes (at the end of each row(page)) is used to store ownership information of each block in the row (15 bits). This information is used to make the Chaining replacement decision.
As explained earlier, when the l cpu threshold is reached, GPU lines are not allowed to evict CPU blocks and such GPU requests contending to evict a CPU line are forced to chain to another block belonging to a GPU and evicting that instead, thus maintaining the l cpu occupancy for CPU. In the very rare case that a GPU block is not found within the three consecutive locations, the request is not inserted into the cache. Thus, even with the Chaining scheme HAShCache follows the simple 10 As mentioned earlier, we classify a data as CPU data or GPU data based on whose request last accessed the data in the DRAMCache. An alternate equally feasible design point is to classify the data as CPU or GPU data based on which request brought the data into DRAMCache, although the CPU and GPU may subsequently access it. In our setup, only the CPU core executing the GPGPU application possibly shares data with the GPU and hence we do not expect to see a significant difference in the results. static replacement policy, i.e., evict the conflicting cache line according to Table 1 for GPU lines and direct mapped for CPU lines. HAShCache does not require one to maintain LRU/MRU recency information for replacement. We now summarize the insertion policy used by the Chaining mechanism in HAShCache. For all CPU fill (insertion) requests, the data is always inserted in the original block, and the victim block is evicted-removing chaining, if any, using the reverse chain bit. For a GPU fill request, if the low threshold mark for CPU occupancy is not reached, then the Chaining scheme replaces a CPU location, either from the original location or from the chained location, as indicated in row 1 of Table 1 . For a GPU fill request, if the original location belongs to GPU and does not have a chained location, then block is inserted in one of the nearest CPU blocks ((B + 1)%N s or (B + 2)%N s or (B + 3)%N s ). If such a nearest CPU block is not found, the request is inserted into the original block itself. If the original location is chained, then the scheme replaces the chained location, if that belongs to CPU or the original location itself, as indicated in row 2 of Table 1 . When the CPU occupancy has hit low threshold, then a GPU fill request replaces the original location if it belongs to GPU (row 4 in Table 1 ). If the original location belongs to CPU and does not have a chained block, then the GPU request is chained to the next nearest GPU location. If the original location is chained, but the chained location belongs to GPU, then the fill request replaces that. Otherwise, the fill request is not inserted in the DRAMCache (see row 3 in Table 1 ). Thus, the Chaining scheme ensures, as far as possible, the CPU requests can find the data in the original location, while the GPU requests attempt to exploit pseudo-associativity for increased cache occupancy. In all cases (for both GPU and CPU requests), the access is satisfied with at most two tag matches, either in the original location or in the chained location (identified by the chain bits).
In essence, HAShCache uses this Chaining mechanism to force occupancy control in the DRAMCache. Chaining is able to (i) ensure a minimum occupancy for the CPU lines while effectively allowing the GPU to occupy the rest of the DRAMCache by providing pseudo-associativity; (ii) remain as a direct-mapped cache for the majority of the CPU requests; and (iii) avoid forcing eviction of hot GPU lines while also avoiding storing of dead lines in the cache. Lastly, this scheme is dynamic and allows one to adaptively set CPU occupancy threshold l cpu based on the workloads requirements. This occupancy control mechanism does not incur any additional storage and uses the unused space in the DRAMCache rows. Once the GPU finishes kernel execution, HAShCache returns to a direct-mapped cache as the CPU lines inserted into the DRAMCache occupy chained blocks thereby unlinking chains.
EVALUATION METHODOLOGY
In this section, we describe the experimental methodology used in our evaluation.
HAShCache Configuration: The DRAMCache we evaluate here is a memory-side, first-level shared cache between the two split cache hierarchies of CPUs and GPUs. Access to memory is interleaved across multiple memory controllers. To avoid a biased interleaving due to uniformly striped address patterns, a basic XOR-based address hashing mechanism is employed. Each memory controller manages a 4GB DDR3 memory device and a 32MB stacked DRAM vault. The stacked DRAM vault caches data from the corresponding 4GB memory device that the controller is responsible for. This setup ensures that there are no cross bus requests between controllers. Since our simulation setup has the limitations of having a one-to-one channel mapping between a stacked DRAM vault and a off-chip DRAM channel, our model does not provide sufficient channel-level parallelism as is essential in a stacked DRAM device. To circumvent this limitation, we increase the number of layers (ranks) to provide a higher amount of parallelism in our stacked DRAM. However, the bank capacity is retained as would be in a standard stacked DRAM. The DRAMCache is non-inclusive [24] of the L2 caches above it and uses a write no-allocate policy. The NoC is modeled as a hierarchical cross-bar topology. One cross bar connects the CPU and caches to each other and another cross bar connects the GPU to the caches. A third cross bar connects the LLCs of the CPU and GPU to the DRAMCache memory controller. The DRAMCache controller does the request scheduling and dispatches commands to the stacked DRAM devices. The DRAMCache controller and off-chip DRAM controller are connected by a point-to-point link. The DRAMCache controller forwards the request to the off-chip DRAM controller if necessary.
Simulator: We evaluate the performance of the DRAMCache over an IHS processor using a modified version of the cycle accurate simulator gem5-gpu [38] which is configured to simulate cache coherent unified CPUs and GPUs. Our IHS consists of five CPU cores (four-wide OoO cores, operating at 2.5GHz) with a 32KB private L1 cache (split I/D), and a 1MB shared L2 cache (shared across all CPU cores). The IHS also consists of eight Fermi-like compute units, operating at 700MHz with a private 64KB L1 and 512KB shared L2 cache (shared across all CUs). The details of the IHS configuration are given in Table 2 . The private L1 of CUs are non-inclusive of the shared GPU L2 cache and can hold stale data. However, the GPU L2 cache is coherent with all levels of the CPU hierarchy. The CPU caches and the GPU L2 cache are kept coherent in the simulator. Table 2 provides the simulator setup details. Our simulator also respects all significant timing and functional parameters for the stacked DRAMCache (including refresh, data bus, request queues, scheduling algorithms, command signaling, and clock frequencies) using the DRAMCtrl [22] model. We have also faithfully modeled a Fill Queue [14] for fill requests to insert data into the cache on the return path from main memory. We also model MSHR and WriteBuffers with their associated latencies to realize the precise working of caches. Further, our baseline setup with a naive DRAMCache is equivalent to the on-chip shared caches in [47] without the NoC-related improvements, which is orthogonal to the ideas presented in this work.
Workloads: Applications having high and medium memory intensive behaviors from SPEC CPU2006 suite [23] were chosen to form a multi-programmed workload. We use the Rodinia benchmark suite [13] to represent GPU applications that offload kernel computation to GPU CUs. These Rodinia applications are modified to elide the memcpy api calls so as to run with unified virtual and physical address spaces. Combinations of quad-core multi-programmed workloads were coupled with a full Rodinia benchmark to create a representative mix of applications that would run in an IHS system. These composite workloads embody different levels of total memory intensity produced by the CPU and GPU cores. Our simulation of the simultaneous activity on both CPU and GPU cores using a co-running multi-programmed workload along with Rodinia-nocopy benchmarks allows us to demonstrate the interleaving of memory traffic at the stacked DRAM and off-chip DRAM. We also measure the footprint of these workloads in terms of the number of unique 128B cache blocks accessed at the DRAMCache. The memory footprints range from 70MB to 650MB for quad-core CPU workloads and from 5.5MB to 135MB for the GPU application. The smaller footprints obligate us to use a smaller stacked DRAMCache capacity of 64MB for all workloads and configurations to make pertinent observations. Simulation Methodology: We fast-forward the initialization phase of each workload up until just before the launch of the first kernel of the GPU program. We ensure that each core is fastforwarded by at least 2 billion instructions and in total each quad-core workload runs 20 billion instructions on average in the fast-forward phase. This is accomplished by adding a pause phase to the Rodinia benchmarks for the duration until the initialization quota of the SPEC programs is complete. We then warm the cache until the fastest core completes 250 million instructions. During the warm-up phase the GPU program is not executed. Timing simulations were then run for at least 250 million instructions for each CPU core, resulting in a total of more than 1 billion instructions across all the CPU cores. As is the norm, when a core finishes its quota of 250 million instructions, it continues to execute until all the cores have completed.
The Rodinia application uses an extra fifth CPU core and offloads to the integrated GPU. These applications are modified to execute in a conditioned loop such that there is no corruption of data structures in the program. The conditioned loop is run infinitely and represents the Region of Interest (ROI) of the Rodinia benchmark. This region includes sections of CPU activity and GPU offloads in the execution, as is typical of a HSA program which will exhibit an interleave of offloaded regions and serial CPU sections. However, only the performance statistics for the first execution of the conditioned loop in the program are considered. This is done as the first loop represents the true run of the GPU kernel. The number of conditioned loops executed depends on the length of the simulation and the IPC achieved by the GPU. Also, subsequent loops can achieve better hit rates in cache due to the data fetched by the earlier loops, thus not corroborating with the true performance achieved. In cases where the ROI is longer than the complete run of the CPU workloads, the statistics for the last completed kernel are used.
Coherence and Memory Consistency: The DRAMCache evaluated here is a memory-side cache [1, 18, 44] . These caches are outside the coherence domain. These caches do not require one to add additional states to the coherence protocol and do not need to be snooped separately. They are logically just in front of the memory and serve to reduce the average latency of memory accesses and increase the memory's effective bandwidth.
The consistency model followed for each type of cores (CPU cores and GPU cores) is different and is dictated by the corresponding processor architecture, i.e., strong memory consistency for the CPU and weak/release consistency for GPU. Cache line evictions and explicit fence operations flush data from the L2 SRAM cache to memory and could be cached in the DRAMCache. Subsequent requests to memory for that cache line first look up the DRAMCache (or bypass in the case of ByE) before sending the request to the off-chip DRAM. Further discussion about cache coherence and memory consistency models for IHS architecture is beyond the scope of this work.
Performance Metric: We report the performance of each processor using the intuitive metric of harmonic mean of IPC for the CPU cores and GPU CUs, which is defined as
where IPC CPU i and IPC GPU i are the instructions per cycle achieved by the ith CPU core and the ith GPU CU, respectively. In this work, we report CPU and GPU H -MEAN (and the improvement in them) independently. This is done to identify the impact of CPU and GPU applications on each other, as well as to understand the impact of the proposed modifications on each of the application types. We report the combined system performance using the combined H-MEAN of all CPU and GPU cores, as in
Lastly, we also report combined system performance using the weighted speedup metric [43] , which is defined as
where IPC CPU I H S i and IPC

GPU I H S i
denote the IPC achieved by the ith CPU core and the ith GPU CU when running in an IHS setup, respectively; whereas, IPC SP i and IPC GPU denote the same when the ith CPU core and the ith GPU CU are running standalone, respectively.
RESULTS
In this section, we evaluate the results of the proposed HAShCache mechanism and the optimizations discussed in Section 3. Figure 8 of harmonic mean of IPC, for CPU and GPU, respectively, normalized to the baseline without a DRAMCache. We report the performance improvement due to the introduction of DRAMCache (naive), and the different HASHCache mechanisms (PrIS, ByE, and Chaining and some combinations of them) in this section.
PrIS DRAMCache Scheduling
Prioritizing CPU requests with our PrIS scheduler at the DRAMCache controller leads to considerable performance benefits for the CPU. By using PrIS, the average access latency of the DRAMCache for CPU requests reduces by an average of 15.3% and up to 48.9% over a naive DRAMCache (presentation of detailed data is omitted due to space constraints). Hence, PrIS is able to improve the performance of the CPU by an average of 35% over a naive DRAMCache. However, on the flip side, giving aggressive priority to all CPU requests reduces the performance of GPU by 10% despite the GPU being able to tolerate larger memory access latencies. For some of the benchmarks like Qg3, Qg10, Qg11, and Qg12 the high priority given to CPU requests by PrIS impacts the GPU, causing the GPU performance to reduce marginally below the baseline IHS architecture without a DRAMCache. Note, however, that, in these workloads, the introduction of DRAM cache (naive) itself improves performance only marginally. Our mechanisms (ByEand Chaining) further aim to reduce this performance drop for the GPU by ensuring (a) there are fewer CPU requests in the DRAMCache queues and (b) GPU requests, despite being deprioritized at DRAMCache, have a better hit rate in the DRAMCache and avoid accessing the off-chip DRAM.
ByE for Temporal Bypass
ByE attempts to achieve improved performance by directing some requests to be served from the off-chip DRAM, thus achieving improved resource utilization and bandwidth balance in the process. ByE alone achieves 12% improvement in CPU performance and a 3% improvement in GPU performance.
The CPU performance improvements are primarily due to bypassed requests facing reduced queuing delays at DRAM. Figure 9 shows the percent reduction in total memory access latency for CPU read requests achieved by ByE over an already aggressive naive DRAMCache which employs a hit/miss predictor for CPU requests (primary y-axis). The total memory access latency for CPU read requests reduces by an average of 28%. The already high hit rates for GPU in the DRAMCache coupled with the no-fill policy for bypassed CPU requests ensures fewer GPU requests at the offchip DRAM, which leads to lesser congestion. Figure 9 also shows the percentage of incoming read requests bypassed by ByE, on the secondary y-axis. ByE is able to bypass on average about 37% of incoming read requests. On average, 23% read requests are to dirty lines in the cache which cannot be bypassed. The remaining 40% are false positives in our Bloom filter implementation which could have bypassed the DRAMCache. We discuss more on a sensitive study of Bloom filter later in this section. Further, the reduced set contention and lesser number of CPU requests in DRAMCache queues reduces congestion for the GPU, which in turn leads to small performance benefits for the GPU as well improving the harmonic mean of the GPU IPC by 3% over naive DRAMCache.
Combining PrIS with ByE allows for the non-bypassed CPU requests at the DRAMCache to be served with a higher priority and hence reducing the queuing delays. ByE+PrIS performs better than just PrIS by 10% for CPU and 7% for GPU. Overall, ByE+PrIS achieves 48.5% improvement in CPU performance over a naive DRAMCache while degrading GPU performance by just 3%.
The somewhat high false-positive ratio in our bypass mechanism is due to the large number of dirty blocks in the DRAMCache and the relatively small size of the Bloom filter. We also experimented with three hash-functions while also optimally increasing the array capacity to 312KB (20% larger) to reduce aliasing and increase the efficacy of bypass. However, the CPU performance improves by only 2.1% while the GPU remains largely unaffected.
Chaining for Spatial Occupancy Control
As discussed in Section 3.2.3, the Chaining mechanism improves hit rates for GPU while guaranteeing some occupancy for the CPU in the DRAMCache. We empirically determine a suitable low occupancy threshold of the CPU (l cpu ) to be 20% for all our workloads. Chaining alone performs no better than a naive cache as the queuing latencies overwhelm any improvements in hit rates. However, when chaining is coupled with PrIS, the increased hit rates reduce the performance drop caused by PrIS for the GPU from 10% to 5.2% (i.e., 4.8% performance improvement over PrIS). For the CPU, guaranteed occupancy in the DRAMCache and the secondary effect of lower congestion at off-chip DRAM allows CPU requests to be serviced with lower delays. This improves the performance of CPU by 7%, over only PrIS.
Overall, Chaining+PrIS improves CPU performance by 44.7% while degrading GPU performance by merely 7% over a naive DRAMCache.
System Performance with HAShCache
We now holistically examine the performance improvements due to the introduction of our HAShCache mechanisms, in CPU and GPU cores together. From Figure 8 we observe that, adding a naive DRAMCache can achieve an average of 42% and 24% improvement in CPU and GPU cores, respectively; wWhereas, HAShCache achieves significant speedups of (205%,17.5%) and (211%,20.4%) for (CPU,GPU) using heterogeneity-aware mechanisms of ByE and Chaining, respectively. This comes within 16% and 13% of the ideal no interference performance for the GPU (final gray bar in Figure 8(b) ) and within 81% and 76% of the ideal no interference performance for the CPU (final gray bar in Figure 8(b) ) for each of the schemes. Further, for memory intensive combination of CPU and GPU workloads like Qg7 and Qg8 which see significant degradation in the performance of both processors due to interference, adding a DRAMCache can improve performance up to 430% for CPU and 48% for GPU over a baseline system with no stacked DRAMCache. In comparison, the naive DRAMCache implementation only brings 55% and 56% improvement in the CPU and GPU performance, respectively. 11 Figure 10(a) plots the performance improvement as a harmonic mean of IPCs of IHS IPC (combined CPU cores and GPU CUs), normalized to our baseline IHS architecture without a DRAMCache. Overall, with simple heterogeneity-aware management of the stacked DRAMCache, IHS systems can achieve on average 200% (up to 400%) performance improvement while a naive DRAMCache is able to achieve just 41.8% improvement.
As a comprehensive system metric, Figure 10 (b) plots the weighted speedup normalized to the baseline of IHS without DRAMCache. Adding a DRAMCache to IHS processors naively achieves an improvement of merely 3%. In fact, for some workloads like Qg3 and Qg11 adding a DRAMCache without careful considerations can lead to negative performance impact. However, our heterogeneity-aware mechanisms are able to achieve on an average of 16% and 15% improvement in performance over a IHS architecture without a DRAMCache. This improvement corresponds to a 12.9% and 11.5% improvement for each of the schemes over a carefully designed but heterogeneity-unaware DRAMCache for the IHS processors.
We also observe that in the case of a naive DRAMCache, on average 58.22% and 23.93% of the peak DRAMCache and peak off-chip DRAM bus bandwidth is utilized. With PrIS+ByE, the off-chip DRAM bus utilization improves by 23.8% with negligible reduction (less than 1%) of DRAMCache bus utilization. Further, with PrIS+Chaining, the DRAMCache bus utilization improves by 35.12% while off-chip DRAM bus utilization reduces by 2.64%.
Comparison with Related Work
The objectives of ByE and PrIS are similar to the Mostly Clean DRAMCache (MCC) [42] and Staged Memory Scheduling (SMS) [9] , respectively. Hence, we compare our work with MCC (adapted for IHS architecture) and SMS in Figure 11 which presents the performance of (a) CPU and (b) GPU normalized to an IHS with no DRAMCache. Overall, ByE performs 7.8% better than MCC in terms of CPU IPC. This is because when a page (row) is marked as write-through in MCC, all requests to that page are bypassed oblivious of the source of the request. The large number of GPU requests Fig. 11 . Comparison of HAShCache against MCC [42] and SMS [9] . (a) CPU and (b) GPU. to the cache lines in the write-through (clean) pages quickly exhausts the available MSHR entires, which leads to the DRAMCache being blocked for subsequent requests. For workloads Qg9, Qg10, and Qg12 the GPU workloads are relatively less intensive and MCC tends to perform slightly better than ByE. The GPU performance is comparable for both approaches.
The striped bars in Figure 11 show the performance of the PrIS and SMS for (a) CPU and (b) GPU. For SMS implementation, we proportionally scale down the total hardware requirements to the same size as that of PrIS (128 entries). We observe that PrIS performs on-par with SMS for almost all CPU workloads. For workloads like Qg2 and Qg12, SMS performs marginally better than PrIS as it is able to better manage CPU inter-application interference using Shortest-Jpb-First (SJF) scheduling. However, for the GPU, SMS performs 4% better than PrIS due to its batching algorithm that is able to admit row buffer hit requests into the queues, which leads to improved row buffer hit rate (9% better). Our combined mechanism of PrIS+ByE proposed in this work performs better than each of the prior compared schemes for both CPU and GPU.
Sensitivity Study
5.6.1 Larger Capacity DRAMCache. We conduct a sensitivity study for HAShCache mechanisms with a larger 128MB stacked DRAMCache. Figure 12 presents the performance of a naive DRAMCache and our mechanisms ByE+PrIS and Chaining+PrIS for the (a) CPU and (b) GPU in terms of H-Mean of IPC normalized to an IHS with no DRAMCache. We observe that the larger naive DRAMCache is able to achieve 54.5% and 29.8% improvement for CPU and GPU processors, respectively. HAShCache mechanisms continue to provide an average improvement of 46% and 39% for the CPU, on average, above a heterogeneity-unaware DRAMCache, and 226% and 215% over the baseline IHS. 12 
Off-chip DRAM with Same Latency as DRAMCache.
Although we expect the smaller sizes of stacked DRAMs to have lower latency than off-chip DRAMs, commercial stacked DRAMs like those in Intel's Knights Landing [7] have access latencies similar to off-chip DRAMs. Hence, we scale up the off-chip DRAM to a DDR3-2133-like device with latencies (13.09ns-13.09ns-13.09ns-33ns) for t CL -t RC D -t RP -t RAS . As before, Figure 13 with no DRAMCache. For PrIS+ByE, CPU performance improves by 48% while reducing merely 3% of GPU performance. For PrIS+Chaining, CPU performs 47% better than a naive DRAMCache while decreasing GPU performance by 4.4%.
Higher Bandwidth DRAMCache.
We carry out experiments for HAShCache with a stacked DRAMCache of 80GB/s. We achieve this by doubling the burst length of the stacked DRAMCache devices. Again, Figure 14 presents the performance of a naive DRAMCache and HAShCache for (a) CPU and (b) GPU. We observe that our mechanisms scale with increased DRAMCache bandwidth and ByE+PrIS performs 51% better and Chaining+PrIS performs 47.3% better than a naive DRAMCache for CPU. GPU performance also improves by 1.9% and 2.4% over a naive DRAMCache. Notably, for GPU benchmarks like streamcluster and Gaussian Chaining+PrIS performs better than ByE+PrIS.
RELATED WORK
State-of-the-art stacked DRAM device designs have primarily focused on improving the performance of multi-core architectures. In [16, 41] , the stacked DRAMs are organized as part of memory in view of the large capacities provided by these devices. The designs propose hardware management schemes for swapping hot pages into and out of the stacked DRAM devices. On the other hand, several designs propose to use the stacked DRAM as transparent hardware-managed cache. Sectored caches such as [26, 28] allocate large blocks and avoid wastage of bandwidth by intelligently fetching useful blocks only. In Section 3.1, we have broadly alluded to some of the works of DRAMCache designs and organizations [21, 25, 34, 39] that organize DRAMCache at smaller system-sized blocks (64B or 128B). These designs intelligently manage metadata and tag lookup serialization by using approximate hardware structures. HAShCache extends and adapts the simplicity and effectiveness of the Alloy Cache organization for IHS architecture after carefully considering the implications of each design decision on performance. While these works have focused on improving multi-core CPU performance, our work shows that there exists a large potential for performance improvements by using stacked DRAMCache in an IHS architecture.
Orthogonal to these efforts, there have also been proposals such as [37] to expose the stacked DRAM to the applications and/or system software by providing special allocation calls or using intelligent page management algorithms in system software to place hot data in the high bandwidth memory. Besides these designs incurring the obvious overheads of software modification to improve performance, they also require a good understanding of program behavior and IHS architecture knowledge.
Recently, researchers have pointed out the need to utilize all the available bandwidth from both on-chip and off-chip DRAM devices due to comparable access latencies [15, 18, 20, 30, 42 ] to extract best performance and improve resource utilization. However, HAShCache uses the ingrained disparity in the request rates and their implication on the performance of each core to optimize bandwidth balance in a heterogeneity-aware manner.
Complementary to our work, there have been efforts to improve the performance of IHS systems in [29] by throttling the GPU cores using intelligent warp scheduling and avoiding congestion in the NoC [32] . These techniques are orthogonal to ours, which explores managing the interference in the memory subsystem of IHS architectures. To manage shared on-chip SRAM caches, Lee and Kim [31] propose heterogeneity-aware schemes that are built on top of UCP and RRIP schemes for managing shared resources. While our chaining mechanism is in line with this to ensure minimum occupancy for CPU requests, it also goes by introducing pseudo-associativity and improving hit-rates (specifically for GPU requests). Mekkat et al. [35] further propose shared SRAM cache management for IHS workloads that uses runtime metrics, like cache sensitivity of each workload, to allocate cache capacities. Despite larger capacities, DRAMCaches have higher latencies and hence will not be able to adapt quickly to SRAM occupancy management schemes proposed in these works.
Zhan et al. [47] propose improving the performance of IHS architectures by replacing on-chip SRAM caches with slightly larger STT-SRAM caches that are non-volatile but have asymmetric read/write energy and latencies. They focus on NoC-related optimizations through NoC reordering/batching schemes and differential CPU/GPU, read/write prioritization. The NoC optimizations are orthogonal and can be supplemented to the ideas proposed in this work. The performance improvement due to introduction of STT-RAM in IHS architectures is equivalent to that observed in our naive DRAMCache.
The authors in [27] propose a QoS-aware memory scheduler to avoid the GPU from missing a frame rendering deadline. However, in our IHS architecture the GPU is used to accelerate general purpose code and hence PrIS does not consider such deadlines for the memory controller.
In [48] , the authors propose to use the GPU side stacked DRAM as another level of cache for the GPU memory hierarchy before going to the host memory. They use additional structures to maintain memory coherence information both on the CPU and discrete GPU side. However, complementary to our approach, this stacked DRAMCache on the discrete GPU is primarily used for caching data from the GPU cores while HAShCache is shared by both CPU and the integrated GPU throughout the execution. Several of our design decisions include tradeoffs which are heterogeneity-aware and provide improved access latency for CPU cores as well as high bandwidth for GPU cores.
Workload suites such as Chai [19] and Hetero-Mark [45] were designed to collaboratively engage both CPU and GPU cores simultaneously and take full advantage of the IHS architecture. The Chai benchmark suite was developed independently and concurrent to this work.
FUTURE WORK AND CONCLUSION
A detailed study of the energy efficiency of IHS systems with a stacked DRAM as cache is deferred to future work as this requires additional enhancements to the simulator owing to the heterogeneous nature of the cores. Intuitively, as a result of the significant speedups obtained using HAShCache the static energy dissipation of the system will be lower. The design and optimization of NoC and interconnect topology for IHS architecture is outside the purview of this work and is also deferred as a future work. We would also like to explore the design of the virtual memory subsystem shared by CPU and GPU cores in an IHS architecture in the future.
In this work, we presented a case for performance improvement of an IHS processor by the addition of a stacked DRAMCache. We quantify the effects of interference due to co-running on each processor and show that the heterogeneity adversely affects CPU performance compared to the GPU. We carefully design an effective DRAMCache organization for IHS processors and improve IHS performance using three simple and effective heterogeneity-aware techniques: a DRAMCache scheduler, a spatial bypass, and a temporal occupancy control. HAShCache achieves significant improvement of 200% in overall system performance on average over a baseline system with no DRAMCache and 41% over a heterogeneity-unaware DRAMCache. This work thus shows that there are significant benefits to using a stacked DRAMCache for IHS processors, far exceeding the usefulness of such devices in homogeneous GPUs and multi-core CPU systems.
