Abstract-Modern heterogeneous multiprocessors integrate CPU and GPU together to provide a boost to computational performance. Data sharing and communication between CPU and GPU has been a critical issue for the final speedup. With tighter integration of CPU and GPU, it has the advantage of sharing and moving data more efficiently in order to leverage the computational power that a GPU can provide. Initially, DMA or PCIe devices were used to transfer data between CPU and GPU with low efficiency and little flexibility. Recently a single address space and coherent cache hierarchies are being adopted in heterogeneous architectures to share data more efficiently. Thus it poses new challenge to understand the communication overheads in this new context and to improve communication efficiencies for these architectures.
I. INTRODUCTION
The trend in hardware is for the incorporation of domain specific accelerators in chip multiprocessors to improve performance and power efficiency. One such heterogeneous architecture integrates CPU and GPU into a single chip, examples of which are AMD's Fusion architecture (which is now named as Heterogeneous System Architecture or HSA), including APU series such as LIano and Trinity. Even though a GPU can provide an enormous speedup for data parallel applications, the communication cost between CPU and GPU can result in significant performance degradation. To effectively leverage the performance of a tightly integrated CPU and GPU, it is critically important that data is shared and moved efficiently to achieve good performance and minimal power.
The process of communication involves two major steps on hardware: firstly, data is evicted from the owner's private caches, written back to the upper level memory (extra memory copies may exist here by copying from owner's memory space to requestor's memory space); and secondly, the requester fetches the data from the upper level memory into its own cache. An example is the transfer of data from CPU to GPU where the GPU has to wait until data is transferred all the way from the CPU's cache to memory, then from memory to the GPU's cache. The communication latency is part of the execution critical path and can significantly degrade performance, especially when communication happens frequently during execution. In the past few years, communication involved latency and memory traffic have proven to be important factors affecting performance [1] .
In earlier architectures DMA was used to copy data from one address space to another, this mechanism usually involved interrupting the CPU to complete the task. PCIe TPH (TLP Processing Hints) allows PCIe devices to communicate cache injection hints to the host CPU. They are specific to PCIe communication and don't necessarily generalize to shared memory based heterogeneous architectures. Some heterogeneous designs are adopting a single address space enabling the GPU to access the same virtual address space as the CPU (such as AMD Trinity) thus avoiding copying data during communication.
Further heterogeneous cache coherence protocols are designed for CPU and GPU to share data through a shared last level cache, such as AMD Kaveri and Intel Ivy Bridge. The significant bottleneck of applying traditional CPU coherence protocols is the vast demand for coherence traffic and required hardware resources, due to GPU's high memory bandwidth requirements and unique memory access patterns. There are a few previous work [2] , [3] , [4] that proposed to manage coherence at a much bigger region level in both snooping based and directory based systems. For examples, Heterogeneous System Coherence (HSC) [4] implements directory coherence at a region granularity between an integrated CPU and GPU. Given the high spatial locality of GPU data, obtaining coherence permissions at a coarse granularity (compared to traditional cacheline-level) removes the above bottleneck and makes it possible for a full memory system's coherence. However, coherence latency and traffic can still degrade performance due to the complexity of coherence itself.
This paper proposes iCHAT technique, which stands for inter-Cache Hardware-Assistant data Transfer, to reduce the communication latency and related cache traffic. iCHAT can detect and learn when communication happens and store the information about which data blocks have been transferred. Based on the knowledge learnt, the hardware can predict when the communication will happen again and starts data transfer before request. We implement the eager eviction technique of the iCHAT in a simulator based on AMD's APU architecture. Experimental results show that the communication related eviction traffic is reduced by an average of 40% and the total directory traffic is reduced by 8% on average.
The main contributions of this paper are: (1) This paper characterizes inter CPU-GPU communication patterns of GPGPU application and categorizes communication data into two classes. (2) Based on the characteristics, this paper proposes iCHAT technique in the context of APU, to reduce the data transfer latency and total traffic for heterogeneous chip multiprocessors. (3) We design a bounding experiment that provides a quantitative evaluation of all the on-chip communication data and the potential performance boost. In the following sections, we will first introduce the GPGPU data transfer characteristics. In Section III, we will describe the proposed technique and also discuss some design alternatives and future improvements. In Section IV, we first discuss the selected preliminary results of iCHAT. Then we present the quantitative evaluation of communication data and the bounding of potential speedup that iCHAT can achieve. At last, related work and conclusion will be showed.
II. BACKGROUND AND MOTIVATION

A. GPGPU Data Transfer Pattern
We target GPGPU applications in this paper and choose Rodinia benchmark suite [5] as workloads in the evaluation. A typical GPGPU computing pattern is showed in Figure 1 , which usually includes the following three phases.
• Phase 1: CPU prepares/initializes the data for GPU.
• Phase 2: GPU performs a set of computations within a loop with the CPU checking the results at the end of each iteration.
• Phase 3: CPU reads and post-processes the data generated by the GPU. Data transfer occurs when switching between CPU and GPU computations. In Phase 1, CPU usually prepares the 1 CPU initializes the data 2 do { 3 GPU kernel executes and computes the data 4 CPU re-processes data 5 } while (condition) 6 CPU post-processes data The total amount of transferred data between CPU and GPU can be very large. Table I shows the amount of transferred data in number of pages between CPU and GPU for each benchmark. The same hot page transferred n times are counted as n transferred pages in the table. The amount of initialization data and hot data vary for different applications. For example, dynproc, cell, hotpspot and nw have more initialization data while streamcluster, kmeans and particle have more hot data. We can see even for Rodinia benchmarks with medium problem size, the total transferred data can be up to thousands of pages. Then for large problem size GPGPU applications or normal graphics applications, the data transfer will cause much more burden on cache controllers, directory and memory system. Even though GPU can achieve significant speedups for the kernel computing itself, with the communication overheads the final performance will be withheld from the promised potential. Figure 2 shows a high level description of the baseline heterogeneous system with region based hardware coherence protocols. CPU and GPU are integrated on the same chip. Both CPU and GPU have a private L2 cache while L3 cache is shared. L3 cache is like a on chip memory side write buffer which only holds the write back data from L2 caches. Previous work [6] has proposed Heterogeneous System Coherence (HSC) as the region level coherence proposals for heterogeneous processors. HSC extends region coherence to directory based coherence protocol. HSC adds a Region Buffer to both CPU L2 and GPU L2 caches to track access permissions at the region granularity. All the L2 misses first query the region buffer. If valid permission for that region (Shared for reads, Private for writes) is found in the Region Buffer, data requests are sent directly to memory. The direct memory accesses have much higher bandwidth, especially for GPU. If permission is not found, requests are forwarded to Region Directory to acquire permission for the region. The Region Directory connects the Region Buffers and tracks and arbitrates the permissions of all the regions on-chip. By obtaining permission at region granularity, HSC significantly reduces the bandwidth to the directory compared to the traditional directory based coherence protocols. The grain size used in this paper is set as page size.
B. Baseline System
Data transfers between CPU and GPU normally happen in a passive way, which means the data is only transferred on the requestor's demand. In our baseline system, when the requestor (either CPU or GPU) needs some data from the owner (GPU or CPU), it has to go through a 4-hop process: first, the requestor sends a data request to the region directory; second, the directory sends an invalidation request to the owner's region buffer; third, the owner's cache evicts the region from owner's region buffer and writes data from its private cache back to the L3 cache; and fourth, the requestor gains the region permission and fetches the data from L3 cache into its own cache. The four-hop data transfer results in high latency and significant cache and directory traffic. When the transferred data block is large, the directory will become the bottleneck. These are the communication overheads in the new context of coherent heterogeneous processors which will lead to performance degradation.
III. INTER-CACHE HARDWARE COMMUNICATOR
Based on the GPGPU computing pattern and communication characteristics, we proposes inter-cache hardware communicator iCHAT to move the communication data between CPU and GPU in parallel with computation to hide the communication latency. The communicator detects when the GPU requests data blocks that are owned by CPU, or when CPU asks for GPU's data. Thus it needs to be integrated with the centralized memory hierarchy, which can watch both CPU and GPU data accesses as well as identifying the owner for a given address. So we connect the communicator to the directory. Figure 3 shows the integration of iCHAT communicator with the baseline system. It sits between L2 caches and the shared Chip Fig. 2 . The baseline heterogeneous system with HSC coherence protocols. The darker line means high bandwidth direct memory access when region permission is given.
!#!' L3 cache, connecting to the directory. Thus the communicator can interact with the cache hierarchies through directory and watch the data transfers between CPU and GPU. The iCHAT communicator includes three components: the communication detector detects communication data and predicts when the communication happens; the last evicted page records the latest evicted page during transfers of the initialization data; the hot block table stores the addresses of the data blocks that are frequently transferred. The communicator transfers the communication data ahead of time through the following 2 steps. In the following sections, we will describe the mechanism of each step in details.
• Step 1: Capture the communication data and pattern.
• Step 2: Evict communication data from owner's cache to L3 cache.
A. Communication Detection
In order to speedup the data transfer, it is critical for the communicator to be able to detect the communication data and their patterns. Communication detector captures those information by watching the traffic between CPU and GPU. The detection mechanisms for initialization data and hot data are different due to their specific characteristics. The captured information are stored in last evicted page buffer and hot block table separately. 1) Initializaiton Data: As discussed in previous section, GPGPU computing can consume large amount of data and normally those data will be initialized by CPU in Phase 1. When switching to Phase 2, as many as hundreds or even thousands of pages of initialization data will be then transferred from CPU to GPU. Experiments show that the initialization data has specific characteristics which can be used for detection. First, initialization data is usually transferred just once during the application execution. Second, CPU usually streams through each element of the data structure to initialize it. Thus the initialization data normally consists of large amount of continuous pages with all cachelines within each page touched by CPU. We add a validation vector in the directory for each page as seen in Figure 3 . The length of the vector is 64 bits and each bit represents one cache block of the page to indicate whether the block is in the cache. By checking the validation vector the detector could tell whether all the cachelines in the page are accessed. When GPU starts to request CPU's data, the detector will check whether the data meets the above features. If yes, the data will be identified as initialization data and the current page ID will be stored in the last evicted page buffer for eager eviction.
2) Hot Data: The basic idea to speed up hot data communication is to learn and capture the hot pages which are frequently transferred between CPU and GPU, then predict and proceed the data transfer before communication data is requested. With the proposed communicator, the hot data can be moved into the requestor (CPU or GPU) L2 cache before it asks for it. So once the requestor starts to issue the fetch instruction for the hot data, the data will be ready in the L2 cache. Thus there is no need to wait for the data to transfer all the way from the owner's side. The major benefit of this technique is that it can hide the 4-hop data transfer latency between CPU and GPU while reduce related directory and cache traffic.
When CPU and GPU are requesting each other's data, but the features of the initialization data does not apply. The data blocks will be identified as hot data. The hot data blocks can be recorded at fine-grain or coarse-grain granularity. Page level granularity together with valid bit for each cacheline is used in this paper. The addresses of transferred data blocks are stored in the hot block table with a counter that increases each time when the block is transferred. By sorting the counters, the hot data's page ID will be identified and stored in the hot block table. Since the hot pages are normally not many, we set the hot block table to be 10 entries currently and each counter takes 3 bits.
There are a few mechanisms to detect and predict the time for communication. One straightforward mechanism is interval-based prediction. As we found out in GPGPU computing, each of the three phases showed in Figure 1 tends to have uniformed interval. In order to calculate the interval, communication detector can record each time stamp of data transfers between CPU and GPU. The recorded communication time can be local network or cache cycles. After the communication happens a few times, the detector can get an average interval length for each phase and prediction can be made when switching between phases. However, given the interval, when the communicator should start to move the data is the critical design tradeoff. An inaccurate interval based prediction might result in moving data too early (before the data is ready on the owner side) or too late (not until the requestor needs it). A more accurate detector should rely on the hardware information. One design alternative could use write-activation mechanism, which means the communicator decides that data is ready when the owner modifies the data. This mechanism will make sure that data are touched before moving back to the other side. A further last-write technique could be used in case the owner will write the data several times before data is finally ready to be transferred. Another design alternative can be requestor initialized. The communicator only starts to move the data once it sees one access request falls into the hot block table (the requestor starts to use the communication data). This technique will avoid moving the data too early that causes the cache pollution in the requestor's cache. We use the requestor initialized mechanism in the initial implementation and evaluation.
B. Eager Eviction
In the previous subsection, the communicator detector identifies both the initialization and hot data, and stores the information separately in last evicted page and hot block table. Eager eviction is a technique to evict the detected data from the owner's cache ahead of time. iCHAT conducts eager eviction by sending invalidation request (through directory) to the owner's cache which evicts the data blocks and writes back to L3 cache. Eager eviction requests will be sent when the directory is free, to avoid bandwidth conflicts with processor's data demand.
The eager eviction for initialization data is triggered when GPU starts to request the very first pages of the initialization data. Eviction of the neighbouring pages will begin when the directory is free. The iCHAT communicator will check the validation vector of the page with the ID number next to the last evicted page. If all the cacheline blocks of the next page are in the CPU cache, the next page will also be taken as initialization data and will be eagerly evicted. The eager eviction will stop when the features of initialization data no longer applies. Eager eviction for hot data is straightforward. When communication is predicted to start, only data blocks recorded in hot block table are evicted.
With eager eviction, when requestor later requests communication data, it can go to fetch data directly from L3 cache without waiting for the owner to evict the requested data. Thus eager eviction technique can transform the original four-hop data transfer into two-hop transfer. As seen in Figure 3 , the hop 2 (invalidate) and 3 (writeback) will be hidden, just hop 1 (request) and 4 (fetch) are left. The traffic on directory will be reduced too.
C. Further Optimization: Communication Injection
A further optimization technique of eagerly transferring the communication data is to inject them from the L3 cache to the requestor's private caches. Since the L2 cache size is normally not big compared to L3, the communication injection has to be designed to avoid cache pollution to the requestor's cache. The hot pages for GPGPU applications are usually not much (most are just one to three pages), and also will be consumed pretty soon. So the injection will not cause cache pollution by evicting much of other useful data. However, initialization data blocks are usually very large. In order to avoid cache pollution, a threshold based stepped injection mechanism will be used in our future work. Injection can be broken up into three steps: first, dynamically determine an injection page number threshold based on the total number of communication data; and then inject threshold number of the pages in the requestor's cache; later when the requestor is starting to consume the data, inject another threshold number of the data pages to the requestor's cache until all the communication data are transferred.
This communication injection can be applied to iCHAT as the third step to more eagerly hide communication latency to the extent of 100%. After the three steps of communication data eager transfer, the data will be in the requestor's private cache already when it is needed. As seen in Figure 3 , the 1, 2, 3 and 4 hops latency will all be hidden and the data transfer traffic will be reduced. This paper has not implemented communication injection for iCHAT design.
The control logic of iCHAT is straightforward and simple. The hardware complexity mainly lies in the storage overhead to store the communication information including page validation vector which uses 64 bits for each page, the address of the last evicted page and the hot block table which we set to be 10 entries currently with each entry storing the page address and a 3-bit counter.
IV. EXPERIMENTAL RESULTS
For the CPU and memory system simulation, we use the gem5 [7] simulator system. For simulating the GPU, we use a proprietary simulator based on the AMD Graphics Core Next architecture [8] . The heterogeneous architecture is simulated by combining the memory systems of gem5 and the GPU simulator. The baseline architecture settings used in this paper are shown in Table II .
We choose the most used heterogeneous benchmarks including Rodinia benchmark suite and AMD SDK APPs [9] as the heterogenous workloads in the evaluation. We will first show the preliminary results of iCHAT, which show the potential of applying the proposed technique for initialization data. In order to provide a thorough and quantitative study of communication overhead and potential performance gains, we design a bounding experiment based on ideal iCHAT. The bounding experiment will report percentage of inter CPU-GPU data transfers out of total data requests, percentage of requests to the CPU-GPU shared data out of total data requests and bounded performance boost. This bounding experiment provides a methodology that can be applied to heterogeneous processors and indicates through an iCHAT, what is the achievable speedup and how much original CPU and GPU requests can be benefited.
A. Preliminary iCHAT Results
In this paper, we evaluate the result of applying a basic iCHAT on a region coherent heterogenous processors. The basic iCHAT can detect the communication data and eager evict the data from owner's private cache back to memory. The preliminary results evaluate the proposed technique for initialization data. The evaluation for hot data transfer hot data and proposed further optimization technique will be implemented and showed in a future work.
As the eager eviction technique can evict the data from CPU L2 cache to L3 cache in advance, the eviction requests will be reduced. Successful eager eviction of a page enables the GPU to fetch that page directly from L3. So the 4-hop data transfer process will be reduced to 2 hops as explained in Section III-B. Figure 4(a) shows the total eviction requests issued by GPU compared to the baseline. Eager eviction reduces the eviction traffic significantly, up to 74% for lud and 40% on average. The results clearly fall into two groups. For the first seven benchmarks on the left side, the eager eviction can reduce traffic by 60% on average. While for the rest six benchmarks on the right side, the traffic is reduced less than 20%. The first reason for that is some of the six benchmarks are scientific benchmarks without many pages transferred. For example, kmeans just has five pages of initialization data and particle just has one. Thus the potential space of optimization for those benchmarks is not much. Another reason is that our technique eagerly evicts pages in an incremental sequence of page ID starting with an ID of the last evicted page. However, if GPU has a scattered data access footprint, it may request a page before the communicator has a chance to evict it. In this situation, GPU still needs to go through the original 4-hop data transfer process.
As the directory does not need to send invalidation requests to the owner, the total traffic seen by the directory is also reduced. As showed in Figure 4 (b) depending on how much eviction traffic counts in the total traffic, the total traffic decreases from a few percent to 25% for nn, and 8% on average. It is also separated into two groups for the similar reasons as Figure 4 (a). The total traffic for kmeans increases. This is because of the miss predicted eviction, which results in more data blocks being evicted than necessary. Figure 4(c) shows the miss eviction rate for each benchmark. The miss rate for kmeans is specially high resulting in an increase in eviction traffic as well as extra requests to bring the miss evicted data back from L3 cache into CPU L2 cache. However, since the initialization data is not much, just five pages for kmeans, the absolute value of traffic increase won't be that much. For the other benchmarks with relatively high miss prediction rate, such as bfs, the basic reason is that the eager eviction evicts some non-initialization data which also meets the features of initilization data and has continuous page ID following the last evicted page. One way to fix this problem is to set a threshold for the number of pages can be evicted each time, such as 5, to slow down and double check whether GPU consumes the data.
B. A Quantitative Bounding Experiment
In order to provide a thorough and quantitative study of communication overheads and potential performance gains, we designed a bounding experiment based on ideal iCHAT. In the bounding experiment, we assume that iCHAT can detect and inject the communication data into CPU's and GPU's L1 caches. iCHAT can track data at both region and cacheline block granularity, thus this bounding experiment could provide a more accurate evaluation compared to the results in Table 1 . In the bounding experiments, we have a very rigid definition of communication data, only the data that travels across CPU and GPU, the real inter CPU-GPU transfers, as communication data. When seeing communication data, the ideal iCHAT can move the data immediately into the requestor's cache with no latency, so the requestor can get the data from its L1 cache with L1 cache latency. The communication data is also called CPU-GPU shared data in the experiment. So all the requests to the shared data are also affected by the communication efficiency, thus they are also evaluated. The read share data that exist in both CPU's and GPU's caches are not counted as communication data, but counted as share data requests.
The bounding experiment can bound the potential speedup that iCHAT can achieve by setting the CPU and GPU communication data latency to their L1 cache latencies. The bounding experiment uses both Rodinia benchmark, size 2 and AMD SDK size 4 to cover more workloads at different workload sizes. Figure 5 presents the percentage of real inter CPU-GPU transfers out of total CPU+GPU data requests. Figure 5 (a) presents the percentage for Rodinia benchmark. As seen, Rodinia has on average 8.2% communication data while AMD SDK has 20.3% communication data. The percentages differ quite a lot for different workloads, which is due to the algorithm design. Notice that even when the percentage is small, it does not necessarily mean there are few transfers. The absolute number of transfers for hotspot, kmeans and nbody are pretty small, thus the percentage is small. For matrixmul the absolute number is significant, but because CPU and GPU have far more total requests, thus a small percentage. Figure 6 (a) and 6(b) present the percentage of requests to the communication (shared) data out of total requests. We evaluate all requests to the communication data because the subsequent accesses after the inter transfers can also benefit from fast communication. As seen in Figure 6 (a), Rodinia has on average 60% shared data requests for GPU and 22% for CPU. As seen in Figure 6 (b), AMD SDK has on average 75% shared data request for GPU and 9% for CPU. GPU's shared data requests take up a large percent, which verifies that GPU usually operates on the data that transferred from CPU. Notice that even though the real transfers in Figure 5 can be small, but since GPU later may reuse the communication data a lot, which will still lead to a large shared-data-request percentage, as the case for most benchmarks, including bfs, kmeans, botonic, nobody, martrixmul, etc. As stated in Section 2, GPGPU workloads usually have the pattern that CPU prepares the data and offloads it for GPU to compute, while CPU only shares data from GPU when CPU needs to check GPU's output and post-process the results. Therefore CPU tends to have much less shared data requests.
In the quantitative experiment, we set CPU and GPU communication data latency close to their L1 cache latencies, which are 1-cycle and 25-cycle respectively. The bounding experiments give the bound of potential speedup the heterogeneous processors can get by an iCHAT design. Figure  7 (a) and 7(b) show the speedup for Rodinia and AMD SDK respectively. Rodinia achieves 1.4x speedup on average and SDK shows 1.2x speedup. The speedup for a specific application is depending on a mixed of different factors including how much communication data got accelerated, how many requests to the share data, and whether the communication data is on CPU's or GPU's critical execution path, etc. Notice that the benchmark application has the most communication data does not necessarily achieve the highest speedup. For example, backprop has 38% inter CPU-GPU transfers but only achieves 1.2x speedup while bfs has only 5.5% inter transfers but achieves a 2.4x speedup. This is because bfs's communication data are fine grained scattered but on the critical path. matrixmul achieves 1.5x speedup with 0.5% inter transfers, which is because it has 31% of shared data requests on both CPU and GPU's side. The bounding experiment presents the quantitative evaluation of on-chip communication data and further proves that by accelerating CPU-GPU communication data, heterogeneous processors can gain obvious speedup in the context of coherent cache. This provides insight for further research direction.
V. RELATED WORK
A. Heterogeneous Coherence
Historically, coherence between the CPU and GPU within a system has been managed by software, which has the drawback of potential programming model complications or performance issues. However, now that the CPU and GPU are becoming more closely integrated, there have been some proposals to provide hardware coherence for the shared memory. Many systems have been developed to reduce the bandwidth required in snooping systems. Moshovos et al. [2] proposed JETTY which filters incoming snoops to the cache based on contiguous regions. RegionTracker [3] have been proposed as well, including virtual tree coherence [10] , subspace snooping [11] , and in-network coherence filtering [12] . Since AMD and other HSA Foundation members have committed to providing hardware coherence, we focus on only hardware coherence in this paper. The significant bottleneck of applying traditional directory based CPU coherence protocols is the vast demand for directory bandwidth and resources, due to GPU's high memory bandwidth requirements and unique memory access patterns. Previous work Heterogeneous System Coherence (HSC) [4] implements directory coherence at a region granularity between an integrated CPU and GPU. Given the high spatial locality of GPU data, obtaining coherence permissions at a coarse granularity (compared to traditional cacheline-level) enables the elision of the majority of directory accesses.
B. Eager Eviction
There are several eager write-back schemes proposed before to reduce shared data coherence overhead or increase row hits for CPU. However, none of them are designed for inter CPU-GPU data transfer characteristics.
Lebeck et al. [13] propose the first speculative invalidation technique, called Dynamic Self-Invalidation, for cachecoherent distributed shared memory system. The idea of this scheme is based on the observation that data blocks that have recently had conflicting accesses-and hence would have needed invalidation-are candidates for self-invalidations. To predict when to self-invalidate a block, they use synchronization boundaries to trigger block self-invalidation. The technique can predict when a processor completes accessing a shared block and speculatively invalidate the block in advance so that subsequent accesses by other processors can be fastened. For similar purpose, Lai et al. [14] propose Last Touch Predictors. It is based on the observation that memory sharing and invalidation are triggered by program instructions. The technique maintains an instruction trace from a coherence miss until last touch to a block before invalidation. As the program behavior is repetitive, it is possible to use this trace to predict block invalidation. Stuecheli et al. [15] propose Virtual Write Queue scheme for eager write-back to increase row-level access locality. While write operations in the DRAM write queue are being scheduled, other dirty cache lines that mapped to the same row as the scheduled ones are searched in the last-level cache and immediately transferred to DRAM.
Compared to the existing solutions, our proposed iCHAT communicator detects the hot regions transferred between CPU and GPU, and eagerly evicts them from the owner's cache. The region-level invalidation can better match the large data needs for GPU applications and reduce the cache and directory traffic.
C. Cache Injection
Cache injection was proposed to speedup producerconsumer style communication on distributed parallel machine [16] . Once producer produced new data, it would be sent and injected to the requestor's cache. One example for CPU cache injection technique is [17] . Cache injection technique has not been applied to GPU related platform. The hardware or software prefetchers [18] [19] [20] serve similar purpose as injection, reducing data accessing latency. However, the prefetching techniques may not cover communication data since communication data has different locality. Our injection technique can be used to supplement prefetching techniques to speedup communication data access.
VI. CONCLUSIONS
The frequent transfers of large data blocks between CPU and GPU result in high latency and heavy cache and memory traffic, which withhold the heterogeneous multiprocessors from potential speedups. This paper first characterizes GPGPU computing's communication patterns and categorizes communication data into two classes: initialization data and hot data. Then this paper proposes a technique called iCHAT (interCache Hardware-Assistant data Transfer) to watch and detect the communication pattern, eagerly evict the communication data from the current owner's caches and inject it into the requestor's caches ahead of time. Using iCHAT can reduce cache traffic and the long latency involved in CPU and GPU communication, and thus enhance the performance boost by leveraging GPU as the hardware accelerator.
As the first step, this paper finishes iCHAT implementation for initialization data and preliminary results show reductions of 40% of cache eviction traffic and 8% of total directory traffic. To provide thorough understanding of iCHAT's advantages, we design a bounding experiment that provides a quantitative evaluation of all the on-chip communication data and the potential performance boost. Specifically a full iCHAT design could remove GPU and CPU's data requests by 60% and 22% respectively for Rodinia benchmark and 75% and 9% for AMD SDK APPs, and achieve on average 1.4x speedup for Rodinia and 1.2x speedup for AMD SDK APPs. Future work includes full implementation of iCHAT design and further optimization to explore the potential to completely hide communication latency.
