Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of others. The data prefetching can lead to significant performance degradation due to shared resource contention on shared memory multicore systems. This article proposes a thread-aware data prefetching mechanism based on low-overhead runtime information to tune prefetching modes and aggressiveness, mitigating the resource contention in the memory system. Our solution has three new components: (1) a self-tuning prefetcher that uses runtime feedback to dynamically adjust data prefetching modes and arguments of each thread, (2) a filtering mechanism that informs the hardware about which prefetching request can cause shared data invalidation and should be discarded, and (3) a limiter thread acceleration mechanism to estimate and accelerate the critical thread which has the longest completion time in the parallel region of execution. On a set of multithreaded parallel benchmarks, our thread-aware data prefetching mechanism improves the overall performance of 64-core system by 13% over a multimode prefetch baseline system with two-level cache organization and conventional modified, exclusive, shared, and invalid-based directory coherence protocol. We compare our approach with the feedback directed prefetching technique and find that it provides 9% performance improvement on multicore systems, while saving the memory bandwidth consumption.
INTRODUCTION
Memory access latency has become one of the major bottlenecks to system performance. To hide memory access latency, researchers have proposed hardware data prefetching mechanisms. Prefetching refers to fetching data from memory into caches or a prefetching buffer before actually using them. In this way, the memory latency can be effectively hidden and the processors performance can be improved. Data prefetching techniques in conventional single-core processors have been proven useful. However, on a chip multiprocessor (CMP) system, all running threads usually share the last level cache (LLC) and off-chip memory. Memory requests from each core conflict with those from other cores, since prefetching requests need to traverse on-chip interconnect networks and the memory bus to arrive at memory banks as normal data requests. The presence of a prefetching engine would cause additional cache conflicts and memory bandwidth competition. The conflicts increase applications' execution time and limit the performance scaling of multicore processors.
Prefetching-caused contention of shared resources can be reduced if the shared resources are smartly managed. A lot of works have focused on the memory controllerbased request arbitration [Kim et al. 2011; Mutlu and Moscibroda 2008; Nesbit et al. 2006] , operating system-based thread scheduling [Jaleel et al. 2012; Zhuravlev et al. 2010a Zhuravlev et al. , 2010b , feedback-directed prefetching aggressiveness throttling [Ebrahimi et al. 2011a [Ebrahimi et al. , 2009a [Ebrahimi et al. , 2009b Srinath et al. 2007] , and other methods managing the shared LLC and main memory. However, most previous techniques are designed for multiprogrammed workloads, and prefetching on multithreaded parallel applications is confronted with new problems.
Generally multiple threads on multicore systems share data among themselves. In order to maintain consistency, if data in a shared cache is replaced by prefetching data, then all its sharers should invalidate their local copy of the cache block. By analyzing the impact of prefetching on the shared resource contention in multithreaded parallel applications, we found that prefetching requests cause interthread invalidations and lead to increased miss rates. Figure 1 shows the prefetching-caused interthread invalidations of multithreaded applications (Section 4 describes the experimental environment). The white-column on the left side indicates the proportion of shared cache blocks in all cache blocks; this value is sampled and averaged during the runtime of each application. The gray column on the right-hand side indicates the proportion of prefetching-caused invalidations in all L2 cache invalidations (caused by demand misses and prefetching requests). As shown in this figure, for some applications the percentage of shared cache blocks is high, and the large amount of prefetching-caused interthread invalidations will cause severe performance degradation.
In order to avoid such prefetching-caused interthread invalidations, we propose a filtering mechanism [Yu and Liu 2014 ] to filter such prefetching requests that cause interthread invalidations (attacking prefetches, as defined in Section 3). Meanwhile, in order to adapt to the runtime features of multithreaded parallel applications and reduce prefetching-caused contention on shared resources, we develop a thread-aware data prefetching mechanism based on low-overhead runtime information to tune prefetching aggressiveness, mitigating the resource contention in the shared memory multicore systems. Due to the properties of multithreaded applications, there is always a thread that is on the critical path of execution. We define the critical thread the one with the longest completion time in the parallel region and propose a limiter thread acceleration mechanism to estimate and accelerate the critical thread.
Our contributions are summarized below.
(1) We first propose a thread classifying directed (TCD) approach to adjust prefetching parameters dynamically based on the runtime information for each thread to reduce shared resource contention. (2) We employ an attacking prefetch filter (APF) which can be used to avoid prefetching-caused interthread invalidations. (3) We develop a limiter thread acceleration (LTA) mechanism to estimate and accelerate the critical threads of a multithreaded application that are on the critical paths of execution. (4) We extensively evaluate the proposed thread-aware adaptive data prefetcher on a 64-core system using multithreaded workloads from PARSEC [Bienia et al. 2008] , SPLASH-2 [Woo et al. 1995] , and some scientific applications.
The article is organized as follows. Section 2 describes the basic prefetching structure; Section 3 gives the proposed thread-aware prefetching mechanism in detail; Section 4 presents our evaluation methodology; Section 5 analyzes the experimental results; Section 6 outlines related work; and, finally, Section 7 concludes the article.
BASIC PREFETCHING STRUCTURE
The basic prefetching engine is a single-core, multimode prefetcher (MDP). It consists of a memory access pool (MAP) for recording history information and a pattern table (PTB) for the maintenance of predicted patterns. Both sequential and chained data address access patterns can be detected by the MDP. The structure of MAP is shown in Figure 2 (a). The Missaddr field records the address of each cache miss. The Stride field records the value of the stored miss address subtracted by the newly missed address (for sequential streams) or the newly returned value (for chained streams). The Status field is used for the training process of stream patterns. The Type field is used to distinguish between sequential patterns and chained patterns. The structure of PTB is shown in Figure 2 (b). The First and Last fields in PTB record the first and last addresses of previous prefetching requests. The Stride field records the stride of the sequential stream for sequential patterns or the offset of next field in the linked list for chained patterns.
During program execution, the first address of a cache miss is recorded in a MAP entry indexed by the missed address. We use Address Prediction Heuristics, which is borrowed from the content-directed prefetching (CDP) approach [Cooksey et al. 2002] , to determine if a return value is a potential address value. The Address Prediction Heuristics is based on an assumption of data structure that most data addresses tend to share common high-order bits. It compares the return value with the missed address; if the two addresses have the same high-order bits, then the return value is considered as a address like value. Then the Type field is set, implying a chained pattern training. Otherwise, the Type field is unset, implying a sequential pattern training. The most important distinction between our approach and CDP for chained stream prefetching is that prefetching requests in our mechanism are not immediately issued when an address like value is returned. Instead, the value will be used for the training of a chained stream, and only if a chained stream has been found will the corresponding prefetching requests be issued. This approach is used to avoid arbitrary issuing of every address like value which is of low accuracy. Whenever a new stream pattern is formed or a demand-request hits on an existing stream, a prefetching request is ready for issuing. The prefetching engine will check whether the prefetching address of the request data is already in the cache or if the Miss Status Holding Register (MSHR) has recorded the data request. If it is neither, then a prefetching request is finally issued.
We have evaluated the performance of MDP and a conventional stream prefetcher [Palacharla and Kessler 1994; Ganusov and Burtscher 2005] . The MDP engine performs 6% better than the stream prefetcher within a single-core system and performs 2% better within a multithreaded 64-core shared memory system. Thus we choose MDP as a reasonable baseline in our work. Figure 3 shows the multicore system modeled in this article. Each node has private L1 instruction and data caches, and all nodes share a physically distributed LLC with integrated directory and communicate with each other through routers which are connected by the on-chip interconnection fabric. Caches are shared under a Modified, Exclusive, Shared, and Invalid-(MESI) based directory coherence protocol implementation. The prefetching engine on each node is based on the multimode prefetcher described in Section 2. In order to exploit performance of the multithreaded environment, we made the following improvements based on the single-core MDP mechanism to build the thread-aware adaptive prefetcher (TAP): (1) a TCD prefetcher tuning unit, which can adjust prefetching engines dynamically according to the runtime information of multithreaded applications, to reduce shared resource contention; (2) an APF to avoid prefetching-caused interthread invalidations; and (3) a LTA module to estimate and accelerate the critical threads.
THREAD AWARE PREFETCHING SYSTEM
In the following, we discuss the rationale for thread categories in Section 3.1, the approach to prefetching adjustment in Section 3.2, the principle of an attacking prefetch filter in Section 3.3, and the mechanism for accelerating the limiter threads in Section 3.4.
Prefetching-Aware Thread Classification
The resources which are shared among threads include on-chip cache, off-chip main memory, memory bus, and so on. In a multithreaded application, multiple threads are used for processing each subset of data or performing different tasks, and thus they have different data requirements, making resource contention complicated. To reduce resource contention among the complex combination of threads, we classify threads by their memory behavior and characteristics of prefetching and adjust each thread's prefetcher algorithms according to the classifying results.
As mentioned above, contention for shared resources leads to a larger number of memory accesses and longer memory access time, and the presence of prefetching engine will even exacerbate this situation. In order to evaluate the shared resources contention among threads, we use the metric tMH (total time for miss handling), which is normalized to the result with no prefetching, to indicate prefetching-caused impact on system performance. 
Additionally, we use the following four metrics, which indicate different aspects of threads' memory access behavior, to evaluate the shared resource contention.
(1) MPKI (misses per thousand instructions). This metric indicates the contention on limited cache capacity. It directly reflects thread's memory access intensity. (2) PPKI (prefetch requests per thousand instructions). This metric indicates the aggressiveness of prefetching for shared caches. (3) PA (prefetch accuracy). This metric reflects the effect of prefetch requests. (4) PL (prefetch lateness). This metric can be used to evaluate the timeliness of prefetches issuing.
To classify threads into several types, we need to analyze their runtime behavior. During the phase of program feature analysis, we have sampled all 24 applications (Section 4 for details) for the information of tMH, MPKI, PPKI, PA, and PL. A clustering algorithm (k-means [MacQueen 1967] ) is used to analyze the data and decide the threshold value of the metrics among different thread types. Other basics for thread classification include the observation of data and our experimental experience.
According to the results of normalized tMH, we divide applications into three collections, as shown in Table I .
(1) The applications in the first collection have a tMH increase of less than 10% (tMH < 1.1). This means that the influence of prefetching is slight, and these applications We are inspired by the results that the impact of prefetching on shared resource contention is closely related to thread's memory behavior and prefetching performance. Therefore it is reasonable to classify threads and manage prefetchers according to these metrics. The threads classifying policy is shown in Table I . The meaning of classification for each application is that the probability of one application to perform as the corresponding type of threads is high. It should be noted that the thread classification is not static-it is changing during the runtime of threads. The real-time feedback information is collected by a hardware module in every statistical interval. We use the number of cache replacements (1024 by default) to define the statistical interval for adjustment. Since the purpose of prefetching adjustment is to reduce resource conflicts, and the frequency of cache replacements essentially reflects the cache conflict intensity, this approach can change the adjustment automatically depending on the runtime degree of resource contention.
Thread Classifying Directed Adjustment
Based on the thread classifying policy, we can adjust each thread's prefetcher algorithms according to the classifying results. As different types of threads have their own characteristics, we carry out the corresponding prefetching adjustment operations with the following considerations:
For LM: Since it is characterized by its sparse memory access requests, there is no need for such threads to worry about the problem of memory resource contention with other threads. For MMLP: While it makes many memory accesses, the number of prefetch requests is small. That means the data address pattern of such threads does not fall into a pattern that can be predicted by the prefetcher. Thus we can temporarily shut down the threads' prefetching engine to reduce its contention for the shared resources. The prefetcher will be started again with default configuration (degree = 2, distance = 8) in the next statistical interval, and thus the metrics will be updated.
For MPHA: Prefetch requests are issued intensively with a high prefetch accuracy and low prefetch lateness, showing that the prefetcher is having the best impact on its performance. Thus, if there is no memory resource contention with other threads, then it should tune up the aggressiveness of its prefetcher as much as possible to make full use of prefetching engines to achieve performance improvement.
For MPHAL: Prefetch accuracy and lateness are both high. This means that although the prefetcher has been predicting accurate addresses, the issuing of prefetch requests is not timely enough. In this case, it is necessary to tune up the prefetch distance to achieve a better issuing timeliness.
For MPLA: It issues a large number of inaccurate prefetch requests which can cause serious shared resources wastes. When there are other threads competing for memory resources, the MPLA threads prefetching aggressiveness should be tuned down.
To recap, the TCD prefetching adjustment strategies are listed in Table II . We adjust prefetching aggressiveness by changing the distance and degree of prefetchers separately. The prefetch distance is the interval of data blocks between the data currently being prefetched and the data that the processor actually uses along the stream. The prefetch degree indicates how many continuous data blocks are fetched every time when a prefetch request is issued. Table III shows the aggressiveness at all levels from low to high corresponding to the prefetch distance and degree.
Reducing Prefetching-Caused Interthread Invalidation
The sharing of data caches among threads results in additional cache misses. This is partially caused by the replacement of other threads' useful cache blocks when demanded data or prefetched data are returned from memory system. This situation is more complicated under a MESI-based cache coherence protocol. Since L1 cache misses and prefetching requests will access LLC, they can cause the replacement of a data block which is shared by other threads. The result is that the data copies of that block in its sharer's L1 caches will be invalidated, thus causing demand misses when the data block is later accessed.
If a demand miss is caused by a prefetching request, then it is called a prefetchingcaused invalidation. Such prefetching requests are called attacking prefetches. An attacking prefetch can be identified in the following case: If an L1 prefetch misses in the LLC and the prefetched data tries to replace a shared cache block.
To avoid degradation of performance, one solution is to abandon all attacking prefetches. An APF is used to filter such prefetches. The filter works when a prefetching data request returns from main memory and tries to replace a shared cache line in L2. As shown in Figure 4 , if the returned data is a prefetched data, and it is going to replace a shared L2 cache line, then the replacement of the shared cache line will be terminated to avoid interthread invalidations.
However, this approach will break a chained stream when prefetching a linked data structure. Since a new node's address cannot be calculated until its predecessor is returned. In this article, we propose solutions for sequential and chained streams separately. For sequential streams all attacking prefetches are abandoned and the address of an attacking prefetch is recorded in the pattern table for later use. For example, in Figure 5 , one thread's L1 prefetcher detects a stream at address A and begins to issue prefetches. When an attacking prefetch is detected at address A + 3N, the prefetch is abandoned but still recorded in the pattern table. Later access to A + 3N will hit in the pattern table to let further prefetches be issued.
For chained patterns, L1 prefetching misses are not abandoned immediately if they are identified as attacking prefetches. Instead, they are issued, but the return value is used to maintain the linked pattern streams. As shown in Figure 6 , the prefetch for node C is identified as an attacking prefetch and is issued. When the value is returned, it is used to calculate the next node address in the linked stream and then discarded.
When the prefetched data are returned, it is inserted into the Least Recently Used (LRU) stack of the corresponding cache set. Figure 7 is an example of cache replacement with four-way associative structure. Normally, the prefetched data are treated as a demand request and inserted into the most recently used (MRU) position in the LRU stack. This mechanism is efficient if the prefetch accuracy is high. However, if a prefetch request is miss-predicted, which means the prefetched data will not be used, then the prefetched data will occupy a part of the cache footprint until it is replaced. This leads to a waste of cache resources. We utilize one kind of adaptive insertion policy [Jaleel et al. 2008] for our cache replacement mechanism to act in concert with the proposed attacking prefetch filter. According to the prefetching aware thread classifying, the prefetch accuracy can be estimated when the prefetched data are returned. For high prefetch accuracy, the prefetched data are inserted into the MRU position as usual. Prefetched data with lower prefetch accuracy is inserted into the second or third position of the LRU stack. If the prefetch accuracy is very low, the prefetched data are inserted into the LRU position to reduce the waste of cache footprint. Table IV shows the thresholds used for our cache replacement policy.
Limiter Thread Acceleration
When multithreaded workloads are concurrently executed, the execution time is determined by the critical path. Figure 8 is an example of barrier synchronization; four threads are executed during the stage between Barrier0 and Barrier1. The execution time of this stage is exactly the time that Thread1 spends. This means that, during this stage, prefetching for Thread0, Thread2, and Thread3 will not bring any benefit for the system performance [Ebrahimi et al. 2011b ], or even additional memory bandwidth competition, and slows down the execution of the limiter thread. Taking into account the interdependent nature of multithreaded workloads, we propose a LTA mechanism, which tunes up the prefetching aggressiveness of limiter threads (Thread1 in this example) to achieve shorter execution time. For the threads that are not on the critical path and have the longest waiting time for barrier synchronization (Thread3 in this example), LTA mechanism will turn down the prefetching aggressiveness to avoid unnecessary memory bandwidth consumptions. We find that performance can be improved by exposing information about limiter threads to the prefetching engine, which uses this information to accelerate the limiter threads and save memory bandwidth consumption of other threads. We utilize the method described in Ebrahimi et al. [2011b] to estimate the limiter threads. Algorithm 1 illustrates the method in detail. To implement this algorithm, the runtime system maintains one counter (TotalWaitTime) per lock to accumulate the total cycles that threads wait in that lock's queue, and keeps one counter (ThreadWaitTime) per thread to track the number of cycles that one thread waits for a lock. To keep the track of each lock's waiting time, when a lock j is successfully acquired by thread i, the runtime system adds ThreadWaitTime i to TotalWaitTime j . The thread which is holding the lock with longest TotalWaitTime is determined as limiter threads. A bit-vector LimiterThreadBit is added to identify which thread is the limiter thread. Every statistical interval, the runtime system compares the accumulated lock queue waiting time of all locks and finds the lock with the longest waiting time during the previous interval. The system saves this lock as LockLongest. Then the thread which is holding LockLongest is determined as Owner Longest, and the corresponding bit in the LimiterThreadBit is set. Finally, when a thread acquires LockLongest, the bit corresponding to the previous Owner Longest is reset, the new Owner Longest is recorded, and the bit for new Owner Longest in LimiterThreadBit is set.
Another bit-vector FasterThreadBit is added to identify the thread that has the longest waiting time for barrier synchronization. Since the faster threads are not on the critical path, tuning down its prefetching aggressiveness can avoid unnecessary memory bandwidth consumption and will not affect the execution time. The faster thread can be estimated by an algorithm which is similar to Algorithm 1. For FasterThreadBit, the runtime system searches for the lock with the shortest waiting time during the previous interval and saves this lock as LockShortest. The thread which is holding LockShortest is determined as Owner Shortest, and the corresponding bit in the FasterThreadBit is set. The bit-vectors are updated and informed to the prefetching engine in order to prioritize the limiter threads. For the limiter threads, the LTA mechanism will tune up prefetching aggressiveness in order to achieve the potential improvement on execution time. However, for the faster threads, the LTA mechanism will turn down prefetching aggressiveness to avoid unnecessary memory bandwidth consumption. Note that prefetching aggressiveness is adjusted by both TCD and LTA, and thus there may be a conflict between the two decisions. Since changing the prefetching aggressiveness of the faster threads will not affect the execution time (e.g., TCD turning up the faster threads' prefetching aggressiveness), when a conflict occurs, LTA will turn over TCD's decision and dominate the adjustment.
Implementation
To implement the proposed method, we need hardware support for computing the metrics used in our scheme, classifying running threads, and tuning the parameters. Note that we use tMH to evaluate the contention of shared resources. This statistic is collected by the software simulator (see Section 4) during the phase of program feature analysis and is used to help design the thread classification strategy. However, it will not be collected during the runtime of the proposed mechanism. The hardware prefetching engine collects only four metrics (MPKI, PPKI, PA, and PL) to direct the runtime thread classifying.
MPKI: To collect the MPKI, two hardware counters are used. The first counter, insttotal, tracks the number of instructions that have been committed. The second counter, miss-total, tracks the number of demand misses. The metric MPKI is computed by taking the ratio of inst-total to miss-total.
PPKI: A hardware counter (pref-total) is used to track the number of prefetching requests. The metric PPKI is computed by taking the ratio of pref-total to inst-total.
PA: A hardware counter (pref-used) is used to count the number of useful prefetches. When a prefetched block is inserted into the cache, an additional bit (pref-bit) associated with that block is set. When an L1 cache block with pref-bit set is accessed by a demand request, the pref-bit is reset and pref-used is incremented. The metric PA is computed by taking the ratio of pref-used to pref-total.
PL: A prefetching request is late if a demand request for the prefetched address is generated before the prefetched data has arrived. A hardware counter pref-late is used to track the number of late prefetches. To track the history of prefetching requests, we add a hardware structure, Prefetch History Table (PHT) to each L1 cache. The PHT has 16 entries, each entry has an addr-tag to record the prefetch address, and a hit-bit to indicate that a demand request for this address was generated. If a prefetching request is issued, then its address is inserted to the addr-tag of PHT and the associated hit-bit is unset. If a demand request is generated, then its address will be compared with all addr-tags in PHT; if matched, then the associated hit-bit is set. When the prefetched data is returned and the hit-bit has been set, the pref-late counter is incremented. The metric PL is computed by taking the ratio of pref-late to pref-total.
Attacking prefetch filter: To filter an attacking prefetch, there is an additional bit (pref-request) in each MSHR entry of the L2 cache. There are 32 MSHRs for each node. When a prefetch request is issued, an MSHR entry is acquired, and the bit prefrequest is set to indicate that this memory request is generated by the prefetcher. If the prefetched data returns and tries to replace a shared cache line in L2, then this prefetch will be identified as an attacking prefetch. The replacement of the shared cache line will be terminated to avoid interthread invalidations.
METHODOLOGY
We implement the proposed mechanism on a cycle-level, execution-driven, in-house simulator. The processor microarchitecture, the cache coherence, and communication Inc. 2008] . The simulator models a 64-node MESI-based directory protocol with a detailed model of both stable and transient states and queuing of requests. All memory transactions are modeled using an event-driven framework accounting for latency, bandwidth constraints, bank queuing, and other contentions. Miss status holding registers and nonblocking memory controllers are modeled. Memory is address interleaved, which means that every memory controller serves the addresses mapped to one-eighth of the CMPs and uses a separate router to connect to the cores. Synchronization instructions (load-link and storeconditional) and tree barriers are also supported. The multimode prefetching engine on each processor is configured to have a 16-entry MAP and an 8-entry PTB. The initial prefetch distance and degree are set to be 8 and 2, respectively. The details of the architectural parameters used for evaluation are shown in Table V .
PopNet [Li 2007 ] is used to model the packet-switched network. The network is a 10 × 8 mesh which connects the core tiles in the middle and the memory controller tiles at two sides. We use McPAT [Li et al. 2009 ] to evaluate the power consumption of the processor cores in the case of 65nm complementary metal-oxide-semiconductor (CMOS) process. The DSENT tool is exploited to calculate the power consumption of the networks-on-chip (NoC).
Our evaluation is performed with a suite of parallel applications from SPLASH-2 [Woo et al. 1995] , PARSEC [Bienia et al. 2008] , and other multithreaded workloads. These applications are compiled with "-O3" optimization by a cross-compiler to generate MIPS32 binaries. The limitation of the cross-compiler prevents us from running certain applications. Table VI lists the applications used, along with the input parameters. All results are collected by running a portion of the benchmarks for 3 billion instructions. The first 1 billion instructions are used to warm up the system and the statistics are collected for the next 2 billion instructions. We use the number of cache replacements to define the statistical interval; it is a configurable parameter and is set as 1024 by default. We compare our method with the feedback directed prefetching (FDP) [Srinath et al. 2007] technique. The FDP mechanism estimates prefetcher accuracy, prefetcher timeliness, and prefetcher-caused cache pollution to control the aggressiveness of the prefetcher dynamically and mitigates prefetcher-caused cache pollution by adjusting insertion policy of prefetched blocks. FDP tracks cache pollution caused by the prefetcher. If the degree of prefetcher-caused pollution is high, then FDP inserts all prefetches into the LRU position of the LRU stack. FDP inserts all prefetches into the LRU position when prefetcher-caused pollution is high, which means that some accurately prefetched blocks getting inserted into the LRU position. Inserting an accurate prefetch into the LRU position may lead to the cache block getting evicted from the cache before a demand request that needs it arrives, resulting in one additional miss for that block. The TAP mechanism uses machine learning-directed runtime adjustments to relieve the attack interference between cores, which improves the prefetching performance of multithreaded applications. As an application shows different characteristics in different phases of execution, we use the TCD mechanism to classify threads at runtime and tune the aggressiveness of prefetchers according to the classification. Furthermore, in our TAP mechanism, the estimated prefetching accuracy is used to determine the insert position of prefetched data.
We also implement and evaluate hierarchical prefetcher aggressiveness control (HPAC) mechanism [Ebrahimi et al. 2009b ] to combine with our mechanism. HPAC mechanism consists of a hierarchy of prefetcher aggressiveness control structures that combines local and global interference feedback to maximize the benefits of prefetching on each core while optimizing overall system performance. In HPAC, Bandwidth Consumed by Core i (BWC i ) and Bandwidth Needed by Cores Other than Core i (BWNO i ) are used to keep track of prefetcher-caused intercore interference in the shared memory system, whereas in our mechanism, we use the attacking filter and thread classifying directed adjustment to reduce prefetching-caused interthread invalidations.
To evaluate the prefetching influence on system performance, we collect the performance results (execution time, energy consumption, memory bandwidth consumption, etc.) and normalize them to the results with no prefetching. The geometric mean of the results is calculated to find out the system performance speedup ratio on average.
EXPERIMENTAL RESULTS

Thread Classification
During the runtime of each application, the characteristics of memory access behavior change with time. We use thread classifying policy (as shown in Table I ) to classify each thread dynamically at runtime. Adjusting prefetching engines according to the classifying result, the TCD approach changes the memory access behavior of each thread and has an impact on the distribution of thread classifying. Figure 9 shows the distribution of five types of threads, using TCD optimization results (right side) compared to the normal results (left side) without TCD. Since the thread classification is changing during the runtime of threads, it is sampled every statistical interval, and the columns in this figure represent the proportion of each type of threads that appeared in the statistical intervals. For MPLA threads, TCD reduces the degree of the prefetcher in order to improve prefetch accuracy, and for MPHAL threads, the prefetching engine increases prefetch distance to improve the timeliness. As a result, we can see from the figure that TCD optimization increases the proportion of MPHA threads, especially in the canneal, ferret, and ocean applications. In addition, TCD stops the prefetching engine of MMLP threads, and the proportion of MMLP threads is decreased, such as em3d and shallow. TCD optimization improves the prefetch accuracy and timeliness and reduces cache pollution and competition for LLC. Thus the cache miss rate (indicated by the metric MPKI) is reduced, and the proportion of LM threads is increased. Furthermore, the details of memory requests will be discussed in Section 5.9. 
Reducing Prefetching-Caused Invalidation
As discussed in Section 3, we choose tMH (total time for miss handling) as one of the metrics for thread classifying. tMH accumulates the total time of handling cache miss events and reflects the intensity of competition for shared resources.
The total miss handling time for each application is shown in Figure 10 . All results are normalized to the result of an ideal situation which eliminates shared resource contention. As can be seen from the figure, by using the APF, prefetching-caused interthread invalidations are avoided, and the total time waiting for miss handling has been shortened in many applications, especially in barnes, fft, lu, and raytrace. On average, the proposed APF reduces the miss handling time by 7% compared to the MDP.
Limiter Thread Acceleration
When multiple threads concurrently execute in a CMP system, they are synchronized through locks or barriers. We propose the LTA mechanism to estimate the threads likely to be on the critical path of execution, and design the prefetching adjustment strategy to prioritize the critical threads in order to reduce the threads' synchronization waiting time and accelerate their execution. Figure 11 shows the average synchronization waiting time of all threads for each application, all the results are normalized to the results with no prefetching. Since the LTA mechanism estimates and accelerates the critical threads, the average synchronization waiting time is reduced by 6% compared to the MDP.
System Performance
Figures 12 and 13 illustrate the influence of prefetching on multicore system performance by showing the execution time normalized to the result with no prefetching. We evaluate the system performance of the proposed techniques (described in Section 3, namely TCD, APF, and LTA) separately applied on the baseline MDP, as well as the performance of TAP, which simultaneously uses the three proposed techniques to get maximum improvement. Figure 12 compares the normalized execution time among TCD, APF, and LTA. Taking threads' runtime behavior into consideration, TCD reduces the conflicts among threads and improves the prefetch accuracy, such as canneal, dedup, and ferret. As we can see, the TCD method increases the proportion of MPHA threads and decreases the percentage of MPLA threads for the canneal, dedup, and ferret applications. With high accuracy of prefetching, system performance increases. APF, which gains benefit from the attacking prefetching filter and cache replacement mechanism, reduces the negative impact of prefetching-caused invalidations. The LTA mechanism accelerates the limiter threads and reduces the threads' waiting time for synchronization. On average, TCD, APF, and LTA improve performance by 13%, 10%, and 9%, respectively. Figure 13 compares the normalized execution time between MDP, TAP, and FDP. The results show that the use of the multimode prefetcher on the shared memory multicore system can improve system performance for some applications while it decreases system performance for others. The average speedup using the multimode prefetcher is 5%. TAP gathers the benefits from the proposed techniques (TCD, APF, and LTA) and improves performance by 18%, while FDP improves by 9% compared to the noprefetching system.
Energy Consumption
The introduction of a prefetch engine increases the memory bandwidth consumption as well as prefetching-caused cache-related operations, thus inevitably increasing the energy consumption of the whole system. Figure 14 shows the system energy consumption; the results are normalized to that with no prefetching. On average, MDP, TAP, and FDP increase the energy consumption by 6%, 4%, and 5%, respectively. The energy consumption overhead is acceptable, since it brings notable improvement of system performance. When using the TAP approach, the number of useless prefetching requests is minimized, and this reduces unnecessary energy consumption.
Effect of TAP with HPAC Mechanism
As mentioned in Section 4, we have implemented and evaluated our mechanism with the HPAC mechanism. Figure 15 shows the performance of TAP and FDP when employed in a system with HPAC mechanism. FDP mechanism lacks the way to control prefetching-caused interference between cores, when combined with HPAC mechanism, the system performance is improved by 10% ("FDP" vs. "FDP with HPAC"), as the HPAC mechanism takes advantage of hierarchical prefetcher aggressiveness control structures to maximize the benefits of prefetching on each core. The TAP mechanism mainly focuses on interthread interference, while the HPAC mechanism takes memory bandwidth consumption into consideration; thus the HPAC mechanism increases the performance of a system that uses TAP by 4% ("TAP" vs. "TAP with HPAC"). Although there is no global control structure in TAP, the attacking prefetch filter is proposed to avoid prefetching-caused interthread invalidations. As a result, HPAC using TAP locally improves system performance by 3% over HPAC using FDP locally ("FDP with HPAC" vs. "TAP with HPAC").
Impact on Memory Bandwidth Consumption
Prefetching can adversely affect the memory bandwidth consumption when prefetches are not used or when they cause cache pollution. Figure 16 shows the memory bandwidth impact of prefetching. The results are normalized to the system with no prefetching. Since the proposed TAP mechanism reduces the number of demand cache misses caused by interthread invalidations, it consumes 5% less memory bandwidth than FDP. Compared to a system only using TAP, a system using TAP with HPAC mechanism can decrease the memory bandwidth consumption by 1%, as HPAC has a 13:18 P. Liu et al. global control mechanism which can allow or override the decision of a local prefetcher to optimize overall system performance.
Effect of Cache Partitioning Mechanism
The cache partitioning mechanism [Chang and Sohi 2007; Qureshi and Patt 2006] is an effective method to improve shared cache performance. For multiprogrammed workloads, the cache conflicts are caused due to the limited cache capacity. Since different programs have different utilities of cache resources, the cache partitioning mechanism can intelligently decide the number of cache ways allocated to each core and adjust the partitioning options according to runtime information to improve system performance. However, for multithreaded workloads, multiple threads share data among themselves, and thus they cannot be independently treated.
We implement the cache partitioning mechanism [Qureshi and Patt 2006] in our multithreaded system and Figure 17 shows the system performance of the cache partitioning mechanism. As can be seen, there is no significant effect of this mechanism for multithreaded workloads. This is one of the reasons why we propose the TAP mechanism for shared-memory multicore systems running multithreaded workloads.
Distribution of Memory Requests
For a system with prefetch engines, its memory requests usually consist of two parts, demand requests and prefetch requests. A part of the demand requests is caused by the invalidation due to prefetches (D-prefetch) , and the others are normally issued during the runtime (D-normal). Furthermore, it is also clear to divide the prefetch requests into three categories, accurate prefetch requests that are on time (P-ontime), accurate but late prefetch requests (P-late), and inaccurate prefetch requests (P-miss). In order to get further insight into the effectiveness of TAP mechanism, we analyze all kinds of memory requests and draw a breakdown figure, as shown in Figure 18 . The FDP results are on the left bars, while the TAP results are on the right bars; all of the results are normalized to that with no prefetching. TAP utilizes the attacking prefetch filter mechanism to avoid prefetching-caused interthread invalidations, compared to FDP, the percentage of D-prefetch is reduced. On the other hand, taking the advantage of prefetching-aware thread classification mechanism, which uses a k-means clustering algorithm to analyze threads' characteristics, TAP can adaptively adjust the prefetch engines for multithreaded workloads at runtime, thus it enhances prefetching accuracy (P-ontime) and reduces useless prefetches.
Sensitivity Study
We evaluate the sensitivity of TAP to different L2 cache sizes and memory latencies. Table VII shows the change in average IPC (instructions per cycle) and MBC (memory bandwidth consumption) provided by TAP over FDP for each configuration. Considering the interthread confliction and interaction, the TAP mechanism cuts down the invalidations caused by prefetching requests and reduces the execution time by accelerating the limiter threads. Thus, TAP provides better performance and consumes less memory bandwidth than FDP for all evaluated cache sizes. On the other hand, as memory latency increases, the MBC improvement of TAP also increases. This is because the bandwidth-saving effect of TAP becomes more important when memory becomes a larger performance bottleneck.
Furthermore, we have a sensitivity study to the number of threads. Table VIII shows the change in average execution time when the number of threads changes. The results show that with the number of threads grows, the performance improvement of TAP over FDP increases. In a system with smaller number of threads, the contention for shared resources is mitigated, thus the proposed TAP mechanism lacks the potential for improving performance. While in a system with larger number of threads, there are more interthread conflicts resulting in performance degradation, thus the TAP mechanism can provide more performance improvement over FDP mechanism. Table IX shows the storage overhead of the proposed TAP mechanism in terms of the required state. The additional storage is 20.2KB for a 64-node system (one processor per node). This storage overhead of our mechanism is less than 0.13% of the data-store size of the baseline 16MB L2 cache. Meanwhile, combinational logic is required for the update of counters, the update of the flag bits in cache and MSHR, the update of the bit-vectors for limiter thread observation, calculation of classifying metrics at the end of each sampling interval, termination of the cache replacement of the attacking prefetch, insertion of prefetched blocks into appropriate locations in the LRU stack, and the tuning of prefetching aggressiveness. The complexity of the required combinational logic is not significant, and none of the required logic is on the critical path of the processor.
Hardware Cost
RELATED WORK
The key idea of hardware data prefetchers is to use a special hardware unit to detect data address sequences and find patterns. Based on these patterns, future data addresses can be predicted. Many hardware data prefetching techniques for sequential address pattern have been proposed, such as sequential prefetching [Jouppi 1990 ], nonunit stride address sequence prefetching [Palacharla and Kessler 1994] , and multistride prefetching [Fu et al. 1992; Iacobovici et al. 2004] . Techniques for prefetching chained address patterns have also been proposed, such as dependence-based prefetching [Roth et al. 1998 ], the CDP method [Cooksey et al. 2002] , and so on. Moreover, prior works [Joseph and Grunwald 1997; ] have studied the prefetchers for irregular patterns. Irregular Stream Buffer (ISB) [Jain and Lin 2013] uses the idea of introducing an extra level of indirection to create a new structural address space in which correlated physical addresses are assigned consecutive structural addresses and produces benefits in terms of coverage, accuracy, and memory traffic overhead. Chen et al. [2012] proposed an algorithm-level feedback-controlled adaptive data prefetcher, which was based on a specifically designed hardware structure, to provide an algorithm-level adaptation to applications' access patterns. The above prefetching techniques can significantly improve performance for different types of applications on single-core processors. In multicore systems, however, the introduction of prefetching will increase conflicts and competition for shared resources, and prefetching in a multicore system without efficient management can lead to performance deterioration.
To relieve the impact of prefetching engines on shared resource contention, Dahlgren et al. [1993] proposed adaptive prefetching for shared-memory multiprocessors by dynamically configuring aggressiveness for each core according to prefetching accuracy. Jerger et al. [2006] observed prefetching-caused shared cache invalidations for multithreaded applications, and they presented a taxonomy that classifies the effects of multiprocessor prefetches as well as a characterization of the effects of different prefetching schemes. Srinath et al. [2007] considered the influence of a prefetching engine on memory bandwidth consumption and proposed an approach based on feedback information to direct the adjustment of prefetching dynamically, achieving the goal of reducing prefetching-caused bandwidth consumption. Huang et al. [2012] analyzed the effect of increasing prefetch distance on shared cache pollution and reduced the shared cache pollution by controlling prefetch distance. An adaptive prefetching scheme [Jiménez et al. 2012] in POWER7 is proposed to dynamically modify the prefetch settings according to the requirements of single-threaded and multiprogrammed workloads in order to improve performance and power consumption. In our article, we focus on how an uncached prefetch will invoke a costly interthread invalidation and propose the APF mechanism to address this problem. Ebrahimi et al. [2009b] proposed an HPAC mechanism as a prefetcher throttling solution to improve prefetching performance in CMP systems. By gathering global feedback information about the effect of each core's prefetcher, the HPAC mechanism can dynamically adjust the aggressiveness of each prefetcher in two ways: local and global. The global mechanism can override the local decision by taking into account effects and interactions of different cores' prefetchers when adjusting each one's aggressiveness. HPAC aimed to reduce the cache pollution caused by evictions of prefetches. While in our TAP mechanism, we focus on the invalidation of shared cache lines and use an attacking prefetch filter to reduce interthread influences. Furthermore, HPAC tracked memory bandwidth consumption and memory access delay time of each core to make global decisions, whereas in TAP, a limiter thread acceleration mechanism is used to tune up limiter threads' prefetching aggressiveness to achieve potential improvement and turn down faster threads' prefetching aggressiveness to avoid unnecessary memory bandwidth consumption. This mechanism is more suitable for multithreaded applications to optimize overall system performance.
By exploiting the idea of "prefetch-aware," Ebrahimi et al. [2011a] and Lee et al. [2008] took the existence of prefetching and its accuracy into consideration when dealing with data requests and improved the effectiveness of memory scheduling. The "thread-aware" idea was also explored by Lee et al. [2010] : They proposed GPGPUspecific software and hardware prefetching mechanisms and developed a prefetch throttling mechanism that dynamically adjusts the level of prefetching to avoid performance degradation. Somogyi et al. [2006] proposed Spatial Memory Streaming (SMS) to predict code-correlated spatial access patterns and stream predicted blocks to the primary cache ahead of demand misses. In their subsequent studies, they observed that the order of spatial accesses repeated both within and across regions and proposed Spatio-Temporal Memory Streaming (STeMS) [Somogyi et al. 2009 ] to exploit the synergy between spatial and temporal streaming. The combination of temporal and spatial predictions into a single total predicted sequence prevented the predictors from interfering with each other and enhanced lookahead for spatial accesses.
PACMan [Wu et al. 2011] aimed to mitigate the degree of prefetch-induced LLC pollution. Based on dynamic prediction of the rereference behavior of prefetch requests, PACMan treated demand and prefetch requests differently and employed a prefetchaware cache insertion policy. B-Fetch [Kadjo et al. 2014 ] was a data prefetcher driven by branch prediction and effective address value speculation. Based on a speculative control flow path from a branch predictor, B-Fetch computes the effective address of the load instructions along that path based on a history of past register transformations and uses a per-load filtering mechanism to reduce cache pollution. Sandbox Prefetching [Pugsley et al. 2014 ] combined the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. It tracked candidate prefetch patterns at runtime by adding the prefetch addresses to a Bloom filter rather than actually fetching the data into the cache. A candidate prefetcher may be globally activated to immediately perform a prefetch action when the accuracy of evaluated prefetchers exceeds a threshold. Seshadri et al. observed that over 95% of useful prefetches are not reused after the first demand hit and proposed the Informed Caching policies for Prefetched blocks (ICP) mechanism [Seshadri et al. 2015] based on this observation. The ICP mechanism consisted of two components: ICP-D demotes a prefetched block to the lowest priority on a demand hit, and ICP-AP predicts the accuracy of prefetch requests and only predicted-accurate prefetches are inserted into the cache with a high priority. In our article, we assume that the runtime behavior of each individual thread will not change dramatically in a short time. Thus our mechanism predicts the prefetch accuracy as the value counted by the statistical model in the most recent interval.
We focus on how to reduce prefetching-caused resource contention in multithreaded applications. The work in this article differs from the above in that we discovered a new problem, that is, that prefetching can cause interthread invalidations in multithreaded applications when LLC is shared under MESI-based cache coherence protocols. We provide the thread-aware adaptive data prefetcher as the solution to this problem. In addition, instead of directing prefetchers based on the characteristics of each core, our approach directs each prefetcher based on the classification of running threads. The proposed method can effectively adjust prefetchers dynamically according to the realtime status of each thread, and estimate and accelerate the critical threads according to the properties of multithread applications.
CONCLUSION
This article presents a thread-aware adaptive prefetching engine for multicore systems. Compared to existing multicore prefetchers, our prefetching engine can reduce interthread shared resource contention on the shared LLC and off-chip memory. Our approach combines three ideas: (1) It takes the threads' runtime memory requests and prefetching efficiency into consideration and makes decisions to relieve the interthread contention on the shared LLC and memory system, (2) it filters attacking prefetching requests if they would invalidate other threads' shared data while avoiding the break of prefetching streams, and (3) it uses a limiter thread acceleration mechanism to estimate and accelerate the threads that are on the critical paths of execution. Our approach reduces shared resource contention, brings energy efficiency, and improves system performance noticeably.
