Modern processor micro-architecture offers advanced prefetch mechanisms that are designed to effectively hide memory latency and improve application performance. However, pointer-chasing applications employing linked data structures expose a memory latency problem that is difficult to deal with by using hardware prefetchers. It is promising that helper threaded prefetching based on Chip Multiprocessor is an effective method for reducing the memory latency of accesses to linked data structures. In this paper, we first illustrated two L2 prefetchers on Chip Multiprocessor and two different helper threaded prefetching techniques for pointer-chasing applications. Then, we revealed the limitations of L2 prefetchers for pointer-intensive applications after applying two different threaded prefetching techniques. Finally, we optimized the deployment of L2 prefetchers with two different threaded prefetching techniques for pointer-chasing applications. The experimental results indicate that L2 prefetchers' effectiveness on helper threads depends on the memory access pattern of the targeted applications, and the optimized deployment of L2 prefetchers further improves the performance of pointer-intensive applications.
Introduction
Long-latency of memory accesses is one of the major performance bottlenecks of modern computing platforms [1] . The behavior of cache determines system performance due to its ability to bridge the speed gap between the processor and main memory. To tolerate the memory access latency, there have been a plethora of proposals for data prefetching [2] [3] [4] [5] [6] [7] [8] [9] [10] . Data prefetching techniques improve performance by predicting future memory accesses and fetch them in cache before they are accessed. As a result, memory access latency is hidden. More recently, processor manufacturers have integrated into the processors hardware prefetchers that bring data and instructions into the unified L2 cache before they are explicitly requested. Many research works [4, [7] [8] 10] have focused on hardware prefetchers and their performance impact on the memory system. However, linked data structures due to pointers abound in applications, and this is difficult to prefetch by hardware prefetchers.
On the other hand, with the advent of chip multiprocessors (CMP) architectures, thread-based prefetching and speculative execution techniques have received much attention in the research community [2, 6, 9] . One novel method, called helper threaded prefetching, utilizes a helper thread to boost the performance of the main thread by prefetching data into cache. Compared to traditional multi-threading where every thread should be executed and committed to guarantee the correctness of program execution, helper threads only affect the performance of the applications. For pointer-chasing applications with linked data structures in which data accesses cannot (or are very difficult to) be predicted statically, threaded prefetching based on a multi-core system is an effective method for reducing memory access latency.
Helper threads have the potential to improve the performance of LDS programs; however, helper threaded prefetching techniques sometimes degrade performance for several reasons: 1) the contention for the shared cache increases the amount of accesses to off-chip main memory; 2) the helper thread contends the bandwidth to access memory, leading to the queuing delay of memory accesses; 3) when employed together with L2 cache prefetchers, they are likely to cause cache pollution and more resource contention in memory system. The accumulated contention in the cache and the memory bus will degrade the bandwidth efficiency and delays the execution time.
As hardware moves towards integrated memory controllers and higher bandwidth, selecting prefetchers properly will play a significant role in optimizing system performance. Recent works [11] [12] [13] [14] on the topic of performance degradation in multi-core systems focused on contention for cache space and bus bandwidth when applications share the last level cache (LLC). It is well known that many factors contribute to performance degradation when threads share an LLC. Sharing of system resources, such as the shared L2 cache, memory bus, memory controllers, and prefetching hardware, all play an important role. Thereafter, a decision to enable or disable hardware prefetch mechanisms should be made to avoid resource contention.
In this paper, we focus on analyzing the performance effect of L2 hardware prefetchers on pointer-chasing applications before and after applying helper threaded prefetching. To summarize, this paper consists of the following contributions:
First, we illustrate the performance impact of L2 prefetchers on original pointer-intensive applications by comparing the performance under different deployment of L2 prefetchers.
Second, we examine the effect of L2 prefetchers on two threaded prefetching techniques for pointer-chasing applications .
Third, we apply deployment optimization of L2 prefetchers on two threaded prefetching techniques and show using experimental results that the performance is improved.
The rest of the paper is organized as follows: in Section 2, we state the related work. The analysis results of original applications are presented in Section 3. In Section 4, we describe the experimental methodology. Experimental results for applications with helper thread are analyzed in Section 4. Finally, in Section 5, conclusions are drawn and future work is discussed.
Related Work
Hardware prefetching techniques and thread-based prefetching techniques are two effective methods to hide memory access latency. Hardware prefetching techniques capture and save the history information of cache misses using the added hardware components and then analyze the history information to guide the prefetches. Chen and Baer [3] presented a design for a hardware-based prefetching scheme to reduce the data access penalty, which was the preliminary work for hardware prefetching. Gendler et al. [4] introduced a Prefetcher Assessment Buffer (PAB) feedback mechanism to filter out requests that are unlikely to be useful, with which applications that cannot benefit from aggressive prefetching will not suffer from their side-effects. Lee et al. [6] utilized an active threaded prefetching based on a new semaphore synchronization mechanism to hide memory access latency.
Herdrich et al. [5] proposed cache monitoring technology to enable monitoring of shared cache usage by different applications, which was implemented in software infrastructure for cache QoS needs. Wang and Martinez [10] indicated that efficiently allocating shared resources in computer systems is critical to optimizing execution and that hardware resource sharing is a key challenge for the upcoming manycore generation. Lee et al. [7] investigated the interactions between prefetchers and on-chip networks, exploiting the synergy of these two components in multi-cores to reduce the traffic generated by the prefetchers. Blanche and Lundqvist [11] evaluated the prediction accuracy and co-scheduling performance of four state-of-the-art characterization methods for memory aware (co-) scheduling. Zhang [14] proposed a load balancing task scheduling algorithm based on weighted random and feedback mechanisms to eliminate system bottlenecks and balance loads dynamically. Kaur [12] studied and explored several Directed Acyclic Graph (DAG) based task scheduling algorithms on a unified basis by using various scheduling parameters.
Revealing Potential Performance Improvements of L2 Prefetchers on Helper Thread for Pointer-Chasing Applications

L2 Cache Prefetchers in the Intel Multi-Core Micro-Architecture
Modern processor micro-architecture offers advanced prefetch mechanisms that are designed to effectively hide memory latency and improve application performance. For example, processors based on the Intel Core micro-architecture expose two different L2 prefetchers, the Data Prefetch Logic (DPL) and L2 Streaming Prefetch, which are described as follows.
Data Prefetch Logic (DPL):
The hardware prefetcher operates transparently to fetch streams of data and instruction from memory into the unified second-level cache. DPL is able to detect more complicated access patterns, even when the program skips access to a certain number of cache lines. The prefetcher is capable of handling multiple streams in either the forward or backward direction. It is triggered when successive cache misses occur in the last-level cache and a stride in the access pattern is detected, such as in the case of loop iterations that access array elements.
L2 Streaming Prefetch (ACL): When L2 streaming prefetcher (i.e. Adjacent Cache-Line Prefetch mechanism) is enabled through the BIOS, two 64-byte cache lines are fetched into a 128-byte sector, regardless of whether the additional cache line has been requested or not. When it is disabled, an Intel Core 2 Quad processor-based system only fetches 64 bytes. The other 64 bytes of the sector in the last-level cache are not used unless the application explicitly issues a load to that address.
The architecture of Intel Core 2 Quad processors with two L2 prefetchers is shown as Figure 1 . The advantage of L2 hardware prefetchers is that the data for a prefetched cache line starts to be loaded while the previous cache line is still being processed. Applications are allowed to select which prefetcher they wish to use. The default prefetch setting in practice increases the success rate of the cache subsystem for many workloads; unfortunately, it sometimes leads to the opposite result. Careful consideration should be given to memory-intensive applications before enabling or disabling these mechanisms. 
Impact Analysis of L2 Prefetchers on Threaded Prefetching
To test the L2 prefetchers' impact on helper thread, we implement two kinds of helper threaded prefetching techniques, SP (explained in our earlier work [15] ) and PV (as described in [6] ) on the Intel Core micro-architecture.
The result in Figure 2 provides an insight into the key factor which affects the applications' performance. When applying two different helper threaded prefetching techniques, their bandwidth consumption (i.e. memory accesses misses in L2, including partial hits, where the demanded data arrive in L2 before the memory request is serviced, and total misses) is an important indicator of their performance. Some applications applying with threaded prefetching gain less performance improvement because they waste some memory bandwidth, and this could adversely reduce memory throughput and increase access latency as a result of bandwidth contention. Through further analysis, we found that bus bandwidth becomes a bottleneck when applying helper thread in these memory-intensive applications.
L2 prefetchers on the Intel Core micro-architecture may have a negative impact on performance when applying with threaded prefetching. To verify the impact of L2 prefetchers on performance, we test the performance of applications applying with threaded prefetching under the following three conditions. For simplicity, we only plot the normalized results of PF_ON and ACL_OFF to that of PF_OFF. Comparing the results with PF_ON and ACL_OFF in Figure 3 , we found that the execution times of these applications for both SP and PV are either reduced or unchanged when DPL prefetcher is enabled and ACL prefetcher is switched off. Specifically, for both SP and PV, the run time of em3d, mcf, and mst are decreased obviously when L2 prefetchers are deployed from PF_ON to ACL_OFF. This suggests that the ACL prefetcher produces a negative performance effect on these three applications. In contrast, the run time of the other three applications exposes little change, implying that the ACL prefetcher produces a negligible performance effect on them. In addition, when running with PF_ON and ACL_OFF, the run times of em3d and mst are increased, while those of the rest of the four programs are decreased relative to that with PF_OFF. This observation indicates that the DPL prefetcher brings a negative impact on em3d and mst but a positive impact on the other four applications. Finally, we found that the two L2 prefetchers have the same performance effect (good or bad) on SP and PV as on its original execution. After applying helper thread, the performance effect of L2 prefetchers increases on em3d and mst but decreases on health. From Figure 4 , we can see that for both threaded prefetching techniques, the LLC misses of these applications (except em3d) with PF_ON and ACL_OFF are lowered when compared to that with PF_OFF. This illustrates that L2 prefetchers are still effective to reduce LLC misses of some pointer-intensive applications after applying helper thread. However, when deploying L2 prefetchers from PF_ON to ACL_OFF, SP increases LLC misses of EM3D, tsp, gcc, and health, while PV decreases that of em3d and mcf but increases that of tsp and gcc. This result denotes that the impact of L2 prefetchers on LLC misses is uncertain for different threaded prefetching techniques.
It can be seen in Figure 5 that the L2 prefetchers exert an apparent impact on memory accesses for these applications with both SP and PV. The normalized result shows that both two L2 prefetchers increase bus bandwidth consumption apparently on all these applications except tsp. The growth of memory accesses is particularly evident on em3d, mst, and gcc. On the other hand, it can be observed that, with enabled DPL prefetcher, memory accesses of these applications (except tsp) decrease when ACL prefetcher is switched off, which illustrates that enabled ACL prefetcher aggravates the pressure of system bus. This explains the performance improvement of ACL_OFF relative to PF_ON in Figure 3 . Tsp makes an exception, in which L2 prefetchers have no effect on its memory access events. The reason lies in that memory accesses in Tsp are alternatively triggered by normal demands, L2 prefetchers, and helper thread when the L2 prefetchers' configuration changes. Furthermore, we found that two L2 prefetchers have a similar effect for memory accesses on SP and PV as on its original execution. Besides the impact on main thread of pointer-intensive applications, L2 prefetchers could influence helper thread in two ways. First, L2 prefetchers may issue prefetches before the helper thread, which will affect the helper thread to hide LLC misses of hot functions. Second, the extra prefetches from L2 prefetchers can lead to overflowing outstanding misses. Although the influence is related to the presence of outstanding requests and the request handling capacity, it is already reflected by the performance change of the main thread above. Thus, we exclude this experimental result in this paper.
From the results of the experiment, we can sum up with several conclusions. First, although two kinds of threaded prefetching techniques behave differently from L2 prefetchers, the performance impact of L2 prefetchers on them is similar. Second, in general, L2 prefetchers have a similar performance effect on applications with helper thread as on its original execution. Third, the DPL prefetcher exerts a greater effect than the ACL prefetcher on pointer-intensive applications due to the dynamic allocated memory of LDS. Finally, although L2 prefetchers increase memory accesses of all applications considerably, they only produce performance degradation. This happens due to two major reasons. First, the ACL prefetcher is bound to launch useless prefetches that the fetched data will never be accessed or is evicted before accessed, since most of the pointer-intensive applications have poor spatial locality. These useless prefetches will increase memory accesses, such as the result on em3d. Second, the memory access request from L2 prefetchers may be dropped in the way to memory because of the heavy bus traffic. These discarded prefetches will not hide LLC misses or improve the performance. In fact, L2 prefetchers may have a bad effect even though they reduce LLC misses, since the increasing timing penalty of memory service is observed due to the shared resource competition, such as the result on mst. 
Experiments
Experimental Setup
We deploy L2 prefetchers on threaded prefetching on a real physical system. We use a system with an Intel Core 2 Quad Processor with two Core 2 Duo E6600 processors. There are two L2 prefetchers in Intel Core 2 Quad processors, Prefetch Logic (DPL), and L2 Streaming Prefetch (ACL), as described above. Two of them in a dual-core processor are shared dynamically between the two cores. In the bus queue, requests from L2 prefetchers have lower priority than normal cachemiss requests and may be ignored or canceled due to heavy bus traffic. The default behaviors of the L2 prefetchers can be modified by changing the content of the model specific register MSR (IA32_MISC_ENABLE). L1 cache prefetchers in two cores always keep the enabled state of default configuration in our experiments.
For our evaluations, we use the pointer-intensive (we classify a benchmark as pointer-intensive if it has hot loops with LDS accesses and the hot loops contribute at least 20% CPU_CLK_UNHALTED consumption) workloads from SPEC CPU 2006 and Olden benchmark suites, consisting of six applications. All benchmarks were compiled using GCC 4.1 compiler with the "-O2" option. Table 1 shows the descriptions of these applications, including the input set and the percentage of the exposed memory latency of the targeted loops over the entire CPU_CLK_UNHALTED (column CLK%). 
Evaluation Metrics
We use three metrics to evaluate the effect of L2 prefetchers on different threaded prefetching techniques: accuracy, coverage, and timeliness. Achieving the right balance between these three metrics is an important design consideration for all prefetching techniques. The accuracy, coverage, and timeliness of helper threaded prefetching were measured as described in [16] . In this section, the effectiveness of threaded prefetching is measured over the original applications with PF_ON, and the effect of L2 prefetchers on effectiveness of threaded prefetching is examined under the three conditions described above. Figure 6 and Figure 7 demonstrate the different accuracy and coverage of prefetching in PV and SP. Comparing the results for SP and PV, we found that the accuracy and coverage of SP and PV have similar threads when L2 prefetchers turn from PF_ON to ACL_OFF and PF_OFF. Specifically, the accuracy of threaded prefetching increases for EM3D, MST, and GCC, while it decreases for TSP. The accuracy of threaded prefetching for MCF increases when the ACL prefetcher is switched off and the DPL prefetcher is enabled; in contrast, the accuracy of threaded prefetching for HEALTH decreases when the DPL prefetcher is switched off and the ACL prefetcher is enabled. On the other hand, the coverage of threaded prefetching decreases for TSP and GCC and exposes little changes for EM3D, MCF, and MST. The coverage of threaded prefetching for HEALTH decreases when the DPL prefetcher is switched off and the ACL prefetcher is enabled. The outcomes in Figure 6 and Figure 7 indicate that the accuracy and coverage of prefetching changes differently on different applications due to the interference of L2 prefetchers.
Results and Discussion
L2 Prefetchers' Effect on Effectiveness of Threaded Prefetching
Combined with the result in Figure 3 , we can observe that the accuracy and coverage of prefetching have a close relationship with the performance of applications. For example, accuracy increases and coverage changes little on EM3D and MST when L2 prefetchers are disabled; thus, EM3D and MST with PF_OFF achieve the best performance as shown in Figure 3 . However, accuracy and coverage may change in different directions. For instance, accuracy increases but coverage decrease on GCC when L2 prefetchers are disabled. The result in Figure 3 shows that GCC with ACL_OFF obtains the best performance, implying that accuracy and coverage produce a joint effect on performance of applications. Having more gains and less losses is better, because a more effective threaded prefetching method would increase hit behavior in L2 accesses.
From Figure 8 , it can be seen that when L2 prefetchers are deployed from PF_ON to ACL_OFF and PF_OFF, the total hits continues to increase but the partial hits steadily decrease in hot functions with SP or PV. This illustrates that part of the partial hits of L2 cache accesses for SP or PV transfers to total hits when L2 prefetchers are switched off; in other words, the timeliness of threaded prefetching grows. This occurs because references originated by L2 prefetchers delay the service of prefetches from helper thread, since they contend for the limited memory bandwidth.
Although timeliness increases in SP and PV when L2 prefetchers are switched off, the results in Figure 3 show a degraded performance in some applications. This indicates that timeliness is not the only crucial factor for the effectiveness of threaded prefetching. In fact, accuracy, coverage, and timeliness are all considerable factors when applying threaded prefetching. Although disabling shared L2 prefetchers lightens the contention for the limited memory bandwidth and improves the timeliness of threaded prefetching, it may decrease the accuracy and coverage of threaded prefetching by canceling hardware prefetches along with the helper thread.
We should offer a few comments regarding the effect of L2 prefetchers on threaded prefetching. First, enabling L2 prefetchers may improve the accuracy and coverage of threaded prefetching for some applications. Second, we find that disabling L2 prefetchers can improve timeliness of threaded prefetching and relieves pressure on the bus and competition for the shared resources. Third, in these real prefetching algorithms, accuracy, coverage and timeliness can actually compete against one another. Finally, L2 prefetchers introduce a similar effect on threaded prefetching as on the original execution for pointer-chasing applications. It is worth mentioning that HEALTH makes exceptions in this experiment. Zilles [17] explored the Olden benchmark HEALTH and demonstrated that the performance of its inefficient means of performing a given task is irrelevant. HEALTH is also included in this paper because it can help us better understand the effect of L2 prefetchers.
Deployment Optimization of L2 Prefetchers on Threaded Prefetching
We optimized the deployment of L2 prefetchers on SP and PV based on the effect analysis above. Figure 9 shows the speedup for SP and PV with deployment optimization of L2 prefetchers with respect to that with PF_ ON separately. To distinguish the resulting data between SP and PV, the legends in Figure 9 are marked distinctly. For example, SP_optimized and PV_optimized represent the speedup of SP and PV respectively. It can be seen in Figure 9 that the growth of performance is particularly evident for EM3D, MST, and GCC. 
Conclusion
This paper has examined the impact of L2 prefetchers and optimized their deployment for pointer-intensive applications after applying helper thread. The results show that L2 prefetchers waste bus bandwidth and aggravate shared resource contention. As a result, memory-intensive applications with a heavy bus traffic could see a performance degradation if L2 prefetchers were turned on. An optimized deployment of hardware prefetch mechanisms is made based on the nature of the memory access stream. In our future work, we will focus on the shared resource contention among requests from normal software, hardware prefetchers, and helper thread to adjust the aggressiveness level of prefetching on pointer-intensive applications.
