In this paper, we evaluate the benefits achievable from software data-prefetching 
Introduction
The memory wall challenge [19] has been tackled in recent systems by adding larger caches and building sophisticated memory hierarchies. However, cache size is limited by the die size and memory hierarchy is not a panacea for narrowing the processor-memory speed gap. Long latency memory accesses are still substantial for real world applications and are becoming more and more expensive as the processor-memory performance gap continues to widen. There are a number of techniques that have been intensively studied to address this performance issue. These techniques fall into two categories: latency tolerance and latency elimination [5, 11, 13] .
 Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. * Other brands and names may be claimed as the property of others.
Software data-prefetching is an effective latency tolerance technique that is aimed at hiding the memory access latency by overlapping it with time to compute results and time to access other memory locations [5, 11, 13] . The long latency elimination techniques, also called locality optimizations, are aimed at minimizing long memory latency [13] through maximizing data and instruction reuse at cache levels of the memory hierarchy. In this paper, our primary focus is to examine the impact of a set of software data-prefetching schemes on SPEC OMP application performance with the Intel® C++ and Fortran compilers [14, 15] on an SGI Altix 32-way shared-memory multiprocessor system built with Intel® Itanium® 2 processors. The SPEC OMPM2001 applications are industry standard benchmarks [1, 9, 20] that are used by hardware and compiler vendors to measure and examine performance of OpenMP applications on sharedmemory multiprocessor systems. These benchmarks are parallelized explicitly using pragmas and directives defined in OpenMP C++/C and Fortran 2.0 standards [16, 17] .
The software data-prefetching framework implemented in the Intel® Itanium® compiler [5, 10] provides the required infrastructure for our performance study. Most of the compiler analyses and optimizations that are needed to enable effective software data-prefetching are done before data-prefetching utilizing the services of an advanced memory disambiguation module. The disambiguation phase uses information from the state-of-the-art pointer analysis, address-taken analysis, array dependence analysis, language semantics and other sources to answer queries on memory disambiguation.
The main contributions of this paper include:
• A comprehensive study of the effectiveness of a set of software data-prefetching techniques.
• An insight into which compiler-based data-prefetching techniques are the most beneficial for the performance of multithreaded programs.
The remainder of this paper is organized as follows. In Section 2, we give an overview of the Intel® compiler. Section 3 presents software data-prefetching implemented in the Intel® C++ and Fortran compilers. In Section 4, we describe our experimental methodology that is used to study the impact of software data-prefetching strategies on SPEC OMPM2001 benchmark performance, and presenting our experimental performance results. In Section 5, we discuss related work briefly. Section 6 concludes the paper with our key observations and insight.
Intel® Compiler Overview
The Intel® Itanium® 2 processor provides a number of new architectural and micro-architectural features [8] , which are used by the Intel Itanium compiler to enhance software performance significantly. The EPIC (Explicitly Parallel Instruction Computing) architecture provides the ability to exploit large amounts of instruction level parallelism (ILP) in an efficient manner. Control and data speculation increase the amount of ILP that the processor can exploit by allowing loads to be scheduled across branches or other memory operations that previously impeded scheduling. Predication similarly increases ILP by removing branches and eliminating the penalties of branch mis-prediction associated with them.
The Intel® compiler is a state-of-the-art product compiler for C/C++ and Fortran95 for Windows* and Linux*. It supports both automatic optimizations and programmercontrolled methods to achieve high-performance software. The compiler incorporates many advanced technologies, including profile-guided multi-file inter-procedural analysis and optimizations, memory disambiguation/optimizations, parallelization, data and loop transformations [10] , global code scheduling, predication, and many optimizations that make use of speculation and predication [2, 10] .
The Intel® compiler also offers programmers the ability to leverage multiprocessor computing power [15] by making small changes to their source code. One kind of source level changes that the user can perform to take advantage of multiprocessor is the insertion of the OpenMP directives. All our compiler optimizations can be summarized as below:
Multi-Level Parallelism (MLP):
Intel compiler supports software pipelining and parallelization for Itanium family processors to exploit instruction-level parallelism (ILP) and thread-level parallelism (TLP) effectively. Exploiting MLP (ILP+TLP) ensures that the compiler fully utilizes the rich set of performance features of Intel architecture for achieving the highest application performance.
Inter-Procedural Optimization (IPO):
This component includes points-to analysis and mod/ref analysis required by many other optimizations. Points-to analysis expands the capabilities of memory disambiguation by keeping track of the set of memory locations that may be accessed by a memory reference.
High-Level Optimization (HLO):
Optimizations in HLO include loop transformations such as loop fusion, loop tiling, loop unroll-and-jam, loop distribution [18] , software data prefetching, scalar replacement and data transformations to improve data locality and reduce memory access latency.
Scalar Optimizations:
Intel compiler implements an extensive set of scalar optimizations such as branchmerging, strength reduction, constant propagation, dead code elimination, copy propagation, partial dead store elimination, and partial redundancy elimination (PRE) [3] .
Task Queuing Model:
The Intel compiler supports a task queuing model [6] that can be used to effectively exploit irregular parallelism inherent in applications. This model allows a programmer to parallelize control structures that are beyond the scope of those supported by the standard OpenMP programming model, while still fitting into the framework defined by the OpenMP specification.
Effectively extracting the full potential of Intel® Itanium® 2 processor requires a sophisticated compiler encompassing a full suite of optimizations. To achieve this goal, the Intel Itanium compiler incorporates a wide range of best-known compiler optimizations from both industry and research communities to enhance software performance.
Prefetching for Itanium® 2 Processor
Software data-prefetching is a complementary approach to classical locality optimizations such as linear loop transformations, loop distribution, fusion, blocking, and scalar replacement. Prefetching is an effective technique to hide memory access latency. It injects prefetch instructions for selected data references at carefully chosen points in the program, so that referenced data items are moved as close to the processor as possible before the data items are actually used. In addition, the data prefetch instructions do not normally block the instruction stream and do not raise any exception. The software data prefetching implemented in the Intel compiler makes use of architectural features such as predication, rotating registers, and data speculation that are available on the Intel Itanium 2 processor.
The overhead of data prefetching arises from the execution of prefetch instructions as well as other instructions that compute addresses of data to be prefetched. The prefetch instructions also occupy memory slots, thereby increasing resource usage. In principle:
• We should avoid prefetching data that has been already loaded into the cache, since such prefetches result in an overhead and provide no benefit during execution.
• Data prefetches should be issued at the right time: they should be early enough so that the prefetched data item is available in cache before its use; they should be sufficiently late so that the prefetched data item is not evicted from the cache before its use.
Prefetch distance denotes how far ahead a prefetch is issued for a memory reference. This distance should be estimated based on memory latency, resource requirements in the loop, and data-dependence information.
Software data-prefetching developed in the Intel compiler utilizes data-locality analysis to selectively prefetch only those data references that are likely to suffer cache misses. Essentially, three types of data-locality are identified by the compiler:
• Spatial locality exists if data references inside a loop access different memory locations that fall within the same cache line.
• Temporal locality exists if a data reference accesses the same memory location multiple times.
• Group locality occurs if different data references access the same cache line.
In the example in Figure 1 , the compiler inserts prefetches for arrays x and y. The memory references to array x and y have spatial locality. The two references to array y also exhibit group locality with respect to the k loop iterations.
In this example, D is the prefetch distance computed by the compiler. With a cache line size of 128 bytes and an array element size of 8 bytes, the prefetches to x and y have to be issued only once every 16 iterations. This is because the prefetch instruction moves the entire cache line containing the address into the cache.
Figure 1: An example of data-prefetching
Note that the conditional statements used to control data prefetching can be removed by loop unrolling, strip-mining, and peeling. However, this may result in code expansion, which can cause increased instruction cache misses. The predication support in the Intel® Itanium® 2 processor provides an efficient way of exploring the use of prefetch instructions. Here, the IF statements within the loop are ifconverted to use predicates. Using predication, the compiler is able to change control dependency into data dependency, so the branch misprediction penalties can be avoided. But the extra compare and arithmetic instructions for predicate computation are required.
Given the large number of registers available in the Intel Itanium 2 architecture, it is possible to store memory addresses of prefetching in registers, obviating the need for register spill and fill within loops. The Intel Itanium 2 architecture also provides support for memory access hints that enable compilers to orchestrate data movement between memory hierarchies efficiently [5] . For example, if a data reference does not exhibit any kind of reuse, then it can be prefetched using a special lfetch 'nta' hint to reduce cache pollution. These architectural features provide support for the compiler to perform better data reuse analysis on data movement across loop bodies so that unnecessary prefetches are avoided. Data prefetching is done as part of the high level optimizations in the Intel compiler enabled at the O3 optimization level. The compiler calculates the prefetch distance for each memory reference using a paremetrized scheme based on parameters such as the compiler-estimate of the latency of the memory access, resource requirements inside the loop, and the estimate of the loop trip-count.
Experimental Evaluation
This section evaluates the impact of a number of software data-prefetching strategies on the performance of the SPEC OMP benchmark. We first describe the methodology in our study. Then, we show the performance gain and loss with software data prefetching as a whole, and with individual software data-prefetching strategies.
Methodology
Our experimental evaluation employs the SPEC* OMPM 2001 benchmark suite that consists of a set of OpenMP* based application programs [1, 9] . The input reference data sets of the SPEC OMPM2001 benchmark suite (also referred to as the medium data-size suite) are derived from state-of-the-art scientific computations on shared-memory multiprocessor systems. This benchmark suite consists of 11 large application programs, which represent the type of software used in double x(0:99), y(-1:100); for (k=0; k<100; k++) x(k) = x(k) + y(k-1) * y(k+1) // code with prefetches injected for (k=0; k<100; k++) { x(k) = x(k) + y(k-1)*y(k+1) if (mod(k, 16) == 0) { /* D denotes the prefetch distance */ prefetch(&x(k+D)); prefetch(&y(k-1+D)); } } scientific technical computing. Table 1 provides an overview of the benchmark suite. Of the 11 applications, 8 applications are written in Fortran, and 3 applications are written in C. These benchmarks require a virtual address space of about 2GB to run. The data-sets are significantly larger than those of the SPEC* CPU2000 benchmarks, while still fitting in a 32-bit address space.
The shared-memory multiprocessor system we used for our experiments is the SGI* Altix3000 system built with Intel® Itanium®2 (1.5GHz) processors. The system consists of 32 CPUs, in which each CPU has 16KBI+16KBD L1 cache, 256KB on chip L2 cache, 6MB on chip L3 cache, and 256GB memory per 4-CPU module. The OS installed on the system is the SGI ProPack™ v3. The compilers are Intel® C++ and Fortran95 compiler 8.1 beta releases. For the SPEC OMPM2001 base performance run, we used the SPEC configuration file supplied with SGI's publication of the SPEC OMPM2001 performance results in Nov 2004 at the SPEC web page (www.spec.org). All experiments are performed with 32 threads mapped onto 32 processors, i.e., the maximum number of CPUs available in the system.
Impact of Software Data-Prefetching
In this section, we demonstrate the change in performance by enabling data-prefetching in the Intel® compiler. We have tuned optimizations such as parallelization, privatization, loop transformations, inter-procedural optimization, scalar replacement, prefetching, and software pipelining to work well together inside the Intel compiler. The data-prefetching phase is invoked after most other optimizations mentioned here, so it can benefit from these optimizations, which makes the prefetching more effective. Note that interactions between these different compiler optimizations tend to be very complex. A discussion of the interaction among optimizations can be found in [21] . As we can see in Figure 2 , the results show that software data-prefetching supported in the Intel compiler exhibits a significant performance gain for SPEC OMPM2001. The largest gain is obtained for 314.mgrid_m that improves close to 100%. Overall, 6 out of the 11 benchmarks show gains larger than 10%. 4 others show benefits ranging from 3.83% to 6.96%, and only 332.ammp_m shows less that 1% gain on a SGI* Altix 32-way system. In the following subsections, we discuss the impact of several individual software prefetching strategies on the SPEC OMPM2001 performance.
Impact of Prefetching for Loads Only
Minimizing the overhead introduced by prefetching and not causing a surge in resource requirements for prefetching are important attributes to perform effective prefetching. For applications that are memory bandwidth bound, loads/stores from/to memory dominate performance, and adding extra prefetches increases the pressure on memory channels that may result no additional performance gain or even hurt the performance. On a shared-memory multiprocessor system, selectively issuing prefetches without causing resource contention on the memory system is important to achieve better performance. We performed an experimental study by issuing prefetches only for memory references that are loads. This is compared with a baseline run that has full prefetching for loads and stores. As shown in Figure 3 , there are 5 out of 11 applications that achieved a performance gain ranging from 0.45% to 9.25%, 5 out of 11 applications that got a performance loss ranging from -0.02% to -5.22%. The geometric mean is 0.06% for SPEC OMPM2001 by issuing prefetches only for loads through suppressing stores). Throttling the total number of data prefetches issued within loops with many array references reduced the pressure of memory bandwidth and resulted in 9.25% gain for 312.swim_m and 2.73% gain for 314.mgrid_m. These two are memory bandwidth bound applications with a lot of streaming data accesses. However, for programs that are not memory bandwidth bound such as 318.galgel_m, 326.gafort_m, and 310.wupwise_m, the performance loss is due to the memory latency of stores, since the compiler did not issue prefetches for them. Our observation is that the prefetching for loads only has a small positive impact on the SPEC OMPM2001 benchmark performance in terms of the geometric mean. This is a prefetching scheme that can be helpful to get better performance for some memory intensive applications that are bandwidth-bound, but this is not a general applicable scheme for most applications
Prefetching for Spatial Locality
Spatial locality occurs when a memory reference inside a loop accesses different memory locations in successive iterations that fall within the same cache line. Identifying spatial locality and issuing prefetches that exploit spatial locality generally results in a performance boost as reported in research papers [2, 4] . Typically, when a cache line is brought in from memory, it contains a number of elements of an array that is accessed in the loop. In the example shown in Figure 1 , memory accesses to arrays x and y exhibit spatial locality. When the array elements x(k+D) and y(k-1+D), where k=0 and D=8, are prefetched, array accesses to elements x(9:15) and y(8:14) will result in cache hits. Certainly, there are a number of factors such as strides of array references, array size and shape, cache line size, etc. that could have an impact on the prefetching effectiveness. The main point is that for a data reference with spatial locality, a prefetch instruction can be issued only once in several iterations and still ensure that there are no cache misses for this memory reference. This section reports the performance gain/loss attributed to prefetches references that exhibit spatial locality. The baseline run uses a compiler that does not issue prefetches for memory references that have spatial locality. As shown in Figure 4 , there is a 21.89% performance boost in terms of the geometric mean measurement of the SPEC OMPM2001 benchmark suite by issuing prefetches for spatially local references with the Intel compiler. Majority of data accesses in loops tend to exhibit spatial locality. So, it is very important for the compiler to have an efficient prefetching scheme to handle these references. In the next sub-section, we see how the Intel compiler makes use of IPF specific features to reduce the overheads for spatial prefetches.
Prefetching using Rotating Registers
In the Itanium® 2 processor, a set of general registers r32-r127, floating registers f32-f127, and predicate registers p16-p63 can rotate. The remaining registers r0-r31, f0-f31, and p0-p31 do not rotate and are referred to as static registers. Register rotation provides a hardware renaming mechanism that aids the compiler to control prefetching with minimum overhead as described in [5] . See a simple example shown in Figure 5 , which shows how rotating registers are used by the compiler to reduce prefetching overheads in a software-pipelined loop.
Figure 5. An example of prefetching using rotating registers
In order to be prefetched, the address locations of the arrays x and y are loaded to rotating registers r33 and r34 outside the loop. Within the loop, one prefetch instruction is injected to issue prefetch at address loaded to r34. After the prefetch is issued, r34 is incremented by 16 bytes and placed in register r32 by an add instruction. This results in the creation of a smaller rotating region of two integer registers (r33 and r34). When the branch instruction is executed, r32 and r33 are renamed to r33 and r34 respectively. In the second iteration of the loop, r34 contains the value &(x(0+D)) and is prefetched. Assuming a cache line size of 128 bytes, in the first 16 iterations of the loop, 8 prefetches are issued to one unique cache line of the array x, and 8 prefetches are issued to one unique cache line of the array y.
This clever scheme of optimizing software data-prefetches for spatial references using rotating registers: (a) reduces the number of issue slots needed for prefetch instructions, thus helping to achieve a tight schedule for the loop with prefetches (b) avoids branch mispredict penalties associated with using conditionals or predicate computation overheads associated with if-conversion, and (c) avoids the need for loop unrolling and associated code size overheads. Its implementation details can be found in [5] . The main disadvantage of using this scheme is that some of the prefetches issued will be redundant (accessing the same cache line). . Gain / Loss of Prefetching using Rotating Registers Figure 6 reports the gain and loss of prefetching using rotating registers for the SPEC* OMPM2001 suite. The baseline performance is obtained without using the rotating register scheme, but using a conditional statement inside the loop to issue the prefetch as illustrated in Figure 1 . These conditions may later get predicated by the compiler depending on the predication heuristics employed during scheduling. As shown in Figure 6 , there are 6 out of 11 applications that show a performance gain with prefetching using rotating registers. The most notable ones are 15.90% gain for 314.mgrid_m and 7.52% gain for 310.wupwise_m. The geometric mean gained 2.71%. Note that 4 applications show a performance loss in Figure 6 . However, all of the losses are less than 1%. In summary, prefetching using the rotating registers scheme in the Intel® compiler brings a positive impact on the SPEC OMPM2001 benchmark performance.
Prefetching for Spatial References with No Predication
Predication refers to the guarded execution of an instruction based on a boolean source operand called the qualifying predicate. Almost all Itanium®2 processor instructions have a qualifying predicate. If the qualifying predicate is true, the instruction is executed. If the qualifying predicate is false, the instruction behaves like a no-op.
The Itanium 2 architecture provides 64 predicate registers (p0-p63) of which 48 predicate registers (p16-p63) rotate. The rotation of predicate registers serves two purposes. The first is to avoid overwriting a predicate value that is still needed. The second purpose is to control the filling and draining of a software pipeline. Predication helps to remove branching by converting control dependences into data dependences.
Thus, a straightforward approach for the compiler to prefetch spatially-local references is to minimize redundant prefetches and avoid branch mispredict penalties by injecting guarded prefetch instructions using predication.
Here we compare the performance of issuing prefetches with no predication with a scheme that uses a combination of predication and rotating-register mechanism. The baseline compiler used issues conditional prefetches for spatial references, some of which are converted to use the rotating resister scheme as described in section 4.5. Note that the rotating register technique works only for softwarepipelined loops.
Our experiments show that 7 out of 11 applications in the SPEC OMPM2001 suite achieved a performance gain without using predication. This is because the Itanium 2 processor discards redundant prefetch instructions to the same cache line. By removing predication altogether, 4 out of 11 benchmarks recorded a small performance loss ranging from -0.03% to -0.89%. One interesting finding is that 324.apsi_m was relatively a big winner with a gain of 13.25% in addition to 6 minor winners. The geometric mean of the SPEC OMPM2001 applications shows a 1.73% gain with unconditional prefetching compared to issuing prefetches with predicates for spatial references. The reason that 324.apsi_m achieves a performance gain with no predication is due to it is a memory bandwidth-bounded program, the execution cycles that achieved by the static instruction scheduling is not a dominator factor with multiple threads, and the reduction of predication overhead shows a positive impact.
Impact of Prefetching References with No Spatial Locality
The overheads of prefetching memory references in loops that exhibit no spatial locality are higher than those for spatial references. These references have to be prefetched in every iteration of the loop. They cannot be optimized using predication or using rotating registers. In this section, we measure the performance impact of prefetching for such data references. The baseline run uses a compiler that does not issue prefetches for memory references that have no spatial locality. As shown in Figure 8 , prefetching with no locality delivers 4.51% performance gain on the SPEC OMPM2001 suite in terms of measuring of the geometric mean. There are 8 out of 11 applications that achieved a gain from 1.56% to 36.64%. 310.wupwise_m got -4.29% performance loss, 324.apsi_m and 326.gafort_m got negligible performance loss (less than 1%). The 320.equake_m showed a nice performance boost (36.64%) with this aggressive prefetching scheme.
Impact of Prefetching for Outer Loops
Generally, prefetching is a latency tolerance technique that is inner-loop centric, where most performance gains are achieved through prefetching for references inside innermost loops. However, a few papers have showed that some applications may get a benefit by issuing prefetches for references that appear in outer loops. Hence, we conducted our study by issuing prefetches on outer loops experimentally. The base run uses a compiler that does not issue prefetches for references in outer loops. Our results demonstrate that 7 applications out of 11 get a performance degradation ranging from -0.01% to -3.09%, 4 applications of 11 get a negligible gain ranging from 0.02% to 1.05%. The geometric mean gets -0.48% degradation. Essentially, our observation is that there is a minor negative impact on SPEC* OMPM2001 performance with enabling prefetching for outer loops under our experimental framework and system configuration.
Prefetching Arrays with Indirect Indexing
For an indexed array reference (of the form a[b[i]]), the memory indirection through an index array to access the data array requires a sophisticated prefetching strategy. If both the index array and the data array accesses encounter cache misses, prefetches have to be issued for both references. Also, the distance used for prefetching the index array has to be larger than the distance used for the data array. This is to ensure that the index array loaded as part of the prefetch address computation for the data array does not encounter any cache misses. Data speculation support in Itanium®2 architecture is used to load the index array while computing this address for the data array. This ensures that any out-of-bound accesses of the index array are silently ignored (without generating any exceptions) during program execution. Generally, this kind of prefetching will be helpful for applications with irregular memory access patterns. As shown in Figure 10 , the impact of prefetching indexed array references is minor for SPEC OMPM2001 suite. The performance gain or loss varies from -1.63% to 4.04% for all applications, with a 0.44% gain on the geometric mean. Thus, prefetching for indirect array referencing is not the most valuable scheme for SPEC OMPM2001.
Related Work
Researchers and compiler developers have developed numerous software-based prefetching techniques. Todd C Morwy, et al. first presented a general compiler prefetching method working with scientific programs [13] . Later Guang R. Gao's et al. [12] presented a compiler-based prefetching scheme for pointer-based programs. However, software prefetching is ineffective in pointer-based programs. To address the issue, Chi K. Luk et al. developed a profileguided post-link stride data-prefetching using stride profiling information to enhance the support of software prefetching for pointer-based programs [11] . The Intel® compiler incorporates all well-known techniques in these areas and hence provides a unique test-bed for evaluating data-prefetching effectiveness.
In the domain of compiler support for OpenMP, there have been a few papers discussing OpenMP parallelizations in the compilers [4, 7] . Various OpenMP implementations are possible. An OpenMP preprocessor [4] can accept C++ OpenMP programs and translate them to C++ programs (without OpenMP pragmas) that are subsequently compiled with a native compiler (one that generates machine code). A more integrated approach is to have an internal OpenMP translation phase in the native compiler [14, 15] itself eliminating the preprocessor. In most cases an auxiliary runtime library for thread management is used with the compiler generated code making many calls to this library [15] . However, there are no prior work and papers that have been found on studying and evaluating the impact of compiler-based software data-prefetching schemes on the performance of the SPEC OMPM2001 application suite and on the performance of OpenMP C/C++ and FORTRAN programs in general.
Concluding Remarks
In this paper, we studied and discussed the impact of a set of software prefetching strategies on the performance of SPEC OMPM2001 application suite on a 32-way Intel® Itanium®2 processor based shared-memory multiprocessor system. To the best of our knowledge, this is the first paper that studies the impact of software data-prefetching on the SPEC OMPM2001. From the results presented in this paper, it should be clear that software data-prefetching is an effective compiler technique for tolerating the long memory latency without notably increasing the memory traffic through memory hierarchies. In addition, we measured both serial execution and single thread execution performance (geomean) of the SPEC OMPM2001 benchmark suite, the performance of single-thread execution achieved ~96% of the performance of its serial execution. Hence, we claim that impact of prefetching for the threaded-code generated by the compiler is at same level of its impact on serial code performance. Our key observation is that most of the performance gain is obtained through three software dataprefetching strategies: (a) prefetching for memory accesses exhibiting spatial locality, (b) prefetching for array references with no spatial locality, and (c) prefetching using rotating registers. Software data-prefetching with assistance from other optimizations in the Intel® compiler delivers 29.95% gain for the SPEC OMPM2001, viz., 11.88% to 99.85% gain for 6 out of 11 applications, 3.83% to 6.96% gain for 4 out of 11 applications, with only one application obtaining less than 1% gain on an Intel® Itanium® 2 processor-based SGI* 32-way multiprocessor system.
Although this study clearly demonstrated the effectiveness of software data-prefetching, it remains to be seen whether more advanced tuning of data-prefetching strategies can further leverage new processor features for getting better performance on the Intel Itanium 2 processor based sharedmemory multiprocessor systems.
Our future work will also include two more directions. First, we plan to investigate the performance impact of issuing prefetches with automatic distance-adjustment by adjusting the distance for memory banks while prefetching multiple arrays. Another important focus will be taking an in-depth look at the impact of compiler-based prefetching for programs parallelized with taskqueuing model [6] such as "while" loops for parallel pointer-chasing and graphsearching programs.
