Abstract improving memory perjormance at software level is more effective in reducing the rapidly expanding gap between processor and memory perjormance. Loop transformations (e.g. loop unrolling, loop tiling) and array restructuring optimizations improve the memory performance by increasing the locality of memory accesses. To find the best Optimization parameters at runtime, we need a fast and simple analytical model to predict the memory access cost. Most of the existing models are complex and impractical to be integrated in the runtime tuning systems. In this paper, we propose a simple, fast and reasonably accurate model that is capable ofpredicting the memory access cost based on a wide range of data access pattems that appear in many scientifir applicafions.
lntroduction
Immense research effort has been spent on reducing the performance gap between processor and memory. Processor speed continues to increase every year. CPU clock 6equency is doubling every 18 months complying with Moore's Law. On the other hand main memory (DRAM) speeds haven't increased enough to catch up with the processor speed. This performance gap has been increasing for the last 20 years [I31 and the trend appears to continue in the near hture.
Cache memories work on the principle of spatial and temporal locality [17]. However, there are many applications that lack locality in accessing the memory. These applications spend a major fraction of execution time waiting for data accesses. Cache memories are exploited better if the cached blocks of data are reused extensively before other cache blocks replace them.
Transforming and reordering the memory accesses improve application performance [IO, Some advanced compilers utilize these optimization techniques. But, compilers alone are not sufficient to achieve the best possible optimization [I]. Superior manual optimizations require extensive knowledge of the hardware architecture and data access patterns. The developer needs to be aware of efficient optimization techniques to be applied in the right place. Automating the optimization process is needed to obtain consistent performance.
Performance prediction of memory access cost is required to automate the optimizations. Currently there are a few automatic tuning software tools. One of the most popular tools of optimization is Automatically Tuned Linear Algebra Software (ATLAS) [19] . This tool runs subroutines multiple times to obtain the best optimization parameters by a trial and error method. A prediction model can remove these multiple runs and be extended to optimize more than just linear algebra subroutines. This model should be simple and fast to perform the optimization dynamically, at runtime, based on the data access pattern and available memory hierarchy.
Many researchers worked towards developing accurate cache performance models. But most of these models [2, 18, 61 lack generality. They are complex, and are bounded to a few algorithms or data access patterns. Jacob [6] extracts address traces from the code, which requires execution of the program, and consumes a lot of time if an optimization has to be applied We base OUT prediction model on various access patterns, which are parameterized. This helps in predicting the memory cost with very small complexity and skips the costly process of tracing the references every time a loop parameter is changed. Chattejee et al. [3] studies the exact analysis of cache misses based on the polyhedral model, which is complex. The Cache Miss Equations model (CME) [SI is the least costly performance model to OUT knowledge. However, this model also requires tracing the references to create the reuse vectors and solve cache miss equations. These 0-7803-8694-9/04/$20.00 02004 IEEE models are accurate but expensive, and are better choices for static analysis of cache behavior. Our model fits better in choosing the optimization parameters dynamically at runtime than CMEs.
Our model also focuses on wide range of data access patterns with multiple array variables. Most of the other cache analysis models hold good results for a specific algorithm [2, 181 and fall short in acquiring generality.
The rest of this paper is organized as follows. Section 2 classifies various data access patterns that are used in most of the scientific applications. Section 3 discusses the parameters of memory hierarchy. In Section 4, we propose the memory access cost analysis and prediction equations. Section 5 provides experimental verification, and section 6 discusses an application of our model and cwrent projects. Section 7 concludes with further objectives.
Data Access Patterns
Loops and arrays are fundamental structures of most numerical and scientific applications [14] . A major sbare of the execution time of these applications is spent in loops, accessing data from arrays. Analyzing these access patterns is needed to find out the hotspots and to optimize the performance by reorganizing these memory references.
Data access patterns are classified based on the stride between successive accesses. Modal model of memory [ I l l categorize data accesses as constant, strided and non-monotonic modes. Yan et al. [15] classify memory access patterns into three types: migratory, group and unpredictable patterns.
We classify data access pattems in scientific applications as constant, contiguous and noncontiguous. Non-contiguous patterns are further divided into four pattems based on the size of data blocks accessed with each reference and their successive strides. Stride is the distance between the previous reference and current reference.
Constant accesses are those where the same data block is accessed repeatedly i.e., stride is equal to zero.
Contiguoirs access pattern is where the stride between successive accesses is equal to the size of datatype.
These are divided further as fixed length block accesses and variable length block accesses. Fixed length block accesses refer to the same datatype in consecutive references.
Non-contiguous access pattern is where the stride of next reference is greater than the size of currently accessed datatype. These can be further divided as follows: a) Fixed length hlock. with faedstride: Stride is similar through out the access pattern. As shown Figure. l a , a block with a size of constant block-size is copied into dest from src. The next block is copied from src+stride to dest+block-size, i.e. array src is being accessed non-contiguously with a fixed stride and array dest is being accessed contiguously. b) Fixed length block, with varying stride:
The stride varies between each access. ( Figure. In Section 4, we predict the memory access cost for these data access patterns
des1 += block-size; 
Model parameters
To bridge the gap between processor and memory performance, modem computer architectures include multiple levels of memory hierarchies that consist of cache memory and TLB. In this section, we discuss the details of memory hierarchy parameters, which are used in developing the prediction model.
A We treat the TLB as a level of memory hierarchy.
Its parameters are page size P (similar to cache line size of a cache) and the capacity. The capacity of a TLB is the amount of memory page mapping it can store and is equal to number ofpage entries multiplied by page size. Table 1 summarizes the memory hierarchy parameters. Subscript i of a parameter signifies the level of that cache/TLB in the hierarchy of memory. Cache memory at level i has three properties: its size in bytes (Ci), cache line size (Li) and its associativity (Ai). TLB is represented with the number of page table entries (T,), page size (P,) and it associativity (AT).
=(k.i)

R;.;)
M refers the total number of cache levels.
Cache misses are classified into three types [9] . They are compulsory misses, capaciv misses and conflict misses. Cache misses at level i are represented with Mi. MTk,;, refers to the number of cache misses at level k of memory hierarchy, in accessing ith array (variable) contiguously. If it is being accessed noncontiguously, it is represented by Data access pattem parameters are shown in Table  2 . The subscript i represents the i" array being 
Memory access cost prediction
Our goal is to predict the memory access cost of a basic block of loop with any type of data access patterns discussed in section 2, and for multiple data array variables. We assume LRU replacement policy for cache and TLB. We assume that the memory hierarchy is following inclusive property. The total cost of accessing memory includes the access time and the miss penalties of these levels in the hierarchy. If there are k levels of cache memory and one level TLB To predict this cost, we have to find the number of cache hitsimisses at each level and TLB hit rate. We predict the cache and TLB misses based on the access pattern.
U312
Assuming that there are Mlevels of cache, the total miss penalty due to cache misses is the sum of miss penalty at each level.
where M , is the total number of cache misses and Tk is the miss penalty at level kcache. cz is the overlapping the cache misses with prefetching and other OS optimizations.
Consider that there are m array variables accessed contiguously and n array variables accessed noncontiguously, the total number of misses at cache level k is the sum of misses caused in accessing contiguously accessed anays and those of noncontiguously accessed arrays. (4.7) Non-contigunus access patterns: As described in section 2, there are four main types of access patlerns. These patterns are classified based on the variability of stride and data block size. The occurrence of cache misses is categorized into four regions based on the working set size, similar to Saavedm and Smith [16]. First region is the one where all the working set fits in the cache. As long as the working set size is less than the cache size, the total data fits into the cache. = rn * (W, Lk ) 1 (4.8) where n is the number of data accesses. Thus, we set our focus on the first two regions to count the number of cache misses.
First we find the cache misses for a fixed size of data block accesses of one variable, with a fixed stride.
If R&, is the number of non-contiguous references of ith m a y at cache level k of the memory hierarchy.
?Yj) is the size of j " data block being noncontiguously accessed in ith variable. In this pattern when the stride S(i,j) is less than the cache h e size, we assume that the cache line bas already been fetched into the cache. However when this stride is causing to fetch a new cache line, then this formula misses to count that cache miss. This can be corrected by maintaining the history of cache line that has been fetched recently.
The number of cache references at level k (R;,,))
is the number of cache misses at the lower level cache, i.e. RG,i) = MG-,, i ) . For fixed or variable stride with variable size block accesses, the cache misses have to be counted for each block size. In this case, we assume that the stride is always more than L, . If the data block size is less than L, , it does not cause a cache miss as the prefetched line of data is reused. The number of cache misses is:
Refer Table 3 and Table 4 (at the end of this document) for a summary of formulae to calculate the cache misses for all data access patterns. Using (4.3) total number of cache misses in accessing contiguous and non-contiguous data is calculated. Formula 4.2 gives the total memory access cost.
Performance verification
This section presents performance measurements to verify the predicted memory access cost with the measured cost on various architectures. We measure the performance of loops with all the data access patterns mentioned above and compare that performance with the predicted performance.
We took the measurements on a Sun Solaris based cluster called Sunwulf, which is located at the Scalable Computer Software Lab of Illinois Institute of Technology. Sunwulf is composed of a four-processor E450 server and 63 high-end workstations. We run our experiments on one of the nodes. Each node is a SUN Blade-100 workstation with one UltraSparc-Ile, 500MHz CPU. The L1 cache is 16KB, with a 16-byte cache line size. The L2 cache has a capacity of 8MB and its line size is 64 bytes. It also bas a TLB with 4KB page size and 48 entries. We used a microbenchmark to find the average access time and miss penalty of each level of memory hierarchy. This is similar to the microbenchmark proposed by Saavedra and Smith [16] .
Another platform we used for experiments is a 32-node Beowulf, located at University of South Carolina. Each node consists of 933MHz, Pentium I11 processor. It has 16 KB L1 cache and 256KB L2 cache. Both these caches are on the die, and the average penalty for load misses is measured as 7 cycles and 70 cycles for L1 and L2 respectively. We chose these processors, as they apply inclusive property in the memory hierarchy with less aggressive pre-fetching.
We used the loops similar to Figure. 1 and measured the time to execute those loops. In all these loops, two array variables are accessed with different access patterns. We chose these loops since many applications contain loop blocks where the data accesses are similar to the access patterns discussed above in section 2. We can apply the same prediction model for any number of arrays. Execution time of these loops contains only the data access cost, without any computation cost. We used pointer-to-pointer copy to avoid the cost of memcpy. In these experiments we ran many iterations of the program to find the minimum cost. We also flushed the cache afler measuring the time for an iteration to replace any cache blocks that are reusable. We compiled these programs using gcc 3.0 and padded the arrays to avoid any cache thrashing. The comparison of predicted cost and measured memory access cost is presented in the following paragraphs. The memory access cost is presented as a ratio of execution time to the number of memory references. This normalization is done to fit all the data into the graph. The performance is better for lower values. pessimistically without taking prefetching into consideration. The prediction error was below 20% for small data and below 4% for large data with this data access pattem.
To test the non-contiguous access pattem performance we used three sizes of fxed strides (16hytes, 32 bytes and 64 bytes) that are equal to LI cache line size, more than L I line size and that of equal to 1 2 line size. For non-contiguous accesses, with stride equal to LI cache line size, the prediction error reduced as the data size increase. It can be seen from Figure 2 .b and Figure 3 .a, that the utilization of caches are more effective when the data size is less than 12 cache size. Overall the error is below 20% in most of the cases. For the remaining two non-contiguous access patterns with fixed strides, the prediction error is below 10% for larger data sizes.
For non-contiguous access pattern with variable strides, we initialized an array that contains strides of accesses. Prediction cost of this access pattem contains the cost of accessing non-contiguous arrays as well as the cost of accessing the array of strides. The prediction error is below 15% (Figure 3.c) . This error is caused by missing some of the cache misses in noncontiguous accesses, which requires maintaining the history of the length of cache lines that are already been fetched into the cache. Another reason for prediction error for all these access patterns is that we are using average miss penalties, which may not be accurate. processor on Beowulf cluster (Fig 4 and 5) . The prediction error is slightly high for small data sizes where the prefetching of this processor is effective. As the working set size increase, the L2 misses increase and the prediction error is below 20% in these cases.
We also verified the performance of the loops in NAS Parallel benchmarks that are performing matrix transpose operation. We have measured the performance two variations of matrix transpose algorithms from NAS Parallel benchmarks' Fast Fourier Transform program. The fnst algorithm is a simple matrix transpose of copying rows of one matrix to columns of another matrix. The second algorithm uses cache-blocking optimization to improve the performance. Both algorithms fit into the data access pattems explained in section 2. The data working set of the first algorithm increases with the dimension of the matrices. Due to the row major ordering of arrays (in C or column major ordering in Fortran), one matrix is accessed contiguously and the other is accessed noncontiguously with fKed stride. The second algorithm makes sure that a block of data is fully utilized before replacing it from the cache. In this algorithm, the two matrices are accessed non-contiguously with fixed strides. However, as the whole data block is reused before it is being replaced, and we chose the block size such that it fits into the cache, the number of cache misses is very less compared to the unoptimized version of matrix transpose. These experiments are performed on Sun UltraSparc Ile processor node.
As expected, the performance (timeireference) increases as the data size increases for the unoptimized transpose algorithm (Fig 6) . Predicted values of performance are slightly different from the measured values. The error is around 13%. In the second algorithm, the performance is improved for the transpose algorithm due to the cache-blocking optimization (Fig 7) . The performance error was below 5% for most of the data sizes, but increased for large data sizes. This is mainly due the increase in average time per memory reference for the large data sizes. The memory-communication cost for sending a data segment depends on architectural parameters, such as cache capacity, and code characteristics, such as data distribution, as explained in the memory-LogP model. In general, the overall communication cost includes data-collection overhead, the cost of data copying to the network buffer, the cost of data forwarding to the receiver (network-communication cost), and other costs added by the middleware implementation. When data distribution in memory is noncontiguous. the data is typically collected into a contiguous buffer before being copied to the network buffer. This process adds extra buffering overhead to the overall communication cost and is implementation dependent. The memory access cost predicted in this paper is a part of the latency (I) parameter of the memory-lo@ model. implementation to optimize the transfer of noncontiguous data. In practice, however, few MPI implementations provide derived datatypes in a way that performs better than what the user can achieve by manually packing data into a contiguous buffer and then calling an MPI function. Memory access cost has been the reason for this performance bottleneck. We use memory-logP model with the help of prediction formulae to predict this cost and apply memory access optimization techniques to improve the performance. Due to space restriction, we cannot explain the optimization method here. Refer to [I] for full details.
An application of the model
Conclusion
Loop transformations and loop access reordering techniques improve the memory access performance. To obtain these loop optimization parameters, a simple, fast and accurate memory access cost prediction model is necessary. This improves the standard of application level optimizations and reduces the burden on the programmers to leam the rapidly improving processor and computer architecture technolo&T. Towards achieving this goal, in this paper we proposed an analytical model to predict the memory access cost based on the data access patterns. We first classified the most common data access pattems in scientific computing applications. We then proposed a model to predict the memory access cost. We verified this model with measurements and showed that this model is practical. The accuracy of our model is reasonable given its simplicity. We also applied this model to matrix transpose routines in Fast Fourier Transform program of NAS benchmarks, which was implemented in different memory access patterns.
Our model is simple, effective, and easy to be incorporated into memory cost tuning tools, where optimization parameters are to be found at runtime. The prediction errors of 10% to 20% exist, they are reasonably accurate in making optimization decisions. We are currently utilizing this model to improve the performance of MPI derived datatypes, by optimizing the memory access cost. This cost prediction is a part of our memory-logP model, which emphasizes the importance of memory communication performance in point-to-point communication. Our model is practical because of its simplicity. We are able to fit this easily into any optimization library to choose optimization parameters dynamically at runtime. This is not possible with the existing models due to their complexity. 
