Abstract
Introduction
Processor and memory speeds have been diverging at the rate of almost 50% a year, but architectural trends suggest that cache sizes will remain small to keep pace with decreasing processor cycle times. McFarland [IO] shows that for a feature size of 0.1 micron and a I-nsec cycle time, the L1 cache can be no bigger than 32 kilobytes to maintain a one-cycle access time, or 128 kilobytes for two-cycle latency. High memory latency, increasing CPU parallelism, poor cache utilization, and small cache sizes relative to application working sets all create a memory bottleneck that results in poor performance for applications with poor locality. Bridging the growing processor/memory performance gap requires innovative hardware and software solutions that use cache space and memory bandwidth more efficiently. Myriad approaches attempt to do this in different ways, and application and input characteristics highly influence the appropriate choice of optimizations. An analytic framework that models the costs and benefits of these optimizations can be useful in driving decisions.
Iteration space transformations [5, 1 I , 171 and data layout transformations [ 3 , 81 are two classes of compiler optimizations that improve memory locality. Loop trunsformations, an example of the former, improve performance by changing the execution order of a set of nested loops so that the temporal and spatial locality of a majority of array accesses is increased. This class includes loop permutation, fusion, distribution, reversal, and tiling [ 1 I , 171 .
Array restructuring is a common data layout transformation. It improves cache performance by changing the physical layout of arrays that are accessed with poor locality [8] . Static restructuring changes the compile-time layout of an array to match the way in which it is most often accessed. For example, the compiler might choose column-major order over row-major order if most accesses to an array are via column walks. Static restructuring is most useful when an array is accessed in the same way throughout the program. Dynamic restructuring creates a new array at run time, so that the new layout better matches how the data is accessed. This is most useful when access patterns change during execution, or when access patterns cannot be determined at compile time. Dynamic array restructuring is more widely applicable than static, and we focus on it in this study. The runtime change in array layout is most often accomplished by copying, and we refer to this optimization as copyingbased array restructuring. We also consider the possibility of having smart memory hardware perform the restructuring [ 2 ] . Hardware mechanisms that support data remapping allow one to create array aliases that are optimal for a particular loop nest. We call this optimization remapping-based array restructuring. The tradeoffs involved are different from that of traditional copying-based array restructuring, and thus this optimization is sometimes an useful alternative when the latter is expensive.
However, hardware support does not come for free, and thus there is a need to determine automatically whether hardware support should be relied upon.
Loop transformations and array restructuring can be complementary, and often are synergistic. When applicable, loop transformations improve memory locality with no runtime overhead. However, it is often not possible to improve the locality of all arrays in a nest. For example, if an array is accessed via two conflicting patterns (e.g., a [i] [ j] and a [ j] [i]), no loop ordering can improve locality for all accesses. Furthermore, loop transformation cannot be applied when there are complex loop-carried dependences, insufficient or imprecise compile-time information, or non-trivial imperfect loop nests. In contrast, array restructuring can always be applied, since it affects only the locality of the target array. However, all forms of dynamic restructuring incur overheads, and these costs must be amortized across the accesses with improved locality for the restructuring to be profitable. Loop and array restructuring can be integrated and the best choice depends on which combination has the minimum overall cost.
Others have integrated data restructuring and loop transformations [3, 51, but their approaches only consider static array restructuring, and do not provide any mechanisms to determine whether the integration will be profitable. In this paper, we present cost models that capture the codperformance tradeoffs of loop transformations, copying-based array restructuring, and remapping-based restructuring. We use an integrated cost framework to decide which optimizations to apply, either singly or in combination, for any given loop nest. Code optimized using our cost model achieves performance within 95% of the best observed for a set of eight benchmarks. In contrast, the performance of any fixed optimization is at best 76% of the best combination we observed.
Restructuring Optimizations
Consider the simple example loop nest, irkernel, in Figure I(a) . If we assume row-major storage, two of the arrays are accessed sequentially, with U having good spatial locality and V having good temporal locality. Unfortunately, W is accessed along its columns, and X is accessed diagonally; neither will enjoy good cache performance. In this section, we review the candidates for integrated restructuring and examine their effects on irkernel.
Loop Transformations
The loop transformations we consider in this paper include loop permutation, loop fusion, loop distribution and loop reversal. McKinley et al. [ 1 11 choose between candidate loop transformations based on a simple cost model. 
Copying-based Array Restructuring
Array restructuring directly improves the spatial locality of array accesses. It makes sense to store one array in row-major order if it is accessed row-by-row and another array in column-major order if it is accessed column-bycolumn. Array restructuring generalizes this idea to any direction. For instance, Figure 2 shows how the skewed access pattern in the original array becomes sequential after array restructuring. We say that an array reference is in array order if its access pattern in the loop matches its storage order. When applying copying-based array restructuring to the loop nest in Figure l( 
Remapping-based Array Restructuring
In this section, we briefly explain the details of the hardware mechanism we use to implement remapping-based restructuring, and show how we apply the optimization to the example irkernel.
The Impulse adaptable memory system expands the traditional virtual memory hierarchy by adding address translation hardware to the main memory controller (MMC) [2] . The memory controller provides an interface for software (e.g., the operating system, compiler, or runtime libraries) to remap physical memory to support different views of data structures in the program. Thus, applications can improve locality by controlling how the new view to the data is created. To use Impulse's address remapping, an application performs a mapshadow system call indicating what kind of remapping to create -e.g., a virtual transpose of an array -along with the original starting virtual address, the element size, and the dimensions. The OS configures the Impulse MMC to respond to accesses to this remapped data appropriately. When the application accesses a cache line at an address corresponding to a remapped data structure, the Impulse MMC determines the real physical addresses corresponding to thc remapped elements requested and loads them from memory. To support this functionality, the MMC contains both a pipelined address calculation unit and a TLB. The elements are loaded into an output buffer by an optimized DRAM scheduler and then sent back to the processor to satisfy the load request. 
We flush rX from the cache to maintain coherence when the offset changes.
After applying the remapping-based restructuring optimization, all accesses are in array order. The cost of setting up a remapping is quite small compared to copying. However, subsequent accesses to the remapped data structure are slower than accesses to an array restructured via copying, because the memory controller must re-translate addresses on the fly and "gather" cache lines from disjoint regions of physical memory. Our cost model accounts for this greater access latency for remapped data when considering which restructuring optimization to apply, if any.
Analytic Framework
To optimize to a program, a compiler must perform the following two steps. First, it decides upon a part of the program to optimize and a particular transformation to apply to it. Second, it transforms the program and verifies that the transformation either does not change the meaning of the program or changes it in a way that is acceptable to the user. Bacon er al.
[ I ] compare the first step to a black art, because it is difficult and poorly understood. In this section, we explore this first step in the domain of restructuring optimizations. We present analytic cost models for each of the optimizations, and analyze tradeoffs in individual restructuring optimizations, both qualitatively and quantitatively. We show why it might be beneficial to consider them in an integrated manner, and present a cost framework that allows integrated restructuring optimizations to be evaluated.
Modeling the Restructuring Strategies

Basic Model
We need careful costhenefit analysis to determine when compiler optimizations may be profitably applied because finding the best choice via detailed simulation or hardware measurements is time-consuming and expensive. We have developed an analytic model to estimate the memory costs of applications at compile time. Like McKinley, Carr and Tseng [ 1 I], we estimate the number of cache lines accessed in the loop nests.
The memory cost of a single array reference, say Ra, in a loop nest is directly proportional to the number of cache lines accessed by it. Suppose cls is the cache line size of the cache closest to memory, stride is the distance in elements between successive accesses to this array, and f is the fraction of the cache lines reused from a previous iteration of the innermost loop. We estimate the memory cost of a single array reference Ra to be:
We do not have a framework for estimating f accurately. We expect f to be very small for large working sets, which allows us to neglect the (1 - and by array U is $.
We estimate the memory cost of a loop nest to be the sum of the memory costs of the independent array references in the nest. We define two array references to be independent if they access different cache lines in each iteration. This characterization is necessary to avoid overcounting cache lines. Thus, if both a[i] [ j ] and
are present, we consider them as one. The cost of a loop nest depends on 1oopTripCount (the total number of iterations of the nest), the spatial and temporal locality of the array references, the stride of the arrays and the cache line size. We estimate the memory cost of the i"' loop nest in the program to be:
u:bldej~riidenrR~f'
The memory cost of the entire program is estimated as the sum of the memory costs of all the loop nests. If there are n loop nests in a program, then the memory cost of the program is:
For the example code in Figure I (a), the total memory
The goal of the cost model is to choose the combination of array and loop restructuring optimizations that has minimum memory cost for the entire program. We assume that the relative order of memory costs determines the relative ranking of execution times. The above formulation of total memory cost of an application as the sum of the memory costs of individual loop nests makes an important assumption -the compiler will know the number of times each loop nest executes, and can calculate the loop bounds of each loop nest at compile-time. This assumption holds for many applications. In cases where it does not hold, we expect to be able to make empirical assumptions about the frequencies of loop nests. We also assume that the cache line size of the cache closest to memory is known to the compiler.
Modeling Loop Transformations
When we consider only loop transformations, the recommendations from our model are the same as those of the simpler model by McKinley et al. [ 111. However, their approach does not consider array restructuring. We consider the total memory cost of the loop nests, which allows us to compare the cost of loop transformations with that of independent optimizations such as array restructuring.
Modeling Array Restructuring
The cost of array restructuring is the sum of the initial cost of creating the new array and that of the optimized loop nest. In addition, there is the cost of updating the original array if the new array was modified. The cost of setup is the sum of the memory costs of the original array and the new array in the setup loop.
cls cls
In Figure 1 (C), the memory cost of creating array CX in the setup loop is -+ and the memory cost of array X in the setup loop is @. Thus, the setup cost of
is usually the case. The calculation for array cW is similar.
The cost of the optimized array reference in the loop nest is:
The cost of the loop nest with optimized references, CX and cW, is J-x (2N3 + N 2 + N ) . Array restructuring is expected to s[," profitable if:
For this example, the total cost of the array-restructured program (assuming cls = 16) is ($ + @ $ + G), while the cost of the original program is (g + $ + N 2 ) . The latter is larger for almost all N , and so array restructuring is assumed to be always profitable for this particular loop nest. Simulation results bear this decision out. 
Modeling Remapping-based Restructuring
The overhead costs involved in remapping-based array restructuring include the cost of remapping setup. the cost of updating the memory controller, and the cost of cache flushes. The initial cost of setting up a remapping through the m a p s h a d o w call is dominated by the cost of setting up the page table to cache virtual-to-physical mappings. The size of the page table depends on the number of elements that are to be remapped. We model this cost as K1 x #elementsToBeRernapped. Updating the remapping information prior to entering the innermost loop every time using the remapshadow system call incurs a fixed cost, which we model as K2. We also model the costs of flushing as being proportional to the number of cachelines flushed where the constant of proportionality is K3. We have empirically estimated these constants using microbenchmarks.
The memory cost of an array reference optimized with remapping support is:
1)
In summary, multiplying the number of cache lines gathered by the cost of gathering the cache lines makes remapping-based array restructuring comparable with pure software transformations such as copying-based array restructuring and loop transformations.
Returning to our example in Figure I 
. This is less than the original program's memory cost and thus remapping-based restructuring is correctly assumed to be profitable for this example.
Integrated Restructuring
By integrated restructuring we mean two things. First, the individual restructuring optimizations can be combined to optimize an application in complementary ways. For example, we could combine loop permutation, loop fusion, and copying-based array restructuring to optimize one single loop nest. Second, we should be able to select a good combination of restructuring optimizations from the legal options given the application and its inputs. Researchers have provided heuristics-driven algorithms for combining loop and array restructuring [3, 51 but there has not been any work on providing a framework for choosing the right set of optimizations among restructuring optimizations. Our cost model framework for selecting the right combination of restructuring optimizations is based on Equations l -9. These equations allow us to decide which optimization to choose for a given application and input size. Loop transformation incurs no run-time costs and thus is the optimization of choice when it succeeds in rendering all references in array order. In the presence of conflicting array access patterns, however, loop transformations cannot improve the locality of all arrays. In such cases, array restructuring can be applied to individual references that lack the desired locality. The choice of what loop transformation to use might therefore depend on which array(s) are cheaper to restructure.
We analyze the example loop nest in Figure l (a) to see whether integrated restructuring improves performance. Recall that loop transformation could not improve the locality of array x, and array restructuring incurred the costs of creating the arrays CX and cW through copying. Consider the integrated restructuring optimization that permutes the loop nest into j i k order and uses array restructuring to improve the locality of X . This Table 3 ).
The same loop nest can be optimized using loop transformation with remapping-based restructuring (L+R), in which case the loop transformation would bring array W into array order, and array X would be remapped to rX, as in Figure I(d) .
In general, selecting the optimal set of transformations is a non-trivial problem. Finding the combination of loop fusion optimizations alone for optimal temporal locality has been shown to be NP-hard [ 6 ] . The problem of finding the optimal data layout between different phases of a program has also been proven to be NP-complete [7] . As a result, researchers have used heuristics to make integrated restructuring tractable. Our analytical cost framework can help evaluate various optimizations, and we show later in our results that the cost model driven optimizations comes closer to best possible performance than any fixed optimization strategy.
Evaluation
To perform our studies, we used URSIM, an executiondriven simulator derived from RSIM [ 131. URSIM models in detail a microprocessor similar to a MIPS R10000, a split-transaction MIPS RI0000 bus that supports a snoopy coherence protocol, and the Impulse memory controller [ 121. The processor modeled is four-way issue, outof-order, and superscalar with a 64-entry instruction window. The L1 data cache is 32KB, non-blocking, writeback, virtually indexed, physically tagged, two-way associative, with 32-byte lines and has a one-cycle latency.
The L2 data cache is 512KB non-blocking, write-back, physically indexed, physically tagged, two-way set associative, with 128-byte lines and has an eight-cycle latency. The instruction cache is assumed to be perfect.
The TLB maps both instructions and data, has 128 entries, and is single-cycle, fully associative, and software-managed. The bus multiplexes addresses and data, is eight bytes wide, and has a three-cycle arbitration delay and a one-cycle turn-around time. The system bus, memory controller, and DRAMS all run at one-third the CPU clock rate. The memory supports critical word first and returns the critical quad-word for a load request 16 bus cycles after the corresponding L2 cache miss. The memory system models eight banks, pairs of which share an eight-byte wide bus between DRAM and the MMC. syr2k vpenta btrix Table 2 shows which optimizations we considered for each benchmark. The optimization candidates are copying-based array restructuring(C), remapping-based restructuring(R), loop transformations(L), a combination of loop and copying-based restructuring(L+C), and a combination of loop and remapping-based restructuring(L+R). J indicates that the optimization was possible, N indicates that the optimization was not needed, and I indicates that the optimization was either illegal or inapplicable. In our study, we hand-coded all optimizations; work is ongoing to add our cost model to the Scale compiler [ 161 to automate the transformations. We ran each benchmark for ten different input sizes, with the smallest input size typically just fitting into the L2 cache. Whenever there werc several choices for a single restructuring strategy, we chose the best option. In other words, the results that we report for L is for the best loop transformation ( loop permutation, fusion, distribution or reversal) among the ones we implemented. Similarly, the results for L+R is for the best combination of loop and remappingbased array restructuring that we implemented. the geometric mean speedup obtained for each optimization compared to the baseline benchmark over the range of input sizes. In addition to the performance obtained by applying only a static optimization per benchmark, we also present the results obtained when our cost model is used to select dynamically which optimization to perform for a given input size (CM-driven) and the post-facto best optimization selection for each input size (Best). We refer to best performance as that resulting from making the best choices among the optimizations we consider.
Benchmarks
-) j N N J ' J I ,/ ,/ ,/ J J J N J
irkernel kernel6
Results
As can be seen from Table 3 , CM-driven optimization obtains an average of 94.9% of the best possible speedup ( 1.68 versus 1.77), whereas the best single optimization (in this case, copying-based array restructuring) obtained only 77.0% of the best possible speedup (1.35 versus 1.77). The reason for the good performance of cost model driven optimization is that the best optimization strategy is highly application and input dependent. Even within the same benchmark, the best choice is dependent on the size of the input data. For example, in sjr2k, the costmodel was able to pick between C and L+R and got a higher speedup (1.88) than either C or L+R (i.e. it picked C when L+R was bad, and vice-versa). Overall, for most of the benchmarks, our cost model was usually able to select the correct strategy to employ, and when it failed to pick the best strategy, the choice it made was generally very close to the post-facto best choice among the restructuring optimizations. To better understand why the cost model worked well in most cases, and poorly in a few, we will discuss each benchmark program in turn. matmult: nzatnzult involves multiplying a N by M matrix by an M by L matrix to get the product matrix. One array is walked by columns and these strided accesses incur high TLB and cache misses. Remapping-based or copying-based array restructuring can be used to get unit strided access for this array. This is the only application for which loop permutation alone is sufficient to achieve unit stride for all arrays in the loop nest. While array restructuring improves the performance of matmult, loop restructuring can do so with lower setup costs. Our cost [ Overall   I 1.35 I 1.23 I 1.10 I 1.04 1 1.09 1 1.68 1 1.77 1 Table 3 : Mean speedup obtained when each of these choices were held fixed for all inputs, the cost-model driven speedup, and the best possible speedup. model recognized this situation and recommended loop permutation over the other choices. Though loop permutation was the best choice for a majority of cases, conflict misses caused remapping to be better for some data sizes. Since conflict misses are not modeled, the cost model incorrectly recommended loop permutation for such cases, and achieved a lower speedup than the best possible one. 
vpenta:
This benchmark accessed eight twodimensional arrays in seven loop nests with large strides. Loop permutation was used to optimize the two most expensive loop nests. The remaining loop nests had strided accesses that loop transformations could not optimize. For the remaining array references with strided accesses, we considered remapping-based restructuring. The cost model recommendations performed 4% worse than a fixed choice of L+R and 9% worse than best. The model did not account for a "side effect" of remapping, whereby for input sizes that are a power-of-2, remapping eliminates a significant number of conflict misses. Our cost model does not account for cache conflict effects, and thus in this case it underestimates the potential benefits that remapping can achieve. A more sophisiticated cost model that employed cache miss equations (41 or a similar mechanism might be able to handle this case more effectively.
btrix: The innermost loop was written to be vectorizable, and involves strided accesses across four fourdimensional arrays. The access pattern of one array conflicted with that of the other three arrays. As a result, loop permutation alone could not bring all these arrays into array order. There were six non-obvious optimization candidates (two choices of L+R, two L, one each for L+C, C) involving loop fusion, permutation and array restructuring. Our cost model recommended copying-based array restructuring (C). In contrast, Leung's heuristicsbased decision model recommended that copying-based array restructuring not be done [SI. Our experiments validated our cost model, as we obtained was a speedup of 49% despite the overhead of copying. Though the cost model framework correctly predicted that array restructuring would be beneficial, it correctly predicted that loop transformations alone were the best choice for all experiments but one. The single misprediction was when the array dimensions were a power-of-2 which induced many conflict misses. Combining loop transformations with array restructuring was not as beneficial because the overhead of creating the restructured array was higher than the benefit derived.
cfft2d: This application implements two dimensional FFT. Previous with various loop and data set sizes. Our cost model accurately chose the best optimization in 21 cases, and even in the single misprediction the performance was very close. When we used the cost model to select the optimizations to perform, we achieved an overall geometric mean speedup of 1.99, whereas the best overall mean speedup was 2.00. In contrast, applying remapping (R) or copying (C) exclusively results in an average speedup of only 1.92 and 1.24, respectively. In summary, we find that quantifying the overheads and accesses costs of the various optimizations allowed us to make good decisions about which optimization(s) to choose. The recommendations made by our cost model resulting in a mean overall speedup within 5% of the best combination of optimizations that we considered, whereas the best performance from any single optimization choice was 23% less than the best combination.
Caveats
Our cost framework has a number of known inaccuracies; improving its accuracy is an area for future work. First, we do not consider the impact of any optimization on TLB performance. For some input sizes, TLB effects can dwarf cache effects, so adding a TLB model in an interesting open issue. Second, we do not consider the impact of latency-tolerating features of modern processors, such as hit-under-miss caches and out-of-order issue instruction pipelines in our cost model. This may lead us to overestimate the impact of cache misses on execution time. For example, multiple simultaneous cache misses that can be pipelined in the memory system have less impact on program performance than spread out cache misses, but our model gives them equal weight. Third, we do not consider cache reuse (i.e., we estimatedf to be zero) or cache interference. We assume that the costs of different loop nests are independent, and thus additive. If there is significant inter-nest reuse, our model will overestimate the memory costs and recommend an unnecessary optimization. We do not have a framework for calculating f, and would benefit from a framework such as cache miss equations proposed by Ghosh et al. [4] . Similarly, not modeling cache interference could result in the model underestimating the memory costs if there are significant conflict misses in the optimized applications. Leung and Zahorjan [9] introduce array restructuring, a technique to improve locality of array-based scientific applications. They do not propose a profitability analysis framework for determining when the optimization should be applied. Instead, they use a simple heuristic that takes neither the size of the array nor the loop bounds into account. Consequently, their decisions are always fixed for a particular application, regardless of input.
Related Work
Cierniak al. [4] introduce Cache Miss Equations (CMEs), a precise analytical representation of cache misses in a loop nest. CMEs have some drawbacks as they also cannot handle multiple loop nests. However, they represent a promising step towards being able to accurately model the cache behaviour of regular array accesses and can be used to further enhance the accuracy of our model.
Conclusions and Future Work
The widening processor-memory performance gap makes locality optimizations increasingly important. Optimizations such as loop transformations and array restructuring are effective, but do not have the same mileage in every application. Past studies have attempted to integrate the restructuring optimizations but our work is the first attempt to analytically model the costs and benefits of integration. This paper demonstrates that modeling the memory costs of applications as a whole allows us to compare multiple locality optimizations in the same framework. The accuracy of our cost model is encouraging, given its simplicity. This model can be used as the basis for a wider integration of locality strategies, including tiling, blocking, and other loop transformations. We also show how hardware support for remapping from an adaptable memory controller can be used to support data restructuring. Such hardware support enables more diverse kinds of data restructuring techniques. In general, we make the case for combining the benefits of software and hardware-based restructuring optimizations in the best possible manner, and provide a framework for reasoning about the combined effects of optimizations.
