When parallel programs are executed on multiprocessors with private caches, a set of data may be repeatedly used and modi ed by di erent threads. Such data sharing can often result in cache thrashing, which degrades memory performance. This paper presents and evaluates a loop restructuring method to reduce or even eliminate cache thrashing caused by true data sharing in nested parallel loops. This method uses a compiler analysis which applies linear algebra and the theory of numbers to the subscript expressions of array references. Due to this method's simplicity, it can be e ciently implemented in any parallel compiler. Experimental results show quite signi cant performance improvements over existing static and dynamic scheduling methods.
Introduction
Parallel processing systems with memory hierarchies have become quite common today. Commonly, most multiprocessor systems have a local cache in each processor to bridge the speed gap between the processor and the main memory. Some systems use multi-level caches 5, 14] . Very often, a copy-back snoopy cache protocol is employed to maintain cache coherence in these multiprocessor systems. Certain supercomputers also use a local memory which can be viewed as a program-controlled cache. When programs with nested parallel loops are executed on such parallel processing systems, it is important to assign parallel loop iterations to the processors in such a way that unnecessary data movement between di erent caches is minimized. For convenience, in this paper, we call each iteration of a parallel loop a thread. The following loop nest is an example. In this example, loop J is a parallel loop because, with the value of I xed, statement S has no loop-carried dependences, while J varies. Loop J is executed 100 times in the loop nest, creating 10,000 iterations, or 10,000 threads, in total. Each thread addresses 100 elements of array A. Many array elements are repeatedly accessed by these threads as shown in Table 1 and Figure 1 , where T i;j denotes the thread corresponding to loop index values I = i and J = j, while T i denotes the set of threads created by the index value I = i. As shown in Figure 2 , there exist lists of threads: (T 1;1 ), (T 1;2 , T 2;1 ), (T 1;3 , T 2;2 , T 3;1 ), (T 1;4 , T 2;3 , T 3;2 , T 4;1 ), ......, such that each thread modi es and reuses most of the array elements accessed by the neighboring threads in the same list. If the threads in the same list are assigned to di erent processors, the data of array A will unnecessarily move back and forth between di erent caches in the system, causing a cache thrashing problem due to true data sharing 12].
The nested loop construct shown in the above example is quite common in parallel code used for scienti c computation. J. Fang and M. Lu studied a large number of programs including the LINPACK benchmarks, the PERFECT Club benchmarks, and programs for mechanical CAE, computational chemistry, image and signal processing, and petroleum applications 11]. They reported that almost all of the most time-consuming loop nests contain at least three loop levels, out of which 60% contain at least one parallel loop. Even after using loop interchange to move parallel loops outwards when it was legal, they still found 94% of the parallel loops enclosed by sequential loops. Such loop nests include the cases in which a parallel loop appears in the outermost loop level in a subroutine, but the subroutine is called by a call-statement which is contained by a sequential loop. Most of these loop nests are not perfectly nested, i.e. there exist statements right before or after an inner loop. Fang and Lu proposed a thread alignment algorithm to solve the cache thrashing problem which may be created by such multi-nested loops. Their algorithm, however, assigns threads to processors either by solving linear equations at run time or by storing the precomputed numerical solutions to the equations in the processor's memory. Since storing all the numerical solutions requires quite a large memory space, and since the exact number of threads often cannot be determined statically due to unknown loop bounds, they favor the on-line computation approach. In this paper, we present a method to reduce the runtime overhead in Fang and Lu's algorithm by using a thorough compiler analysis of array references to derive a closed-form formula that solves the thread assignment equations. The thread assignment then becomes highly e cient at run time. Previously, we presented preliminary algorithms 22, 23 ] to deal with a simple case in which data-dependent array references use the same linear function in the subscripts. No experimental data were given. In this paper, we extend the work by covering multiple linear functions and by clarifying the underlying theory. We report experimental results using a Silicon Graphics (SGI) multiprocessor. With our method, the compiler analyzes the data dependences between the threads and uses that information to restructure the nested loop, perfectly nested or otherwise, in order to reduce or even eliminate true data sharing, which causes cache thrashing. Our method can be e ciently implemented in any parallel compiler, and our experimental results show quite signi cant improvement over existing static and dynamic scheduling methods.
In what follows, we rst address related work. We then introduce basic concepts and assumptions. After that, we present solutions to the cache thrashing problem due to true data sharing, and lastly we show the experimental results conducted on an SGI multiprocessor system.
Related Work
Extensive research regarding e cient memory hierarchies has been reported in the literature.
Abu-Sufah, Kuck and Lawrie use loop blocking to improve paging performance by improving the locality of references 2]. Wolfe proposes iteration space tiling as a way to improve data reuse in a cache or a local memory 35]. Gallivan, Jalby and Gannon de ne a reference window for a dependence as the variables referenced by both the source and the sink of the dependence 15, 16] . After executing the source of the dependence, they save the associated reference window in the cache until the sink has also T2  T1   T3,1   T2,1   T1,2  T1,3   T2,2 T100,1 . These optimizations all attempt to maximize the reuse of cached data on a single processor. They also have a secondary e ect of improving multiprocessor performance by reducing the bandwidth requirement of each processor, thereby reducing contention in the memory system. In contrast, our work considers a multiprocessor environment where each processor has its own local cache or its own local memory, and where di erent processors may share data. The work by Peir and Cytron 28], Shang and Fortes 30] , and by D' Hollander 9] share the common goal of partitioning an index set into independent execution subsets such that the corresponding loop iterations can execute on di erent processors without interprocessor communication. Their methods apply to a speci c type of loop nest called a uniform recurrence or a uniform dependence algorithm, in which the loops are perfectly nested, the loop bounds are constant, the loop-carried dependences have constant distances, and the array subscripts are of the form i + c, where i is a loop index and c an integer constant. Hudak and Abraham 1, 18] develop a static partitioning approach called adaptive data partitioning (ADP) to reduce interprocessor communication for iterative data-parallel loops. They also assume perfectly nested loops. The loop body is restricted to update a single data point A(i; j) within a two-dimensional global matrix A. The subscript expressions of right-hand side array references are restricted to be the sum of a parallel loop index and a small constant, while the subscript expressions of left-hand array references are restricted to contain the parallel loop indices only . Tomko and Abraham T1,2  T2,1   T1,3  T2,2  T3,1   T1,4  T2,3  T3,2  T4,1   T1,99  T2,98  T3,97   T99,1   T1,100  T2,99  T3,98   T99, 32] develop iteration partitioning techniques for data-parallel application programs. They assume that there is only one pair of data access functions and that each loop index variable can appear in only one dimension of each array subscript expression. Agarwal, Kranz, and Natarajan 3] propose a framework for automatically partitioning parallel loops to minimize cache coherence tra c on shared-memory multiprocessors. They restrict their discussion to perfectly nested doall loops, and assume rectangular iteration spaces. Unlike these previous works, our work considers nested loops which are not necessarily perfectly nested. Loop bounds can be any variables, and array subscript expressions are much more general. Many researchers have studied the cache false sharing problem 10, 17, 19, 33] in which cache thrashing occurs when di erent processors share the same cache line of multiple words, although the processors do not share the same word. Many algorithms have been proposed to reduce false sharing by better memory allocation, better thread scheduling, or by program transformations. Our work considers cache thrashing which is due to the true sharing of data words. Our work is most closely related to the research done by Fang and Lu 11, 12, 26, 13] . In their work, the iteration space is partitioned into a set of equivalence classes, and each processor uses a formula to determine which iterations belong to the same equivalence class at execution time. Each processor then executes the corresponding iterations so as to reduce or eliminate cache thrashing. These iterations are the solution vectors of a linear integer system. In Fang and Lu's work, these vectors may either be computed at run time or may be precomputed and later retrieved at run time when loop bounds are known before execution. Both approaches require additional execution time when a processor fetches the next iteration. Unlike Fang and Lu's approaches, we solve the thrashing problem at compile time to reduce run-time overhead, while we achieve the same e ect of reducing cache thrashing. Our new method restructures the loops at compile time and is based on a thorough analysis of the relationship between the array element accesses and the loop indices in the nested loop. We have performed experiments on a commercial multiprocessor, namely a Silicon Graphics Challenge Cluster, thereby obtaining real data regarding cache thrashing and its reduction. In contrast, previous data were mainly from simulations 11, 12, 26] .
Basic Concepts and Assumptions
Data dependences between statements are de ned in 6, 24, 4, 25] . If a statement S 1 uses the result of another statement S 2 , then S 1 is ow-dependent on S 2 . If S 1 can safely store its result only after S 2 fetches the old data stored in that location, then S 1 is anti-dependent on S 2 . If S 1 overwrites the result of S 2 , then S 1 is output-dependent on S 2 . A dependence within an iteration of a loop is called a loop-independent dependence. A dependence across the iterations of a loop is called a loop-carried dependence.
There can be no cache thrashing due to true data sharing if the outermost loop is parallel, because no data dependences will cross the threads. Therefore, in this paper, we consider only loop nests whose outermost loops are sequential. To simplify our discussion, we make the following assumptions about the program pattern: 1) All functions representing array subscript expressions are linear. 2) The loop construct considered here consists of a sequential loop which embraces one or several single-level parallel loops. If there exist multilevel parallel loops, only one level is parallelized, as on most commercial shared-memory multiprocessor systems. Hence, as shown below, a loop nest in our model has three levels: a parallel loop, its immediately enclosed sequential loop, and its immediately enclosing sequential loop: DO Since we are considering cache thrashing due to true data sharing, i.e. due to data dependences between threads, we can also write the loop nest in Figure 3 A (h 0 (I; J; K)))(1 m) are potentially dependent reference pairs, and m is the number of such pairs. Fang and Lu 11] reported that arrays involved in nested loops are usually two-dimensional or three-dimensional with a small-sized third dimension. The latter can be treated as a small number of two-dimensional arrays. Nested loops with the parallel loop at the innermost level are degenerate cases of the loop nest in Figure 3 . Therefore, our loop nest model seems quite general. Our method can also be applied to a loop nest which contains several separate parallel loops at the middle level. Each of these parallel loops may be restructured according to its own reference patterns, such that the threads in di erent instances of the same parallel loop are aligned. We currently do not align the threads created by di erent parallel inner loops. For programs with more complicated loop nests, pattern-matching techniques can be used to identify a loop subnest that matches the nest shown in Figure 3 . Other outer-or inner-loops that are not a part of the subnest can be ignored, as long as their loop indices do not appear in the array subscripts.
The compiler analysis is based on a simple multiprocessor model in which the cache memory has the following characteristics: 1) it is local to a processor; 2) it uses a copy-back snoopy coherence strategy; and 3) its line size is one word. The transformed code, however, will execute correctly on machines which have a multiword cache line and multilevel caches. Furthermore, as the experimental results will show, the performance of the transformed code is quite good on such realistic machines. Our analysis can also be extended to incorporate more machine parameters such as the cache line size.
Solutions
In this section, we analyze the relationship between linear functions in the array subscripts. Based on this analysis, we restructure a given loop nest to reduce or eliminate the cache thrashing due to true data sharing. We consider nested loops which are not necessarily perfectly nested and which may have variable loop bounds. For clarity of presentation, in Section 4.1 we rst discuss how to deal with dependent reference pairs such that the same subscript function is used in both references in each pair (Di erent pairs may use di erent subscript functions). Later in Section 4.2, we will discuss how to deal with more general cases by using simple a ne transforms to t to this model.
The Basic Model
In this subsection, we assume that for each pair of dependent references, the same subscript function is used in both references. Under this assumption, if we extract the subscript functionh (I; J; K) from each pair of dependent references, then a model for a nested loop which has m pairs of dependent references can be illustrated by the following code segment. In this example, no data dependences exist within loop J. However, two data dependences exist in the whole loop nest, one between the references to A, and the other between those to B. We have two linear functions to consider, one for each dependence: f 1 (i; j; k) = j + 3k; g 1 (i; j; k) = ?2i + j f 2 (i; j; k) = j + 3k + 1; g 2 (i; j; k) = i + j The iteration subspace N 1 N 2 is called the reduced iteration space because it omits the K loop. In order to nd the iterations in the reduced iteration space which may access common memory locations within the corresponding threads, we de ne a set of elements of array A which are accessed within thread T i0;j0 by using subscript functionh . The following lemma and two theorems establish the relationship between the loop indexes corresponding to two inter-dependent threads. We will use this index relationship to stagger the loop iteration space such that inter-dependent threads can be assigned to the same processors. These assumptions are almost always true in practice 31]. When they are not true, the parallel loops will be too small to be important. With these assumptions, we have the following theorem.
Theorem 2 In order to nd the threads which have data dependence relations with thread T i;j , we make the following de nition.
De nition 4: Given iteration (i; j) in the reduced iteration space, we let S i;j denote the following set of iterations in the space: The following theorem, derived from Theorems 1 and 2 and De nition 4, states that we can use the staggering parameters to uniquely partition the threads into independent sets. Theorem 3: S i;j as de ned above satis es: (1) if (i; j) 2 S i 0 ;j 0 and (i; j) 2 S i 00 ;j 00 then S i 0 ;j 0 = S i 00 ;j 00 , The theorem above indicates that S i;j includes all the iterations whose corresponding threads have a data dependence relation with T i;j . We call S i;j an equivalence class of the reduced iteration space. In order to eliminate true data sharing, threads in the same equivalence class should be assigned to the same processor. We want to restructure the reduced iteration space such that threads in the same equivalence class will appear in the same column. Each staggering parameter (L 1 ; L 2 ) computed for a dependent reference pair tells us that if we stagger the (i + L 1 )-th row in the reduced iteration space by jL 2 j columns to the right if L 2 < 0, or to the left if L 2 > 0, relative to the i-th, then the threads involved in the dependence pair will be aligned in the same column. Di erent staggering parameters may require staggering the iteration space in di erent ways. However, if these staggering parameters are in proportion, then staggering by the uni ed staggering parameter de ned below will satisfy all the requirements simultaneously. If a given loop nest satis es the condition Lk;1 Lk;2 = Lj;1 Lj;2 (1 j; k m) in De nition 5, then, according to Theorem 4 above, the reduced iteration space can be transformed into a staggered and reduced iteration space (SRIS) by leaving the rst g rows unchanged, g = G:C:D:(L 1;1 ; L 2;1 ; :::; L m;1 ), and by staggering each of the remaining rows using the uni ed staggering parameter. There will be no data dependences between di erent columns in the SRIS.
However, if the staggering parameters are not in proportion, i.e, if there exist (j; k) such that 1 j; k m and Lk;1 Lk;2 6 = Lj;1 Lj;2 , then we can no longer obtain a unique uni ed staggering parameter. Moreover, staggering alone is no longer su cient for eliminating data dependences between the di erent columns in the restructured iteration space. This is because some threads in the same equivalence class are still in di erent columns. We perform a procedure called compacting which stacks these columns onto each other. We will discuss staggering rst. After staggering by using any uni ed staggering parameter (g; g 0 ), the resulting SRIS has four possible shapes, as shown in Figure 5 (b ? e). Figure 5 (a) shows details of one of these shapes. Figure   6 (a) and 6(b) show the reduced iteration space for the example in Figure 4 before and after staggering with (3,-3) as the uni ed staggering parameter. Next, we compute the compacting parameter d using Algorithm 1 and 2, to be presented shortly. We then partition the SRIS into n chunks, where n = , which is the total number of columns in the SRIS devided by the compacting parameter d ( Figure 5(d ? e) ). These d-wide chunks are stacked onto each other to form a compacted iteration space of width d, as shown in Figure 7 . As we will explain later, the threads in di erent columns after compacting the SRIS with d are independent. Moreover, the product of d and g equals the number of equivalence classes. The SRIS shown in Figure  6 (b) for our example is transformed by being compacted with d = 9 into the form shown in Figure 8 .
The following algorithm computes the compacting parameter d. (1,1-g') (2,1-g') (g,1-g') (g+1,1) (g+2,1) (2g,1) ...... (1,1) ...... Step 3: Step Step 2: If there are an odd number of zero coe cients a i1 ; a i2 ; :::; a i2k+1 (0 2k + 1 p) among a 1 ; a 2 ; :::; a p , then let
Obviously, the non-zero integers b 1 ...... To further simplify the process of the staggering and the compacting of the reduced iteration space, the following theorem can be used to replace multiple staggering parameters, which are in proportion, with a single staggering parameter. Note that if all staggering parameters are in proportion, then compacting is unnecessary for data dependence elimination. However, to improve load balance, we compact the SRIS by a compacting factor d, which equals the number of the available processors. The restructured code is parameterized by the loop bounds and by the number of available processors, which can be obtained by a system call at runtime. There is no need to recompile for a di erent number of available processors.
If the given loop nest is perfectly nested, then the resulting code, after staggering and compacting, is shown in Code Segment 1 listed below. (2) We can now apply the algorithms in Section 4.1 toh 0 2 andh 1 , which yield a staggering parameter, say (L 1 ; L 2 ). For a given iteration (i 0 ; j 0 ), let (i 00 ; j 00 ) = (L 1 ; L 2 ) + (i 0 ; j 0 ). The iteration (i; j) must have a dependence with (i 00 ; j 00 ) before the a ne transformation if and only if the iteration (i 0 ; j 0 ) has a dependence with (i 00 ; j 00 ) after the transformation. We denote the distance between (i; j) and (i 00 ; j 00 ) as (L 2 ) will not be constant, meaning that the iterations cannot be aligned with a constant staggering parameter. In common practice, since loop J is DOALL in our loop nest model, the two linear functionsh 1 andh 2 will have the same co cients for loop index variables I and J, which implies that 1 = 2 = 1. In this paper, we will consider the case of 1 = 2 = 1 only. We now have L
We de ne (L (1) (1) and (2) Table 2 shows examples of staggering parameters for di erent subscript functions appearing in the dependent reference pair, where the loop index variables are listed in the order from the outermost loop level to the innermost. If we simultaneously consider two reference pairs: A(I; J) with A(I ? 3; J ? 1) , and B(I; J) with B(I ? 1; J ? 3), then the thread T i;j will share the same array element A(i; j) with thread T i+3;j+1 and the same array element B(i; j) with thread T i+1;j+3 . Using Theorem 9, the staggering parameters (L 
Experimental Results
The thread alignment techniques described in this paper have been implemented as backend optimizations in KD- PARPRO 20] , a knowledge-based parallelizing tool which can perform intra-and interprocedural data dependence analysis and a large number of parallelizing transformations, including loop interchange, loop distribution, loop skewing, and strip mining for FORTRAN programs.
To evaluate the e ect of the thread alignment techniques on the performance of multiprocessors with memory hierarchies, we experimented with three programs from the LINPACK benchmarks on a SGI Challenge cluster which can be con gured to contain up to twenty MIPS 4400 processors. First, the programs were parallelized and optimized using KD-PARPRO. To reduce or eliminate cache thrashing due to true data sharing, our tool recognized the nested loops which may cause the thrashing. It applied the techniques described in the previous section to analyze and restructure the loop nests. The parallelized programs were then compiled using SGI's f77 compiler with the optimization option -O2. The sequential versions of the programs were compiled on the same machine using the same optimization option for f77. The output binary codes were then executed on various con gurations with a di erent number of processors during dedicated time. Each MIPS 4400 processor has a 16K-byte primary data cache and a 4M-byte secondary cache. The cache block size is 32 bytes for the primary data cache and 128 bytes for the secondary cache. A fast and wide split transaction bus POWERpath-2 is used as its coherent interconnect. Cache coherence is maintained with a snoopy write-invalidate strategy.
We compared the results obtained by using our algorithm to align the threads with those obtained by using four di erent loop scheduling strategies provided by SGI system software, namely, simple, interleave, dynamic, and gss. The simple method divides the iterations by the number of processors and then assigns each chunk of consecutive iterations to one processor. The interleave scheduling method divides the iterations into chunks of the size speci ed by the CHUNK option, and execution of those chunks is statically interleaved among the processes. With dynamic scheduling, the iterations are also divided into CHUNK-sized chunks. As each process nishes a chunk, however, it enters a critical section to grab the next available chunk. With gss scheduling 29] , the chunk size is varied, depending on the number of iterations remaining. None of these SGI-provided methods consider task alignment.
The speedup of a parallel execution on shared memory machines just like the SGI cluster can be a ected by many factors, including: program parallelism, data locality, scheduling overhead, and load balance. Usually gss, dynamic and interleave schedulings with a small chunk size are supposed to show better load balance than simple scheduling. On the other hand, they tend to incur more scheduling overhead than simple. Furthermore, simple captures more data locality in most cases than other schedulings do.
The programs we selected from LINPACK are SGEFA, SPODI, and SSIFA. SGEFA factors a double precision matrix by Gaussian elimination. The main loop structure in this program consists of three imperfectly nested loops. The innermost loop is inside subroutine SAXPY, which multiplies a vector by a constant and then adds the result to another vector. In order to show the array access pattern inside the loop body, we inlined the SAXPY in the code section given below. However, we kept the subroutine call when we applied our techniques to the program. DO The SRIS, after staggering the iterations in the reduced iteration space (K; J), is shown in Figure  9 . For this program, we only consider the linear function (I; J). The staggering parameter is (1,0) according to De nition 4(4). The number of processors is used to determine the compacting factor. The SRISs, after staggering the iterations in the reduced iteration spaces (K; J) and (J; K) for these two loop nests, respectively, are shown in Figure 10 The SRISs, after staggering the iterations in the reduced iteration space (K; JJ) for two di erent ksteps, are shown in Figure 11 . For clarity, we use c kstep to denote the value of kstep in the current K iteration, and we use p kstep for its value in the previous K iteration. All threads will be aligned well if we properly align the threads created in the current K iteration with those in the previous K iteration. We need to consider four possible cases of dependences: one is between two references to The problem sizes we used in our experiments are n=100 and 1000. The performance of the par-allel codes transformed by our techniques, compared with the performance achieved by the scheduling methods provided by SGI, are shown in Figures 12-15 . Both SGEFA and SSIFA may require pivoting for non-positive-de nite symmetric matrices, but not for positive-de nite symmetric matrices. We show data for SGEFA with pivoting, and we show data both with and without pivoting for SSIFA. Pivoting may potentially destroy the task alignment. As the gures show, our method always outperforms all of the SGI's scheduling methods, with the exception of program SGEFA. For this program, our method's performance is almost the same as that of simple, although our method outperforms simple by 14% on 16 processors with n = 100, which should be attributed to the reduction of cache thrashing due to true data sharing, a problem that tends to be more severe when more processors are running. The simple scheduling method tends to get better performance than the dynamic, gss, and interleave methods, because it results in better locality and less cache thrashing in most cases, and it also incurs less scheduling overhead. But when the programs do not exhibit a good load balance, like SPODI, SSIFA, and other programs in LINPACK, which deal with symmetric matrices, simple's performance results degrade substantially. Our method outperforms simple quite signi cantly in most cases, especially for SPODI (Figure 13 ), as well as for SSIFA without pivoting (Figure 15 ), where our method beats simple by as much as 105%. We are not able to get improvement over simple for program SGEFA when n = 1000, because pivoting is much more likely to destroy the locality we try to keep. For the rest of the programs, we attribute our performance gain over simple both to the reduction of cache thrashing due to true data sharing and to a better load balance, although, for SSIFA with pivoting, we believe our method bene ts more from load balancing. We note that the SGI system software cannot pick the right scheduling method automatically to t the particular program. On the other hand, our method seems more capable of delivering good performance for di erent loop shapes. As the thrashing problem becomes more serious on parallel systems with more processors and greater communication overhead, our method will likely be even more e ective.
Conclusions
This paper presents a method in which the reduced iteration space is rearranged according to the staggering and the compacting parameters. The nested loop (either perfectly nested or imperfectly nested) is restructured to reduce or even eliminate cache thrashing due to true data sharing. This method can be e ciently implemented in any parallel compiler. Although the analysis per se is based on a simple machine model, the resulting code executes correctly on more complex models. Our experimental results show that the transformed code can perform quite well on a real machine. How to extend the techniques proposed in this paper to incorporate additional machine parameters is interesting future work. 
