Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyin is by permission of the Association of Computing ? Machinery. o copy otherwise, or to republish, requires a fee and/or specific permission. ISCA '95, Santa Margherita Ligure Italy @ 1995 ACM 0-89791 -698-019510006 ....$3.50 1 Introduction Performance tuning on today's computers has become very complex.
One factor of this complexity is the use of memory hierarchies, and particularly of cache memories.
As the miss penalty is becoming higher and higher, performance becomes very sensitive to the cache performance.
Unfortunately, the behaviors of direct-mapped and set associative caches are very sensitive to small variations of the application's parameters.
Since the caches are not perfect (limited associativity, non-optimal replacement strategY), performance may suffer unpredictably from confhct misses even with blocked loops [7] . For instance, in a recent study, Schlansker et al [9] ~showed that, even with a very regular memory access patterns such as iterating on the read of a fixed size memory subblock, the miss ratio on a 32-way set-associative cache depends heavily on parameters such as the number of rows of the whole matrix.
In The execution times are given in Figure  5 for the array leading size ranging from 340 to 600 .
Original loop
The For all cache organizations, but the direct-mapped cache, the average number of misses is 50% higher than for the blocked loop.
LU factorization
The last loop nest we experimented is a 100 x 100 LU factorization without pivoting. [5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13] 
M7
.,03 G 
