Nowadays the performance gap between processors and main memory makes an e cient usage of the memory hierarchy necessary for good program performance. Several techniques have been proposed for this purpose. Nevertheless most of them consider only regular access patterns, while many scienti c and numerical applications give place to irregular patterns. A typical case is that of indirect accesses due to the use of compressed storage formats for sparse matrices. This paper describes an analytic approach to model both regular and irregular access patterns. The application modeled is an optimized sparse matrix-dense matrix product algorithm with several levels of blocking.
Introduction
Despite the fact that several hardware and software techniques have been developed and successfully applied in order to improve the memory hierarchy behavior, memory accesses continue to be a bottleneck for system performance. This is particularly true in the case of applications characterized by irregular access patterns, as they limit the spatial and temporal locality that caches try to exploit. It is the case of codes that deal with sparse matrices, whose compressed storage formats 1] generate indirect accesses. Our interest in sparse codes is justi ed by the broad spectrum of engineering and scienti c computing applications where such codes arise.
Cache performance evaluation tools are required both to provide quantitative predictions of miss ratio and information to guide cache optimization. The traditional approach has been the software simulation of the cache e ect for every memory access 2], which is very accurate but gives little insight into the reasons for the cache behavior and requires very long computation times. Some modern microprocessors overcome this problem by providing performance monitoring tools such as built-in counters. Still they present the limitations of requiring program compilation and execution and being restricted to a speci c platform. A more general approach is that of analytical models, which, in addition to requiring short computation times, make more exible the parametric study of the cache. Although many models extract input parameters from address traces 3], 4], 5] and combine them with cache de nition parameters, more general purpose models have been developed taking as input the code to analyze 6], 7], 8]. Nevertheless these models only consider dense algebra codes, with regular access patterns. There are very few works related to the analytic modeling of sparse codes, and their scope is very limited compared to ours. For example, 9] is restricted to the partial modeling (only taking into account self interferences) of the behavior of one vector in a very simple code such as the sparse matrix-vector product, and only on direct-mapped caches. Besides until now no attempts have been made to build a general framework for the modeling of these kinds of codes.
On the other hand, several studies have been carried out on sparse algebra kernels in order to improve their performance applying software techniques such as blocking, loop unrolling, loop interchanging or software pipelining. This work demonstrates the feasibility of modeling such types of complex algorithms that t hardware improvements of current high performance microprocessors. As an example, an optimized version of the sparse matrix-dense matrix product described in 10] is modeled. In this work we apply the probabilistic model for K-way associative caches with LRU replacement introduced in 11], where only simple algebra kernels such as sparse matrix-vector product and sparse matrix transposition were considered. Optimum block sizes for the memory hierarchy of an arbitrary system can be predicted by means of this model. A uniform distribution of the non zero elements is considered, but the model proposed can be extended to di erent types of distributions as we demonstrate in 12] .
The algorithm to model is described in the next section. Basic model concepts and an example of the modeling process are presented in Section 3. Section 4 validates the model and applies it in order to propose block sizes for obtaining minimum execution times on several platforms. A brief study of the cache behavior as a function of several paratemers is also carried out. Section 5 concludes the paper.
Optimized Sparse Matrix-Dense Matrix Product
The algorithm modeled is displayed in Figure 1 . It stores the sparse matrix using the Compressed Row Storage (CRS) format 1], which uses three vectors: vector A contains the sparse matrix non zero elements or entries, vector C stores the column
R2(I)=R (I)  5 ENDDO  6 DO K2=1, N, BK  7  LIMK=K2+MIN(BK, N-K2+1)  8  DO J=1, LIMJ-J2+1  9  DO K=1, LIMK-K2  10  WB(J,K)=B(K2+K-1 of each entry, and vector R points to the location in vectors A and C where a new row of the sparse matrix starts. This format forces a row-wise access of the sparse matrix. The dense matrix is B, while D stores the product matrix.
In this code the outer loop on I (line 13) of the product traverses the rows of the sparse matrix and matrix D (dimension I); then the loop on K (line 19) processes the entries of a given row (so the K dimension is the one common to the sparse and the dense matrices), and nally an inner loop, which has been completely unrolled due to a register blocking, considers the columns of matrices B and D. This is what we call the J dimension. In addition, this optimized version comprises one level of cache blocking using a block of BK rows and BJ columns of the dense matrix. When the processing of a new block begins, a transposed copy is stored in a temporal matrix WB which is the one the inner loop uses. The reason is that matrix B is accessed by rows in this loop, which does not favour the exploitation of locality, as matrices are stored by columns in FORTRAN. The use of a transposed copy permits taking advantage of both temporal and spatial locality. Vector R2 simpli es the access to the subrows of the sparse matrix associated to a given block of B by pointing to the location in vectors A and C where these subrows start.
Blocking at the register level is also considered to optimize the algorithm performance; for this purpose techniques such as strip mining, loop interchanging, full loop unrolling and the elimination of redundant load and store operations were applied. As mentioned above, these techniques have been applied to the inner loop, taking the load and stores for matrix D out of the loop on K. This modi cation requires BJ registers (d1, d2, : : : , dbj in Figure 1 ) to store these values. The resulting algorithm has a twodimensional blocking for registers, resulting in fewer loads and stores per arithmetic operation. Besides, the number of independent oating point operations in the loop body is increased.
Analytic Model
For our modeling purposes we distinguish two kinds of misses: the intrinsic ones take place the rst time a memory block corresponding to a cache line is accessed. The remaining misses a ecting the same line are called interference misses, as they are due to interferences with other lines that are placed in the same set. When the interfering lines belong to the same program vector, we say that they are self interferences, whereas when they belong to another vector, they are considered cross interferences.
In our model and the platforms we have studied (see Section 4) the K-way cache sets are managed following an LRU replacement policy (pseudo-LRU in real systems, actually). In this case K or more di erent lines mapped to the set where a given line resides must be accessed between two consecutive accesses to this line in order to obtain an interference miss in the second access. The probability that an access to a line results in an interference miss corresponds to the likelihood of having placed at least K lines in its set since the last time it was accessed. We use the concept of area vector to compute this probability, being S V = S V 0 ; S V 1 ; : : : ; S V K the area vector corresponding to a given program vector V. The element in the i-th position stands for the ratio of sets that have received K ? i lines of this program vector during the execution of a given portion of code. The exception is S V 0 , which is the ratio of sets that have received K or more lines. Di erent expressions have been developed to calculate the area vectors associated to typical access patterns 11]. Next, we describe those ones that arise in our particular code in Figure 1 .
Access Patterns Modeled
One of the most common patterns is the sequential access to n consecutive words, whose area vector S s (n) is Another useful expression is the one corresponding to an access to an n word vector in which any cache line has the same probability P l of being accessed. We denote the area vector in this case as S l (n; P l )
where m = maxf0; K ? dn=(L s N K )eg and B(n,p) is the binomial distribution a . In the case of k accesses of this type, the area vector is S k l (n; P l ) = S l (n; 1?(1?P l ) k ).
a We de ne the binomial distribution on a non integer number of elements n as 3.2 Other Important Issues and Input Parameters When calculating self interference area vectors we will need to compute the average number of lines that compete with a given line in an n word vector. Function C(n) calculates it as
where z is the average number of lines of the vector mapped to a set. As between consecutive accesses to a given line several program vectors may be accessed, and some of them may be referenced di erent access patterns, a mechanism is needed to add area vectors. Given two area vectors S U = (S U 0 ; S U 1 ; : : : ; S U K ) and S V = (S V 0 ; S V 1 ; : : : ; S V K ), the union area vector S U S V that comprises the accesses corresponding to both area vectors is de ned as
From now on the symbol will be used to denote the vector union operation. This method makes no assumptions on the relative positions of the program vectors in memory, as it is based on the addition, as independent probabilities, of the area ratios. Table 1 displays the main input parameters for our model. Word stands in our model for the size of the data elements the algorithm handles. The size of a oating point number has been chosen as word, but the model is totally scalable, as integers are considered through the use of parameter r.
Modeling Cache Behavior of Algorithm Arrays
The modeling of the proposed code has been performed through the modeling of the behavior of each one of the vectors and matrices involved. For each array, each point where it is referenced is considered separately. As an example, and due to space limitations, only the modeling of the behavior of matrix WB will be detailed here, as it usually accounts for the greatest portion of the misses and its access pattern is the most complex. We will rst calculate the number of misses on this matrix in the inner loop; later we will study matrix WB behaviour during the storage in it of the transposed blocks of matrix B.
Misses on Matrix WB in the Inner Loop
The probability of a hit in an access to a given line of this matrix during the processing of the j-th row is estimated as
where P = 1 ? (1 ? p n ) Av(BJ;Ls) is the probability of accessing a line of matrix WB during the processing of a subrow of the sparse matrix, being Av(a; b) the average number of di erent groups of a consecutive elements into which we divide an abstract in nite vector that fall in a group of b consecutive elements of that vector. 
This way P(1 ? P) i?1 in (5) is the probability that the last access to a line of matrix WB has taken place during the processing of row j ? i. On the other hand, S intf WB (i) is the area vector corresponding to the accesses to all of the vectors involved in the processing of i subrows of the sparse matrix along their corresponding i iterations of the loop on I. As explained in Section 3, its rst element (S intf WB 0 ) gives the probability that a line accessed just before those processes take place su ers an interference miss if it is rerefenced just after they have nished. This vector is calculated adding with the operation the area vectors for these accesses:
Sequential access to i elements of vectors R and R2 (S s (i r), see Section 3.1).
Access to a portion of i+L s ?1 elements of vector A with an uniform access probability per line (see Section 3.1) 
is the probability that the elements belonging to a group of BK consecutive columns in two consecutive rows of the sparse matrix share a line in vector A.
The term 1 ?(1?p n ) BK is used to take into account the probability that the sublock is empty in a row, as in this case no access to vector A takes place.
Idem for vector C, being the access probability per line
where P share C is calculated using expression (8) Finally, the area vector for the access to i subrows of BJ elements of matrix D is calculated using a deterministic algorithm described in 13]. The last term in expression (5) stands for the probability of a hit in a line that has not been accessed in the inner loop yet (probability (1 ? P) j?1 ) , but that is still resident in the cache after the copying of the present block of matrix B to WB. This will happen if it has not been a ected by interferences both in the iterations of the loop on I and in the copying process (1 ? S intf init WB 0 ). The calculation of this last probability is not included here due to space limitations.
The number of misses on matrix WB in this loop, M inner WB , is obtained as the product of three terms: the average miss probability, the average number of lines accessed when a column of this matrix is read, and the number of times that read takes place (once per entry and block in the J dimension): 
where A = maxf0; Ls BJ ? 1g is the average number of accesses after the rst one to a given line of matrix WB during an iteration of the loop on line 8. As the equation shows, the miss probability for these accesses is that associated to an access to an element of matrix B. The probability of a miss for the rst access to line i during iteration j of the loop, P miss rst (i; j), is calculated as P miss rst (i; j) =P acc (j ? 1)(S self (BJ; BK) S s (BK)) 0 + (1 ? P acc (j ? 1))(1 ? P hit WB (M)(1 ? P exp WB (i; j)))
where P acc (j) is the probability of the line having been accessed in the j previous iterations, which is minf1; (L s +j ?2)=BJg but for j = 0, for which the probability is null. The rst access to a line results in a miss unless it is in the cache when the copying begins (probability P hit WB (M)) and it has not been replaced during the copy of the elements previously accessed, (1 ? P exp WB (i; j)). On the other hand, if the line has been accessed in the previous iteration, only the accesses to BK elements of matrix B located in two consecutive columns (whose area we approach by a completely sequential access) and the accesses to a row of matrix WB can have replaced the considered line. This self interference area vector (S self (BJ; BK)) is calculated using the deterministic algorithm mentioned above. As for P exp WB (i; j), it is estimated as the rst element of the area vector resulting from the union of the area vectors associated to the following patterns: i L s =BJ consecutive elements of the subcolumn of matrix B that is being copied, S s (i L s =BJ).
j ? 1 subcolumns of BK elements of B corresponding to the previous copied subcolumns. It is calculated using the deterministic algorithm. 
Validation and Application
The code shown in Figure 1 was rewritten replacing the references to memory by functions that calculate the position to be accessed and write it to a trace le. Later this trace le was fed to the dineroIII cache simulator, integrated into the WARTS toolset 14]. Table 2 displays the prediction deviation for several combinations of the input parameters obtained from the execution on synthetic matrices. For each combination several simulations were made changing the data structures starting addresses. In the table is the average deviation of the number of misses measured in the simulations. The average error obtained in the trial set was under 1.9%, and the maximum error was under 15%. The errors have never given place to deviations in the prediction of the miss ratio above 1.5%. The largest deviations were obtained for the combinations in which is large or BJ and BK are not dividers of H and N respectively and there are few blocks in any of these dimensions. The latter is the case for the deviations over 10% in Table 2 : the use of blocks of size 2100 in a dimension with 5000 elements gives place to two blocks with 2100 elements and another one with only 800, instead of the three completely equal blocks of 2100 elements the model considers to simplify the problem. This leads to an overestimation of the number of misses.
In order to prove the usefulness of the model we have used it to derive the optimum block sizes for the execution of this code on several platforms. We have taken into account parameters such as the cache size, line size, associativity and miss penalty of each cache level. The policy followed to compare the di erent blocks Table 2 : Deviation of the model for optimized sparse matrix-dense matrix product : r=1; Order M = N, Nnz and H in thousands and Cs in Kwords. is the average deviation of the number of misses measured in the simulations; is the deviation of the model prediction with respect to the average number of measured misses.
Order (13) where N Levels is the number of cache levels in the considered system; C s i , L s i , and K i are the cache size, line size and associativity of level i, and Cost i provides the relative cost of a miss at this level. M(BJ; BK; C s ; L s ; K) is the number of misses predicted by the model for this block and cache. The parameters describing the matrices involved have been removed to simplify the expression. The platforms used to carry out our study were the following ones: a DEC 433au Personal Workstation, which is based on a 433MHz 21164 Alpha processor; a SGI Origin 200 server with R10000 processors at 180 MHz and a SUN Ultra 1 using a 167 MHz UltraSPARC-I processor. We think they represent a wide range of present-day architectures in order to validate our experiments. The results obtained for several matrices are shown in Table 3 . The trial set for BJ (block size in the J dimension) was 8, 10, 12, 16 and 20, while the values tested for BK (block size in the K dimension) were 50, 100, 200, 250, 500, 1000, 1250, 2500 and 5000 (these last ones only when applicable). The table displays the di erence between the execution times obtained using the block proposed by the model, that is, the one that minimizes Cost(BJ; BK), and the optimum block, that is, the block that leads to the minimum execution time. This di erence is expressed as a percentage of the time obtained for the optimum block.
The model was initially designed for write-back caches, considering the same cost and behavior for read and write misses. A variation has been developed to adapt it to write-through non-allocating caches, as the rst level caches of the 21164 and the UltraSPARC I processors follow this policy. It has consisted in considering any write on this caches as a miss, but giving it a weight proportional to the time required to solve it. Besides the write accesses to these caches produce no interference area vectors, as the a ected data are not placed in the cache. We see that the worst estimations have taken place for the case of the Origin server using a small matrix of order 1000. Although the percentage di erence is noticeable, the execution time of the block proposed by the model is only less than 0.1 seconds larger than the optimum time for p n = 0:05 and 0.18 seconds for p n = 0:1. Anyway, we have used the perfex utility of the operating system in order to access the internal counters available in the R10000, which can provide information such as the number of data misses in each of the two levels of the hierarchy. The number of misses reported is not close to that predicted by the model, which is not surprising, since there are several e ects not considered by it: the sharing of the second level cache with the code; the access patterns are not exactly those expected due to the use of intermediate variables and compiler optimizations, and the measurement of data misses that the model does not take into account such as those that take place when the sparse matrix is read from disk. Besides the CPU is always shared with other processes, which causes misses in the context switches. Anyway, we expect the most important patterns to be those re ected in the model, being our purpose not to give an accurate quantitative estimation of the real number of misses, but better give a fair idea of the evolution of the cache behavior with respect to the variations in the model input parameters, using the predicted number of misses as an indication of this behavior. The measurements obtained using the processor counters show that this goal has been achieved, as there is almost always a proportionality between the number of real misses measured in each cache level and the number of misses the model predicts. This fact is highlighted in Figure 2 . The block size proposed by the model is really more e ective than the optimum block size from the point of view of the memory hierarchy. What happens is that the total execution time depends on more factors: the block proposed by the model ts better in the caches, as it is smaller than the optimum block, but its use has more overhead, as it requires more iterations in the loops that control the blocks. On the other hand, we are talking about modern superscalar processors capable of issuing multiple instructions and hiding memory latencies in several ways, which makes the number of misses only a rough approximation of the program performance. Similar reasons can be given for the deviation observed in Table 3 for the largest matrix in the Ultra 1 system. Besides in this case we have the drawback that in the rst level data cache each line is divided into two sub-blocks and our model does not support sub-block placement. The model could be easily extended in order to support this feature. Nevertheless we think subblock placement has not reached a wide implementation in current caches by now.
The maximum BJ value in our trials has been 20 because there must be BJ free oating point registers in order to implement the register blocking and the FORTRAN compiler of the 433au personal workstation uses up to 19 registers for this purpose. The optimum block sizes have almost always used the maximum value of BJ in the three machines tested. This makes sense, since it leads to the minimum number of blocks in the J dimension. The number of accesses is reduced and the exploitation of the spatial locality is favoured. Access to matrix WB in the inner loop takes place sequentially in this dimension, whose components are stored in consecutive memory positions. Nevertheless there have been strong variations in the optimum values of BK for the di erent architectures and matrices, which has moved us to make a study of the in uence of di erent parameters on the cache behavior.
In Figures 3 to 5 we have tested the cache behavior during the product of a 5000x5000 sparse matrix as a function of BK keeping BJ constant and equal to 20. Each one of the three gures focuses on di erent degrees of associativity, line and cache sizes, respectively. Typical values of second level caches have been chosen for the cache parameters. In the following, cache behavior is explained based only on matrix WB, as it usually accounts for the greatest amount of misses, as said in Section 3.3. Figure 3 shows that small values of p n favour the use of large blocks no matter the size of the cache sets because all of the block sizes considered t in the cache and there are few cross interferences due to the small number of entries. Nevertheless as p n grows the optimum block decreases its size because of the increase in the number of cross interferences. Besides, as expected, the smaller the value of K the smaller the optimum block, as the interferences e ect is greater.
Line size (L s ) in uence on the number of cache misses is shown in Figure 4 . The references inside the inner loop are basically 20 read accesses to consecutive memory positions, so for typical second level cache line sizes, the larger the line the better. Only for extremely large line sizes ( 256) an increase in the number of misses arises, due to their negative e ect on the interference probability. On the other hand, the optimum block size experiences little variations with L s . Although it is not shown in the graph, the increase of the associativity gives place to an increase of the optimum block size with the line size, as it balances the interference growth that takes place with larger line sizes. It must be taken into account that the increase of the line size can lead to a longer execution time despite the reduction in the number of misses because of the longer miss time.
The behavior of the optimum block size as a function of the cache size is scalable for the two di erent values of p n in Figure 5 . For C s 128Kw any block ts in the cache, so the largest block is the best for large caches. Nevertheless we see the e ect observed in Figure 3 of reducing the optimum block size as p n grows because of the increase in the cross interference probability. The behavior of the only second level cache considered in Figure 5 for which some blocks do not t, that is, the 64Kw cache, limits the optimum block size to be the largest below C s in the case of p n = 0:004 in order to avoid self interferences and exploit the cache. As p n grows this limit is reduced because of the added e ect of self and cross interferences. The shape of the curve for the 64Kw cache and the matrix with p n = 0:04 shows clearly the relative importance of cross and self interferences: these latter are much more important.
Conclusions
A cache behavior model developed by our group has been applied in a systematic way to a remarkably complex code in comparison with those in the previous literature. The model has demonstrated its usability for proposing code parameters, such as block sizes, that lead to minimum execution times on real platforms. The model was rst validated by simulations showing a good degree of accuracy. Later it was used to derive optimum block sizes for di erent architectures and matrices taking into account the di erent cache levels of each system. The blocks proposed by the model were almost always the optimum ones or provided very similar results. Finally, a comparison of the predicted number of misses with the real values measured using the internal built-in counters of the R10000 microprocessor shown that, although the model is far from being quantitatively accurate due to several factors it does not take into account (context switches, sharing of certain cache levels with code, etc.), its predictions do follow the real cache behavior qualitatively.
Besides, the model presents the advantage of its high execution speed in comparison to simulations and even to real executions providing a exible architectureindependent tool for the analysis of the cache hierarchy behavior and the proposal of optimum blocks. As an example, let us consider the product of a 5000x5000 sparse matrix with 500K entries by a dense matrix with H = 100 using a 20x1000 block on a direct mapped 128Kw cache with L s = 16. These are the parameters of the second level cache of the Ultra 1 system mentioned above. The time required to generate the trace and process it in this system was 690 seconds using dineroIII. The real execution time of this product consumed 7 seconds of CPU time, while the model required 0.64 seconds. Although two executions of the model are needed to re ect the behavior of this two-level hierarchy, the model is still competitive with respect to the real execution. This is specially true if we take into account compilation times. The di erence grows with the dimensions of the matrices and the number of entries (as the simulation and execution times are proportional to the number of accesses) and with large miss rates, because of optimizations in the model implementation.
