Among the many features that are implemented in today's microprocessors there are some that have the capability of reducing the execution time via overlapping of diyerent operations. Overlapping of instructions with other instructions, and overlapping of computation with memory activities are the main way in which execution time is reduced. In this paper we will introduce a notion of overlap and its definition, and a few diferent ways to capture its effects. We will characterize some of the ASCI benchmarks using the overlap and some other quantities related to it. Also, we will present a characterization of the overlap effects using a lower bound derived empirically from measured data. We will conclude by using the lower bound to estimate other components of the overall execution time.
Introduction
In the years 1994-95 superscalar processors, with three or more issues per clock, made their appearance on the market. All the major vendors were, and still are, supporting superscalar processors such as: MIPS R10000, Intel Pentium Pro, IBM Power2, HP 8000, DEC Alpha, Sun UltraSPARC. Among the many features that are implemented in today's microprocessors there are some that have the capability of reducing the execution time via overlapping of different operations. Overlapping of instructions with other instructions, and overlapping of computation with memory activities are the main ways in which execution time is reduced. For all those processors we can also talk of overlap between memory accesses at different levels of the memory hierarchy. In order to exploit even more parallelism between activities, some processors support out of order execution, and not blocking caches with more than one outstanding miss. In the study presented the MIPS RlOOOO has been chosen as reference for the experiments. The results presented in this paper are focused on the estimation of overlap effects. A lower bound for the overlap is computed using an empirical approach based on measured data. The result is then applied to compute bounds to quantify other effects. The results presented in the paper are obtained using the ASCI' benchmarks. Due to the novelty of both the ideas and their implementation in today's processors no other study, to the best of our knowledge, h,as approached the problem of quantifying the effects of overlap.
Overlap
The analysis in the following sections uses a simplified mean value parameterization [I] to separate CPU execution time from stall time due to memory loads and stores. Figure 1 is a pictorial description of modeled times. The model projects the overall cpi of (an application as a function of CPU execution time and average memory access times:
where cpi, is defined to be the cpi of the application assuming that all memory accesses are from an infinite L l cache and take one clock period (i.e., the i=l term is included in cpi,), and h, and t, are, correspondingly, the hits per instruction and average non-overlapped access times for the ith level in the memory hierarchy. The second term in the right hand side of equation (1) is also referred to as cpi,,,,,. If no overlap of CPU execution and memory accesses occur, every memoiry access to the ith level incurs the full round-trip latency, which we denote
1=2
Department of Energy Accelerated Strategic Computing Initiative as T. We define a measure of the overlap of memory accesses with computation as m, in (3.a), where cpi is formulated as in (2) . The overlap contribution expressed in clocks per instruction is given in (3.b).
From (I), m, is derived as one minus the ratio of the average memory access time to the maximum memory access time.
In [9] can be found a preliminary study on understanding what are the features that have an impact on overlap, and in what measure they contribute to the whole overlap. The approach used in [9] makes use of a combination of measurements, and least square fitting.
ASCI Benchmarks
Four applications, which form the building blocks for many ASCI simulations, were used in this study. A more detailed characterization of the ASCI benchmarks is presented in [2, 3] .
Code Descriptions
SWEEP3D is a three dimensional solver for the time independent, neutral particle transport equation on an orthogonal mesh [4] . The first-order form of the transport equation is solved by sweeping through the spatial mesh along discrete directions (ordinates). In SWEEP3D, the main part of the computation consists of a "balance" loop in which particle flux out of a cell in three Cartesian directions is updated based on the fluxes into that cell and on other quantities such as local sources, cross section data, and geometric factors. The cell-to-cell flux dependence implies a recursive or wavefront structure, i.e. a given cell cannot be computed until all of its upstream neighbors have been computed. The specific version used in these tests was a scalar-optimized "line-sweep" that involves separately nested, quadrant, angle, and spatialdimension loops. In contrast with vectorized plane-sweep versions of SWEEP3D, there are no gatherhatter operations and memory traffic is significantly reduced through "scalarization" of some array quantities. Because of these features, L1 cache reuse on SWEEP3D is fairly high (the hit rate is about 85%). A problem size of N implies N ' grid points. Oheat-100 0 Qhydro-t-I00 .hydro-t-ltiO vectorizable. An important characteristic of the code is that most arrays are accessed with a stride equal to the length of one dimension of the grid. HYDRO-T is a version of HYDRO in which most of the arrays have been transposed so that access is now largely unit-stride. A problem size of N implies N2 grid points. HEAT solves the implicit diffusion PDE using a conjugate gradient solver for a single timestep. The code was written originally for the CRAY T3D using SHMEM. The key aspect of HEAT is that its grid structure and data access methods are designed to support one type of adaptive mesh refinement (AMR) mechanism, although the benchmark code as supplied does not currently handle anything other than a singlelevel AMR grid (i.e. the coarse, regular level-1 grid only). A problem size of N implies N3 grid points. NEUT is a Monte-Carlo particle transport code. It solves the same problem as SWEEP3D but uses a statistical solution of the transport equation.
Particles are individually tracked through a three dimensional mesh where they have some probability of colliding with cell material. The output from the particle tracking is a spatial flux discretized over the mesh. Vector (or data parallel) versions of this type of code exist which track particle ensembles rather than individual ones. A problem size of N implies N ' grid points and 10 particles per grid point.
Performance Characteristics
In this section we present some single-processor characteristics of the benchmark codes as obtained from performance counters on the Origin 2000 [6] . Detailed performance characteristic data for these codes were
MemIFLOPS is the ratio of memory references to floating point instructions and reflects the density of loadlstore instructions in a code. The results show the number of accesses is related to FLOPS by a small constant (greater than one) and the growth rate of both memory accesses and FLOPS is O(n). In HEAT, the high MemIFLOPS ratio is due to gatherlscatter memory accesses in the code. The codes' overall cpi curves are generally the inverse of their corresponding MFLOPS curves; that is, an increasing cpi corresponds to a decreasing MFLOPS at nearly the same slope and vice versa. The cpi of three of the codes (HEAT, HYDRO and SWEEP) is strongly dependent on problem size. Although not shown in the figures, we calculated TLB hit ratio and branch prediction hit ratio.
The calculation shows that MIPS RlOOOO processor uses a good technique for performing speculative branch prediction. All four benchmark codes (HEAT, HYDRO, HYDRO-T and SWEEP) have branch prediction hit ratios over 99%. This means that over 99% of speculated branch predictions are taken in real executions. TLB hit ratios for all these codes are higher than 98%.
A Lower Bound for Estimating Overlapping Effects
A formal definition of lower bound can be found in [8] .
Under the assumption that we will present in this section, a lower bound to estimate the effects of overlap can be computed. A lower bound for each code, and for each different problem size will be obtained. The lower bound can be seen as a portion of the actual overlap contribution for a particular code and problem size on a given architecture. This portion is less than the actual overlap, according to the definition of lower bound. In this paper we cannot claim that this lower bound is sharp, but that it is tight enough. The terminology and the concepts adopted to derive the lower bound are the same as those used in [9] , and presented previously in section 2. Equation (2) was derived by transforming (l), and incorporating the overlap in the stall component. In particular, 1-m, estimates the fraction of memory references that pay full latency (indicated with the presence of T). If now we consider (2) in a hypothetical scenario with no overlap, then the stall component in (2) will be maximized. In fact, considering equation (2) with the maximized stall component produces a different cpi, contribution, as formulated in (4).
(4) cpi = cpii + cpi,;:; cpi,,,,, component is maximized since it is actually representing the worst case, in which every memory access sees, from the processor perspective, full latency. As a consequence of the fact that the right hand side in (4) must add up to cpi, the value for cpi, is then the minimum possible2. At this point we can algebraically derive the minimum cpi, as described in (5).
( 5 )
In figure 3 we show the components of equality (5) The results presented are based on measured data obtained using the hardware performance counters available on the MIPS RlOOOO chilp [6] . Counts for number of cycles and number of instructions executed by a given code, are used to compute thie overall cpi. The number of misses at each level of cache enables us to compute the hit ratios h,, since we can measure the instructions executed. The maximized cpi,,,, can be computed since the T, 's are given by the manufacturer [7, 11] . The minimum cpi, is then computed by difference. One can see that the maximized cpiSlal, could be greater than the overall cpi, producing a negative minimized cpi,. Since the overall cpi i,s fixed and known, the component that will maintain the balance in (4) is the cpi, Thus, the fact that the maximized cpi,,, is greater than the overall cpi tells us that alverlap effects are present in the code analyzed. With this knowledge a first estimation of overlap impacts is possible. Figure 3 shows Equations (2) and (4) are the same in their right hand sides. Equation (4) differs from (2) in the way the two components in the right hand side distribute their values, because of the maximized pi^,,^ part. We can say that Given a code and a problem size the quantity represented by cpio is the least computation contribution needed, under the assumption of maximum stall time, in order to get the same cpi value as in (2).
the difference between the overall cpi and the maximized cpisro,, as first quantification of the overlap on an Origin 2000 system using the ASCI benchmarks. In our assumptions we maximized the stall component and minimized as a consequence the cpi, component. The cpi, is setting the minimum computation contribution to the overall cpi. In order to reach our lower bound of overlapping effects, we have to add a concept specific of the architecture where the measurements are taken. The component needed is the best cpi, for the architecture under consideration. The best cpi, will coincide also with the best overall cpi, because of the way in which cpi, has been defined. The best cpi,for the architecture under study is equal to 0.25 [lo] since the MIPS RlOOOO microprocessor is capable of graduating four instructions within the same cycle. Given the overlap as expressed in (3.b), a lower bound is defined in (6).
From the second case of (6) we can only say that the minimum overlap contribution is null, but that doesn't imply that the actual overlap contribution is null. From the first case of (6) we can say something about those cases where we know that overlap is present. This part is in fact saying that the overlap contribution cannot be less than that value. In figure 4 we show the lower bound for the overlap, expressed in cycles per instruction, together with the overall cpi. From the chart one can see how each given code benefits from an overlap that is at least cpiOve,,, to obtain the overall cpi shown. With the data in figure 4 it is possible to relate the meaning of (6) to some codes.
From the chart one can see that there are cases in which the overlap is null. There are two possible interpretations to this case: the code under study doesn't have potential for overlap (ex. every instruction is dependent from the previous); or, the code under study is cache resident, and doesn't have a significant stall component. All the cases for the NEUT benchmark are an instance of the latter scenario; their working set fits in cache. From figure 3 one can see that for these code even the maximized cpistall is significantly small, and the hit ratio for L1 is virtually 100% (figure 2). Thus in this context a null lower bound for the overlap indicates a code that performs well, and that needs no significant overlap contribution to achieve good performance.
Using the Lower Bound
There are different ways in which we could use the lower bound:
We can use it to characterize the overlapping capabilities of two different codes on the same architecture. We can use it to characterize the overlapping capabilities of a code with varying problem size. We could use it to characterize the same code over different architectures. We could use it to estimate the capabilities of compilers in generating low-level code that takes advantage of overlapping features of the architecture.
The reasons that brought us to undertake this study are related to a larger project in which performance modeling has a key role. We used the lower bound to compare the overlapping capabilities of two different systems currently available to us: a Power Challenge system and an Origin 2000 system, both from SGI. The lower bound effects on a Power Challenge system are presented in figure 5 . In particular, during our studies we wanted to quantify the impact on performance of two features that are different on the two systems and that, according to [9] , can be seen as overlap effects. These features are number of outstanding misses, and main memory latency. In the Origin 2000 systems the number of outstanding misses is increased to a value of 4 compared to 1.5 on the Power Challenge. The latency to main memory of the Origin 2000 systems has been reduced to 80 cycles against the 205 cycles of the Power Challenge. Figure 6 shows the characterization of the ASCI benchmarks on the two different systems using the lower bound for the overlap. From the chart one can see that more overlap resulted on the Power Challenge system. From the data in figures 5 and 3 it is clear that the Origin system is performing better. The effect of the overlap can be interpreted in two ways: high overlap is a consequence of a code that is using compiling and architecture features at their best, and that the nature of the code is such that those feature are needed. High overlap is a consequence of poor memory behavior; frequent and long memory access create a good potential for overlap, but this does not necessarily imply better performance than a system with a smaller overlap. The data from figure 6 present an instance of the latter scenario. The Origin system is a better machine from an architecture point of view; its stall time for main memory is significantly smaller, and the frequency of main memory access is reduced. Thus on Origin the codes spend less time stalling. Also, the implementation of the codes doesn't take full advantage of the higher number of outstanding misses available on the Origin. A more detailed comparison of the two systems using the ASCI codes can be found in [2,9]. An immediate consequence of our study is the possibility of characterizing the impact of different features on the overlap, and as well as how the overlap effects may vary when some of the features are changed. In the following section we will use the lower bound result to bound the estimation of cpis,,,, and cpi,.
Derived Bounds for cpiShll and cpi,,
A use of the lower bound result enabled us to bound some terms of the equality (1). In fact, not all the terms in (1) can be directly measured. The effective latencies ti and cpi, are not directly measurable. We can use the lower bound for the overlap to compute a bound for both the unknowns.
Upper bound for cpil,,
Before we approach the problem, it is necessary to mention the result from [2,9] which shows that cpi,can be considered as a constant for a given problem. This means that each different code will have its own cpi,, but this value converges to the same value when the problem size increases. Also, cpi is a good approximation of cpi, if the problem size is small enough to make the problem fit in primary cache [2,9]. We can now rewrite (1) where we will replace cpi, with cpi, and obtain (7).
Knowing that the cpi, stays constant gives all the benefits of overlap to the cpi,,il term. Thus, the difference between the maximized cpisro,, and the cpi,,+, will give us an upper bound for the cpi,,@, term. The upper bound for pi,^,, is always below the overall cpi. Also, note that in some cases the approximation of cpi, with cpi,, introduced some errors: two of the upper bound values are slightly negative (their first two decimal digits are zero). This is caused by the nature of the code and its problem size. The two codes in which this phenomenon happens can be considered resident in cache; therefore the amount of time spent stalling is negligible. Thus, in these cases the quantity cpis,"[/ that we want to estimate using our approximation of cpi, is of the same order as the error introduced by the approximation. Looking at the whole picture one can see that this problem happens only in those cases where cpi is smaller than cpi,,. In these cases the measurement of cpi,,,,, is actually not so relevant because it is negligible compared to cpi, (whether real or approximated). The upper bound for cpi,,, is shown in figure 8 together with the overall cpi and the maximized cpi,a,, that we used do compute the lower bound for the overlap. From the chart it is possible to observe how the bounded cpi,,, is different from the maximized cpi,,,, which could be seen as the worst case. Thus, this upper bound gives a tighter estimation of the actual cpi,,,; also, because its value is in all cases less than the overall cpi. Figure 9 shows the comparison of the upper bound for cpi,,, compared with another possible upper bound derived using the ideal cpi, for the Origin 2000 system. This upper bound is computed as follow:
Wheat-25

Bheat
The data in figure 9 show that the upper bound for cpi,, that is derived using the lower bound for the overlap gives a tighter bound than the one computed using the ideal cpi,.
The fact that the ideal cpi, is independent of any code characteristic, but is dependent only on the architecture component, doesn't enable this bound to capture all the cases well. Using our upper bound we can capture code characteristics providing a tighter upper bound to the actual cpisraI,.
Lower bound for cpi,
A lower bound for cpi,can be derived in a way similar to the derivation of the upper bound for cpi,,,,,,. The overlap contribution is estimated using our previous result on the lower bound in equation (10).
(10) cpi
The lower bound for cpio is shown in figure 10 as a function of the overall cpi and cpiLl (where cpiLl is the approximated value measurable for problem size that are primary cache resident). The chart shows that the lower bound is always smaller than the overall cpi and that is always smaller that cpi,. Also, the lower bound is always greater than the best cpi,, which on the system under consideration is 0.25. While 0.25 can be considered as a lower bound, it cannot be considered a tight one since it is independent of any code characteristics, and is only architecture dependent. Instead, the lower bound of equation (lo) , considers components such as the overall cpi, the stall component, and the overlap, all of which are code and architecture dependent. Thus, in those cases where the stall time is minimal, the lower bound reflects this with a value close to the overall cpi and cpi,,. 
Conclusion
This paper proposes an empirical technique to compute a lower bound to characterize the effects of overlap. The result produces a tight bound, though not sharp, lower bound. The lower bound obtained using the technique described in section 4 has been applied to characterize the performance of the ASCI benchmarks. Section 5 shows in what proportion the overlap is present in each code, as well as how it varies with the problem size. A comparison of two different architectures using the lower bound is presented in section 5 . The lower bound result has also been applied to compute tight bounds for two quantities that cannot be directly measured. An upper bound for the cpis,,,, component and a lower bound for the cpi, component are presented in section 6. The technique to compute a lower bound to quantify the effects of overlap is an empirical one. Based on measured data a lower bound can be computed. Unlike iterative methods, such as those used in [2,9], a lower bound computed as described here is a better approximation of the actual value. While an iterative method such has the least square fit has a margin of error that could be negative or positive, using the lower bound result it is guaranteed that the actual value is going to be greater. The result obtained using this technique could be used in combination with a least square fit method to provide more constraints that could reduce the percentage of error. Future studies will continue to analyze the problem of characterizing the impact of overlap, looking particularly at how different architecture features contribute to the overall overlap. Also, future studies will try to apply this technique to performance prediction. 
