As useful as performance counters are, the meaning of reported aggregate event counts is sometimes questionable. Questions arise due to unanticipated processor behavior, overhead associated with the interface, the granularity of the monitored code, hardware errors, and lack of standards w.r.t. event definitions. To explore these issues, we are conducting a sequence of studies using carefully-crafted microbenchmarks that permit the accurate prediction of event counts and investigation of the differences between hardware-reported and predicted event counts. This paper presents the methodology employed, some of the microbenchmarks developed, and some of the information uncovered to date. The information provided by this work allows application developers to better understand the data provided by hardware performance counters and better utilize it to tune application performance. A goal of this research is to develop a cross-platform microbenchmark suite that can be used by application developers for these purposes. Some of the microbenchmarks in this suite are discussed in the paper.
INTRODUCTION
Performance monitoring hardware consists of a set of registers that record information about different processor events that occur during application execution. This information is in the form of aggregate event counts or sampled event traces. For example, when used in the former manner, the registers accumulate counts of the occurrences of events triggered by memory hierarchy activity, such as level-one data-cache misses or translationlookaside buffer misses, or functional unit activity, such as the execution of floating-point or branch instructions. Processors with hardware performance counters include the DEC Alpha, IBM Power, Intel Pentium, and Sun UltraSPARC series [5] .
Different manufacturers provide software that interfaces with hardware performance counters on their processors. Higher-level user interfaces, such as the Performance Application Programming Interface (PAPI) [2] , the Hardware Performance Monitor (HPM) Tool Kit [3] , and the Hardware Activity Reporter (HAR) [7] facilitate access to performance counters. For example, PAPI provides a cross-platform user interface to access performance counters on various processors. It can be used to monitor a set of 104 different events. Note, however, that no platform supports all 104 PAPI events. For example, the Pentium II and Pentium III processors support 49 of these events, while the Power4 supports 22.
As useful as performance counters are, the meaning of reported aggregate event counts is sometimes questionable. Questions arise due to unanticipated processor behavior, overhead associated with the interface, the granularity of the monitored code, hardware errors, and lack of standards w.r.t. event definitions. To explore these issues, we are conducting a sequence of studies using carefully-crafted microbenchmarks that permit the accurate prediction of event counts and investigation of the differences between hardware-reported and predicted event counts. This paper presents the methodology employed, some of the microbenchmarks developed, and some of the information uncovered to date. The information provided by this work allows application developers to better understand the data provided by hardware performance counters and better utilize it to tune application performance. A goal of this research is to develop a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. cross-platform microbenchmark suite that can be used by application developers for these purposes. Some of the microbenchmarks in this suite are discussed in the paper.
METHODOLOGY
The methodology used to study aggregate event counts, which was presented in [5, 6, 8] , has been refined. Now it prescribes microbenchmarks that are parameterized and have built-in scalability, and it includes a feedback loop. In this context, parameterization and scalability are meant to ease experimentation across platforms and a range of increasingly larger event counts. These modifications to the methodology enhance microbenchmark design and implementation.
The methodology, applied to a particular event, includes the following steps: 1. design and implement, when possible, a parameterized, scalable microbenchmark that permits event count predictions; 2. predict event counts for a range of increasingly larger event counts using tools and mathematical models developed with knowledge of the architecture, operating system, and compiler; 3. collect hardware-reported event counts using PAPI or another API; 4. compare predicted and reported event counts; and 5. analyze experimental results to identify and quantify differences between predicted and reported counts; if necessary repeat all five steps to reflect the knowledge gained from the analysis.
Microbenchmarks are written in C and use PAPI to collect hardware-reported event counts, and are designed to allow validation of hardware event definitions and prediction of the number of event occurrences. These objectives are meant to ensure that hardware performance counters count what application programmers expect them to count and to explain why sometimes the reported counts are not what are expected. In this context, a microbenchmark is designed according to the following criteria: 1. functionality, in terms of stressing that part of the microarchitecture or memory hierarchy that triggers the target event, 2. compactness, in terms of static size, 3. efficiency, in terms of execution time, 4. simplicity, in terms of the amount of potential concurrency that can be exploited by the microarchitecture and memory subsystem, and 5. portability among microprocessors.
When designing a microbenchmark, the similarities across platforms are exploited when possible. In the best case, the highlevel language microbenchmark is simply ported to the various platforms. In the worst case, the design is portable but the implementation of the design changes with the platform.
MICROBENCHMARKS AND RESULTS
We now present examples of the microbenchmarks that we have used to evaluate performance counter data as well as results of these evaluation studies.
Branch Misprediction
The platforms of interest that support the branch misprediction event are the Pentium III, Itanium, Power3-II, and R12K. For these super-scalar processors, which support speculative execution, effective branch prediction is essential for good performance. The branch prediction algorithms implemented on these processors are proprietary and quite sophisticated [4] , so much that reverse-engineering them is extremely difficult. Thus, rather than trying to characterize these algorithms and study various types of branching behaviors, we identify a control structure for which branch misprediction behavior is identical across platforms and exploit this observation to produce a crossplatform microbenchmark. The hypothesis is that, for all platforms of interest, a for-loop causes a branch misprediction event only upon exit and all other branches are predicted correctly. Using this hypothesis, the branch misprediction microbenchmark, shown in Figure 1 , was designed. It consists of two nested for-loops. The inner loop is iterated a constant number of times, i.e., 10 times; while the outer loop is iterated a variable number of times. Scalability is achieved by parameterizing the outer loop, which is iterated the specified number of times. To cause n mispredictions, the input parameter must cause n-1 iterations of the outer loop, which result in n-1 mispredictions on the n-1 exits of the inner loop plus one misprediction on exit from the outer loop. The initial port of the microbenchmark was to the Pentium III, where the hardware-reported count is within 1% of the predicted count. The code was ported to the remaining processors of interest, with similar (<1%) results, which makes this code our first verified cross-platform microbenchmark. 
Floating-point Square Root
The Power3-II and Power4 are the only platforms of interest that support the square-root event. Ported to these systems, the relatively simple floating-point square-root microbenchmark monitors a for-loop that sequentially accesses an array of floatingpoint numbers, computing the square root of each. In order for square root to be implemented in hardware on these platforms, the microbenchmark must be compiled with an optimization level of at least three. Predicted event counts are obtained using a script that counts the number of high-level language square-root instructions. For the Power3-II, the hardware-reported square-root event counts match the predicted counts only when more than 100 square-root instructions are computed. Similarly, for the Power4, the reported counts match the predicted counts only when more than 86 square-root instructions are executed. In general, when the reported and predicted counts do not agree, the reported count is zero. This discrepancy indicates a possible hardware or firmware error.
Cycles
All platforms of interest support an event that permits the counting of the number of cycles elapsed during the execution of monitored code. To be able to predict the elapsed number of cycles, however, events that may not be modeled accurately, e.g., latency-hiding events, pipeline stalls, and resource contention must be eliminated. For example, in most modern processors, multiple cache misses can be outstanding; miss handling is concurrent with instruction execution, thus hiding a portion of the miss penalty; multiple functional units execute concurrently and compete for access to resources such as caches and microarchitectures buses; and data dependencies may introduce bubbles in the pipeline. Such events can be hard to model and, thus, may make it difficult to predict event counts. Thus, the initial idea behind the cycles microbenchmark design was to monitor only one type of instruction, stressing only one functional unit, the one with the smallest latency, generating an instruction stream that could be modeled easily. For the Pentium III, the first platform to which the microbenchmark was ported, this translates into stressing the integer ALU with integer add instructions, each of which has a latency of one cycle. However, this sounds easier than it actually is.
The first hurdle that presented itself is associated with the compiler. The microbenchmarks are compiled with no optimization so that the monitored code is not changed. If the compiler changes the code organization or eliminates instructions, event count prediction is thwarted and hand-tuning of the assembler code, when necessary, is difficult. However, with no optimization, no variables are allocated to registers, and a simple add instruction in the high-level language (C) results in three memory-accessing instructions: a load, load-add, and store, which may introduce additional cycles, especially when they generate cache misses. To eliminate these extra, unpredictable cycles, declaration of register variables is done manually. But in order to identify registers available for allocation to variables, the assembler code is analyzed to determine the liveness [4] of registers. Figure 2 presents an example of how this is done. Once enough free registers are identified, as shown in Figure 3 , the three memory-addressing instruction sequence is reduced to a single register-to-register integer add assembler instruction with one-cycle latency.
The next hurdles are associated with the microarchitecture and memory hierarchy. It was determined experimentally that for each of the two ALUs, two alternating integer add instructions, each using a unique register, need to be issued. If two instructions that use the same register are dispatched consecutively to the same ALU, then bubbles may be introduced to allow time for a register update.
Originally, to avoid the complexity associated with branch instructions, the code was implemented as an inline microbenchmark, i.e., one with sequential execution flow. However, as the code size grew, monitored event counts (which monitored cycles, cache misses, and TLB misses) indicated that instruction cache and TLB misses contributed "extra" (unpredicted) cycles to the cycle event count. To minimize these "extra" cycles associated with misses in the memory hierarchy, a relatively simple for-loop microbenchmark with only integer-add instructions in the monitored loop body was implemented.
Results from the branch misprediction microbenchmark, discussed in Section 3.1, indicate that penalties due to a mispredicted branch occur only once, as the loop is exited, and do not contribute much overhead. To avoid data cache misses and associated "extra" cycles, the for-loop variables also need to be assigned to registers The final cycles microbenchmark, devoid of events that introduce "extra" cycles, resulted in hardware-reported cycle event counts within 1% of predicted event counts. The design was also ported to the Power4, with similar (<1%) results.
Instructions Issued and Completed
The instructions issued and completed microbenchmark includes a monitored a for-loop with a body of 10-20 add instructions. The predicted event counts are made from the disassembled object module. For these two events, the predicted counts are the same as the number of dynamic instructions between the PAPI directives. This microbenchmark has been used on a variety of platforms. For example, the results on the Power3-II and the Pentium III show a small (<1%) difference between predicted and reported event counts. However, not all platforms show such results. On the Itanium platform, the reported counts are consistently 17% larger than the predicted counts. This is due to no-ops introduced by the compiler to "pad" VLIWs (very long instruction words); this was discovered by inspection of the disassembled object module. Accounting for this in the predicted counts, the difference is close to 0%. A similar situation exists on the Power4, where the initial results showed a difference of up to 400%. One hypothesis for the difference is that the way instructions are packaged inside the Power4 affects the number of instructions issued. In order to keep track of the instructions in flight, groups of five instructions are formed. The groups cannot be dispatched until all resources are available [9] . For each highlevel add instruction in the microbenchmark, there are a sequence of load, add, and store instructions generated in the assembler code. It seems as if the load and store instructions could not be issued in the same group because of data hazards. This would force those instructions to be issued in different groups forcing no-ops to fill the rest of the group. However, this "packaging" is done within the microarchitecture, so unlike the Itanium, no extra instructions (no-ops) are visible in the disassembled object module. In order to minimize this effect, all load and store instructions were removed by using register variables. With this new microbenchmark the results show a 2% difference between predicted and reported counts.
CONCLUSIONS AND FUTURE WORK
As illustrated by the events discussed in this paper, the majority of hardware-reported event counts agree with predicted counts. Earlier work [5, 6, 8] , which presented work in progress, indicates differences between hardware-reported and predicted event counts that were not yet understood. Many of these differences were later understood as we refined our methodology.
Results for instruction cache misses and floating-point instructions are also available [1] , while other events are currently under study.
As new processors are introduced into the marketplace, the representativeness of their event counts needs to be evaluated. Developing new architecture-specific microbenchmarks is too expensive. To minimize this work, architecture-independent microbenchmarks, when applicable, are essential. The goal of this research is to develop a suite of microbenchmarks that can be ported easily to any platform. For events for which the microbenchmark must be tailored to the architecture, rather than just ported or parameterized, the goal is to port the microbenchmark design, if not the implementation. The microbenchmarks presented in this paper are a first step towards this goal. Such a suite would facilitate the evaluation of performance counters and their use by application programmers to tune their codes.
Some microbenchmarks, for example the instruction cache miss microbenchmark, cannot be ported across platforms because, for the same event, the event definitions differ. This research shows that event definitions need to be standardized across platforms to avoid misunderstanding and misuse of data obtained from performance counters.
For future work, the porting of the cross-platform microbenchmarks to the platforms of interest will be completed. In addition, the events associated with the Cray X1 and R16K platforms will be evaluated using the cross-platform microbenchmarks-this will test the portability of these microbenchmarks.
