Abstract
Introduction
The microprocessor industry is capable of predicting the clock frequency and functionality of first silicon. However, predicting its performance on real programs is still a challenge. Performance models are used during the development of microprocessors to predict the average number of instructions executed per cycle (IPC) . IPC measures the instruction throughput and is a key factor in the overall performance of a microprocessor. Performance models are usually implemented in a high level language, such as C or C++. Performance models measure instruction throughput by capturing the timing essence of instruction execution. This is done by modeling the latency of instructions, the dependencies between instructions, and the allocation of limited resources in the microprocessor implementation. Microprocessor designers use performance models to assist in design decisions [1] . These models allow the designer to evaluate new ideas, without the cost of fully implementing them in hardware or a hardware design language, such as verilog or VHDL.
Often the validation of such performance models is not an algorithmic process and relies largely on "validation by inspection."
As in all design efforts, performance modeling is susceptible to many sources of error, including but not limited to modeling errors, specification errors, and abstraction errors. Modeling errors occur when the developer understands the modeling task, yet incorrectly codes the desired functionality. Specification errors occur when the developer is misinformed about the correct functionality and models the wrong behavior. To keep performance models simple and fast, the details of some features are abstracted. An abstraction error occurs when the developer implements a feature at a higher level of abstraction without maintaining equivalent instruction timing. Another source of abstraction errors is due to features that are not implemented but ended up having significant impact on instruction execution timing.
Much like hardware functional validation in the past, today's performance model validation relies mostly on inspection. The inspection validation process is a programmer intensive effort that involves analyzing the simulation process and performing sanity checks on the simulation results.
This may require single stepping through the execution cycles of small code sequences and observing the internal states of the performance model. The developer is responsible for sighting incorrect behavior as the code sequences are executed. Sanity checking utilizes an array of instrumentation counters that are built into the performance model. These counters gather statistics, such as IPC, cache miss rate, instruction dispatch rate, average and peak resource usage, etc. These counters provide summary statistics on the behavior of a code sequence. Simple code sequences are developed to exercise the boundary conditions of all resources in the performance model. The code sequences are simulated and the counter statistics are analyzed to try to find incorrect behavior. 3 Longer code sequences and benchmarks are also used for sanity checks. Large benchmarks are simulated several times with different microarchitecture configurations. The counter statistics are analyzed and if the expected behavior is found the confidence in the performance model increases.
If the counter statistics do not match intuitive expectations and an explanation cannot be determined, the performance model is searched for possible errors.
Although these inspection techniques are effective at debugging a model, without a systematic method a high level of confidence cannot be placed on the results produced by these performance models. This paper documents an experimental study in calibrating a performance model against actual hardware, and demonstrates the risks of relying solely on an inspection-based validation. It proposes a systematic method for calibrating a performance model against a Register Transfer Level (RTL) implementation of the microprocessor. RTL models are gate level models of the hardware. The experimental results in this paper suggest the potential effectiveness of the proposed calibration method.
To demonstrate the effectiveness of the proposed method a performance model of the PowerPC 604 is developed (Section 2) using only inspection-based validation. This "infant" performance model is exercised by the validation method proposed in Section 3, and the results are presented in Section 4. The proposed method is used to debug the infant performance model and evolve it to a more mature "child" performance model in Section 5. Section 6 discusses the effectiveness of the proposed method.
Performance Model Development
The experimental vehicle chosen for this study is a performance model of the PowerPC 604, implemented in the Microarchitecture Workbench (MW), a performance modeling environment de-veloped at Carnegie Mellon University [2, 3] . MW provides a framework that minimizes the amount of work required to develop a performance model and the amount of changes necessary to evaluate new features in a microarchitecture. For a more complete description of the MW tool refer to [2, 3, 4] . MW has been used to model a number of microprocessors, including Alpha 21064/21164 and PowerPC 601/604/620, and has been distributed to a number of universities and industrial companies.
The PowerPC 604 is a state-of-the-art 4-wide superscalar microprocessor capable of out-of-order execution. It can fetch, dispatch, and complete up to four instructions per cycle, and can issue and execute up to six instructions per cycle. The MW-based PowerPC 604 model is developed based on information in published reports [5, 6, 7] . The performance model is capable of modeling all the key microarchitecture features of the PowerPC 604, including branch prediction, instruction fetching, instruction dispatching, register renaming, out-of-order instruction issue and execution, execution result forwarding, the non-blocking cache hierarchy, load/store alias detection, instruction refetching, and in-order completion. Since the PowerPC 604 performance model is trace driven, mispredicted paths are not simulated. However, the PowerPC 604 performs a pipeline flush on branch misprediction, so the penalty cycles due to misprediction can be accurately modeled. The only shortcoming of this abstraction is that potential instruction cache pollution is not modeled. are simply not verified in functional validation. This paper proposes using these same test suites to validate the instruction execution timing in a performance model.
Performance Model Validation

Performance model validation reference
All validation methods require a reference in order to determine if a test sequence passes or fails. Hence, in this study instead of the RTL model we chose the actual hardware as the reference machine, and validate our performance model against a physical PowerPC 604 system. More specifically an IBM Power Series 850 AIX 4.1.3 system is chosen as the reference machine. This is feasible because the PowerPC 604 chip provides a set of hardware embedded counters. These counters provide adequate observability into the execution of code, to extract cycle counts, instruction count, completion rate, and other statistics.
PowerPC 604 hardware counters
The PowerPC 604 has a set of hardware embedded counters called the performance monitor. [7] provides a detailed description of the performance monitor features. The performance monitor of 6 the PowerPC 604 has two 32 bit counters and a 32 bit control register. The two performance monitor counters count "events" during execution. The control register determines which "event" each counter will monitor during the execution of a program. The PowerPC 604 allows the counting of many "events" such as cache and TLB misses, instruction dispatching, instruction finish, instruction completion, and load/store miss latencies. Different "event" counting modes allow the user to count only supervisor code, user code, or specially marked processes.
Hardware counter interface
The PowerPC 604 hardware counters are implemented as special registers accessible only in supervisory mode. Special supervisor instructions provide read/write access to the counter and control registers. Since these registers are accessible only in supervisory mode, a set of AIX dynamically-loadable pseudo devices are developed to interface user code to the supervisor instructions. The header of the pseudo device initializes register state, flushes the machine pipeline, then configures and starts the performance monitor counters. The test sequence is executed and a trailer stops the counters and returns the count results to the calling function. This device provides a clean interface to the performance monitor with a small fixed overhead.
Performance model validation test patterns
For this study, we define five different test suites to target specific portions of the PowerPC 604 performance model. Alpha tests exercise instruction latency by executing each instruction one at a time. Beta tests check pipeline dependencies within an instruction type, e.g. data forwarding and register renaming, by executing each instruction 2 to 100 times back to back. Gamma tests execute each instruction next to every other instruction, testing pipeline dependencies between instruction types and data forwarding across functional units. Random test sequences of up to 100 instructions are used to randomly exercise the interactions between instructions and the different components of the microarchitecture. Finally, hand written patterns are generated to test microarchitecture features that are difficult to automatically generate tests for or are not sufficiently covered by random sequences.
These five test suites are only a small portion of the functional validation process used by IBM and Motorola for PowerPC microprocessors. A complete validating process involves dozens of automatically generated and specially directed test suites and years of random test sequence execution.
While this study only utilizes the five test suites to illustrate the need for performance model validation, the calibration method suggested by this work advocates the use of all functional validation test suites.
Alpha, beta, and gamma test suites
The alpha, beta, and gamma test suites are automatically generated by an Automatic Test Program Generator (ATPG) that is built into the MW framework. The ATPG tool extracts the Instruction Set Architecture (ISA) information from the PowerPC 604 performance model and generates executable code sequences, that fall into the alpha, beta, and gamma categories of test sequences.
Random test sequences
The ATPG also generates executable random code sequences. All source and destination operands are randomly selected with the exception of branch instructions and load/store instructions.
Branches are not allowed to leave the program space and loop for a random amount of time less than a pre-programed maximum. Load and store instructions access random data locations from an allocated data space.
Handwritten test suite
Handwritten tests are usually designed to exercise boundary conditions and obscure functional states that a random test generator may not exercise. These tests are also used to improve "test cov-erage" by stimulating certain signals in the hardware implementation. As with hardware models, performance models have boundary conditions and behaviors involving obscure states. Proper validation of a performance model must include these tests.
The validation of the PowerPC 604 performance model includes several such test sequences. These test sequences are mostly branch and load/store tests, designed to exercise the branch paradigm and memory hierarchy. These tests are inserted into the performance monitor device drivers. Hardware counter statistics are gathered to pinpoint the behavior of the hardware. 
Putting it all together
Infant Model Validation
The initial MW-based PowerPC 604 performance model developed in this study is called the "infant" model. To illustrate the need for a systematic performance model validation, the infant model is validated using only the inspection process described in Section 1. After a long debugging period using inspection, many bugs were found and repaired. During this period, single stepping of the simulation process and sanity checks on the simulation results were performed. It is expected that the resultant infant model should accurately model the PowerPC 604, with few bugs remaining. It is also assumed that if the debugging process is continued the few remaining bugs will be found.
Subsequently, the proposed performance model validation method, described in Section 3, is applied using the five test suites stated above. 
Child Model Validation
The validation method outlined in Section 3 is then used to debug the infant model. After this validation process the infant model is promoted to a child model, to reflect a more mature performance model. Bugs infected every aspect of the microarchitecture and ranged from incorrect instruction latencies to instructions dispatching to the wrong functional units. Table 2 lists some of the more interesting errors found during validation. It is apparent from Table 2 that errors of all t ypes play a role in the accuracy of performance models. Such a wide range of failures indicates the need for a systematic validation method. The significant number of modeling errors further demonstrates inspection methods will not fully debug a performance model Table 3 illustrates the number of tests or test sequences in each test suite, with the current number of passing and failing sequences.
Accurate modeling of instruction latency (Alpha tests) increased from 50% to 96%. Pipeline modeling (Beta tests) also shows improvement from 30% to 75%. If the test pass requirement is relaxed from a zero cycle difference to a single cycle difference, the Alpha tests jump to 100% passing and the Beta .tests pass 85% of the time.The Gamma tests now show a pass rate of 80% for a zero cycle a. Random sequence numbers were never recorded due to100% failure. It is clear from the results of this validation process that significant effort is required in order to achieve a reliable and accurate performance model. Even our much-improved child model still requires further validation to ensure maturing into adulthood before it can be used with confidence to accurately predict performance. 
Analysis of Validation Results
The results in Section 5 demonstrate the effectiveness of the proposed validation method at finding bugs in performance models. However, to verify the effectiveness of the proposed validation method at improving accuracy, longer code sequences need to be executed. A small set of real benchmarks are selected and executed on the reference machine as well as the infant and child performance models of the PowerPC 604. Obviously, the smaller the delta between the reference cycle counts and the performance model cycle counts the more accurate the performance model.
This section documents the results from executing real code on the hardware, extracting execution traces of the code execution, and running the traces on both the infant and child performance models. The primary source of error in this experiment is in trace gathering. Does the tra cing tool accurately trace the actual run-time execution of the benchmarks?
Benchmark instruction count correlation
Dynamic instruction count is a microarchitecture independent metric that is common to both the hardware execution and trace simulation. It is assumed, if the instruction count is the same between a hardware execution and a trace simulation, the trace accurately captures the run-time execution of the benchmark. The benchmarks used to verify this validation process are listed in Table 4 , along with their input data sets. The table includes the instruction counts for both the hardware execution and the trace-driven performance simulation. These numbers demonstrate a strong correlation between the two instruction counts. There are two sources of error that can account for the small discrepancies:
Tool overhead: Both the hardware counters and the tracing tool have a fixed overhead of 621 and 13 hardware counters and trace gathering. However, these constant fixed overheads offset and only add 64 instructions to the hardware instruction counts.
Trace gathering: The tracing tool used in this study is designed to trace library calls, however it is unclear how extensive the trace reaches into each library call. The hardware counters count a user process up to the point it switches into supervisor mode. Some additional instructions in the hardware count are due to this discrepancy.
Supporting these assertions the hardware counter instruction counts are consistently higher than that of the simulation traces. With the exception of the eqntott benchmark there is strong correlation between the two counts. Therefore, very reliable cycle count results for cjpeg, grep, gperf, mpeg, and quick are expected. 
Benchmark cycle count correlation
Final analysis
After debugging the infant model, using the proposed validation method, the accuracy of the performance model on benchmark execution improved from 7.49% to 4.25% average cycle-count discrepancy. The application of the validation method improves both the accuracy and the confidence in the performance model. However, it is observed in Table 5 Figure 2 illustrates the maturing of a performance model from infancy, through childhood and teenage period, and finally to adulthood.
The microprocessor designer uses the performance model to assist in design decisions by adding new features and comparing the execution results to a baseline model. This is called "A-B testing."
A-B testing is a normal practice in industry, however it is based on the naive assumption that performance models asymptotically approach correctness very rapidly. If the actual behavior is very sporadic as observed in this study, the designer can not make quality design decisions without an The experimental results obtained in this paper clearly highlight the difficulty in developing an accurate performance model. As microarchitecture complexity continues to increase, especially with the incorporation of very aggressive speculation techniques [8, 9, 10] , accurate performance modeling and the validation of performance models will continue to be a great challenge.
Acknowledgments
The research efforts presented here have been supported by NSF (CCR 9423272) and ONR (N00014-95-1-1112 and N00014-96-1-0347). We have also benefited from the generous donation of a large number of Pentium Pro Systems from Intel. We would like to give special thanks to Mar-
