Abstract
Introduction
The superscalar microarchitecture has been found as a good solution to improve microprocessors' performance while keeping backward software compatibility. One important design issue of superscalar architectures is the resource allocation problem: such as how many integer units, load/store units, branch unit, floating point units are necessary? How to allocate resources between the register renaming unit and branch prediction unit? The keys to answer these questions are the micro operation level parallelism (MLP) and the distribution of functional unit usage derived from typical and real application software. Such measurements can be obtained with a parameterized microarchitecture simulator or a trace simulator. However, during the early design exploration phase of a superscalar microprocessor, such simulators might not be available. In addition, the overheads in constructing the runtime environment support and the long simulation time (typically tens to hundreds of simulated instructions per second) may also hinder the use of the simulators in the early design exploration phase. Therefore, other approaches are necessary.
We participate in a research project to build a high performance superscalar x86 compatible microprocessor and face the aforementioned problem. Motivated by the problem, we have developed a time-saving approach based on our instruction set CAD system x86 Workshop, consisting of three tools: x86Bench, State Mapper, and ASIA-II. x86Bench is an x86 application analysis system which produces disassembled x86 basic blocks annotated with their execution counts. State Mapper is an automatic retargeting tool which maps a given instruction set to MOPs for the given microarchitecture. Both x86Bench and State Mapper serve as the front end of ASIA-II, which is a second generation of our instruction synthesis tool ASIA [1] . ASIA-II enables us to investigate many interesting instruction behaviors in typical superscalar architectures, including the distribution of functional unit usage, MLP, CPI (cycle per instruction), etc. Note that these techniques, except x86Bench, are retargetable tools which can be applied to other superscalar architectures, in addition to the x86 architecture reported in this paper.
In this paper we present the approach and its experimental results. Section 2 reviews related work. Section 3 provides a superscalar architecture model for efficient x86 instruction execution. Section 4 describes the CAD framework and individual tools for the x86 instruction analysis. Section 5 presents the analysis of x86 application software and the results. Section 6 draws conclusions for this study and points out future direction.
Related work
Instruction level parallelism (ILP) and machine parallelism have been among the hot topics in computer architecture. See, for example, [11] , [12] , [13] , [15] , [16] , [17] , and [18] . Most of the research works are based on simulation. For example, the work of Shinatani et al. is based on an RTL simulator that takes the instruction trace of benchmark programs [13] . Hara et al. build a trace-driven simulator to simulate three different superscalar architectures and obtain the functional unit utilization [12] . There are difficulties in applying these typical techniques to the x86 study with real application programs as explained in the introduction section. One way to overcome the inefficiency and overhead problems of simulation-based approaches is to instrument the application programs and execute them on the real hardware and environment. The instrumented programs will generate profiling information while performing their normal operations. For example, Davidson and Jinturakar use a MIPS R4000 architecture measurement tool to instru-ment the code and measure the dynamic instruction counts and latency of benchmarks [15] . However, this approach is applicable only when high level source code is available for instrumenting and recompiling. In the x86 world, most of the frequently used application programs are commercial ones of which the source code is unavailable.
As for the studies specific to the x86 architecture, Adams et al. analyze x86 instruction usage in the MS-DOS environment, based on the interrupt mechanism of the x86 microprocessors [11] . The analysis is performed at the mnemonic code level, which is insufficient to superscalar study. For example, a mnemonic instruction mov may turn into a memory load, a memory store, a register to register move, or an immediate to register move. These variations use different hardware resources in a superscalar microarchitecture. Huang and Peng conduct a similar but extended experiments for modern applications under both MS-DOS and Windows95 application programs [6] . In their study, mnemonic code is further differentiated according to their actual hardware usage. However, further improvements, such as taking into account the effects of register renaming and branch predication, are necessary in order to explore the superscalar configurations.
Bhandarkar et al. measure the instruction set usage and compare the performance between a Pentium processor and a Pentium Pro processor under the WindowsNT environment [14] . The instruction execution information is obtained by using special x86 instructions to access performance monitoring information that are automatically collected by the processors and stored inside the processors. Although the information is very accurate, it is not applicable to estimate the performance of superscalar configurations other than the one implemented in the chip.
A Superscalar Model for x86 Instruction Execution
The general practice to speed up x86 instruction execution is a two-layered microarchitec- Note that the superscalar core incorporates the branch predication mechanism and register renaming mechanism to increase the number of MOP's available for execution per cycle (micro operation level parallelism).
The objective of this study is to develop proper CAD tools for application software analysis in order to determine the allocation of functional units in the superscalar core, as depicted in the grey area.
The x86 Instruction Set CAD System: x86 Workshop

Framework
The framework of the x86 instruction set CAD system x86 Workshop is shown Figure2. [6] and [7] , respectively.
x86 Bench is an instruction analysis tool for the x86 instruction set, which is built around Intel's performance tuning tool VTune [10] . x86 Bench accepts an x86 program and its input data.
The x86 program can be a DOS or Windows95 application. The tool can analyze x86 programs either with or without source code (high level language source code). For a given x86 program and its input data, the tool generates the x86 instruction usage frequencies and the disassembled code annotated with basic blocks' execution counts. Currently we are able to analyze instructions belonging to the application programs but not instructions belonging to the operating system. To analyze instructions in the operating system we need the symbol files for the Windows95 kernel which are not available to general public [9] .
State Mapper is an instruction retargeting tool. It translates a given assembly code from one instruction set to another instruction set, based on a machine state transition notation. It can be configured to solve our x86 problem as well, as illustrated in Figure2. Each x86 instruction, due to its CISC nature, is considered as an assembly code, which is to be translated into a sequence of MOP's (i.e., micro sequence, or micro program). The MOP's of the superscalar architecture is considered as the target instruction set for State Mapper. The generated micro sequences can be viewed as the entries of the x86-to-MOP mapping table.
ASIA-II reads in the disassembled instruction sequences generated by x86 Bench and maps the instruction sequences into MOP's, according to the x86-to-MOP mapping table generated by State Mapper. ASIA-II then schedules the MOP's into time steps, subject to constraints of their dependencies and the constraints of the given superscalar microarchitecture model. The superscalar microarchitecture model describes the supported micro-operations, operational delays and the topology of data path components (i.e., the achievable data movements in the data path).
The numbers of data path resources, such as read/write ports of the register file, memory ports and functional units can be given as the resource constraints. The numbers of data path components can also be left unspecified, letting the tool search for the best combination (w.r.t. to the given objective function). The objective function controls how the MOP's are scheduled. It can be configured to optimize for performance (as in the experiment of this paper), functional unit cost, or a combination of both. MOPs scheduled into the same time step represent MOPs that are executed in parallel in the superscalar core. From the scheduled MOP's, the MLP and the distribution of functional unit usage can be obtained. In the following sub-section, we present ASIA-II, the investigation tool for superscalar microarchitecture, in more detail.
ASIA-II for Superscalar Architecture
ASIA-II is the second generation of our instruction set analysis/synthesis tool ASIA (Automatic Synthesis of Instruction set Architecture) [1] , which analyzes and synthesizes application specific instruction sets for pipelined uni-processors. Since the internal superscalar core in Figure1 can be regarded as an application specific RISC-based core with its sole application as an x86 instruction set emulator, ASIA-II, with the enhancements described in this section, can be tuned to study many design issues of x86 compatible microprocessors, such as the design of the internal RISC-based instruction set [8] and the resource allocation problem for the superscalar core, which is the focus of this paper. With the last paragraph of Section 4.1 describing the basic operations of ASIA-II, this section focuses on the necessary techniques adopted in ASIA-II for superscalar architecture study.
The basic scheduling algorithm
Using an instruction scheduling tool to investigate MLP and distribution of functional unit usage for superscalar architecture is a convenient approach but requires special care. Otherwise, non-optimal designs may results. In addition, in a superscalar core, the relative order of incoming operations is usually preserved during execution unless there are dependencies or there are some operations which take much more cycles than others to finish, as observed in Chapter 2 in Johnson's famous book on superscalar microprocessor [18] . For example, it is very unlikely that the sixth MOP (sub r15 r16 r17) in Figure3 (a) is executed in time step 1 while the second MOP (add r4 r5 r6) being executed in time step 2, although the dependency relationship allows so. Therefore, the scheduling algorithm should also try to preserve the relative order while optimizing for performance.
To take care of the above two issues, ASIA-II adopts a scheduling algorithm based on local compaction with a simulated annealing approach [19] . We understand that other scheduling algorithms, such as the trace scheduling algorithm [23] or the force-directed scheduling algorithm [24] , are also capable of investigating the same problem. Therefore, we don't claim any novelty for our simulated-annealing based scheduling algorithm over other algorithms. It is adopted in our work because it is easy to implement and it is 
Register renaming
The superscalar core in Figure1 supports the register renaming mechanism in order to boost parallelism. Register renaming eliminates anti (write after read) and output (write after write) dependencies between a pair of operations by redirecting the write operation (the later operation in the dependent pair) to a different location. Later operations that read the write result are also redirected to the new location.
The register renaming feature is supported in ASIA-II by ignoring the anti and output dependencies among the MOP's during scheduling, as also commonly practiced in [21] and [22] .
Sequentially executed complex instructions
As explained in Section 3, sequential instructions are executed sequentially in the superscalar core. Such instructions effectively reduce the basic block sizes. In addition, they also incur cycle penalty since they have to wait until all pending operations are completed. The effect of such instructions on the superscalar performance can be modeled as splitting of basic blocks. 
Hardware branch prediction
Branch instructions impede the instruction fetcher's capability to supply instructions at a sufficient rate to keep functional units busy. When the outcome of a branching instruction is not known, the instruction fetcher has to stall or incorrect instructions are fetched. A stalled instruction fetcher or incorrectly fetched instructions decrease the number of instructions ready to execute in parallel.
Hardware branch prediction aims to reduce the branch penalty by predicting, with hardware support, the direction of the current branch instruction based on its previous branch outcomes before the result of the current branch is known. If the branches are successfully predicted, the instruction fetcher is stalled for less times, and a smaller number of incorrect instructions are fetched. The functional units in the data path would see more instructions ready for parallel execution. Therefore, branch predication reclaims potential parallelism which is undermined by branch instructions.
From the viewpoint of the instruction fetcher, the benefit of branch prediction is that the length of the instruction sequence which the instruction fetcher can keep fetching without been interrupted becomes longer. Since the boundary of uninterrupted fetching is mainly marked by the branching related instructions 1 , which also mark the boundaries of basic blocks, the effect of The above observation suggests that scheduling with the annotated Eblock approach, with a medium extension distance parameter, provides practically the same parallelism measurement for the micro-operations executed in typical superscalar architectures as scheduling with entire instruction traces. On the other, scheduling annotated Eblocks requires much less computing time than scheduling entire traces. Therefore, we conclude that the annotated Eblock approach is a feasible solution to design exploration for superscalar architectures. In this example, the weight is set to one since there is only one program (the flow graph in Figure6 (a) ). 3A,0M,0B,0F
Derivation of distribution and parallelism
Other × × ∑ ∑ × × ∑ ∑ -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- =
Analysis of x86 Application Software
In this section, we apply x86 Workshop to measure the potential MLP, the distribution of functional unit usage and the performance characterization of several commercial Windows95 applications. Table1 lists the Windows95 applications used in this experiment. These applications are typical programs used by graduate students of computer engineering. In the table we list the numbers of executed instructions of the programs and the weights of programs which represent the relative frequencies of these applications used in a typical work environment. Note that Word and Excel execute relatively less instructions than some other applications, because they are interactive programs which spend most of the time in idle and waiting for users' inputs. The total number of instructions executed is over 524 million.
The superscalar core under measurement is based on the superscalar model in Section 3.
In order to obtain the maximal available MLP, which is used to serve as the upper bounds for the design space exploration, we adopt the following assumptions. The assumptions are similar to the ones used in Section 3.3.3 of [18] and Chapter 4 of [20] for superscalar architecture exploration.
1. The cache is 100% hit; i.e., there is no delay cycles caused by cache miss.
2. The branch prediction is 100% accurate.
Program
Executed Instructions 3. The instruction fetcher and decoders are fast enough to provide and decode sufficient instructions, in order to sustain the maximal MLP.
4. All the functional units are pipelined and the execution latency of functional units is one cycle, except the load/store unit which requires two cycles (the first cycle computing the effective address while the second cycle accessing the cache).
5. The reservation station is large enough to accommodate all ready MOPs and perform all necessary register renaming.
The experiments of ASIA-II take about 24 hours of computing time on four UltraSparc workstations (one at 143MHz, two at 200MHz, and one at 270MHz)
Micro operation level parallelism (MLP)
Table2 lists the MLP's for the given software programs, which are derived with the equation EQ1 and with the weights given in Table1 
Functional unit usage
Performance contribution of register renaming and branch prediction
Register renaming and branch prediction are two important techniques to improve the performance of superscalar processors. An interesting question is that how they individually and cor- The observation indicates that in order to gain maximal performance improvement, both techniques have to be implemented. However, when designing an x86 superscalar core with a very tight area constraint, register renaming may be granted with more area resources than branch prediction. The reason is that the x86 instruction set has very few general registers which severely prevent the parallelism to be exploited, and hence register renaming unlocks the hidden parallelism.
Comparison with related approaches
Our approach is an approximate method to obtain a large amount of superscalar measurements with greater flexibility and less preparation and computing time. It is usually used for a design exploration in which possible design points have to be quickly evaluated and compared.
Although it may be less accurate, it is desired that its approximated data fall within a reasonable range and show trends similar to the trends found in data obtained by more accurate (but possibly much slower) approaches. Here we compare our results with other related research. Due to the limited accessibility of the experiment material and our resources, we could not repeat their experiments locally; instead, our comparison relies solely on published data.
Bhandarkar and Ding present performance characterization of the Pentium Pro processor, measured on the real chip [14] . They show that the average number of micro-operations per instruction (MPI) is 1.35, whereas our approximation shows 1.26 [6] , with only 6% difference between the two approaches. Regarding the trends among data, the MPI of Microsoft's Excel is about 11% higher than Microsoft's Word in their experiment, whereas our approximation shows 8% (computed from data in [6] ). Both the data ranges and trends between the two approaches are very close.
On the other hand, we compare our analysis of performance contributions in Section 5.3
with a similar analysis for the MIPS superscalar architecture based on a trace simulation approach [18] . In this related work the measured performance contributions of register renaming and branch prediction are 36% and 30%, whereas ours are 46% and 23% (in Table4) . There are two observations. First, both results show the same trend: register renaming provides higher performance improvement than branch prediction. Second, our approximation shows that the contribution of register renaming is more significant in the x86 superscalar architecture than in the MIPS superscalar architecture. It is reasonable because the x86 architecture has only eight general purpose registers while the MIPS architecture has thirty-two general purpose registers. There is more potential parallelism locked by the limited number of registers in the x86 architecture than by the number of registers in the MIPS architecture. Therefore, by releasing the heavily locked parallelism, register renaming contributes more to the x86 architecture than to the MIPS architecture.
Based upon previous discussions, we found that our quick approximation approach produces the same trends as the related approaches which are more accurate but more complicated, and our data fall within a reasonable range of their data. For the data which exhibit significant deviation, we are able to explain that the main reason is due to the differences in the processor architecture features under investigation, not due to our approximation. Therefore, our approximate approach is a quick and viable solution for design space exploration.
Conclusions
We have developed an x86 instruction set CAD system x86 Workshop to measure the dis- In the future, we'd like to conduct experiments to cover a much wider spectrum of Windows95 applications. In addition, we'd like to extend the capability of x86 Workshop to address other superscalar design issues, such as instruction pairing (instruction folding), elimination of short conditional branches, instruction decoder allocation, branch prediction depth, etc.
