This paper presents the author's experience in using architectural simulation tools in the instruction of computer architecture courses. In particular, we develop the notion of incrementally building a programmable, trace-driven "timer" tool, for use as a learning vehicle. We show how the cycleby-cycle simulation output of such timers can be used to illustrate performance bottlenecks, and how this and other output statistics can be interpreted to convey key design tuning issues. As part of the overall simulation toolkit, we also use available cache simulators, trace generators and other utilities in illustrating key performance determinants and architectural trade-off issues.
fundamental pipeline-stage level design trade-off issues. Until very recently, use of architectural simulation tools (e.g. SPIM [2] ) was not prevalent.
This author has found it extremely useful to use such simulation tools in generating creative enlightenment among students. In this paper, we first describe (in section II), TRISC, a simple load/ store instruction set architecture, implemented using a basic super scalar machine organization [3] . We have used this simple machine and its associated trace-driven "timer" quite successfully as an instructional aid. In fact, for some courses, we have encouraged a few students to develop the timer itself from scratch, as their course project. Other students have used a timer provided by the author, to study fundamental design trade-offs. In section III, we explain the basic software structure of the simple, parametrized timer used to study cycle-by-cycle pipelined execution states of the TRISC machine. In section IV, we illustrate the use of timer and other related simulation tools in explaining fundamental design trade-off issues. We conclude, in section V, by summarizing our experience in timer toolkit based instruction, and speculate on future trends.
II. The TRISC Architecture and Machine
The Instruction Set:
The TRISC architecture [3] is a simple, but extendable load/store instruction set, with 32 integer and floating point registers. It has fixed fields for opcode, register specifiers and immediate or displacement operands. The core ISA (instruction set architecture) consists of the following opcodes: The unconditional branch instruction (b) causes a change in program sequence by unconditionally jumping to the target address (TA). The conditional branch (bc) tests the value of the fixed point register specified by RC, which is used as a count register; the branch occurs if the value is non-zero, after pre-decrementing RC. The target address is computed by adding the branch displacement D to the program counter, i.e., the address of this branch instruction.
Notation for unconditional branch, b: TA <--PC + D PC <--TA /* PC is the program counter */ Notation for conditional branch, bc:
Other instructions, like logical operations, and branches based on condition registers, are not described here. The ISA depicted here was used to study floating-point intensive, loop-oriented applications, where conditional branches are primarily loop-ending branches. The core ISA can easily be extended, if desired. However, we found this to be a very adequate core for beginning undergraduate courses, which did not go into elaborate branch prediction and resolution schemes, for example.
The Machine Organization:
We usually use a simple, super scalar processor model. In its simplest form (see Figure 1 ), intended for a beginning course in computer architecture, we assume a centralized, in-order instruction fetch/dispatch process, with three functional units: a branch unit (BRU), a fixed point unit (FXU), which processes integer arithmetic, as well as all load/store operations and a floating point unit (FPU). Initially, a perfect (infinite) cache model is assumed. (Later in the course, time permitting, finite cache models may be used). Re-order (completion) buffer mechanisms to enforce in-order completions (for precise interrupt support) is frequently omitted in an introductory course. Similarly, the concept of register renaming to eliminate certain kinds of data dependencies at runtime, is not introduced initially. Wherever possible, a particular hardware resource is parametrized in the simulation model (see next section) to enable trade-off and bottleneck analysis, which is the main intention in using or developing such a tool for class use. 
III. The TRISC Parametrized Timer
We usually use a classical, trace-driven, cycle-by-cycle simulation approach [5, 7] in implementing a class tool, or in using one for analysis purposes. We shall not go into the details of such timer implementation methods in this paper. The key steps and elements will be iluustrated in the actual talk. Here, we simply show (Figure 2 ) a sample cycle-by-cycle output for an input trace to illustrate the benefit of using such a tool for understanding pipeline stage level cycle-by-cycle behavior of such machines. For example, the degree of "slip" [6] between loads and their consuming operations in floating point loop kernels, and its sensitivity to overall performance, can be understood quite clearly from such "timeline" output. The TRISC timer is invoked in association with the following inputs: (a) a parameter file which specifies settings of modifiable hardware parameters; (b) a program trace, generated by an instruction set simulator or some other trace generation mechanism (e.g. hand-tracing for simple loop kernels). It generates at least one output file, with cycle-by-cycle listing of the program execution timeline and various processor statistics. The main timer loop looks as follows (using a Pascal notation):
BEGIN /* main program */ init_system; /* set up files and initialize variables */ REPEAT print_summary; cur_cycle:= cur_cycle + 1; cache_access; fix; flt; dsp; compute_system_idle; /* sets Boolean variable system_idle */ UNTIL end_of_trace AND system_idle; total_cycles := cur_cycle -1; printstats; close_files; END. /* main program */ Initially, the parameter file is read in, and the system is initialized; The main timer loop then services each functional pipeline unit once every cycle, and also updates the timeline output file. The loop terminates after the input trace has been consumed and all instructions have been emptied from all the unit pipelines.
IV. Fundamental Issues: Tools-based Analysis
The primary use of such simulation tools in class, is clearly in studying the fundamental tradeoffs which exist between machine organization parameters, in optimizing instructions-per-cycle (IPC) performance. In this section we mention a simple example to illustrate such use. In our experience, it is useful to encourage the students to first use analytical reasoning or "intuition" [5, 6] in answering a "what-if" question posed in class. They are then told to validate their reasoning via detailed simulation-based analysis.
Test kernel for testing overlapped (decoupled) access-execute:
The additional test case [7] . Let us consider a case, where we assume presence of full register renaming. Let us also consider a 3-issue execution model, with an n-stage floating point execute pipe. In pipelined mode, the throughput of completed adds should be determined solely by the three load/store instructions. Since we are dealing with a single-ported (infinite) cache, the number of steady-state cycles per iteration should be 3; hence, with adequate number of queue resources, cycles-per-instruction (CPI) expected is: 3/5 = 0.6 and cycles-per-flop (CPF) is 3.0. Unrolling the loop will not further improve the CPF because of the limitation imposed by the single cache port. In a detailed timer model, the various resources, such as reservation station sizes, completion (reorder) buffer size, rename buffers, etc can be varied to measure variation of cpi and cpf. If, under large extensions of buffer size ranges, the student is unable to achieve the expected CPI bound of 0.6, he may have either exposed a timer model bug, or a limitation of some piece of the design logic, which is causing an unexpected stall in some unit. The cycle-by-cycle output can then be used to isolate the cause of the bottleneck. Table 1 , shows an example summary of experimental timer-aided results for the above loop (with and without unrolling), which would be indicative of a performance bug in the machine. In the table, CBUF size refers to the size of the completion (reorder) buffer and FRBF size refers to the number of floating point rename buffers.
In addition to trace-driven timer models, other related tools useful in instruction are: trace generation and analysis programs, and cache simulators (e.g. [4] ). The former are useful in computing various dynamic workload (i.e. application) statistics, such as instruction frequency mix or average basic block size. These statistics can be used by the students to correlate timer-generated performance numbers against expectations or bounds computed from those statistics. Cache simulators are used to compute average miss ratios and miss rates (misses per instruction) for given applications and cache geometries. From the miss rate and average miss penalty (in processor cycles), the finite cache effect (or cpi penalty) can be estimated by taking their product. In a full-blown timer model, the detailed cache access pipeline is modeled as part of the timer model, along with the effects of reload time, leading and trailing edge effects, etc. Such a model is able to compute the overall CPI performance much more accurately than the "averaging" method referred to above. 
