Overview
Modern processors are incredibly complex marvels of engineering that are becoming increasingly hard to evaluate. This paper describes the SimpleScalar tool set (release 2.0), which performs fast, flexible, and accurate simulation of modem processors that implement the SimpleScalar architecture (a close derivative of the MIPS architecture [4] ). The tool set takes binaries compiled for the SimpleScalar architecture and simulates their execution on one of several provided processor simulators. We provide sets of precompiled binaries (including SPEC95), plus a modified version of GNU GCC (with associated utilities) that allows you to compile your own SimpleScalar test binaries from FORTRAN or C code.
The advantages of the SimpleScalar tools are high flexibility, portability, extensibility, and performance. We include five e~e-cution-driven processor simulators in the release. They range from an extremely fast functional simulator to a detailed, out-oforder issue, superscalar processor simulator that supports nonblocking caches and speculative execution.
The tool set is portable, requiring only that the GNU tools may be installed on the host system. The tool set has been tested extensively on many platforms (listed in Section 2). The tool set is easily extensible. We designed the instruction set to support This work was initially supported by NSF Grants CCR-9303030, CCR-9509589, and MIP-9505853, ONR Grant N00014-93-I-0465. a donation from Intei Corp., and by U.S. Army Intelligence Center and Fort Huachuca under Contract DABT63-95-C-0127 and ARPA order nn. D346. The current support for this work comes from a variety of sources, all of to which we are indebted. easy annotation of instructions, without requiring a retargeted compiler for incremental changes. The instruction definition method, along with the ported GNU tools, makes new simulators easy to write, and the old ones even simpler to extend. Finally, the simulators have been aggressively toned for performance, and can run codes approaching "'real" sizes in tractable amounts of time. On a 200-MHz Pentium Pro, the fastest, least detailed simulator simulates about four million machine cycles per second, whereas the most detailed processor simulator simulates about 150,000 per second.
The current release (version 2.0) of the tools is a major improvement over the previous release. Compared to version 1.0, this release includes better documentation, enhanced performane.e, compatibility with more platforms, precompiled SPEC95 SimpleScaiar binaries, cleaner interfaces, two new processor simulators, option and statistic management packages, a sourcelevel debugger (DLite!) and a tool to trace the out-of-order pipeline.
The rest of this document contains information about obtaining, installing, running, using, and modifying the tool set. In Section 2 we provide a detailed procedure for downloading the release, installing it, and getting it up and running. In Section 3, we describe the SimpleScalar architecture and details about the target (simulated) system. In Section 4, we describe the SimpleScalar processor simulators and discuss their internal workings. In Section 5, we describe two tools that enhance the utility of the tool set: a pipeline tracer and a source-level debugger (for stepping through the program being simulated). In Section 6, we provide the history of the tools' development, describe current and planned efforts to extend the tool set, and conclude. Note the "tar.gz" suffix: by requesting the file without the ".gz"
suffix, the ftp server uncompresses it automatically. To get the compressed version, simply request the file with the ".gz" suffix. The five distribution files in the directory (which are symboIic links to the files containing the latest version of the tools) are:
• simplesim.tar.gz -contains the simulator sources, the instruction set definition macros, and test program source and binaries. The directory is 1 MB compressed and 4 MB uncompressed. When the simulators are built, the directory (including object files) will require 11 MB. This file is required for installation of the tool set.
• simpleulils.tar.gz -contains the GNU binutils source (version 2.5.2), retargeted to the SimpleScalar architecture. These utilities are not required to run the simulators themselves, but is required to compile your own SimpleScalar benchmark binaries (e.g. test programs other than the ones we provide). The compressed file is 3 MB, the uncompressed file is 14 MB, and the build requires 52 MB. • simpletools.tar.p -contains the retargeted GNU compiler and library sources needed to build SimpleScalar benchmark binaries (GCC 2.6.3, glibc 1.0.9, and f2c), as well as pre-built big-and little-endian versions of libc. This file is needed only to build benchmarks, not to compile or mn the simulators. The tools are 11 MB compressed, 47 MB uncompressed, and the full installation requires 70 MB.
• simplebeneh.big.tar.gz -contains a set of the SPEC95 benchmark binaries, compiled to the SimpleScalar architecture running on a big-endian host. The binaries take under 5 MB compressed, and are 29 MB when uncompressed.
• simplebench.little.tar.gz -same as above, except that the binaries were compiled to the SimpleScalar architecture running on a little-endian host. Once you have selected the appropriate files, place the downloaded files into the desired target directory. If you obtained the files with the ".gz" suffix, run the GNU decompress utility (gunzip). The files should now have a ".tar" suffix. To remove the directories from the archive: tar xf fJ.lansme.ta~
If you download and unpack all files, release, you should have the following subdirectories with following contents:
• simplesim-2.0 -the sources of the SimpleScalar processor simulators, supporting scripts, and small test benchmarks. It also holds precompiled binaries of the test benchmarks.
• binufils-2.5.2 -the GNU binary utilities code, ported to the SimpleScalar architecture. • ssbig.na.sstrix -the mot directory for the tree in which the big-endian SimpleScalar binary utilities and compiler tools will be installed. The unpacked directories contain header files and a pre-compiled copy of libc and a necessary object file.
• sslittle-na-sstrix -same as above, except that this directory holds the little-endian versions of the SimpleScalar utilities.
• gcc-2.6.3 -the GNU C compiler code, targeted toward the SimpleScalar architecture.
• glibc-l.09 -the GNU libraries code, ported to the SimpleScalar architecture.
• ['2c-1994.09 .27 -the 1994 release of AT&T Bell Labs' FORTRAN to C translator code.
• spec95-big -precompiled SimpleScalar SPEC95 benchmark binaries (big-endian version).
• spee95-little -precompiled SimpleScalar SPEC95 benchmark binaries (little-endian version)
Installing and running Simplescalar
We depict a graphical overview of the tool set in Figure 1 . Benchmarks written in FORTRAN are convened to C using Bell Labs' f2c converter. Both benchmarks written in C and those converted from FORTRAN are compiled using the SimpleScalar version of GCC, which generates SimpleScalar assembly. The SimpleScalar assembler and loader, along with the necessary If you use the precompiled SPEC95 binaries or the precompiled test programs, all you have to install is the simulator source itself. If you wish to compile your own benchmarks, you will have to install and build the GCC tree and optionally (recommended) the GNU binutils. If you wish to modify the support libraries, you will have to install, modify, and build the glibc source as well.
The SimpleScalar architecture, like the MIPS architecture [4] , supports both big-endian and little-endian executables. The tool set supports compilation for either of these targets; the names for the big-endian and little-endian architecture are ssbig-na-sstrix and ssfitfle-na-sstrix, respectively. You should use the target endian-ness that matches your host platform; the simulators may not work correctly if you force the compiler to provide crossendian support. To determine which endian your host uses, run the endian program located in the simplesim-2.0/ directory. For simplicity, the following instructions will assume a bigendian installation. In the following instructions, we will refer to the directory in which you are installing SimpleScalar as
The simulators come equipped with their own loader, and thus you do not need to build the GNU binary utilities to run simulations. However, many of these utilities are useful, and we recommend that you install them. If desired, build the GNU binary utilities !:
=d $ZDIR/b£nut£11-2.5.2 configure --hoitm$HOST --target-sa]~g-nasstrix --with-~nu-as --wlth-~u-ld --pzafixm$IDIl% make i. You must have GNU Make to do the majority of installations described in this document. To check if you have the GNU version, execute "makev" or "gmake -v". The GNU version understands this switch and displays version information. eonfi~re --hostffiSK08T --taEgetfasbi~-nasutEix --wlth-gnu-as --wlth-gnu-ld --prafIx:$IDIR make LANGUAGES-c . . luimplasim-2.01ulm-safe ./enquire -f > I float .h-areas make install
We provide pro-built copies of the necessary libranes J n s s b i gna-sstrix/lib/, so you do not need to build the code in glibc-l.09, unless you change the library code. Building these libraries is tricky, and we do not recommend it unless you have a specific need to do so. In that event, to build the libraries: The entire tool set should now be ready for use. We provide precompiled test binaries (big-and little-endian) and their sources in $IDIR/simplesim2.0/Uests). To run a test:
c~ $ZDX~/ai~leaim-3.0 81m-safe tests/bin.big/test-math
The test should generate about a page of output, and will run very quickly. The release has been ported to---and should run on--the following systems:
The Simplescalar architecture
The SimpleScalar architecture is derived from the MIPS-IV ISA [4. ]. The tool suite defines both iittle-endian and big-endian versions of the architecture to improve portability (the version used on a given host machine is the one that matches the endianness of the host). The semantics of the SimpleScalar ISA arc a superset of MIPS with the following notable differences and additions:
• There are no arehitected delay slots: loads, stores, and control transfers do not execute the succeeding instruction.
• Loads and stores support two addressing modes---for all data types----in addition to those found in the MIPS architecture. These are: indexed (register+register), and auto-increment/decrement. • A square-root instruction, which implements both singleand double-precision floating point square roots. • An extended 64-bit instruction encoding. We list all SimpleScalar instructions in Figure 2 . A complete list of the instruction semantics (as implemented in the simulator) can be found elsewhere [2] . In Table 1 , we list the architected registers in the SimpleScalar architecture, their hardware and software names (which are recognized by the assembler), and a description of each. Both the number and the semantics of the registers are identical to those in the MIPS-IV ISA.
In Figure 3 , we depict the three instruction encodings of SimpleScalar instructions: register, immediate, and jump formats. All instructions an: 64 bits in length.
The register format is used for computational instructions. The immediate format supports the inclusion of a 16-bit constant. The jump format supports specification of 24-bit jump targets. The register fields are all 8 bits, to support extension of the architected registers to 256 integer and floating point registers. Each instruction format has a fixed-location, 16-bit opcode field that facilitates fast instruction decoding.
The annote field is a 16-bit field that can be modified postcompile, with annotations to instructions in the assembly files. The annotation interface is useful for synthesizing new instructions without having to change and re.compile the assembler. Annotations are attached to the opcode, and come in two flavors: bit and field annotations. A bit annotation is written as follows:
The annotation in this example is/a. It specifies that the first bit of the annotation field should be set. Bit annotations/a through/p set bits 0 through 15, respectively. Field annotations are written in the form:
This annotation sets the specified 3-bit field (from bit 4 to bit 6
within the 16-bit annotation field) to the value 7.
System calls in SimpleSealar are managed by a proxy handler (located in syscall, c) that intercepts system calls made by the simulated binary, decodes the system call, copies the system call arguments, makes the corresponding call to the host's operating system, and then copies the results of the call into the simulated program's memory. If you are porting SimpleScalar to a new platform, you will have to code the system call translation from SimpleScalar to your host machine in syscall, c. A list of all SimpleScalar system calls is available elsewhere [2] .
SimpleScalar uses a 31-bit address space, and its virtual memory is laid out as follows:
Start of text segment OxlO000000 Start of data segment OxTfffeO00
Stack base (grows down)
The top of the data segment (which includes init and bss) is held in mem_brk__point. The areas below the text segment and above the stack base are unused.
Simulator internals
In this section, we describe the functionality of the processor simulators that accompany the tool set. We describe each of the simulators, their functionality, command-line arguments, and internal structures.
The compiler outputs binaries that arc compatible with the MIPS ECOFF object format. Library calls are handled with the ported version of GNU GLIBC and POSIX-compliant Unix system calls. The simulators currently execute only user-level code. All SimpleScalar-related extensions to GCC are contained in the config/ss subdirectory of the GCC source tree that comes with the distribution.
The architecture is defined in ss .daf, which contains a macro definition for each instruction in the instruction set. Each macro defines the opcode, name, flags, operand sources and destinations, and actions to be taken for a particular instruction. The instruction actions (which appear as macros) that are common to all simulators are defined in -~s. h. Those actions that require different implementations in different simulators are defined in each simulator code file.
When running a simulator, main() (defined in main.c) does all the initialization and loads the target binary into memory. The routine then calls .~im_main ( ), which is simulatorspecific, defined in each simulator code file. s±m_n~in ( ) predecodes the entire text segment for faster simulation, and then begins simulation from the target program entry point.
The foliowing command-line arguments are available in all simulators included with the release." 
Functional simulation
The fastest, least detailed simulator (sire-fast) resides in sim-£a=t:, c. aim-fast does no time accounting, only functional simulation it executes each instruction serially, simulating no instructions in parallel, sire-fast is optimized for raw speed, and assumes no cache, instruction checking, and has no support for DLite!_ A separate version of aim-fast, called aim-safe, also performs functional simulation, but checks for correct alignment and access permissions for each memory reference. Although similar, sire-fast and sire-safe are split (i.e., protection is not toggled with a command-line argument in a merged simulator) to maximize performance. Neither of the simulators accept any additional command-line arguments. Both versions are very simple: less than 300 lines of cede---they therefore make good starting points for understanding the internal workings of the simulators. In addition to the simulator file, both sim-fast and sire-safe use the following code files (not including header files): maln. c, =yscall.c, memory.c, regs_c, loader.c, ~s.c, endian, c, and misc. c. ~im-safe als0 uses dl ite. c.
Cache simulation
The SimpleScalar distribution comes with two functional cache simulators; siro-cache and sim-eJaeetah. Both use the file cache.c, and they use sim-cache.c and sim-cheetah. c, respectively. These simulators are ideal for fast simulation of caches if the effect of cache performance on execution time is not needed.
sire-cache accepts the following arguments, in addition to the universal arguments described in <repl> replacement policy (1 1 f I r), where l = LRU,f= FIFO, r = random replacement. The cache size is therefore the product of <nsets>, <baize>, and <assoc>. To have a unified level in the hierarchy, "point" the instruction cache to the name of the data cache in the corresponding level, as in the following example: Both of these simulators are ideal for performing high-level cache studies that do not take access time of the caches into account (e.g., studies that are concerned only with miss rates). To measure the effect of cache organization upon the execution time of real programs, however, the timing simulator described in Section 4.4 must be used.
Profiling
The dis~bution comes with a functional simulator that produces voluminous and varied profile information, sim-profile can generate detailed profiles on instruction classes and addresses, text symbols, memory accesses, branches, and data segment symbols.
sim-profile takes the following command-line arguments, which toggle the various profiling features: -iclass instruction class profiling (e.g. ALU, branch). -iprof instruction profiling (e.g., bnez, addi). -brprof branch class profiling (e.g., direct, calls, conditional). -amprof addr. mode profiling (e.g., displaced, R+R).
-segprof load/store segment profiling (e.g., data, heap). -tsymprof execution profile by text symbol (functions). -dsymprof reference profile by data segment symbol. -taddrprof execution profile by text address. -ail turn on all profiling listed above. Three of the simulators (aim-profile, sire-cache, and sire-outorder) support text segment profiles for statistical integer counters. The supported counters include any added by users, so long as they are correctly "registered" with the SimpleScalar stats package included with the simulator code (see Section 4.5). To use the counter profiles, simply add the command-line flag:
-pcstat <star> where <stat> is the integer counter that you wish to profile by text address. We show a segment of the text profile output in Figure 4 . Make sure that "objdump" is the version created when compiling the binutils. Also, the first line of textprof .pl must be changed to reflect your system's path to Perl (which must be installed on your system for you to use this script). As an aside, note that "-taddrprof' is equivalent to "'-l~stat sim_num_insn".
Out-of-order processor timing simulation
The most complicated and detailed simulator in the distribution, by far, is sim-outorder (the main code file for which is sim-outorder, c--about 3500 lines long). This simulator supports out-of-order issue and execution, based on the Register Update Unit [5] . The RUU scheme uses a reorder buffer to automatically rename registers and hold the results of pending instructions. Each cycle the reorder buffer retires completed instructions in program order to the architected register file.
The processor's memory system employs a load/store queue. Store values are placed in the queue if the store is speculative. Loads are dispatched to the memory system when the addresses of all previous stores are known. Loads may be satisfied either by the memory system or by an earlier store value residing in the queue, if their addresses match. Speculative loads may generate cache misses, but speculative TLB misses stall the pipeline until the branch condition is known.
We depict the simulated pipeline of sim-outorder in The fetch stage of the pipeline is implemented in ruu_fetch (). The fetch unit models the machine instruction bandwidth, and takes the following inputs: the program counter, the predictor state, and misprediction detection from the branch execution unit(s). Each cycle, it fetches instructions from only one 1-cache line (and it blocks on an l-cache miss until the miss completes). After fetching the instructions, it places them in the dispatch queue, and probes the line predictor to obtain the correct e x e c u t e d -" -' l P e t This routine is where instruction decoding and register renaming is performed. The function uses the instructions in the input queue filled by the fetch stage, a pointer to the active RUU, and the rename table. Once per cycle, the dispatcher takes as many instructions as possible (up to the dispatch width of the target machine) from the fetch queue and places them in the scheduler queue. This routine is the one in which branch mispredictions are noted. (When a misprediction occurs, the simulator uses speculative state buffers, which are managed with a copy-on-write policy). The dispatch routine enters and links instructions into the RUU and the load/store queue (LSQ), as well as splitting memory operations into t w o separate instructions (the addition to compute the effective address and the memory operation itself).
The issue stage of the pipeline is contained in ruu_issue ( ) and Isq_refresh ( ). These routines model instruction wakeup and issue to the functional units, txacking register and memory dependences. Each cycle, the scheduling routines locate the instructions for which the register inputs are all ready. The issue of ready loads is stalled if there is an earlier store with an unresolved effective address in the load/store queue. If the address of the earlier store matches that of the waiting load, the store value is forwarded to the load. Otherwise, the load is sent to the memory system. The execute stage is also handled in r u u _ i s s u e ( ). Each cycle, the routine gets as many ready instructions as possible from the scheduler queue (up to the issue width). The functional units' availability is also checked, and if they have available access ports, the instructions are issued. Finally, the routine schedules writeback events using the latency of the functional units (memory operations probe the data cache to obtain the correct latency of the operation). Data TLB misses stall the issue of the memory operation, are serviced in the commit stage of the pipeline, and currently assume a fixed latency. The functional units' latencies are hardcoded in the definition of
The writeback stage resides in r u u w r i t = e b a c k ( ). Each cycle it scans the event queue for instruction completions. When it finds a completed instruction, it walks the dependence chain of instruction outputs to mark instructions that are dependent on the completed instruction. If a dependent instruction is waiting only for that completion, the routine marks it as ready to be issued. The writeback stage also detects branch mispredictions; when it determines that a branch misprediction has occurred, it rolls the state back to the checkpoint, discarding the erroneously issued instructions. r u u _ c o m m i t ( ) handles the instructions from the writeback stage that are ready to commit. This routine does in-order committing of instructions, updating of the data caches (or memory) with s t o r e values, and data TLB miss handling. The routine keeps retiring instructions at the head of the RUU that are ready to commit until the head instruction is one that is not ready. When an instruction is committed, its result is placed into the architected register file, and the RUU/LSQ resources devoted to that instruction are reclaimed.
sim-outorder runs about an ordar of magnitude slower than sire-fast. In addition to the arguments listed at the beginning of Section 4, sim-outorder uses the following command-line arguments:
Specifying the processor core -fetch:ifqsize <size> set the fetch width to be <size> instructions. Must be a power of two. The default is 4. -fetch:speed <ratio> set the ratio of the front end speed relative to the execution core (allowing <ratio> times as many instructions to be fetched as decoded per cycle). -fetch:mplat <cycles> set the branch misprediction latency. The default is 3 cycles. -decode:width <insts> set the decode width to be <insts>, which must be a power of two. The default is 4. -issue:width <insts> set the maximum issue width in a given cycle. Must be a power of two. Specifying the branch predictor Branch prediction is specified by choosing the following flag with one of the six subsequent arguments. The default is a bimodal predictor with 2048 entries.
-bpred <type> nottaken taken perfect bimod always predict not taken. always predict taken. perfect predictor. bimodal predictor, using a branch target buffer (BTB) with 2-bit counters. The predictor-specific arguments are listed below:
-bpred:bimod <size> set the bimodal predictor table size to be <size> entries. -bpred:21ev <llsize> <12size> <hist_size> <got> specify the 2-level adaptive predictor. <llsize> specifies the number of entries in the first-level table, <12size> specifies the number of entries in the second-level table, <hist_size> specifies the history width, and <xor> allows you to xor the history and the address in the second level of the predictor. This organization is depicted in Figure 6 . In Table 2 we show how these parameters correspond to modern prediction schemes. The default settings for the four parameters are 1, 1024, 8, and 0, respectively. -bpred:comb <size> set the recta-table size of the combined predictor to be <size> entries. The default is 1024.
-bpred:ras <size> set the return stack size to <size> (0 entries means to return stack). The default is 8. VisuulizaUon -pestat <sLat> record statistic <star> by text address; described in Section 4.3. -ptrace <file,> <range,> pipeline tracing, described in Section 5.
. 5 S i m u l a t o r c o d e f i l e d e s c r i p t i o n s
The following list describes the functionality of the C code flies in the s i m p l e s i m -2 . 0 / directory, which are used by all of the simulators.
• b i t = m a p , h: Contains support macros for performing bitmap manipulation.
• b p r e d . [ c , h ] : Handles the creation, functionality, and updates of the branch predictors, b p r e d _ c r e a t e ( ) , bpred_lookup ( ), and bpred_update ( ) are the key interface functions.
• cache.
[c,h]: Contains general functions to support multiple cache types (e.g., TLBs, instruction and data caches). Uses a linked-list for tag comparisons in caches of low associativity (less than or equal to four), and a hash table for tag comparisons in higher-associativity caches. The important interfaces are c a c h e _ c r e a t e ( ), cache_access (), cache_probe (), cache_flush (), and cache_f lush_addr ().
• dlite _ [c, h] : Contains the code for Diite!, the sourcelevel target program debugger.
• e n d i a n .
[ c , h ] : Defines a few simple functions to determine byte-and word-order on the host and target platforms.
• e v a l . [ c , h] : Contains code to evaluate expressions, used in DLite!.
• e v e n t q . [ c , h ] : Defines functions and macros to handle ordered event queues (used for ordering writebacks). The important interface functions are e v e n t q . . . q u e u e ( ) and eventq_service_events ().
• loader.
[c, h]: Loads the target program into memory, sets up the segment sizes and addresses, sets up the initial call stack, and obtains the target program entry point. The interface is id_load__prog ( ).
• m a i n . c: Performs all initialization and launches the main simulator function. The key functions are sim_options ( ), sim_config ( ), sim_main ( ), andsim stats (). • p t = r a c e . [ c , h ] : Contains code to collect and produce pipeline traces from sim-outorder.
• r a n g e . [ c , h ] : Holds code that interprets program range commands used in DLiteI.
• r e g s . [ c , h] : Contains functions to initialize the register files and dump their contents.
• r e s o u r c e .
[ c , h ] : Contains code to manage functional unit resources, divided up into classes. The three defined functions create the resource pools and busy tables ( r e s _ c r e a t = e _ . p o o 1 ( ) ), return a resource from the specified pool if available ( r e g _ g e t ( ) ), and dump the contents of a pool (res_dump ( ) ).
• s i r e . h : Contains a few extern variable declarations and function prototypes.
• s t a t s .
[ c , h ] : Contains routines to handle statistics measuring target program behavior. As with the options package, counters are "registered" by type with an internal database. The s t a t _ r e~t _ * ( ) routines register counters of various types, and s t = a t _ r e g f o r m u l a ( ) allows you to register expressions constructed of other statistics. s~at_print_sta~s() prints all registered statistics. The statistics package also has facilities to measure distributions; seat_reg_dist ( ) creates an array distribution, stat_reg_sdise ( ) creates a sparse array distribution, and star-_add_sample ( ) updates a distribution.
• s s. [ c, h ] :
Defines macros to expedite the processing of instructions, numerous constants needed across simulators, and a function to print out individual instructions in a readable format.
• s s. clef: Holds a list of macro calls (the macros are defined in the simulators and ss.h and as.c), each of which defines an instruction. The macro calls accept as arguments the opcode, name of the instruction, sources, destinations, actions to execute, and other information. This file serves as the definition of the instruction set.
• symbol.
[ c, h] : Holds routines to handle program symbol and line information (used in DLite!).
• syscall.
[c, h] : Contains code that acts as the interface between the SimpleScalar system calls (which are POSIXcompliant) and the system calls on the host machine.
• sysprobe, c: Determines byte and word order on the host platform, and generates appropriate compiler flags.
• vers £on. h: Defines the version number and release date of the distribution.
Utilities
In this section we describe the utilities that accompany the SimpleScalar tool set; pipeline tracing and a source-level debugger.
Out-of-order pipeline tracing
The tool set provides the ability to extract and view traces of the out-of-order pipeline. Using the "-ptrace" option, a detailed history of all instructions executed in a range may be saved to a file. The information saved includes instruction fetch, retirement, and stage transitions. The syntax of this command is as follows: -ptrace <file> <start>:<end> <file> is the file to which the trace will be saved. <start> and <end> are the instruction numbers at which the trace will be started and stopped. If they are leR blank, the trace will start at the beginning and/or stop at the end of the program, respectively. For example: -ptrace FOO.tre 100:500 trace from instructions 100 to 500, store the trace in file FOO.src. -ptxace FOO.trc :10000 trace from program beginning to instruction 10000. -ptrace FOO.trc : trace the entire program execution. The traces may be viewed with the pipeview.pl Perl script, which is provided in the simplesim-2.0 directory. (You will have to update the first line ofpipeview, pl to have the correct path to your local Perl binary, and you must have Per installed on your system). To use the debugger in a simulation, add the "-i" option (which stands for interactive) to the simulator command line. Below we list the set of commands that DLite! accepts. cont [addr] continue execution (optionally continuing starting at <addr>).
break <addr> set breakpoint at <addr>, returns <id> of breakpoint.
dbreak <addr> [r,w,x] set data breakpoint at <addr> for (0ead, (w)rite, and/or e(x)ecute, returns <id> of breakpoint.
rbreak <range> [r,w,x] set breakpoint at <range> for (r)ead, (w)rite, and/or e(x)ecute, returns <id> of breakpoint. 
Printing information:
extending the tool set to simulate ISAs other than SimpleScalar and MIPS (Alpha and SPARC ISA support will be the first additions).
As they stand now, these tools provide researchers with a simulation infrastructure that is fast, flexible, and efficient. Changes in both the target hardware and software may be made with minimal effort. We hope that you find these tools useful, and encourage you to contact us with ways that we can improve the release, documentation, and the tools themselves.
