Trace-driven simulation is an important aid in performance analysis of computer systems. Capturing address traces for these simulations is a difficult problem for single processors and particularly for multicomputers. Even when existing trace methods can be used on multicomputers, the amount of collected data typically grows with the number of processors, so I/O and trace storage costs increase. A new technique is presented in this paper which modifies the executable code to dynamically collect the address trace from the user code and analyzes this trace during the execution of the program. This method helps resolve the I/O and storage problems and facilitates parallel analysis of the address trace. If a trace stored on disk is desired, the generated trace information can also be written to files during execution, with a resultant drop in program execution speed. An initial implementation on the Intel iPSC/2 hypercube multicomputer is detailed, and sample simulation results are presented. The effect of this trace collection method on execution time is illustrated.
INTRODUCTION
Trace-driven simulation is an important method of analyzing the performance of computer systems [ 1, 2] . However. accurately and efficiently capturing address trace data for multicomputers is exmmely difficult. In this paper, we examine the problem of address trace generation and collection for multicomputers, which are non-shared distributed memory parallel processors of the multiple-Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by pemksion of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/ or specific permission. instruction, multiple-data stream class (MIh4D) [3] . This class of machines is in contrast to MIMD multiprocessors with a global shared memory. Recording the address traces for multicomputers typically requires large amounts of memory, and therefore the I/O necessary for saving these traces is a significant overhead. In addition, the traces gathered are typically valid for only the number of processing nodes that participated in the execution.
Understanding how the execution time, speedup, and other system measures change as the number of processors change is of vital importance to multicomputer hardware. software, and application designers. This necessity mandates having several sets of traces for any single application problem -one set of traces for each possible dimension hypercube, for example. Also, since speedup is heavily dependent on the size and characteristics of the input data for most parallel applications, there is a need for application program traces for several different sets of inputs. Keeping all of these traces in storage rapidly becomes an impracticality for a large number of processing nodes. This paper presents a new software address tracing technique for multicomputers called TRAPEDS -TRAceProducing Execution Driven Simulation. This software technique modifies executable code (at the assembly language level), producing a new executable program which dynamically produces correct address traces of the user code and other information valuable in assessing computer system performance. The primary purpose of this tool is to enable hardware designers to model and simulate trade-offs in a multicomputer's computation and communication capabilities for the specific parallel algorithms that are traced.
As trace addresses are generated by the execution, analysis of cache performance and other design alternatives can be performed, eliminating the need for storing large amounts of trace data, and thereby reducing the massive trace storage requirement and the I/O bottleneck that would slow execution on a multicomputer. An added benefit of collecting and analyzing the address trace on the multicomputer is the speedup obtained as the number of processors increases. The simulation speedup is dependent on the speedup of the original executable code, since the effects of synchronization, message-passing, and unbalanced or replicated computation are also present in the modified executable code. Our approach to producing address trace data has been implemented on an Intel iPSC/2 hypercube multicomputer. In our implementation on the iPSC/2, the execution of the traced program is typically degraded by less than a factor of 50, which compares favorably with existing trace collection methods. Conventional stored trace data can also be obtained with the TRAPEDS approach with the resulting increase in storage cost and performance degradation.
A brief review of popular existing trace methods is presented in $2. The TRAPEDS methodology is discussed in 53, and its implementation on the iPSC/2 is outlined in $4. 55 presents results of performance evaluation of this method along with preliminary memory reference observations.
REVIEW OF EXISTING TRACING TECHNIQUES
2.1. Hardware Monitoring Based Traces Hardware monitoring can directly record memory bus activity and the actual addresses sent to off-chip caches or main memory modules [4, 5] . This monitoring captures both user and operating system references, as well as multiprogrammed streams of references. The effect of onchip caches on the reference stream is also included, but this implementation effect is also a drawback of hardware monitoring. The primary limitations of this approach are its complexity, cost, and lack of easy flexibility. Because of limited memory and bandwidth, hardware monitors typically cannot capture all of the reference trace, and must settle for isolated collections of contiguous references, or counts of events, rather than a listing of the events themselves. To collect trace information for all of the processors in a multicomputer, the complex hardware required grows at least linearly with the number of processors.
Instruction Interrupt Based Traces
Some computer systems (e.g., some VAX machines [6] ) provide the capability of interrupting the execution of a program after each instruction. The virtual address references for each type of instruction can then be calculated. Since operating system routines typically disable these interrupts, the operating system execution cannot be traced. The need to interrupt each instruction slows down the program execution considerably. For multicomputers, this distortion of instruction execution time inevitably changes the fashion in which different processing nodes interact with each other (except when the multicomputer message-passing is synchronized between a specified sender and receiver, such as in the Occam language [7] ), and thereby possibly changes the address trace.
Software Simulation Based Traces
Software simulation can also provide user traces, and can simultaneously model the execution time of a processor, which can enable accurate modeling of the interaction between different processors in multicomputers. This simulation can also provide emulation of operating system activities, although this emulation may not be exact. Software simulation is slow, however, since the simulator must model much of the real hardware, including the actual ALU operations, flag setting, instruction fetching, and main memory storage and accesses [8, 9] .
Microprogramming Based Traces
ATUM [lo] is a recently introduced technique that alters a machine's microcode to capture address traces. This technique enables the capture of full address traces for multiprogrammed user code and operating system activity. It is also fast, with factor of 20 overhead reported. This technique was recently used to collect traces for a 4-processor system [ll] . Despite the significance of this approach, there are obstacles to implementing microcode alteration on existing multicomputers. The main obstacle is that the processors on commercial multicomputers tend to be one-chip microprocessors, and either do not use microcode or contain their microcode in ROM (as in the iPSC/2, which uses the 80386 processor). Even if this microcode could be changed, there would typically not be extra space on the chip to allow the ATUM changes.
TRAPEDS Based Traces
The TRAPEDS method of this paper (TRAce Producing Execution Driven Simulation) addresses the issues of producing accurate and efficient multicomputer traces with a reduction in the burdensome storage and I/O requirements of stored traces. The traces produced by this method currently do not include operating system references. Also, the current implementation on the iFYX/Z does not provide the ability to collect multiprogrammed traces. At the present time, multicomputers such as hypercubes are rarely used in a multiprogramming mode, partly because each processing node has a fixed amount of space into which all currently executing programs must completely reside. TRAPEDS also attempts to mitigate the effects of execution time distortion on the interaction between processors by introducing a simulated time for each multicomputer processing node, and by passing these simulated times between nodes during communication.
TRACE-PRODUCING EXECUTION-DRIVEN SIMULATION METHOD
Execution-driven simulation is a term coined by Covington, et al. [12] , for an approach to gathering accurate timing statistics for a program as it is executing. Briefly, the method estimates the time to execute each basic block in the assembly code, where a basic block is defined as a set of machine instructions that will always execute together in the absence of interrupts. Calls to a simulation timer update routine are placed at the beginning of each basic block in the assembly code, and the estimated execution time for that basic block is passed as a parameter to this routine. The execution of this modified program also updates the timer, simulating the program execution time. This method was also used by Fujimoto [13] , and in the instruction counting method introduced by Weinberger [ 141. This general execution driven simulation approach could have been used to perform a static address analysis on each basic block. Static analysis can generate instruction address traces, but data addresses cannot be fully determined without register and memory information available only at run time. Our paper extends the execution-driven simulation concept, enabling user address tracing for both instructions and data, by also collecting and analyzing the run time, or dynamic, information necessary to calculate data addresses.
The dynamic collection of information utilized in this paper requires additional modification of the assembly code. In addition, static analysis produces address information that must be stored in the virtual address space. For these masons, addresses collected during execution may not be identical to the actual addresses in the unmodified code. Calculating the correct data addresses at execution time, therefore, is a major element of the TBAPEDS method. The steps used to produce a modified executable program are described in what follows. All steps are accomplished automatically by the TBAPEDS software.
TRAPEDS Steps in Modifying the Executable File STEP 1:
The original program's source hles are compiled and linked with the library functions to produce the original executable file as is illustrated in Figure 1 . This file is analyzed to record the beginning virtual addresses of the text (program), initialized data, and uninitialized data sections. STEP 2:
All source written in a high-level language is compiled to assembly code. Together with any source files written directly in assembly language, these compiled programs form the suite of assembly language files that will be modified by the TRAPEDS software. STEP 3:
For each resulting assembly language file, the corresponding machine language instructions in the executable file are analyzed. Utilizing both the assembly language and machine code is advantageous because extracting virtual address information requires the actual machine language instructions. However, it is far easier to modify the associated assembly language program to capture necessary run time address information. STEP 4:
The assembly source is divided into basic blocks by noting labels and statements such as jumps, calls, and returns that can break the normal sequential execution of the program. In a separate assembly language file (named auxfile. s in this discussion), the starting address of the basic block is recorded, the fnst of several types of data that will be recorded in auxf ile. s for each basic block. A call to the basic block performance simulation routine (hereafter called X-bb perf) is inserted at the beginning of each basic block in the assembly source file. A pointer to the auxf i le . s address information is also saved in a global variable after this call to X-bb-perf.
Since the dynamic address information is collected during the execution of a basic block, the call to X-bb-perf must analyze the previously executed basic block. This is conceptualized in Figure 2 , which shows the high-level organization of X-bb-perf in a C-like syntax. Figure 2 . High-level structure of X bb-perf, the basic block performance analysis routine.
dynamic address values needed in any given basic block of the executable file. STEP 8:
The modified assembly files are assembled again, and linked with X-bb-perf and any simulation routines called by X-bb-perf, resulting in a modified executable file capable of generating address trace information.
Solving the virtual address modification problem
The discussion in this section is based on UNIX', but the principles considered apply to many other operating systems as well. In UNIX, an executable file is commonly divided into three segments -.text, .data, and .bss (the stack) [15] . The . text section within the . text segment contains user code. The . data segment has two adjacent sectionsthe first section contains initiulized static data, which we shall refer to as initialized permanent data, because the location of the data is reserved during the entire execution of the Program. The second section contains uninitialized permanent data. The . bss segment contains no initial data, and merely indicates that a stack segment is required. The . text and . data sections are pictured in Figure 3(a) .
The initiahcd . data section starts in the next page table directory after the last one used by the text section, in the first page of that directory, with an offset into that page equal to the first unused memory offset in the last page of text. This allows the . text and .data sections, which are physically adjacent in the executable file, to be loaded adjacently into physical memory. This also implies that any changes in the size of the . text section will change the starting virtual address of the . data section. In addition, the extra permanent data in auxfile.s,X-bb-perf, and the performance analysis routines called by X-bb-perf change the virtual addresses in the . data sections.
In TRAPEDS, the solution involves ensuring that all newly created permanent data are initialized and placed at the beginning of the initialized .data section of the executable file, as illustrated in Figure 3(b) . In this case, all the original data in the . data sections will be displaced by an equivalent 'UNIX is a q&exd trademark of AT&T. Figure 3 . Placement of code and data in the virtual address space of the executable files.
amount. This displacement is a result of both the . text and .data section changes, and can be easily determined by comparing symbol table information in the modified and original executable files. In addition, either static or dynamic analysis must be able to determine which segment the referenced data is in, because . bss segment addressing remains unaffected by the changes in the . text and . data sections (no address adjustment is needed).
A subtle problem arises when X-bb perf or its simulation routines directly or indirectly call library routines with permanent data that were not called by the original source files. In practice most library routines do not define permanent data, and this problem does not exist with our current performance routines. The solution to this problem involves linking the previously uncalled library routines to X-bb-perf and its routines during a first-pass linking phase. If the permanent data in these library routines is uninitialized (very rare), this data must be initialized. With this procedure, all new permanent data will be placed before the original permanent data by the normal linking of all routines.
IMPLEMENTATION ON THE 80386-BASED IPSCI2
The Intel XX/2 hypercube is an 80386/80387-based multicomputer that can contain up to 128 processing nodes, each with up to 16 Megabytes of main memory. The TRAPEDS method was implemented for a 16 node iPSC/2 with 4 Megabytes of main memory at each node. Each processor also has a 64 Kbyte zero wait-state write-through cache with a 4 byte line size and direct mapping. The 80386 pre-fetches instructions into a 16 byte buffer via its 4 byte data bus [ 163.
On the 80386, all explicit references to memory use the same addressing modes for the segment offset, which are subsets of the following general addressing mode:
The displacement and scale factor, if present, constitute static information saved in auxf ile. s during the assembly code modification. The base register and index register, if present, constitute dynamic information that must be saved at execution time.
As shown before, the calculated segment offset of the virtual address may not be correct. In the 80386, the instruction implicitly or explicitly indicates the segment used for memory references. Any references to the .data segment that also use a base register have incorrect (modified) segment offsets, and the correcting offset is subtracted from the calculated virtual address for these cases.
X-bb-perf recognizes several types of memory accesses, such as push, push memory, pop, pop memory, read memory, read and write memory, write memory, read 2 words of memory, etc.. The segment (usually . data or . bss) referenced is also stored as part of the access type. Combined with the addressing mode, these types of accesses provide a full description of every memory access.
The current implementation records the type of access and addressing mode in auxfile.
s as shown in Figure 4 . If the recorded addressing mode contains a displacement, this displacement is placed in the 4 bytes following the mode information. The auxf ile . s information for each basic block also contains the starting text address of the basic block, and along with the code fetching information saved in bits 16-23, allows X-bb-perf to fully reconstruct and interleave the code and data accesses to form an accurate trace.
In collecting traces on a multicomputer it is important to model the interaction between processors as accurately as possible. The trace collection slows down the execution of each program and thus potentially changes the order in which processors send messages. In this implementation, one of the functions of X-bb-perf is to simulate the elapsed number of cycles in each processor's execution. This number of cycles is stored in an 8 byte field X-time, since using only 4 bytes to count cycles would cause wrap-arOund of the time to zero in less than 5 minutes of simulated execution time of the 16-MHz 80386 processor. The information stored in auxf ile . s for each basic block also contains the estimated number of processor cycles for that basic block (assuming no cache misses), and X time is incremented by that amount -when the basic block is executed. The effects of cache misses are modeled simplistically by adding a fixed number of cycles penalty for each type of memory access. The 80386 also contains a 32 entry 4 way associative TLB, but the TLB and the effects of TLB misses are not modeled in the current implementation. The iPW2 message-passing routines have a top layer of code implemented in C that calls the actual operating system code. This top layer of code was provided to allow simple modifications of message passing. The TRAPEDS simulation routines contain redefined message sends that send an extra message containing X-t ime for every normal message send. The receive routines are recoded to receive X time after every normal receive. Each modified send and &eive routine models the cost of communication (both latency and per byte transmission speed) in a simple manner which does m~)t account for possible delays in message routing caused by network congestion, two or more messages arriving at a node in the same time interval, etc.
When a message is received, the received X time can be compared with the local X time to deter&e when communication waiting periods &ur. However, it is possible to receive a sequence of messages (from different processor nodes) with non-monotonically increasing X time.
For these cases, a mechanism should exist to provide reordered reception of messages, so that the simulated execution will more closely model the actual execution. The implementation of this paper does not attempt any such message reordering, and this is a topic of current research. A related purpose of X-time is modeling the performance of the iPSC/2 hypercube as various hardware parameters are changed.
TRAPEDS PERFORMANCE AND CACHE SIMULATION RESULTS
This section discusses TRAF'EDS simulation performance and data collected by the TRAPEDS method. Benchmark studies of the overhead of this method are particularly emphasized, since the modified executable file requires more memory space and more execution time than the original executable file.
Space and Time Overhead
Both the . text section and the initialized permanent .data section are substantially lengthened by additional information. This amount of additional information increases with the number of basic blocks and memory references in the Origilld . text section. Table 1 shows the additional memory requirements for three modified ipSC/2 hypercube node programs. X bb perf and its associated routines are listed separately because their size is not dependent upon the size of the original . text section. Much of the X-bb-perf overhead is for a cache model which is included in the simulation routines (for Table 1 the cache model is large enough to store up to 16K tags for a direct mapped cache). The additions to the . text section and .data section together require roughly 4 times the memory space of the original . text section. When compared to the total ipSC/2 node memory space (4 Megabytes), the space overheads for the programs shown are quite small. Extra execution time is incurred saving register values and calling X-bb-perf at the start of each basic block. Extra overhead is also incurred for any simulation routines called by X-bb-perf.
It is desirable to separate the effects of these two overheads, since for a given program and hypercube dimension the overhead due to X-bb-perf execution should be relatively constant, while the simulation routines can be changed for each new run (e.g., changing from a direct mapped cache to a set-associative cache model or simulating two or more cache models in the same run will cause an increase in the cache model simulation time). The plots to be shown for execution overhead assume the following definition: execution overhead = modif ied file execution time original file execution time
To separate the effects of address generation from simulation, performance benchmarks were run against a parallel version of the simplex algorithm [17] . This algorithm has moderate but not excessive parallelism, and this parallelism is sensitive to changes in the input data size, allowing some control over algorithm speedup. The complexity of sequential simplex is roughly proportional to m*n , where m is the number of rows in the input matrix, and n is the number of columns. For each graph shown, the number of rows and columns in the input am displayed. In the first performance experiment X-bb perf generates addresses and calls the cache simulation routine, but the simulation routine immediately returns to X-bb-perf. A plot of the execution overhead for the original and modified executable files is shown in Figure 5 . Execution overhead due to TBAPEDS is = 30 for execution on a single node. As the number of nodes increases, however, the overhead factor decreases. This is a consequence of the parallel nature of the address tracing overhead.
Because the overhead is decreasing with the hypercube dimension, the speedup S(mod), of the modified executable file is higher than the speedup S,, for the original file, where these speedups are defined as: The second experiment shows the performance overhead when a direct mapped cache simulation model is used to find cache hit ratios. Figure 7 shows this overhead for different sizes of hypercubes. For a program executing on a single hypercube node, the execution overhead for producing trace addresses and simulating cache hits and misses for a direct mapped cache is less than a factor of 50. Again, increasing the number of hypercube nodes decreases the execution overhead, this time even more dramatically, which implies improved speedup S (mod), .
Even for the simple direct mapped cache model used, the cache simulation constitutes a significant portion of the total execution overhead. Additional complexity in the cache model or simulating cache performance for two or more caches will increase the execution overhead further, but the resulting analysis will be conducted with an even higher degree of parallelism.
Also of interest is the execution overhead when traces are being saved to disk. For this experiment, each hypercube node was assigned a specific disk output file for trace address storage. On the iF%C/2, all communication to the host machine (which is the only processor with direct access to the disk) must pass through a single hypercube node, called node 0. Thus node 0 and the host machine provide a substantial I/O bottleneck as the number of nodes increase. The disk space on our system was limited, so we simulated the writing of these files by periodically reverting to the beginning of the file before writing the trace information. This procedure still requites the same node communication as is required for full trace storage, thus the location of disk writes are the only factor altered. Figure 8 shows the results of this experiment. It is obvious that storing the traces from the iPSC/2 is very costly, and this cost increases substantially as the number of nodes in the hypercube increases. The size of a block of addresses influences this overhead. Larger block sizes appear to be inefficient for a small number of nodes, but become more attractive as the number of nodes in the iPW2 increases. For 16 nodes, saving the traces addresses to disk is about 2 orders of magnitude slower than generating and analyzing these addresses concurrently.
Cache Data from Multicomputer Traces
Although the purpose of this paper is not the analysis of multicomputer cache performance, a small sample of memory access and cache hit ratio data will be presented to discuss some multicomputer issues different than issues found in single processor studies. One such issue is the variance in the number of memory accesses and cache hits among different nodes of a multicomputer. Another example issue is how the number of memory accesses and the cache hit ratio change with the number of nodes for a given problem. More extensive cache simulation studies have been performed by the authors using TRAPEDS data [18] .
To illustrate variance in the number of memory accesses, Figure 9 shows the total number of text reads, data writes, and data reads for the simplex algorithm on a 4 node hypercube.
This data was captured by X-bb-perf as it generated the addresses. Figure 10 shows the dependence of cache hit ratio on the number of nodes for the simplex algorithm with a 64 Kbyte direct mapped cache with a 4 byte line size. This data was captured by a cache model called by x bb-perf after it generated each address. For a very smalllinear optimization problem (aiiro), the cache hit ratio is highest for a single processor, and monotonically decreases as the number of processors increase. This is because the code and data fit easily into the 64 Kbyte cache, and the number of additional references to cached code and data is relatively small. Thus, as the number of processors increase, less instructions and data residing in the cache are reused, and the cache hit ratio decreases.
The other curves in Figure 10 represent hit ratios for larger problems that do not fit easily into a 64 Kbyte uni- processor cache. For these problems, as the number of nodes increases, a larger percentage of the "working set" of code and data can fit in the cache. For the largest problem shown, sharelb, cache hit ratio improves substantially as the number of nodes is increased from 1 to 4. A slight improvement is seen for 8 nodes, and then a decrease occurs for 16 nodes. For share2b, substantial improvement is seen changing from 1 node to 2 node execution, but the hit ratio alternately rises and falls for larger hypercubes. This type of behavior may suggest that after the working set is primarily contained in the cache, unpredictable factors such as the properties of the data distribution may have an important influence on the hit ratio.
CONCLUSIONS
This paper presents TRAPEDS, a new method of producing address traces. This method analyzes the program at the assembly language level to create modified executable files that produce the address traces. The modified executable files mn less than a factor of 50 slower than the original executable files, which compares favorably with existing software trace gathering approaches. Benchmark studies show that the execution overhead of the TRAPEDS method decreases as the number of processors traced increases.
Current drawbacks of the TRAPEDS approach include no traces of.operating system addresses, no current ability to collect multiprogrammed traces, and slower execution than with hardware trace capture. Only virtual addresses of user code are captured with TRAPEDS.
The TRAPEDS method has particular advantages for multicomputer systems. The problems of I/O and storage for trace generation and trace usage for multicomputers are resolved by analyzing in parallel the generated addresses during the collection process. 
