This paper presents novel sampling-based techniques for collecting statistical pro les of register con tents, data values, and other information associated with instructions, such a s memory latencies. V alues of in terest are sampled in response to periodic interrupts. The resulting value pro les can be analyzed by programmers and optimizers to improve the performance of production uniprocessor and multiprocessor systems.
INTRODUCTION
Hardware-based value prediction mechanisms were originally proposed by Lipasti and Shen 13] to reduce pipeline dela ys for long-latency operations. Simulations indicated a surprising amount of localit y in the values computed by instructions, allowing some result values to be predicted accurately based on prior executions of the same instruction.
Softw are-based value pro ling w as rst investigated by Calder, Feller and Eustace 4, 5, 9] . A v alue pro ler records values generated by the instructions in a program, and maintains statistics about the observed values. F or example, a value pro ler might report that, 53% of the time, the instruction at PC 0x2468 generates the result value 0, and the rest of the time its result value is 1. There are several possible approaches to implementing a value pro ler. A binary-rewriting tool can be used to instrument a program, adding code to capture the results generated by instructions Calder et al. used atom 16 ] to instrument binaries. Alternatively, a machine simulator or emulator can be modi ed to record values of in terest during simulation. This was the approach used in various architectural studies of value prediction. Finally, timer-based interrupts can be employed to periodically sample values as a program executes. We pursued this last technique, which we refer to as value sampling when we wish to distinguish it from the other approaches.
We generalize the traditional notion of value pro ling by allowing users to capture a wide variety o f v alues associated with the execution of the code. F or example, in addition to recording values generated by the program being pro led, we might a l s o collect timing information (e.g., this load took 20ns), as well as state not directly visible to the running program (e.g., this load hit in the second-level cache the ph ysical address accessed by this store was 0x561c).
V alue pro ling has a number of practical uses. It can provide data for evaluating proposed hardware features 13] . V alue pro les also pro vide feedback that can help focus manual tuning or drive automated optimizations 5] . It can also be used in debugging, although we curren tly ha ve little experience with this application. Several code optimizations are enabled when a value pro le reveals places where values are invariant (or semi-invariant 4])|that is, places where some variable or register (almost) always con tains the same value. Suc h optimizations include:
Prefetching: a value pro le can reveal which addresses are accessed, and identify absolute addresses or relativ e o sets that are highly predictable.
Specialization: a v alue pro le can identify common values of procedure arguments, allowing signi cantly better code generation. F or example, at a given call site, the log() routine may always be called with the argument 1.0, which admits a particularly fast implementation. Similarly, virtual method calls in objectorien ted languages can be specialized for their most common receiver classes.
Speculation: a value pro le can expose opportunities for softw are speculation, allo wing predicted values to be used for dependent instructions while the actual values resulting from long-latency operations are still being computed. Such optimizations might be particu-larly e ective o n a r c hitectures that support predicated execution, such as IA-64.
Value pro les can highlight the reasons why a piece of code is performing poorly, allowing tuning e ort to be focused more e ectively. For example, by r e v ealing load latency information, a programmer might realize that a data structure is being shared between processors in an ine cient w ay.
Our value sampling system extends the Digital Continuous Pro ling Infrastructure (DCPI) 2], which w e brie y review here. DCPI is a pro ler based on statistical sampling, combined with a set of pro le analysis tools. DCPI uses frequent randomized periodic interrupts to obtain samples across almost all code running on the machine, including the operating system kernel. Each DCPI sample contains a P C and address space identi er, and may optionally include information about other events (such a s c a c he misses or branch mispredictions) depending on the speci c processor implementation 2, 7] . A device driver aggregates samples and passes them on to a user-space daemon process. The daemon uses information from the dynamic linker and the operating system to map address space identi ers to object les (executable and libraries) in the le system, and stores the samples in les grouped according to the object les they refer to. Analysis tools use the samples in various ways, from providing traditional CPU time pro les of procedures, to inferring the reasons for dynamic pipeline stalls at individual instructions. Careful implementation yields an overall overhead of a few percent, despite a sampling interval of about 64 thousand instructions.
By building on DCPI, we inherit its overall structure of a kernel device driver, a user-space daemon, and analysis tools that access pro les via the le system. We also inherit a n umber of DCPI's advantages: E ciency: Periodic sampling can have dramatically less overhead than value pro ling schemes based on binary modi cation or interpretation. When we apply value sampling to all address spaces, we see overheads around 10%, using the same sampling intervals normally used by DCPI. This overhead compares favorably with the order-of-magnitude slowdowns reported for value pro ling systems based on binary instrumentation.
Completeness: We are able to apply value sampling to the operating system kernel, and other privileged address spaces that would be di cult to handle by other means.
Transparency: Programs are slowed down slightly, but otherwise una ected by being pro led. There is no danger of unexpected interactions arising from the use of per-address-space resources (e.g., virtual addresses, or le descriptors).
Similarly, w e inherit DCPI's primary disadvant a g e : I t i s a sampling-based approach, and so cannot capture all values observed in a run of a program. However, in practice we have not found this to be a problem.
In the following sections, we p r o vide details of our implementation, our experiences using it, and what we believe w e have learned.
OUR APPROACH
Our value sampling system augments DCPI's organization in three key ways. First, the interrupt routine captures values from the interrupted program, in addition to the usual PC samples and event records. Second, to limit the space needed to hold value samples, we employ further sampling techniques described by Gibbons and Matias 12] . These allow us to maintain e ciently hotlists containing the most frequently seen values at each PC, using constant space per hotlist. Finally, we have developed additional analysis tools to process the value samples. For example, the user can display v alues observed together with their associated assembly-language instructions and higher-level source statements. Other tools automatically nd semi-invariant values in code that is being executed a signi cant n umber of times.
Gathering Value Samples
We use the performance counters available on Alpha processors to interrupt each running CPU periodically. At each interrupt, we record values from the current context. Typically, the sampling interval is 64 thousand instructions, though a small amount of randomization is added to avoid unwanted timing interactions.
The rst question in such a system is how to obtain data values from an interrupted context. Without knowledge of the path followed by the processor's PC just prior to the interrupt, one cannot trivially associate the values in the registers with particular instructions.
An Early Attempt
Our rst attempt at solving this problem was a \bounce back" technique that arranges for a second performance counter interrupt to occur after a small number of instructions (such as one issue block) has been executed immediately after resuming the original interrupted code. During the rst \setup" interrupt, the return PC and other instructions in its issue block are fetched and recorded to determine which registers will contain values of interest. During the second \bounce back" interrupt, the values of interest (register values, return address, etc.) are captured and recorded.
Ensuring that exactly one issue block is executed between the two interrupts proved fairly di cult because a large numberof kernel instructions are executed in the interrupt return path. We were assisted by a feature of the Alpha 21164 CPU, which can generate an interrupt after a specied number of cycles in user-mode. We w ere able to make the delive r y o f t h i s i n terrupt fairly predictable by evicting the i-cache line containing the issue block of interest, and taking the i-cache ll time into account. Nevertheless, we would sometimes observe that no progress had been made in user mode before the second interrupt was delivered. In such a c a s e , w e w ould increase the number of user-mode cycles that would trigger the \bounce back" interrupt. If too much progress was made, we w ould give up our attempt to collect data at this interrupt|this happened on a few percent o f i n terrupts. A rarer problem was that the amount o f progress made between two interrupts was sometimes ambiguous because of tight loops in the interrupted code.
Although we successfully prototyped the \bounce back" mechanism, it worked only on user-mode code and only with some Alpha processors (the 21164 family). In the light o f these limitations, we sought an alternative.
Using An Interpreter
Ultimately, w e added a complete interpreter for the Alpha instruction set to the DCPI kernel module. The interrupt routine interprets the next several instructions, advancing the interrupted context as though those instructions had been executed directly by the processor. Though conceptually simple, there are some practical concerns with this approach.
First, the interpreter must be reliable and reasonably complete. Although the interpreter can give up if it should encounter an instruction it cannot handle, it is important that such instructions are rare or the pro ling will have significant blind spots. Thus we handle the entire instruction set, and rigorous testing was used to gain con dence in the interpreter.
One might think that we could have run the interpreter without having side-e ects on the interrupted context, and this would relax the need for correctness in the interpreter. An error in the interpreter might produce erroneous value pro les, but would not a ect the pro led program. We d i smissed this approach because we wished to apply value sampling to the operating system kernel, which performs loads on device registers that may h a ve side-e ects.
No matter how complete the interpreter, there are still coverage limitations. We are unable to apply it to code where no interrupts are permitted, such as Alpha's PAL code and certain small parts of the kernel. Some operations cannot easily be emulated by the interpreter because it is running at high interrupt level. In particular, the interpreter gives up when it encounters any of the following:
traps, such as page faults, that cannot be handled at high interrupt levels a c hange to the interrupt level and a c hange to the kernel stack pointer|the interpreter is using the same stack. The interpreter provides exibility not available through the earlier \bounce back" scheme. In particular, the interpreter can be modi ed to record timing information for individual long-running instructions such a s l o a d s t h i s w i l l be discussed in Section 3.3. Similarly, the interpreter could record other system state, such as page table contents, or the interrupt level in the interrupted context. Or it can be modi ed to simulate some internal state of a particular processor in order to deduce where the processor might perform poorly. Quite complex analysis can be performed in the interrupt routine, provided that time critical interrupts are not masked for too long.
User-Mode Interpretation
We also support an alternative means for invoking the interpreter, which has a di erent set of advantages and disadvantages. Instead of running the interpreter in the interrupt routine, we are able to run it as a user-mode library in the pro led address space, using an upcall mechanism.
When the address space is created, the dynamic linker loads a value-pro ling shared library along with the application. The library registers the address space with the proling driver. At each pro ling interrupt, the driver revectors the user-mode context to the library's user-mode trap handler that runs the interpreter, logs the data obtained, and nally returns control to the interrupted context. This is similar to the intended use of the sigprof signal in some unix systems.
The user-mode approach has di erent practical implications:
The address space is being disturbed in ways other than timing|a new shared library is being loaded, and new code is being run on the user-mode thread stacks. The interpreter does not run at high interrupt level, so there is no limit on the amount of time that can be spent i n t h e i n terpreter. Page faults encountered by t h e i n terpreter will be resolved by the operating system in the normal way, s o interpretation will not cease at page faults. Some values available in the kernel, such a s p h ysical addresses, will not be available directly to user mode. Similarly, some data may be easier to obtain in usermode, such as data revealed from a stack trace of the interrupted context. The correctness of the interpreter a ects only a single address space, so in principle users could modify the interpreter to collect specialized information. The user-mode approach m a k es it straightforward to perform value sampling in interpreted languages. We expect that some users will prefer to run the valuepro ling interpreter in user-mode, while others will want t o run it in the kernel.
Data Reduction
Given a basic mechanism for capturing values, a second problem is that of data reduction. The number of values observed at any g i v en point in the program might b e v ery large|far too large to store conveniently. Calder, Feller, and Eustace 4, 5] employed a small table to hold the most frequently seen values. However, their ad hoc update policy required tuning to get good results.
We used Gibbons and Matias' techniques 12] for summarizing a stream of data. These techniques provide a statistically sound basis for keeping a list of the most-frequentlyseen values in a stream of values their main advantage over an ad hoc scheme is that no tuning is required, and they use less memory for a given result quality.
We brie y describe the simplest scheme for keeping track of the top N most frequently seen values in a data stream for more details we recommend Gibbons and Matias' paper. Conceptually, the algorithm keeps a probability p and a table C that maps each possible value v to a counter C v]. The algorithm maintains the invariant that C v]=p is an unbiased estimate of the number of times v has been seen in the data stream. Let NZ (C) be the number of non-zero counters in C . Initially, p = 1 and 8v : C v] = 0, so NZ (C) = 0 . The table C has space for at most N values with non-zero counters that is, NZ (C) N . For each v in the data stream, one is added to counter C v] with probability p. If that causes NZ (C) temporarily to exceed N , the following operation is repeated until NZ (C) N once more: For some arbitrary value f > 1, p is reduced to p=f , and each v alue instance recorded in C is retained with probability 1 =f . That is, each non-zero C v] is replaced by t h e n umber of heads seen when tossing C v] biased coins, where the probability o f h e a d s i s 1=f. A t ypical value for f is N = (N ; 1).
We chose to keep track of the 16 most-frequently-seen values captured at each program location. That is, we run one instance of Gibbons and Matias' algorithm with N = 1 6 and f = 1 6 =15 for each v alue type captured at each p r o g r a m location.
Interesting Values
The value sampling system could capture many di erent v alues associated with the interrupted context, beyond those generated directly by the programmed instructions. We h a ve implemented a few:
Stack context information, such a s t h e c u r r e n t procedure's return address. Latencies for long-running instructions, such a s m e mory accesses. This is measured when the instruction is interpreted by surrounding the operation by reads of a cycle counter. Other possibilities are:
Processor or hardware state, such as the current p h ysical processor or processor set, physical addresses associated with memory accesses, and the processor interrupt priority level. Similarly, the interpreter can simulate execution of instructions for a processor architecture or memory system that does not exist, and capture relevant i n ternal state. OS or runtime system state, including various identiers (current process, parent process, user, group, and controlling tty), privilege level (e.g., e ective user), the set of pending or blocked unix signals, and the current scheduling priority and policy. Application state, such as whether or not the current thread holds certain locks. One of the most useful values to capture in conjunction with other values is the return address of the current p r ocedure. This allows the value sampling system to identify values that are mostly invariant b y call site.
To obtain the return addresses we take the simple approach of logging two v alues: the value in the return address register, and the value at the top of the stack. Because of the conventions followed by compilers that generate Alpha code, the return address is almost always to be found in one of these two places. Downstream analysis tools can deduce which, if either, of the two values is valid using the stack unwinding information present in the object le.
Customized Value Profiling
Our system can be customized in various ways. In particular, a user may specify what information to capture, how to transform it into value samples that are merged into the pro le database, and how to format values for reporting.
To do this, the user writes a dynamically loadable customization module, which is loaded by DCPI's user-mode daemon. Via this \plug-in" module, users may specify what should be captured by t h e i n terpreter for each instruction opcode. One option is to capture nothing for particular opcodes, but usually some basic information is collected, including the PC and the 32-bit instruction code. In addition, users may opt to record one or more of the following: content of an explicitly named register, the instruction's operand or result, a memory operand's virtual address, and the latency of loads. For each instruction, the captured values form a value tuple. Thus, each time the interpreter runs, it generates a tuple sequence for the interpreted instruction sequence.
Tuple sequences generated in the interrupt handler are later processed by the user-mode daemon. For each sequence, the daemon calls a routine in the customization module to transform it into PC-value pairs that are merged into the pro le database after data reduction. This transformation can be arbitrarily complex. For example, the daemon may transform value tuples consisting of the PC and operand address of load and store instructions into PC-value pairs (p v) where v is the PC of another instruction accessing the same address as the instruction at p. (This is the idea behind the application in Section 3.2.) Our current implementation maintains only one value hotlist for each P C . It may be extended to maintain hotlists for di erent kinds of values (e.g., data address and latency of a load) or for composite values (e.g., address-latency pairs).
We modi ed the analysis tool dcpilist to report the most frequent values associated with each instruction in a format speci ed by t h e customization module. For example, the operands of oating-point instructions can be printed as oating-point n umbers, rather than the default hexadecimal format.
We h a ve written several customization modules, such a s a module for capturing load latencies, as discussed in Section 3.3.
EXPERIENCE
We h a ve not used feedback from the value sampling system to direct automatic optimizations performed by the compiler. Nevertheless, we do have experience using it to highlight performance problems that programmers might then be able to address. Below we discuss some uses we expected, and some we did not.
Expected Uses
When collecting register values, we expected our system to provide information similar to that obtained from previous value pro ling systems. We had no reason to believe that the quality of the data would be signi cantly better or worse than that obtained from those systems, though we might claim that our system is easier to use. We repeated the experiments of others only to verify that our value proles agreed with prior work. We a l s o l o o k ed at other programs to demonstrate that our pro les did give useful hints to programmers bent on optimization. We g i v e only a few brief examples below.
Leveraging the ability of DCPI to pinpoint performance bottlenecks, our tools direct the programmer to places that both consume a signi cant a m o u n t of time, and which contain semi-invariant v alues. We s h o wed that these tools made it straightforward for a programmer to rediscover specialization opportunities, such as those found by Calder et al. 4 ] in mk88sim.
We also found opportunities for specialization in the ray tracer povray when working on particular test images. An exponentiation routine was already specialized for a few integer exponents, but not the most common one. Adding an extra case yielded a 20% overall speedup. Similarly, specializing the routine buildsturm for degree 4 polynomials yielded a large improvement.
A smaller optimization opportunity was found in gzip, where a 2% speedup was achieved by noticing that a constant w as being read repeatedly from a global variable.
Identifying Replay Traps
The Alpha 21264 processor attempts to execute memory access instructions as soon as possible, even if that means executing them out of order (that is, in an order other than program order). Part way through the processor pipeline, a load or store may exceed some resource limit, or an architectural constraint on instruction ordering may be encountered which p r e v ents the immediate issue of the instruction. In this case, the 21264 performs a replay trap, w h i c h aborts the instruction and all instructions that follow i t i n p r o g r a m order, and replays them from the fetch stage of the pipeline. Further details about replay traps can be found in the 21264 reference manual 1].
Replay traps are quite expensive, and a programmer might care to know whether such a n e v ent is occurring in his inner loop. We h a ve observed a few unusual programs in which the chip spends over half its time recovering from them. More commonly, one might expect to improve performance by a few percent b y h a ving good information about the causes of replay traps.
Some replay traps were of particular interest to us, because the chip's Pro leMe performance counters 7] do not provide all the information that one would wish for. The interesting replay trap types are:
Order: When a load issues out-of-order before a store that accesses the same bytes, the load must be replayed to ensure that it fetches the stored bytes.
Size: When a load follows a narrower store that accesses some of the same bytes, the load is replayed until the store has been merged with the other bytes.
Synonym: If two o -chip memory accesses use addresses with the same cache index (e.g., are congruent modulo 32K), one is replayed to avoid displacing data in the cache needed by the other. In all these cases, a pair of memory accesses is involved. Even given one instruction of the pair, it can be di cult to identify the other simply by looking at the program text. For example, we have encountered an inner loop where a synonym trap was caused by a load from a global variable interacting with a load from a stack location.
Starting with the value sampling interpreter, we built a mechanism called vreplay to assist in these cases. The chip's Pro leMe hardware identi es the PC of one of the pair of memory accesses as one that incurred a large number of replay traps. The vreplay c o d e identi es the likely PC of the other memory access, and which t ype of replay t r a p w as involved.
The vreplay mechanism works by i n terpreting runs of instructions to detect accesses that may potentially con ict. Interpretation runs need to be long enough to include both instructions in a pair that cause a replay trap. On the 21264, the distance is bounded by the maximum number of instructions in ight (80), except for traps involving two loads loads can retire before the data is back from memory.
Without expensive s i m ulation, there is no way for the interpreter to know whether a replay t r a p w ould have really happened. However, combining data from the interpreter with data from Pro leMe samples ensures that the user's attention is directed only to instruction pairs that are in fact causing replay traps.
To eliminate some false alarms where accesses potentially con ict, the interpreter also tracks data dependencies between interpreted instructions. If there is a data dependence between two instructions that access memory, there can be no replay trap.
On Tru64 unix, the TLB shootdown mechanism uses interprocessor interrupts (IPIs) to remove a TLB entry from each TLB in multiprocessor. If a processor does not respond to the IPI within a time bound, the operating system crashes. This bound imposed a limit on the number of instructions that our uninterruptible, kernel-mode interpreter could process, and provided additional motivation for our user-mode value interpreter.
Load Latency Measurements
The value sampling system can measure the latencies of loads by reading a cycle counter before and after each o n e . Often, more than sixteen di erent latencies are observed for each load. To simplify the report generated, the user typically assigns latencies to bins using the known times for cache hits in various parts of the memory hierarchy. W e use automatic programs to determine such interesting thresholds experimentally.
A concern when recording load latencies is that our system might disturb the measurements so much a s t o m a k e them worthless. We measured how m uch our system perturbs the primary data cache by creating a program that repeatedly touches each block in the cache, where a block i s a <cache line, cache set> coordinate. The program uses a separate load instruction for each b l o c k. In an ideal world without interrupts, none of the these loads would get cache misses. The results are shown in Table 1 .
Miss rate Fraction of cache blocks 0% 88% 1-20%  4%  21-40%  2%  41-60%  2%  61-80%  1%  81-100%  3%  Table 1 : Fraction of primary cache blocks experiencing various miss rates due to perturbation by v alue pro ling.
The key point to notice in Table 1 is that about 90% of the cache blocks are never evicted by the value pro ling system, while a small number are almost always evicted. The fraction of blocks evicted from the second level cache is lower due to its larger size. We expect these results could be improved by carefully tuning our interpreter to minimize the numberofcache blocks touched. Figure 1 is an example of a load latency value pro le from a oating point benchmark. The vtot column is the number of value samples for the instruction the thld column is the probability o f e a c h v alue sample being added to the hotlist the nv column is the length of the hotlist and the latencies column is the hotlist of binned load latencies, where \D" means a primary cache hit, \B" means a secondary (boardcache) hit, and \M" means a memory reference. The retdelay column comes from Pro leMe data and indicates the average number of cycles that the instruction stalled the CPU. The mult instruction is an obvious bottleneck and is suffering from a cache miss that consumes more than 15% of all cycles in the benchmark. Because both operands (f11 and f17) o f t h e mult are the rst uses of loads, the load latency pro les are essential to tell if the rst, second, or both loads are missing. From the latency pro le, it is clear that the rst load usually hits and the second load usually goes out to memory.
Overhead Measurements
To assess the cost of value pro ling, we measured how much i t slowed down the CPU2000 benchmark suite on a 500 MHz Alpha 21264 machine. DCPI interrupts were generated on the average every 62K instructions. Without value pro ling (no vprof), the overhead is less than 4%. With basic value pro ling (vprof), the interrupt handler interprets 4 instructions in one out of every two interrupts. The slowdown is nontrivial at about 10% but still much l o wer than that of instrumentation. This slowdown includes the e ect of all DCPI-related work: value and traditional pro ling, driver and daemon processing. Although vreplay requires more complex processing than vprof, i t c o s t s less for two reasons. First, the handler interprets 128 instructions at a time (versus 4 for vprof) but compensates for that high cost by doing it only once every 128 interrupts. For most instructions, the driver emits no value samples to the daemon because there are no con icting instructions to report, while in vprof it produces one sample for every interpreted instruction. For both vprof and vreplay, recording the return address information discussed in Section 2.3 (the \context") imposes a small extra cost as expected.
Tables 2(b) and (c) illustrate how w e can manage the overhead by balancing how often to run the interpreter (interpret frequency, indicated as once every n interrupts) and how many instructions to interpret each t i m e ( interpret length). Table 2 (b) shows the slowdown for di erent i n terpret lengths when the interpreter runs half the time. The slowdown increases approximately at the rate of 1% per instruction interpreted. Table 2 (c) shows the slowdown for di erent cases that all lead to the same average numberof interpreted instructions per interrupt. The overhead is roughly the same in each case but declines slight l y i f t h e i n terpreter is run less often because the per-interrupt cost is amortized over more instructions. Thus, we can increase the interpret length and keep overhead acceptable by interpreting less often. This is important because, in order to study interaction between instructions (as in vreplay), we m a y n e e d t o i n terpret a relatively long instruction sequence before getting any useful data at all.
RELATED WORK
Our value sampling work was primarily in uenced by prior research on hardware mechanisms for value prediction and software techniques for value pro ling. We w ere also motivated by g r o wing interest in static and dynamic optimizers capable of exploiting value pro les.
Lipasti and Shen rst introduced the idea of value prediction 13], proposing hardware that attempts to predict the next result value computed by an instruction based on a cache of previous result values for the same instruction. Their studies revealed a surprising amount of temporal locality nearly half of all instructions produced the same result value computed during their last execution. Several subsequent proposals have been made for improved hardware value predictors 11, 15, 14] .
Gabbay and Mendelson explored the use of pro ling techniques to identify instructions which exhibit a high degree of value locality 10]. They showed that hardware value misprediction rates could be reduced by tagging the opcodes of predictable instructions, marking them as candidates for hardware prediction. Our low-overhead value sampling techniques could be used to provide even more detailed information to such hardware predictors.
Calder, Feller, and Eustace were the rst to investigate software-based techniques for value pro ling 4, 5, 9] . They used the atom 16] binary-rewriting tool to instrument e a c h executable to be pro led, adding code to keep track of the most frequently occurring values computed by each instruction. A table of the top N values was maintained for each instruction, limiting storage requirements. A heuristic replacement policy was used to maintain the top N values approximately. When the table was full, the least frequently encountered value was evicted half of the table was also periodically cleared to avoid pathological behavior with certain value sequences. In contrast to this ad hoc approach, which required tuning for good results, our application of the Gibbons and Matias sampling algorithms 12] provides a sound statistical basis for maintaining such value table hotlists. Instrumentation-based approaches also impose substantial overhead on pro led programs Calder et al. reported average slowdowns ranging from a factor of 3.8 to a factor of 33, depending on various parameters. Our sampling-based approach imposes dramatically less overhead, enabling transparent v alue pro ling on production systems.
Deaver, Gorton, and Rubin explored the use of limited value pro le information for dynamic runtime code specialization 8]. Their Wiggins/Redstone optimizer identi ed hot spots using DCPI-based statistical pro ling, and dynamically added instrumentation to frequently executed code to collect path and value information. Suitable traces were dynamically specialized and optimized as the program executed. Our user-mode value-sampling interpreter is an ideal match for such an optimizer.
Another approach aimed at transparent dynamic optimization was developed by Bala, Duesterwald, and Banerjia for their Dynamo system 3]. Instead of instrumentation, Dynamo relies on interpretation to observe program behavior without requiring modi cations. As it interprets, Dynamo increments counters to identify hot instruction traces. Hot traces are selected for dynamic recompilation, which emits optimized code into a fragment cache. When the interpreter encounters a branch, it jumps to optimized native code in the fragment cache when it contains an entry for the branch target. Dynamo resumes interpretation when program execution leaves the fragment cache. Dynamo's use of limited interpretation has much in common with our own value sampling approach, although Dynamo does not collect value information, and its interpreter is not triggered by periodic interrupts.
There are many examples of systems that employ t e c hniques that are essentially limited forms of value pro ling. For example, run-time systems for languages such as Self 6] examine the types in use at call sites in order to replace indirect procedure calls with direct procedure calls and to pick subroutines specialized to those types.
FUTURE WORK
Many pro le-driven optimizations could exploit value proles. The usual example is specializing code sequences for frequently occurring values another example is speculatively reducing the critical path of a high-latency computation by assuming it computes the most common values and then checking the assumption. On the Alpha, a simple but effective optimization would be to set the hint bits used to predict the target of an indirect jump based on the most common jump target.
The new types of values that our system can collect enable additional optimizations. Load latency value pro les could guide prefetching. The vreplay pro les, together with Pro leMe pro les, could be used to eliminate replay traps.
Our upcall handler could allow pro le-driven optimizations to be done as the program is running, following the work of Deaver et al. 8] . Fixing jump hint bits and eliminating size and order replay traps are likely candidates because the required code analysis is local. Such optimizations are even more practical on a multiprocessor, where the optimization cost is amortized over many C P U s .
We see some bias in the distribution of interrupted PC locations despite the randomization of the interrupt period. This occurs because on modern processors the probability of an interrupt being delivered at a given PC depends not just on how often the instruction at that PC is executed, but also on other microarchitectural issues such a s h o w often it causes a pipeline trap. Because our interpretation runs begin at the interrupted PC location, the distribution of value samples inherits this bias. For example, the loads in Figure 1 have signi cantly di erent n umbers of value samples, despite being from the same basic block. We believe that we could eliminate this bias by i n terpreting a random number of instructions before sampling values.
CONCLUSION
We h a ve presented a promising system for value sampling. We believe that it makes it more convenient to collect value pro les than previous approaches. We have also experimented with new types of values that can be collected. In the remainder of this section we discuss what we felt went well or badly in our design.
The interpreter was a success. Our fears that it might be di cult to make it su ciently reliable proved groundless. This contrasts with the \bounce-back" technique that we used before we introduced the interpreter. Although \bounce-back" involved much less code than the interpreter, it was tied to a particular hardware type and harder to implement correctly.
Some issues remain with the interpreter. The main one is that, when using the interpreter at interrupt level to diagnose replay traps, long interpretation runs can cause the operating system to crash, as described in Section 3.2. The use of user-space upcalls may b e the answer to this problem. A minor irritation is that there are a few things that we cannot interpret, such a s instructions that cause operating system traps, and instructions that modify the kernel stack pointer.
Gibbons and Matias' algorithm for maintaining hotlists simpli ed things. Employing a well-founded algorithm saved time that we m i g h t otherwise have s p e n t in tuning and experimenting with more ad hoc approaches. Our use of user-level upcalls for pro ling shows promise, but we need more experience with it. We h a ve already found that upcalls interact in interesting ways with unix signal handlers and exceptions. At present, we see no insurmountable problems.
