QProf--a scalable profiler for the Q back end by McLaren, Greg
QProf: A Scalable Profiler for the Q Back End
by
Greg McLaren
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degrees of
Master of Engineering in Computer Science and Engineering
and
Bachelor of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 1995
) Greg McLaren, MCMXCV. All rights reserved.
The author hereby grants to MIT permission to reproduce and distribute publicly
paper and electronic copies of this thesis document in whole or in part, and to grant
others the right to do so.
Author. J
Author ...... ...........
Department lectrical E neering
Certified by ........................... V.. .








Certified by ..................... V.... . .. ... .... .
/ Alejandro Caro
Graduate Student of Elect ical Engineering and Computer Science
.[I __/ I ]i I~ , (1 XThesis Supervisor
Accepted by .... ........ . ..... ...............
. d - l\/ F. R. Morgenthaler
r:man, Departmentl.mmittee on Graduate Theses




QProf: A Scalable Profiler for the Q Back End
by
Greg McLaren
Submitted to the Department of Electrical Engineering and Computer Science
on March 17, 1995, in partial fulfillment of the
requirements for the degrees of
Master of Engineering in Computer Science and Engineering
and
Bachelor of Science in Electrical Engineering and Computer Science
Abstract
Current profiling tools available in the C environment rely either on program counter sampling or
instrumentation of every basic block to generate a performance profile based on actual or ideal
execution time, respectively. Program counter sampling (e.g. prof) yields only a coarse measure
of real time, and makes correct attribution of callee execution time to a caller difficult. However,
relying solely on ideal time, as do cycle-counting utilities like pixie, ignores delays caused by the
memory hierarchy and data-dependencies in execution pipelines. Qprof, a three-phase profiling
utility incorporated into the Q back end of the multi-threaded dataflow Id compiler, takes a step
beyond current schemes by uniting fine-grain real time measurement with ideal time estimates to
reveal memory hierarchy effects. Qprof's initial implementation generates live time and actual time
spent in each Id code block, actual time spent in each partition, and both actual and ideal times
spent in each basic block, taking advantage of the low-overhead timing facilities of the RS/6000
(or PowerPC) processor. Performance measurements are presented graphically by a post-processor
operating on files generated by the runtime system. Although Qprof incurs substantial overhead
in its initial implementation, groundwork is laid for reducing the overhead to a small fraction of
uninstrumented execution time by relocating real-time accumulation points and applying graph-
theoretic optimizations to reduce the number of instrumentation points. Qprof is easily scalable
to multiple processors with minimal modification, and should not, with the overhead reductions
schemes just mentioned, perturb parallel execution significantly.
Thesis Supervisor: Arvind
Title: Professor of Electrical Engineering and Computer Science
Thesis Supervisor: Alejandro Caro
Title: Graduate Student of Electrical Engineering and Computer Science
Acknowledgments




1.1 The Parallel Language Id ..............
1.1.1 Parallelism and Storage in Id ........
1.1.2 Id's Low-Level Implementation .......
1.1.3 The Q Compiler Project ...........
1.2 A Profiler for the Q Compiler ............




3 The Proposed Design
3.1 Instrumentation and Pipeline Simulation ......
3.2 PostProf ........................
4 Detailed Design and Implementation
4.1 Ideal Time: Building a Simulator for the RS/6000 ..........
4.1.1 Approximating Ideal Time with Simulated Pipelines .....
4.1.2 CPU Models.
4.1.3 Timing Experiments and the Detailed Design of the Simulator
4.2 Instrumentation: A Detailed View ...................
4.2.1 Profiling Data Structures ....................
4.2.2 Instrumenting a Partition.




. . . .. 8





























. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
.I .......
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
5 Evaluating Qprof 47
5.1 PostProf as a Performance Visualizing Tool ....................... 47
5.2 Shortcomings of QProf ................................... 54
6 Desired Improvements 62
6.1 Storing the Runtime Statistics More Efficiently . ................... . . 62
6.2 Separating Real Time Collection and Invocation Counting . .............. 63
6.3 Collecting Real Time at Coarser Intervals ........................ 64
6.4 Using Both the Fixed and Floating Point Units ................... .. 64
6.5 Improvements for the Multiple-Processor Environment . ................ 65
6.6 Basic Block Invocation Count Optimizations ...................... 65
6.6.1 Adding Optimal Counters to a Basic Block Graph ............... 67
7 Summary 69
A Appendix: The Simulator Code 70
B Appendix: Runtime System Profiling Support 104
C Appendix: The Instrumentor, the Extended Simulator, and ppc-testparse. 111
D Appendix: PostProf 120




1-1 The runtime storage of a threaded implementation of Id ................
1-2 A Comparison of the Static Data Structures representing a Q *RISC module to the
corresponding runtime structures .............................
1-3 An Id program in Q will eventually be run on symmetric multiprocessors arranged in
a fat tree. ..........................................
2-1 Gprof misrepresents the call-tree profile when two procedures r













The conceptual outlay of Qprof....................
A 3-d histogram. ...........................
The Logical Organization of the RS/6000 CPU ............
The Logical Organization of the PowerPC 601. ............
POWER assembly code for probing actual execution time ......
The RS/6000 Branch Unit .......................
The RS/6000 Fixed-Point Unit .....................
The RS/6000 Floating-Point Unit. ...................
The RS/6000 D-Cache Unit .......................
A module-level diagram of a stand-alone version of the simulator. ..
The QProf runtime data structures ..................
A simplified view of the instrumentation header and trailer applied to
nake a fixed number
........... ....16
. . . . . . . . . . . 19
.............. . ...22
each baic block. .
each basic block.
4-11 The exact instrumentation header and trailers applied to each basic block ......
4-12 An overview of how ppc-testparse was extended to comply with the diagram in Figure
3-1 ..... ... . . . . .. . . . . . . .


















5-1 PostProf displays partition execution time for the partitions of fact.st's first code block. 49
5-2 PostProf displays partition execution time for the partitions in fact.st's second code
block. ............................................. 50
5-3 PostProf displays the actual and ideal execution times of the basic blocks in partition
FACT.partO ......................................... 51
5-4 PostProf displays the PowerPC instruction mix of partition FACT.partO ...... . 52
5-5 PostProf displays the *RISC instruction mix of partition FACT.partO .......... 53
5-6 PostProf displays the time spent in C calls inside the code block TEST01 . . 55
5-7 PostProf displays the live time and invocation count of each Id procedure (code block)
called from within TEST01 ................................ . 56
5-8 Discrepancies between ideal and actual time in one of PostProf's performance snap-
shots of a partition .................... ... . 58
5-9 A plot of total real execution time as a function of the number of instructions added
to a basic block ........................... ............ 60
5-10 A similar plot, again showing the total real execution time as a function of the number
of added instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6-1 A scheme for making variable the amount of profiler storage associated with CBDs. . 63




The driving goal behind the development of new parallel architectures is their potential for a sig-
nificant speedup over conventional sequential processors. However, parallel algorithms designed
without concern for the implementation architecture can often fail to achieve the desired speedup
due to the effects of the network (multicomputers), scheduling, and subtle memory hierarchy delays.
Freed from low-level concerns by a high-level parallel language, even an expert programmer can fail
to forsee dynamic inefficiencies in a particular program resulting from the interplay of scheduling
and the latency of split-phase transactions. Thus, if performance is to be maximized in a parallel
environment while preserving high-level abstractions, the programmer must be supplied with a tool
to diagnose runtime inefficiencies and guide him toward corrective action.
1.1 The Parallel Language Id
The performance gauging tool Qprof developed to meet these needs is designed around the general-
purpose parallel language Id, created by members of the Computation Structures Group in MIT's
Laboratory for Computer Science. Intended for use in programming dataflow and other parallel ma-
chines, Id, at its core, is a purely functional language with non-strict semantics much like the lambda
calculus. Layered over the referentially transparent core are Id's state-containing I-structures and
M-structures. While I-structures, which can only be defined once, break referential transparency,
M-structures, which can be modified at will, may even be non-deterministic. In certain applica-
tions, the expressive power inherent in the I- and M-structures is worth the loss of clean semantics
accompanying a break from the purely functional approach.
8
1.1.1 Parallelism and Storage in Id
An Id program, or module, at the highest level is a collection of Id procedures. Procedures, in turn,
can be broken down into Id code blocks, which are assigned to individual processors for execution;
an invocation of a given code block must execute on a single processor. Associated with each
invocation of a code block is a frame to provide for local storage, and at runtime a call tree of
invocation frames is constructed as shown in Figure 1-1. Within a frame live the active partitions,
which are subdivisions of a code block that execute atomically. Partitions are the fundamental
unit of scheduling in such an implementation of Id, and a the set of active partitions of all frames
allocated on a given processor is that processor's scheduling pool. A partition can either be an inlet,
which is scheduled upon receipt of certain data values, or a thread, which is scheduled by an explicit
instruction from another partition. Important information associated with a code block, such as
its frame size, is stored in a permanent location called the code block descriptor (CBD). While a
procedure can have several activation frames if more than one invocation is live, it always has a
unique CBD. In addition to the tree of activation frames generated by Id's purely functional core,
also present in the diagram is the global heap, where I- and M-structures are stored. The global heap
is conceptually distributed across all processors in the system, and can be referenced by individual
partitions.
1.1.2 Id's Low-Level Implementation
As Id has evolved, it has been run on both simulated dataflow machines, and more recently on
the custom Monsoon tagged dataflow machine. However, as high-performance, general-purpose
processor lines such as the POWER and PowerPC 1 have gained acceptance in industry and research,
it has become increasing clear that to become a more useful and effective tool, Id must be compiled
down to a network of general-purpose processors rather than a custom architecture. As described
by Culler [3], the idea is to first compile Id down to a standard RISC-type language with network,
heap, and scheduling primitives, and then compile this "Portable RISC" code down to assembly level
on the target machine. By defining an interface specification, modular solutions can be developed
which, on the one hand, efficiently transform Id into "Portable RISC" without regard to hardware
issues, and on the other, efficiently implement "Portable RISC" on a given processor.
1The PowerPC processor is a single-chip version of the POWER processor chip set designed for personal computer
use. The POWER and PowerPC instruction sets have many common members but are not identical.
9
Tree of Activation Frames
heap reference .......
Global Heap
1 I I I I I
partition
frame
Figure 1-1: The runtime storage of a threaded implementation of Id. Procedure calls generate a tree
of activation frames, one frame per code block, and partitions running inside the frames may either
reference the local frame or the global heap.
1.1.3 The Q Compiler Project
The Q Project underway at MIT's Computation Structures Group is an effort to implement Id on
a network of conventional processors by compiling through just such a "Portable RISC" interface.
Conceptually, the Q compiler can be broken down into a five-part model. First, the front end
translates the original Id source code into a program graph. Next, the middle end builds a partitioned
program graph from the input graph. In the back end, the partitioned program graph, or PPG, is
mapped into *RISC (pronounced "star risk"), which is intended to function as the "Portable RISC"
interface language for various underlying conventional architectures. In Figure 1-2, some of the
data structures used to represent an Id program in *RISC are shown, and links are made to their
counterparts in the previous diagram. Note that in the current implementation of Q, a procedure
contains a single code block, so that the two terms are interchangeable.
The target hardware in the Q project consists of a fat tree of symmetric multiprocessors, each
composed of PowerPC processors, as depicted in Figure 1-3. The remaining two modules of the
Q compiler must therefore accomplish the translation of *RISC into PowerPC assembly code. To
modularize register allocation, this task is broken into two parts. Initially, the fourth module of
the Q compiler translates *RISC into "infinite register" PowerPC-PowerPC assembly code using
10





























Figure 1-2: A Comparison of the Static
corresponding runtime structures.
Data Structures representing a Q *RISC module to the
11
Figure 1-3: An Id program in Q will eventually be run on symmetric multiprocessors arranged in a
fat tree. Each multiprocessor will be composed of PowerPC processors.
arbitrarily many registers-and then the last Q module performs register allocation on the "infinite
register" code, emitting actual PowerPC assembly code.
1.2 A Profiler for the Q Compiler
Qprof, a profiling tool for the Q back end, was designed to help gauge the inefficiencies in an Id
program and guide the programmer to the source of the problem. As the Q project is currently
running only on a single-processor machine, the present implementation of Qprof is designed for
a single processor. However, a straightforward extension to multiple processors would allow a per
processor, per basic block measurement of both the time that the block's execution should require
assuming no cache misses, and its actual execution time. Moreover, Qprof also determines how
frequently each basic block and machine instruction type is executed, as well as the live time and
invocation count of each C procedure and Id code block (split-phase) call.
This thesis is organized into seven chapters. In the following chapter, existing tools for performance
measurement are surveyed, and the reasons why they are inadequate for the parallel environment of
the Q project are discussed. Next, the proposed design for the initial version of Qprof is developed at
12
a high level and related to the existing Q compiler. With the groundwork for the profiler already laid,
a discussion of Qprof's detailed design and implementation follows as chapter four. In chapter five,
we evaluate Qprof on two *RISC programs, and note some shortcomings. Desired improvements are
collected into chapter six, including both overhead-reducing optimizations and new features useful
when more than a single processor comes into play. Such new features include the collection of idle
times at particular processors to diagnose load imbalance, and the measurement of elapsed time
between the completion of synchronizing partitions. A summary is included as the final chapter,
followed by appendices containing Qprof's source code.
Originally, the idea of support in the Q back end for cache simulations was discussed-an option
to generate data and instruction address traces. However, top-level analysis reveals that absent
some form of special hardware to supply transparent data bandwidth, collecting an instruction trace
would require enough bandwidth overhead to hopeless perturb the execution of a parallel, multiple-
processor program. Programs running on a network of processors are dependent upon message arrival
time relative to the local state of a given processor, and adding the required storage bandwidth by
code augmentation would strongly skew any such timeline. The inherent problem is that unlike an
execution profile, which to first order should occupy constant space regardless of the program's run
time, the storage requirements of tracing vary directly with run time. Add to that the problem
that, unlike ideal execution times, accurate address traces cannot be created statically (due to data
dependencies), and any workable software solution is clearly going to perturb the original program
significantly. Thus, address trace generation is not supported in Qprof.
13
Chapter 2
Existing Models of Performance
Measurement
When considering what features should be supported in constructing an effective profiler for the Q
project, it is helpfill to consider existing Unix performance gauging tools developed for C program-
mers. Some of the more common profiling tools designed to interface with the C compiler include
the real-time-gathering utilities prof [13] and gprof [12], which are available on most platforms, and
the cycle-counting tool pixie [4], which is currently available only on MIPS-processor-based work-
stations. Though these utilities have shortcomings, they provide a baseline against which Qprof can
be compared. In turn, each of the three approaches to performance measurement will be considered.
2.1 Prof
One of the simplest profiling tools available to the C programmer is prof, which takes advantage of
the existing 60-100 Hz operating system interrupts to sample the processor's program counter. At
run time, storage slots are allocated for the basic blocks in the target program's assembly code, and
during an interrupt, the value of the program counter is binned by its position relative to the target
program's assembly code labels, allowing the appropriate slot to be incremented. Although program
counter sampling, as this process is called, has the advantage of generating timing information for
every basic block in the program and not requiring that instrumentation code be added to the
executable, it suffers from a lack of timer resolution. For example, given a processor clock rate of




r = = 420, 000 instructions, (2.1)100
or perhaps half that for slower models of the POWER RS/6000. Given such coarse time resolution,
programs which execute in less than a second cannot be accurately profiled. Although the user can
vary the program's inputs to increase execution time, this may result in a very serious distortion of
the original distribution of run time amongst the basic blocks.
While the use of existing interrupts makes prof non-intrusive, so that it can collect real time
without inducing a significant probe effect, the coarseness of its interrupt-driven timing disconnects
the accumulation of profiling statistics from the target program's fine-grained behavior. Reduced
to state sampling by this defect, prof can acquire little runtime information other than elapsed real
time. As a case in point, note that in prof there is no means to relate time spent in a callee procedure
back to the caller. At the time of the interrupt, only the instantaneous location within user code is
known-neither history information nor a detailed view of the state is available. As a result, given
two procedures that each call a function foo(), the agent responsible for the majority of the calls
cannot be determined.
The overhead of prof on a recursive Fibonacci function executing for sixteen seconds was found
to be negligible. This looks reasonable since a 1000 cycle overhead (as an upper bound) occurring
every 10 milliseconds would amount to a total overhead of only 0.2 percent. As the length of the
target program increases, however, some paging delays may surface as a result of maintaining the
accumulation bins for the timer.
2.2 Gprof
The development of gprof represents a significant improvement over prof. Using procedure-call
counts collected at, run time, gprof assembles a weighted call-tree linking the procedures of the
target program. After the target program executes, the tree is used to allocate callee execution time,
collected via program-counter sampling, to each caller. Since the call tree is an image of program
execution, the picture presented to the user can lead to the detection of higher level inefficiencies,
and even bugs, which could not have been uncovered using prof.
However, the call tree generated is not exact-it attributes a fraction of the total time spent
in a given procedure to each caller by assuming that the time a given caller is responsible for is
directly proportional to the number of calls made by that caller relative to the total number of
callee invocations. Obviously, if one of the callers makes significantly more "difficult" calls to a
given subprocedure than other callers, this heuristic would not yield accurate results. For example,
15
double fact(int n) {
double product = 1.0;





for(i = O; i < 100; i++)
printf("Computing fact(%i) = %lf\n",i,fact(i));
void caller_two() 
int i;
for(i = O; i < 100; i++)





Figure 2-1: Gprof misrepresents the call-tree profile when two procedures such as caller-one() and
caller-two() make a fixed number of requests of unequal difficulty to a callee.
suppose that as shown in Figure 2-1, the procedure fact(n) is called by caller-one(), which requests
fact(n) for some very large values of n, and also by caller-two(), which issues much less difficult
calls to fact(n). Since caller-one() and caller-two each invoke fact(n) 100 times, gprof will
allocate equal amounts of fact's execution time to each, despite the fact that the majority of time
spent in fact is almost certainly attributable to caller-one().
Perhaps the most serious shortfall of either prof and gprof is simply that they are designed for use
with C, a relatively low-level language. It should be clear that the most useful tool is one tailored for
the conceptual model and programming structures of the target program's language. Only with such
a tool can the bottlenecks be seen at the programmer's abstraction level. Most of the groundwork
16
for moving on to consider the design of Qprof has now been laid, but before leaving this topic, one
more contemporary profiling tool should be considered to see an approach which avoids dealing with
real time.
2.3 Pixie
Developed for early MIPS platforms, pixie is a performance gauging tool which instruments, or
augments with performance gathering code, each basic block of a target program at the machine
language level. As the instrumented program executes, basic block counters are incremented and
the total ideal time required for program execution can be computed. In this context, ideal time
represents the number of cycles needed for execution of the target program absent overhead such as
cache misses and page faults. In the case of the early MIPS platform, pixie could easily transform
basic block counts into ideal time by examining the object code files, since the assembler inserted
pipelined delays and expanded macros before emitting machine code. The number of instructions
in a basic block thus could be mapped easily into a cycle count, so that multiplying by the number
of invocations of that basic block provided the total ideal time required for its execution.
Such low level code augmentation, without overhead-reducing heuristics, perturbs real time mea-
surements of program execution by a significant amount. For example, a recursive Fibonacci function
coded in C that took 40 seconds to run in its original state took 98 seconds as a pixied executable,
an increase of 145 percent! While longer basic blocks would help reduce the impact of pixie's in-
strumentation, any real time measurements (and time line interactions between the processors of a
multicomputer), would certainly be skewed.
Within the context of pixie, real time is not even at issue, as the profile is based solely on
invocation counts and basic block lengths. However, an approach such as pixie's that focuses solely
on ideal time would be inadequate for the Q back end operating in multiple-processor mode, even
if heuristics were used to reduce the number of instrumentation points, because the important issue
of real time is not addressed. Without real time, one loses the ability to detect memory hierarchy
effects and unpredicted pipeline stalls which are characteristic of the POWER RS/6000 processor,
which is significantly more complicated than the MIPS. The basic problem stems from the network
of processors. The introduction of the network into the model of performance measurement so
complicates the situation that efficient solutions heavily demand emphasis on real time rather than





Now that the Q back end and its runtime storage structures have been described, and present
profiling tools have been shown inadequate in efficiently gauging the performance of an Id program
compiled through Q, the stage has been set for the design of a new profiling tool mated to Q and free
of the shortcomings found in the standard Unix performance gauging routines. First, we concisely
discuss incorporating ideal time calculation and code augmentation into the original Q model. Then,
the design of the post processor, Postprof, is laid out and issues of data presentation are confronted.
3.1 Instrumentation and Pipeline Simulation
As illustrated in Figure 3-1 below, the initial design of Qprof divided the work of profiling an Id
program into three steps. The goal of this process is to create an instrumented executable and a
table of ideal execution times for the original basic blocks. First, infinite register PowerPC code
taken from the Q compiler's fourth stage is routed into a code instrumentor, which augments the
code at certain points with instructions to collect runtime statistics-such as invocation counts and
actual pipeline delays-into the local activation frame. Next, the augmented code is routed back to
the last stage of the Q compiler to allow register allocation to occur and fed into the simulator along
with the original, uninstrumented code to generate a set of ideal execution times for each basic block.
Inside the simulator, the uninstrumented code is needed to efficiently access the original assembly
code labels, since they are altered during instrumentation.
Unlike cycle counting on the MIPS, ideal time calculation on the POWER RS/6000 architecture
is a complicated task, and a simulator must be built to accomplish it. Given the multiple functional
units of the POWER RS/6000 processor, it should be clear that no direct correlation exists between












Figure 3-1: The conceptual outlay of Qprof. Shown are the five Q compiler stages and the additional
stages required by Qprof.
19
111i
C I m 
I
times is of crucial importance to Qprof because, when united with the runtime statistics, ideal times
can quantify and localize memory hierarchy effects.
As shown in the figure, after the first two phases of Qprof have run and the compiler has been in-
voked to generate an executable, all that remains is the data analysis phase fulfilled by the PostProf
visualizer. Running the augmented executable (fact in the figure) generates a file of runtime statis-
tics which can be used by PostProf together with the simulator's static ideal times and instruction
mixes to give an accurate picture of a program's performance.
By splitting the performance data into a runtime and static file, it was hoped that runtime overhead
could be kept to the minimum actually required to collect the statistics. All data analysis was to
be done offline-either in the pipeline simulator before the executable was even generated, or in the
post processor to reconstruct necessary information from a compacted minimal set. The decision
to place instrumentation before register allocation was not arbitrary, but the result of analyzing
a tradeoff between two evils. If the code were instrumented after register allocation, a number of
registers-those required in the instrumentation code-would have to be permanently allocated to
the profiler, which would mean either that significant performance would have to be sacrificed in
the design of the Q compiler by allowing fewer real registers to be emitted by the allocator, or
else a switch would have to be installed to squeeze the register allocator's output registers into a
tighter set when profiling is enabled. The first choice is obviously poor, and the second may result in a
significant probe effect; when fewer real registers are available to the allocator, the register allocation
strategy may assign a radically different set of registers to the segments of original code. However,
by placing the instrumentor before the register allocator, we suffer a significantly diminished probe
effect because a near totality of the registers used in the instrumentation code are only live inside
the instrumentation. Inside the original code segments, the register allocator should see the same
number of live registers and the allocation map should be relatively unaltered compared to the
unprofiled executable. Although one profiler register is live at all times, it has been reserved solely
for Qprof and thus is not subject to allocation. The exact design of the instrumentation will be
discussed in the next chapter.
Clearly, there is a fundamental assumption evident in augmenting code to collect real time
intervals-the existence of a low-overhead timing facility. Fortunately, the PowerPC and RS/6000
architecture provide such a timer in the form of the Time Base register and Run Time Counter
register, respectively. Clock accesses have a pipeline delay of a single cycle and a latency of two cy-
cles in these architectures, although more complication ensues if long durations 1 are to be captured,
since the time is stored across two registers which must be accessed independently. Much more will
1 We refer here to time periods on the order of 1 second or more.
20
be said about the timer below.
3.2 PostProf
Early on it was decided after a review of Kesselman's dissertation [8] on parallel program performance
measurement that a graphic approach should be taken in the post processor. When a single processor
is at issue, a textual summary such as prof yields is readily analyzed, but as the number of processors
increases, Kesselman has found only a graphic format is likely to present the vastly increased volume
of information to the programmer in a manageable way. Given several processors, several basic
blocks (or code blocks), and several types of measurements, it seems only natural that a visual
approach might prove more useful than any other. The design conceived at this time was inspired
by Kesselman's "three-dimensional histogram," in which code segment goes on the Y-axis, the
processor goes on the X-axis, and grey-level intensity of each box (pixel) represents the magnitude
of the measurement. In this scheme, only one type of measurement, such as basic block invocation
count, can be viewed at a time. An example of a three-dimensional histogram is shown in Figure 3-2,
in which the invocation count of each basic block in a particular code block is depicted. By scanning
horizontally, the user can tell which processor has the highest count for a given basic block, and by
scanning vertically he can tell which basic block has the highest count for a fixed processor.
Major operations that should be applicable to parallel execution data include folding, sorting, and
subsetting, each brought about by highlighting a selection set via X-window interaction. Folding
a selection set onto the Y-axis should generate a conventional histogram of magnitude vs. code
segment, summed over all processors, whereas folding it onto the X-axis should generate one of
magnitude vs. processor, summed over all code segments. Once a conventional histogram has been
obtained, different types of data can be "stacked" onto the single histogram, each represented by a
different shade of grey.
Although folding is perhaps the most powerful feature, the second two operations could also prove
useful when applied directly to a "three dimensional histogram." Subsetting reduces the clutter
presented and allows the user to focus on only those entries which are relevant to user's problem.
Sorting of the selection set by a measurement magnitude could be carried out either on the processor
(X) or code segment (Y) axes to simplify the task of spotting the most costly code blocks according
to some performance metric.
While improvements and extensions to Qprof suggested by experience with the initial implemen-
tation are discussed later in the paper, the extended design of the post processor which has just
been discussed was conceived early on in the project, but could not be implemented since only a
21
Invocation Count vs. Basic Block and Processor
I 
lO
E IE * . II 
. * E
I E I 







1 2 3 4 5 6 7 8
Processor #
Figure 3-2: A 3-d histogram. Invocation count, represented by the intensity of the block, is plotted
as a function of both basic block and processor number. There appears to be a scheduling problem
starving the fifth processor in the first through fourth basic blocks, and another starving the odd










single processor POWER RS/6000 system was available. The actual visualizer implemented will be





While the Q compiler project is hoped eventually to be implemented on a multicomputer with many
PowerPC processors, the reality of the matter during the design phase of Qprof was that only a
single processor RS/6000 system was available for the project. Given this reality, the post processor
phase was greatly simplified relative to the original design, leaving ideal time calculation and code
augmentation as the only real hurdles in accomplishing the implementation. Ideal time calculation
was chosen as the first objective, since the problem could be analyzed and solved independently of
the Q runtime system and compiler stages, which were under development while Qprof took shape.
4.1 Ideal Time: Building a Simulator for the RS/6000
The first major goal was the development of a simulator to compute the ideal time of a POWER
instruction sequence. The ideal time of a sequence of Power/PowerPC ISA instructions is defined
as the number of CPU cycles required to execute the sequence assuming no cache' misses occur.
The significance of ideal time comes from the fact that given (1) the actual time spent executing a
given basic block, (2) the number of times that basic block was called, and (3) the ideal time per
invocation, we can compute the number of cycles lost to the memory hierarchy due to misses in
the cache. Of course, this method assumes that every invocation of a basic block takes the same
number of cycles given a 100 % cache hit ratio, which is clearly false since the immediate ancestor
of a given basic block can leave various clutter in the function unit pipelines of the CPU that may
24
1 Includes both I-cache and D-cache.
not be the same for each invocation. Still, our heuristic approach should prove effective. Note that
as discussed above, the register allocated code is timed to ensure we analyze code as close as possible
to that which will be executed. However, it is important to keep in mind that no matter how good
the heuristics are, the ideal time can never be computed correctly in every case without running a
complete simulation of the entire program, including register values, caches, main memory, etc. The
reason is that some pipeline delays are data dependent, and the Halting Problem tells us that we
cannot construct a algorithm to predict these delays without essentially running a simulation of the
original machine. We must settle for an approximation to the ideal time.
4.1.1 Approximating Ideal Time with Simulated Pipelines
In this subsection, we give a brief overview of the components of the simulator, and then discuss
some interesting problems which arise when one attempts to use such a model to compute accurate
ideal time. Even an excellent simulator, if applied incorrectly to the instruction stream, can yet
poor results. By appropriate heuristics, however, the error can be greatly reduced for most of the
basic blocks in a partition after the first.
In computing the ideal time of a specific instruction sequence, we first place the simulated pipelines
of the CPU in a particular state (assume empty). Next, we feed in the instructions one CPU cycle
at a time, handing them off to the appropriate functional unit pipelines as they become available
while scoreboarding registers not renamed 2 to preserve data dependencies. In the parlance of an
object-oriented language, the functional units could be implemented as data types with manipulators
to insert machine instructions and advance the global time, and predicates to determine whether
the head of a pipeline is ready for an additional instruction. In such a design, the master process
feeding the instruction sequence into the appropriate pipelines would have access to a global dispatch
scoreboard to determine when and if instructions could be dispatched to a given pipeline. In the
end, when the last instruction reaches a chosen point in its execution pipeline, both the total number
of cycles required and the occupancy of the various pipelines must be stored. Generally speaking,
the accuracy of raw ideal time will improve with the length of the code sequence due to the relative
decrease in the "pipeline start up transient". By recording the state of the pipelines, we allow two
accuracy-enhancing improvements to be easily made to the algorithm.
The first, and simplest, enhancement would be either to take a global average of the final relative
occupancy of the pipelines and apply a formula to it to determine some number of extra cycles to
add to each raw ideal time, or else to use the relative occupancy of the pipelines for each possible
2 Power/PowerPC implementations rename floating-point registers.
25
preceding instruction sequence to determine a "fudge factor" in cycles to add to the raw ideal time
of the immediately succeeding code sequence. The catch is that this method can only be used to
add cycles to the raw ideal time of a code sequence if the preceding sequences are limited to a small
set of user code sequences for which pipeline occupancy data has been stored. Obviously, various
heuristics may be more successful than others, but we should be able to come close to the correct
ideal time by using a heuristic sufficient to deal with the pipeline's startup transient.
In the above method, all pipelines were assumed empty upon running each code sequence. A more
accurate calculation of ideal time can be made by simulating the ideal time of instruction sequences
in the order in which the sequences appear in the code, so that the pipeline is occupied to a degree
upon starting the ideal time calculation of each consecutive instruction sequence. This is in fact the
technique chosen in the implementation below. Obviously, a problem occurs when a basic block has
multiple ancestors, but if the ancestors are all user instruction sequences, each ancestor can be run
starting with empty pipelines before running the target instruction sequence, and the average of the
two (or more) ideal times can be used as the final ideal time. If accurate weighting information is
desired-assuming each ancestor equally weighted sometimes yields a distorted profile-the control-
graph weighting heuristics described in the "future improvements" section below may be used.
In a sense, the only reason this approach should better than the previous paragraph's is that a
heuristically-derived fudge factor is being replaced by a recursive call to the ideal time simulator.
Particular attention must be given to branch instructions, since they end basic blocks (and thus
would not be found in the middle of an instruction sequence). Observe therefore that given two
contiguous branch instructions (the first being conditional), the second would be placed in its own
basic block and would encounter completely empty pipelines when run through the simulator, unless
one of the two accuracy-enhancing optimizations above is included. (This would yield an ideal
time of one cycle for the second branch, which would be erroneous on most CPU architectures.)
Fortunately, if the second branch has only a single ancestor, the second optimization technique
above would correctly capture the delay produced by the consecutive branch instructions. Indeed,
the chosen implementation captures this behavior.
4.1.2 CPU Models
The CPU models considered in the design of the ideal time computation engine were the RS/6000
model shown in Figure 4-1, and the PowerPC 601 model shown in Figure 4-2. The RS/6000 CPU
architecture is characterized by separate data and instruction caches, and three separate functional
units-one for each of branch, floating point, and fixed-point operations. The branch unit can grab
up to four instructions from the I-cache in a single cycle, and issue four instructions in a single cycle.
26






I Main Memory 
Figure 4-1: The Logical Organization of the RS/6000 CPU.
Supposing that a logical operation on CR (condition register) bits, a branch instruction, and fixed-
point add, and a floating-point add were available to the branch unit, it could dispatch the floating
and fixed-point operations to their respective functional units, and execute the branch and CR bit
logical operation itself, all in a single clock cycle. This feature allows it to branch transparently,
provided that in the case of conditional branches the branch is correctly predicted.
The main pipeline delays in the RS/6000 include a one cycle latency between a load from the cache
and an instruction that uses the result register, a delay of one cycle between two dependent floating
point instructions, a delay of three (eight) cycles between a fixed (floating) point unit compare
setting CR bits and a successful branch using those condition bits, and a three cycle delay between
two consecutive branch instructions, the first being conditional. All these except the last can be
modeled by the simple scoreboarded set of pipeline abstractions, and the last can be handled by one
of the accuracy-enhancing optimizations described above.
The PowerPC CPU architecture is somewhat more complicated, although a shared D-cache/I-
cache is employed. Instead of being channeled through the branch unit, as in the RS/6000 architec-
ture, the PowerPC 601 design incorporates an instruction queue (IQ) eight instructions deep that
feeds all three functional pipelines (FPU, IU, and BPU). Unlike the RS/6000, the PowerPC 601
can only issue three instructions per clock cycle; however, it can fetch eight in a single clock cycle
from the combined cache, if the cache is not in use. Floating-point and branch instructions may be











3 I Writeback I
E =......I L f
rlwmmlwwmmmwmw
Figure 4-2: The Logical Organization of the PowerPC 601.
integer instructions and other logical operations assigned to the integer unit (IU), may only decode
from the bottom IQ slot. Just as in the case of the POWER RS/6000 processor, the PowerPC 601
can be simulated in an object-oriented language by building a data type for each functional unit and
new objects for each pipeline stage. The differences will lead to internal changes in the data types
representing the functional units, and a new master process to hand incoming instructions off to the
execution units from the IQ.
4.1.3 Timing Experiments and the Detailed Design of the Simulator
As soon as it was known that the RS/6000 processor had become the target of the first Qprof
implementation, detailed experimental design began on the simulator. By studying an issue of IBM
Journal of Research and Development dedicated to the RS/6000 architecture [7], much of the interior
design of the branch, fixed, and floating point units came to be known immediately. As the design
process progressed, the pipelines for each functional unit were laid out and interlocks or scoreboards
were developed to hold back the execution of certain instructions. However, there were two cases in
which the known layout inside a functional unit was not sufficient to predict actual execution time,
and experiments had to be conducted to resolve the appropriate functionality. The first case was that
of POWER ISA moves to and from special registers-the condition register, link register, counter
register, etc. It had been clear from the beginning that lock bits must be introduced for writes to




















register field, but it was not clear how special register operations involving special registers other
than the CR, or even moves from the CR, interacted. In many, but not all cases, adjacent special
register moves were found to create pipeline delays in the fixed-point unit where they executed. In
order to probe the time required to execute a given piece of code in an experimental setting, the
assembly code illustrated in Figure 4-3 below was developed.
The code takes advantage of the low overhead timer (Run Time Counter) on board the RS/6000
processor to time the execution of 5,000 repetitions of the target code segment. Given the RTC
clock rate and that of the CPU, the calling C routine can compute the actual execution time of
the target code segment. Knowing the actual pipeline delays produced by various combinations of
move-to- and move-from-special-register instructions, a table was constructed, and-finally-a set
of interlocks could be designed to mimic the correct pipeline stalls3 . In the final implementation,
many of the special registers were given both read and write lock bits, set when certain instructions
dispatched and cleared when those instructions passed through the execute or post-execute stages
of the fixed-point unit.
The second complication which couldn't be accounted for via the original functional unit stages
were the pipeline occupancy delays. Several special instructions can occupy a single pipeline stage
for varying lengths of time beyond the one cycle norm. Some, such as stsi, si, stsx, and lscbx,
which are load/store string operations, take a variable amount of time depending on the contents
of registers and even memory! Luckily, none of this set of instructions was included among the
POWER instructions implemented in the Q back end. The load- and store-multiple instructions,
Im and stm, have an occupancy statically determined by their target register, and so could have
been simulated, but were also not included in the subset of POWER instructions implemented in
Q. The only extended-occupancy instructions dealt with were thus div/divs (integer divide), which
requires 19 to 20 clock cycles, fd (floating divide), which requires 19 cycles, and the fixed-point
multiply instructions, which require 3-5 cycles depending upon the size (byte, halfword, word) of
the last factor. Both divide operations were assigned 19 cycle occupancies, and the multiply was
assigned a four cycle occupancy.
After having determined all interior delays in each of the functional units, we next considered the
delays between units. Still to be resolved was the link between the store queues and data load inputs
of the fixed and floating point units. For example, when a floating-point load, which passes through
both the fixed and floating point units, executes in the real RS/6000, the FXU generates the address
for the load request and the FPU unit activates its register renaming machinery and locks the target
register until the data arrives there from memory. Our simulator must allow the FXU to send the










# load the link register into register 0
# store several of the non-volatile registers
# save reg. 0 (link reg. copy) on the stack
* create new stack frame
# load register 3 with 5000...
# ... and store it it the count register
# place the lower word of the on-board clock
# in register 4
loop:
<code to be timed goes here>
# time the code, looping back repeatedly
# as the counter is decremented
bdn loop
mfrtcl 5 # place the lower word of the on-board clock





# retrieve link reg. from stk. and put in reg. 0
* pop the stack frame
# restore non-volatile registers
# compute the elapsed time, and place it
# in the result register
mtlr 0
blr
# restore the link register
# branch to it
Figure 4-3: POWER assembly code for probing actual execution time. The generic wrapper permits
calls to be made inside the loop.
30
FPU a signal in the manner just described. In fact, the solution eventually accepted was to add a
fourth functional unit, a D-cache, to the simulator design to establish the correct message-passing
protocol between the FXU and FPU with regard to memory operations. Conceptual diagrams of
each of the four functional units developed during implementation of the simulator are shown in
Figures 4-4, 4-5, 4-6, 4-7
Once the diagrams for each functional unit had been prepared, developing most of the code for
the simulator was straightforward in C++. Each functional unit became an object, as did each
pipeline cell and buffer. An I-cache object was developed to feed the branch unit's input buffers
with four instructions per cycle as dictated by the specification. Initialized with a PowerPC module
representing the target program, the I-cache object can either return instructions consecutively until
the end of a partition is reached, or fetch from a new target address given the label of the target. In all
cases, simulated registers and memory are devoid of the actual data values; there is no dependency
on actual values built into the simulator. This permits it to run at reasonable speed.
Branches
One issue still unresolved at this point in the design is the handling of branches. To what extent are
branches to be simulated? How are they to be fed to the branch unit? Granted, the instructions
themselves get there via the I-cache, but how do the branch decisions get to the branch unit?
Remember, since no data values are kept in the simulator, the decision whether or not to take a
conditional branch must be inserted by a controller external to the simulator. After experimenting
to check feasibility, it was decided that given a partition and some target inside the partition, an
external routine could generate the correct sequence of branch decisions to move from the beginning
of the partition to the target and then exit the partition4 . If this sequence of branch decisions
were fed to the branch unit, the target basic block could be reached in such a manner as to prime
the pipeline with the correct instructions, at least in many instances. (The first basic block is an
exception, as is any basic block with more than one ancestor.) In fact, this is exactly how the branch
unit is controlled in the current implementation; given a target, the branch unit and other pipelines
are fed with a sequence of branch decisions known as a "branch chart" that leads to the target code
segment.
A very tricky issue which at first seems almost invisible is where to begin and end timing of a given
basic block. A naive answer might be, "as soon as the last instruction is dispatched from the branch
unit". However, this is problematic for short instruction sequences since the dispatched instruction,
4 The I-cache is equipped with an extra partition representing a generic C function that is entered whenever a
branch to link register is taken.
31
-4 eq
c q + +































,, 0 3 ,
'A a Xm E03 M: X 1)
























PDO i I PM1 j,








(performing a store will
prevent others ins from
entering decode)
Signal /1
H lter .. .........
...-. -------.....
a }- ~ WRITE
,.
............ W I eturn a
register
ode IL(no ( no delay) and kill











l.(+ FPSCR) I ' I
(writeback to FPRs,
;- a '''w'ir'for' 'e''' 'rorIS
................ i' 'T............. ;.o ... . .. .~... ..
Data Store Queue #3 To Cache
U
ARBITER
S ata Str Fntr y #3 status(FXU UNIT)
STAGE
, ...... _~ata Str Entrv #1 I
. .....forwa.d fd,f or a r....d ..f.ee d,.
VIW~mI II
:E:xe2 Exe2 + 1l Exe2 +2
mReturn signals t BRU





I ; i PSQ , II I
jITable .. ..
!132 . ..
Entries | In #1
*e . .. a·. * hr t
... ... ... . '










































although it will later cause a delay by sitting in the execution stage for several cycles, may well be
able to enter a fixed-point instruction buffer as soon as it becomes eligible for dispatch. Stopping
the time at the dispatch point would mean losing a significant portion of such an instruction's effect
on the pipeline. The only bright point in this enigma is that by running the simulator through a
partition from the beginning every time an interior basic block must be profiled, we can somewhat
blur the accounting of which previous instruction caused which delay and still come up with a valid
set of ideal times. The adopted solution to this problem was to cut off timing after the execute stage
for the fixed-point unit, the first execute stage for the floating-point unit, and the dispatch stage for
condition register operations and branch instructions. This strategy is not perfect but achieves the
correct behavior as long as fixed point instructions begin new basic blocks, which occurs often.
Shown below in Figure 4-8 is a module-level view of the simulator code. Its object-oriented design
is evident from the similarity to the RS/6000 CPU diagram seen previously. A test script was
prepared to run the simulator on a suite of short code sequences so that the integrity of model could
be quickly checked after making small corrections or additions. After much tuning of the various
pipeline stages, the simulator was able to correctly simulate branch delays, moves to and from special
registers, and large-occupancy instructions correctly. The C++ code is included in appendix A.
4.2 Instrumentation: A Detailed View
The major issues which must be resolved in actually instrumenting code are:
* What statistics are to be collected via profiling, and where are they to be stored during execu-
tion?
* What methods will be used to collect these statistics?
* How can the user direct the profiler as to which procedures should be profiled and which of the
possible types of statistics should be collected?
We answer each of these questions in turn, starting with what information is to be collected
and where it will reside during runtime. Note that runtime overhead-increased code size, data
memory requirements, and execution time-all have deleterious effects on the truthfulness with
which the profile data represents an execution of the unprofiled code.5 This is especially true for
increases in execution time. Thus, while analyzing the three guiding questions above to effect the
5 Kesselman notes in his dissertation, that by keeping runtime overhead low, the probe effect can be eliminated
leaving profiling code in even when statistics are not collected. Given the relatively higher overhead expected with




























Figure 4-8: A module-level diagram of the data tpes and procedures used to implement a stand-
























































implementation of the instrumentation phase, we also must keep in mind that after a working
version of the instrumentation phase has been installed, we must work toward including time-saving
optimizations.
In designing the original implementation of the instrumentor, we wanted to collect statistics, for
each processor, on the number of times each basic block is executed and its total ideal and real
execution time. It was thought at one point that even smaller sections of PowerPC code than basic
blocks-such as the image of a single *RISC instruction under the *RISC - PowerPC translation
function-could be instrumented with code for collecting elapsed real time. However, even before
the coding phase, it became clear that such an object would almost certainly be too fine grain to
profile. To see why, one must consider the RS/6000 on-board timer, the Run Time Counter (RTC)
in relation to its CPU clock.
The RTC is accessed via a fixed point unit mfspr instruction which takes a single cycle in the
FXU pipeline but has a latency of two cycles. For short intervals (<< one second), it suffices to
sample only the lower word of the clock, which has a period of one billion nanoseconds and counts
from 0 to 999,999,999 before rolling over. (Obviously the range of a 32 bit unsigned integer is larger
than this, but it is not all used.) Although the upper 24 bits are implemented in all versions of the
RS/6000, the lower 8 bits are not, with the result that "ticks" of the RTC in fact occur only every
256ns6. By contrast, the target RS/6000 model 550 used in this project had a CPU clock rate of 42
MHz. Comparing the two, it becomes evident that the RTC ticks only once every 10.8 CPU clock
cycles. Clearly, it's possible that problems might occur when targets must be measured which take
fewer than 11 or so CPU cycles, or even of that order of magnitude. In mixed fixed and floating
point code, superscalar effects imply that code segments of less than 20-30 instructions would fall in
that range. We'll consider this effect in the interpretation of our results in later sections, but suffice
it to say for now that attempting to profile below the basic block level would not be considered
feasible.
In addition to real time intervals, it should be possible to collect statistics on the types of instruc-
tions executed, both at the PowerPC level and also at the *RISC level. For example, at the PowerPC
level, we might ask how many fixed point operations were executed in relation to the number of
floating point and memory operations. The crucial optimization here which saves greatly on runtime
overhead, is that provided basic block counts are collected, this second set of statistics is already
determined. To produce it, we merely read in the static instruction mix from simulator output file
and scale each basic block's mix by the number of times it was actually executed. In order to have
6In the PowerPC 601, the RTC period is better-128ns; the PowerPC line beyond the 601 will include a Time
Base (TB) rather than a RTC.
38
some sense of how many threads are forked or how many messages are sent in a given partition, the
imix includes the distribution of instructions at the *RISC level as well, divided into categories such
as network, scheduling, and local memory operations. As implemented, the instruction mix even
reveals how many instructions were added by the register allocator so the user can gauge the local
efficiency of the allocation algorithm on the target code.
Basic blocks are not the only unit of code which can be profiled; in particular, we would also
like, for each Id procedure, on a per-processor basis, the number of times it was invoked, the total
actual time that frames allocated it have been live, and the number of calls and elapsed time spent
in each callee Id procedure. (For calls to C procedures rather than Id code, the actual execution
time is collected rather than the live time of the callee's frame, since the call is not split-phase.) The
overhead of these measurements is much less than that of the per-basic-block statistics, and indeed we
find below that fewer optimizations are called for here than at the basic-block-level instrumentation
points.
4.2.1 Profiling Data Structures
As shown in Figure 4-9, a two-level hierarchy is employed to store the statistics collected at runtime.
When an Id procedure is called, the frame allocated contains slots for profiling the execution of each
basic block and each called C function of that procedure. Using the activation frame to store this
data takes advantage of locality, since the frame pointer is already in a register during the execution
of any partition, and the page containing the frame is more likely to be in the cache than the page
containing the code block descriptor. Storing the called-C-procedure statistics in the frame also
makes sure that multiple invocations of the same Id procedure running on a given processor do not
interfere with each other, as they would if they maintained non-atomically updated fields in the
CBD. Each basic block is assigned four slots in the frame-one counting invocations that end in
a fall-through, one counting invocations that end in a branch, and two that, together 7 , store the
total real time spent in the basic block. The distinction between fall-through and branching exits
from a basic block is made because the ideal time required for a branching exit is longer than that
needed for a fall through. The current implementation allows only a limited number of basic blocks
per Id procedure, determined by a constant; to profile large routines, the constant might have to be
increased.
Below the region used for basic blocks lies the called-C-function statistics area, where each called
C routine is assigned five slots. The first slot counts the calls, the second and third, together, hold
7Real time is a 64-bit quantity.
39
the start time of a called routine during its execution, and the last two, together, hold the total
real time spent inside the calls. At the very end of the frame we cache the time of the code block's
allocation; the time can't be left in second-level, per procedure storage region since multiple live
copies of a procedure would then each write into the same location.
The frame-level statistics are accumulated into the CBD at the end of each Id procedure call
just before the frame is deallocated and the code block invocation count, located in the CBD,
is incremented. In addition, the cached start time in the frame is used to accumulate the live
time of each invocation into the total live time stored in the CBD. Finally, the CBD also stores
callee invocation counts and live times, which can be used to generate an accurate call graph. As
implemented, the accumulation of callee live time into the parent's CBD is the job of the callee.
Looking at the end of its CBD to find its identification number, and using a pointer to its caller
stored in the local frame, the callee can determine the correct parent CBD location to update.
4.2.2 Instrumenting a Partition
This initial implementation of Qprof applies a uniform instrumentation strategy to each partition.
First, the partition is classified as either a cthread, inlet, or standard thread. Since the register
allocator assumes the frame pointer (FP) always contains a valid storage base register, if the partition
is a cthread-an initialization thread called from C-code must be added to save the nonvolatile
registers, 13-31, as per C convention, and move the frame pointer from the argument register (r3)
where it lies upon partition entrance to the FP register. Next, the basic block topology of the
original partition is analyzed and a header and trailer are inserted before and after each basic block
to increment the invocation count of that basic block and update its live time. The implementation
distinguishes between fall-through and branching exits from a procedure, since they lead to different
delays, but for illustrative purposes, a simplified header is shown in Figure 4-10 above which counts
all invocations identically. The registers r100,r101,etc., are from the infinite register model since
the instrumentation is added before register allocation. The exact header and trailer format used,
showing how double trailers were added to each basic block, can be found in Figure 4-11.
The basic block ID is summed with the frame pointer before entering a basic block so that if the
basic block exits via a branch and encounters a different trailer, its information still gets stored into
the correct frame slots. If the base register were hard-coded as rFP, then we would have to place the
branch ending a profiled basic block it after the trailer, outside the instrumentation code. Only one
instruction of execution time overhead would be eliminated, but the I-cache overhead would be cut
in significantly since the current scheme, requiring two copies the trailer for each basic block, one
for fall throughs and one for branches to the succeeding basic block, could be abandoned in favor
40
CBD Statistics Slots
(accumulated over the program lifetime)
Live Frame Statistics Slots
(accumulated over a single invocation)
Figure 4-9: The QProf runtime data structures.
41
cal rlOO, 4*5 (rFP)
mfrtcl rSTAT
# save 4X the basic block's ID no., plus the FP, in r100
# grab the lower word of the start time into a special reg.
<original basic block, bb#n>
<bb#(n+l)'s label>:
mfrtcl rO
lwz r102, 0 (r1OO)
lwz r103, 8 (r1OO)
addi rO2,r102,1
stw r102, 0 (r1OO)




stw r103, 8 (rlO0)
stw r104, 12 (rlO0)
# here's where jump entries to the next basic block enter
# grab the lower word of the end time and save in rO
# grab the invocation count of the BB and save in r102
# grab the low word of the BB's live time into r103
# increment the invocation count
# store back the invocation count
# grab the high word of the BB's live time into r104
# put elapsed time in r102, 0 if the low word rolled over
# sum the elapsed time with the old live time
# do the carry
# store back the low word of the live time
# store back the high word of the live time
Figure 4-10: A simplified view of the instrumentation header and trailer applied to each basic block.
The doz instruction stands for "difference or zero," and prevents a rollover of the RTC's low word
from throwing off the measurements.
42
cal rO, 4*5 (rFP) # save 4X the basic block's ID no., plus the FP, into rO
mfrtcl rSTAT # grab the lower word of the start time into a special reg.
<original user basic block, bb#n>
# FALL THROUGH TRAILER -- put inv. count in first slot
mfrtcl riO
lwz r102, 0 (rlO0)
lwz r103, 8 (r100)
addi r102,r102,1
stw r102, 0 (r100)
b joinpoint
# JUMP ENTRY TRAILER
<bb#(n+l)'s label>:
mfrtcl rO
lwz r102, 4 (rlO0)
lwz r103, 8 (r100)
addi r102,r102,1
stw r102, 4 (r100)
join_point:




stw r103, 8 (r100)
stw r104, 12 (rlO0)
# grab the lower word of the end time and save in rO
* grab the invocation count of the BB and save in r102
# grab the low word of the BB's live time into r103
# increment the invocation count
# store back the invocation count
-- put inv. count in second slot
# here's where jump entries to next user basic block enter
# grab the lower word of the end time and save in rlOl
# grab the invocation count of the BB and save in r02
# grab the low word of the BB's live time into r103
* increment the invocation count
* store back the invocation count
# The trailers share a large common section
# grab the high word of the BB's live time into r104
# save elapsed time in r102, or 0 if the low wd rolled over
# sum the elapsed time with the old live time
# do the carry
# store back the low word of the live time
# store back the high word of the live time
Figure 4-11: The exact instrumentation header and trailers applied to each basic block. A fall
through out of the original basic block takes the first trailer, and a jump entry to the following user
basic block takes the second. The label of basic block #(n+l) is removed from its original location
and placed as shown.
43
of a single trailer approach. However, since such an "optimization" would remove branches from
measurable real time, the double trailer format was kept.
The only complication to this uniform instrumentation process occurs when a branch-and-link
instruction is encountered. Since such an instruction signals a call to a C function, the instrumentor
essentially rewrites the code to vector the C call through a special trailer which (1) updates the
statistics information of the most recently executed basic block, and (2) records the current time to
create a reference point. When the C call returns, the reference point is used to compute and store
into the frame the real time spent in the call. To return to user code, the handler then makes a jump
back to the header of the basic block following the original branch-and-link instruction. Relatively
speaking, the overhead of this instrumentation point (a C call) is more acceptable than that of the
previous set of instrumentation points (the basic blocks), since many C calls often take on the order
of 1,000 or more cycles to complete.
The remaining runtime processing of the statistics data, such as accumulating frame information
into the CBD before frame deallocation, and maintaining the invocation count and total live time
of each Id procedure, is carried out by extensions to the Q runtime system. Specifically, the frame
statistics slots are zeroed out, and the initial timestamp is cached in the frame, during frame allo-
cation, and the accumulation steps and invocation count handling is done by a hook in the frame
deallocation routine. Both of these modified Q files are listed in Appendix B. Before exiting, the
instrumented program collects information from the CBDs and creates a file, qresults, representing
the real time statistics gathered during execution.
Selective profiling, mentioned at the beginning of this section, has not yet been implemented in
Qprof, although the basic block invocation counts and real execution delays could easily be made
optional for each Id procedure according to user preference. It was felt that adding selective-profiling
features would not be needed on the first version of Qprof, since they would likely provide little
further insight into choosing the best instrumentation points and runtime storage strategies.
4.3 Integrating the Simulator and the Profiler
To implement the instrumentation algorithms discussed above, the instrument-CB and instru-
ment procedures were designed and ppc-testparse, which converts *RISC to PowerPC assembly
code, was modified to pass all code through the instrumentor as shown in Figure 3-1. Instrumenting
a program generates a small file, qstat-short which contains basic block counts and code block
names, etc.-items which would be hard for PostProf to reconstruct without help. Finally, mea-
sure, an extension to the simulator that records the instruction mix of each basic block in addition
44
to its ideal execution time, was incorporated into ppc-testparse so that the complete feed path of
Figure 3-1, save for PostProf, was put into practice. Calling measure generates a long file of static
measures such as ideal time, instruction mix, etc., named qstat-long. Both measure, instrument,
and all the associated instrumentor and simulator extension data types and procedures are collected
into Appendix C. A dependency diagram appears below as Figure 4-12.
1 data type
Figure 4-12: An overview of the data types and procedures used to extend ppc-testparse to comply





Figure 4-13: An overview of the data types and procedures of PostProf.
4.4 PostProf
Since the current implementation of Qprof runs on a single processor, the sophisticated graphics
originally conceived for PostProf were not needed. Instead, a simple interactive graphic visualizer
was written which pipes data through gnuplot in order to display it for the user. As currently
written, PostProf can display the total live times and relative live times per invocation of each
partition in an Id code block, or the live times and invocation counts of all C procedures called by
an Id code block, and the live times and invocation counts of all Id procedures called by a given
code block. Furthermore, each partition can be examined individually to see how the theoretical and
actual execution delays compare, and to discover the PowerPC or *RISC instruction mix. PostProf
responds to standard emacs keystrokes for "down" and "up" to advance through the code blocks, and
to "left" and "right" keystrokes to advance through the display options available within a single code
block. In order to see the instruction mix of a partition, one need merely to select the partition by
moving to it using the "go right" command, and then invoke Control-S. A useful feature of PostProf
is the rerun option, which is activated by Control-L; it runs the executable again to generate a new






Since the full Q compiler is not yet available, Qprof has been applied to two *RISC programs: call-
test-1.st, representative of a simple Id program which returns a reference to a pair of values, and
fact.st (see Appendix E), an iterative factorial function. To get the desired profile, we call the script
qprof on the *RISC code, which invokes the new ppc-testparse routine to translate *RISC into
PowerPC. By the time ppc-testparse is done, the two static measure files, qstat-short and qstat-
long, have been written out, and target.s contains the instrumented assembly code. Next, target.s
is assembled and linked with the Run Time System, and the resulting executable target.qpf is
called. As target.qpf runs, statistics are collected and, before target.qpf is done, dumped to the
file qresults. The profile can be viewed by calling PostProf as soon as the three statistics files are
available.
5.1 PostProf as a Performance Visualizing Tool
Shown below in Figure 5-1 is a graph of partition execution time for fact.st's first code block as
generated by PostProf. Each partition is represented by two bars-the bar on the left represents
the total amount of execution time accumulated by the given partition, and can be read against the
Y-axis scale, and the bar on the right is a dimensionless quantity representing the relative execution
time per invocation of the same partition. Whereas the left bar, representing total execution time,
can be read off in CPU cycles against the Y-axis, the right bar cannot be since its height only has
significance relative to the per-invocation execution time shown for the other partitions. The purpose
of the right bar is to allow the per-invocation execution times of partitions to be quickly compared.
For code blocks whose partitions are not all invoked the same number of times, it should be clear
that total partition execution time, summed over all invocations, cannot be used for this purpose,
47
since it is weighted in favor of the more heavily-invoked partitions. In such a case, the right bar of
each partition becomes useful. However, as total execution time was felt to be the most useful gauge
of code block performance, it was given primary emphasis and control of the Y-axis.
The convention of pairing two bars with each partition, one of them a dimensionless quantity, was
adopted to allow the two quantities to be displayed on the same screen with a single marked Y-axis.
Above the left bar, the total number of invocations of that partition is listed in the form of a factor
such as "100OX", and at the top of the screen some statistics on the whole code block are listed, such
as the number of times it has been invoked, its total live time, and its total work time. While live
time measures how long an activation frame for the given code block remains allocated, work time
measures the time spent in the code block executing user code. For a code block which calls C or Id
procedures, the total live time will exceed the total work time. A similar plot for the second code
block in fact.st is shown in Figure 5-2. Note that most time is spent inside FACT.partO where the
iteration occurs. The large discrepancy between live and work time is due mostly to the overhead
associated with allocating and deallocating the activation frame, and scheduling-live time begins at
some point during allocation, and ends at some point during deallocation. A less significant factor is
the 80% instrumentation overhead present in FACT, which appears as live time but not work time
Postprof can also focus on performance data from a single partition. In Figure 5-3 the execution
of the partition FACT.partO is broken down into statistics for each basic block. Each basic block is
represented in the plot by two vertical bars-the left bar representing actual execution time and the
right bar indicating the ideal execution time in the absence of cache misses, interrupts, etc. Below
the pair of bars is listed the total number of invocations of the particular basic block. From the plot,
it's evident that most of the time spent in this partition is spent in the sixth basic block, the most
complex basic block inside the iteration loop. The *RISC and PowerPC instruction mix graphs for
FACT.partO follow as Figures 5-4 and 5-5. From the *RISC instruction mix plot of FACT.partO,
it's clear that the register allocator is responsible for generating more PowerPC instructions than all
the *RISC instructions in that partition combined! The reason is that register allocator for the Q
back end currently spills every register, storing each value into the frame after it gets produced, and
loading it back right before it is needed again. Hopefully a better register allocator will be available
for Q in the future. (The *RISC instruction mix is computed by mapping each PowerPC instruction
present back to the *RISC instruction type which generated it, or to the register allocator.)
PostProf is also capable of displaying a performance snapshot of calls to Id procedures and C
routines. After running Qprof on call-test-1.st, this feature of PostProf produced Figures 5-6
and 5-7. In the first of the two, we see that much more time is spent in C calls allocating and




Livetime 3021742.1 cpu cycles














Tot. Exe Tim -




































Livetime 1229200.9 cpu cycles
Work time 7618.7 cpu cycles
Tot, Exe Time -
Ave ET/call (scaled) ----
(1OOX) Av
U(OOX) Av






Figure 5-2: PostProf displays partition execution time for the partitions in fact.st's second code
block.
50













' . . .
. . . . . ... 
. . . ...........
FACT.partO Basic Blocks
100 invocations
Livetim 63082 cpu cycles
(1(0K) (a} (1-) (1 ---)
(100X) (OX) (MUOX} (M0)
B. II
U-]
Tot, Real Exe. Time -
Tot. Ideal Exe. Time ----
(lQX) (1000X) (1OX) (9b0) (1000X) (90DX) (1MOX)
0 1 2 3 4 5 6 7
BB Nmber
(OX) (10WX)
8 9 10 11 12 13
Figure 5-3: PostProf displays the actual and ideal execution times of the basic blocks in partition
FACT.partO.
51
PS Gnuplot I @1
Real ExeTm/Ideal ExeTn


















~~I .[ -!_j. 1
FACT. pr BaUc Block
it ioCtion PawerPC x
Linetie 6S"i cru cclei
...........D 1 . ......
fixed point with loticul or contol ops flotinr pt ,rith nmo,* o
Ilix its tic




















IS r imoction.s T ix -
Lie*.ij 6*58.1 cpu cclf1
. ......... .... . ....... ....... ...... . .....
tixld ALU ops lo4in} ALU opr h rpl i s'r ck o schedulis 1r neoo.rk ors local nea ol r. ello o.
"'
........ I
and relative execution time per invocation can be seen in the case of RTS-istore-handler, which was
called twice as often as the other three C procedures and which thus has a relative execution time
per invocation which is shorter compared to its total execution time than is the case with other three
procedures. Looking at the second figure, we see that no calls to MAIN or TEST01 were made from
procedure TEST01, but that one call is made to TEST02 each time TEST01 is run. In the event
several code blocks were called from within a single Id procedure, this call-graph snapshot provided
by PostProf would allow the user to determine which callee produced the longest delay.
In summary, Qprof can measure the real execution time of Id procedures, much as prof can
calculate the real time spent in C procedures. Moreover, the callee invocation counts and callee live
times which Qprof maintains for each Id procedure can be used to construct a call graph similar to,
but more accurate than, the one produced for C routines by gprof. The additional accuracy comes
from the fact that gprof propagates execution time up the call graph by using invocation counts,
whereas Qprof collects actual callee live time at run time. Finally, Qprof also keeps track of the ideal
time required by basic blocks, much as pixie does. However, unlike pixie, Qprof can use ideal time
to present the user with memory hierarchy overhead time, since it also has real time measurements of
execution available. In short, Qprof incorporates the features of prof, gprof, and pixie into a cohesive
whole that can accomplish more than any tool designed around a single approach to performance
measurement. Best of all, Qprof is scalable to a network of symmetric multiprocessors.
5.2 Shortcomings of QProf
Although an effective performance visualizing tool even in its first version, QProf suffers from some
difficulties. The most obvious is that Qprof generates a large instrumentation overhead. For avail-
able test programs, the overhead has in some cases approached 100 percent. In other words, the
instrumented executable takes twice as long to run as the original user code, despite the effort made
to keep the instructions added during basic block instrumentation to a bare minimum. Now on a
single processor, profiling overhead for schemes such as the one we employ only perturbs cache hit
rates since no live instrumentation registers are maintained inside user code. While altered cache
behavior can indeed create a probe effect, there can be no doubt that the probe effect of such a large
overhead would be much more significant in a network of processors in which timing patterns could
be thrown off. A remedy to this problem is discussed in the next chapter.
Aside from question of overhead, the most puzzling result obtained is perhaps the large variation
between observed and ideal time for some basic blocks profiled. Because of the inaccuracies present








...................... I . .....
,alloc_frame_local ,allochea eaaer .RTSistow hndler




1 2 3 4
C Function
Figure 5-6: PostProf displays the time spent in C calls inside the code block TESTO1.
55







"'_ _, ... ... . .. .. .. ' '
TESTO. Id C Calls
100 invocations





Tot, Live Tim e
Ave LT/call (scaled) ----
{OOX) Clv
.(10OX) Av
- . . ... ... .......... ...
TES02
3 4
Figure 5-7: PostProf displays the live time and
called from within TESTO1.




















than a single ancestor, the calculated ideal time Qprof displays is sometimes one or two percent
off the true ideal value. This explains why the real execution time is sometimes shown marginally
below the the ideal time in PostProf plots such as Figure 5-3, where such a difference can be seen
for the sixth basic block. Even when present, this phenomenon has never been significant enough
to be considered a a flaw in Qprof's design.
By contrast, during the testing of Qprof it was often found that the real time required to execute
a segment of code was significantly greater than the ideal time, even in cases where no memory
hierarchy effects should be present. A typical example of this can been seen in Figure 5-8, in which
the third basic block's actual execution time is about 16 cycles, or 89%, longer than the predicted
ideal. A number of factors may be responsible for the inflation of real time recorded.
The first possibility is that the simulator is deficient in a way which causes it to underreport the
total number of cycles required in some instances. To eliminate the simulator from consideration,
a simple experiment was performed. Given a basic block with nearly equal ideal and actual times
according to PostProf, a fixed number of arithmetic instructions were inserted into the block. When
PostProf was run, it was found that the actual time per invocation of the basic block in question had
jumped by a significantly higher amount than the number of arithmetic instructions inserted. This
indicates that the simulator is not the source of the overreported actual time, since the simple code
sequence inserted had a trivially-easy-to-compute ideal time, and still the actual time expanded by
a significantly larger amount.
Interrupts need not be considered as the possible source, since the real time had a tendency to
jump up in steps during the arithmetic instruction insertion experiment discussed above. In other
words, the real time remains nearly constant as 1 through n arithmetic instructions are inserted into
a basic block, but suddenly jumps to a higher value when the (n + 1)st instruction is added. An
interrupted instruction stream would have a higher measured real time delay, since interrupts take
a while to service, and would not have the particular step-like behavior described for such short
sequences (<< an interrupt interval) of inserted instructions.
The next potential culprit, the graininess of the Run Time Counter relative to the CPU clock,
was known and feared before Qprof had even been written. In the case of the 42 MHz RS/6000
550 used in the testing of Qprof, the RTC ticks only once every 10.8 CPU clock cycles. It follows
immediately that unless several iterations are performed through a basic block with the RTC phase
randomly aligned with the first instruction, small basic blocks cannot be accurately timed using the
RTC. To attempt to prove this factor responsible, the executables originally compiled for the 550
were recompiled with a few modified constants and run on an RS/6000 320, whose CPU clock runs








0 1 2 3 4 5
BB Nuber
Figure 5-8: An illustration of the discrepancies between ideal and actual time in one of PostProf's





were responsible for the elevated real execution time, the discrepancies on the 320 should be nearly
half what they were on the 550.
Surprisely, no substantial change to the real time inflation was observed when running executables
on the 320. When the third basic block shown Figure 5-8 is examined by Qprof on the 320, the
overreported actual time for the third basic block is still present. Further investigation running a
large number of iterations of fact.st revealed that the Qprof runtime system subroutine designed
to randomize the phase of the RTC with respect to the timed edges of each basic block, or some
other factor such as interrupts, was effective at making the granularity of the RTC disappear when
many iterations were run. When the Qprof executable for fact.st is run 1000 times in succession to
accumulate data for 1000 trials, the results produced by the 320 and the 550 are nearly identical.
Unfortunately, this disqualifies timer resolution as a candidate for the observed anomaly.
Certainly the actual time will reflect memory hierarchy overhead as it was designed to do, but
there is still a legitimate question as to whether the particular anomaly observed is also due to
memory access operations. However, the previous experiment also sheds light on the current ques-
tion. Namely, since adding a sequence of arithmetic operations to a basic block produced a large
differential between actual and ideal time where no significant difference had existed before, data
memory operations don't seem to be involved. Here we ignore the possibility that the remaining
instructions from the original basic block, perhaps containing some memory operations, are made to
take D-cache misses by the newly inserted arithmetic instructions. Our justification is that register
operations inserted directly into the assembly code should not change D-cache behavior.
Only one possibility remains-I-cache misses. In order to check whether the I-cache could be
responsible for the anomaly, a more detailed study of real time inflation in the presence of added
instructions was conducted. The target basic block was the one discussed in conjunction with
Figure 5-8, and the added instructions were addi operations. First on the 550, and then on the 320,
the real time inflation was measured as a function of added instructions for 1000 iterations of fact.st.
The resultant plots appear as Figures 5-9 and 5-10. As can be seen, instead of the expected linear
relationship, the plots are quasiperiodic with a period of 16 instructions. This means the period
cannot be related to the RTC or CPU clock. In fact, it turns out that since the I-cache has a line
size of 64 bytes, this is exactly the number of instructions in an I-cache line! The conclusion seems
clear that the I-cache is responsible for the observed behavior. As a check, note that the plateaus
are separated vertically by roughly 20 cpu cycles on each plot, exactly the time required to execute
a row of instructions from the I-cache and load a second row from main memory. To see this, note
that when an I-cache miss occurs, a delay of eight cycles ensues, after which the 16 instructions in










0 5 10 15 20 25 30 35
Added Arithstio Instructions
Figure 5-9: A plot of total real execution time as a function of the number of instructions added to
a basic block. This plot represents 1000 runs of fact.st on the RS/6000 550.
this means that every 16 instructions a delay should be seen of exactly 8 - 4 = 4 cycles. When these
four cycles of delay are added to the 16 required to execute the instructions in a line, we obtain the
20 cycle figure as just mentioned.
Although the exact shape of the graph cannot be explained, the sharp rises probably occur when
the last instruction in the basic block, a branch, advances from being the last instruction in a
particular I-cache line to being the first instruction in the next line loaded. Since the branch
automatically causes a new I-cache line to be loaded, pushing it off the end of the line causes an
addition 8 cycle delay to arise which cannot be covered. The flat portion before this sharp rise
occurs when the branch is near the end of the I-cache line. Since the branch target line must be
fetched from main memory anyway, several instructions may be executed "for free" during the fetch
latency.
60







4*0 0 0 O 0~~~~~~~0~~~~~~~~~
0 ~ ~ 0 0








EII~~~~~~~~~~~~~~~r~~~~~~~C~~~~~~~~-~~~~~~~- - -~~~~~~~~~--- -- 
I*l··*lll·ll*Y·ll1·1111111111···111







0 5 10 15 20 25 30 35 40
Added rithmetio Instructions
Figure 5-10: A similar plot, again showing the total real execution time as a function of the number



















, 0 0 I I I
0
O O ~ ~ ~ ~ ~ ~ ~ ~ ~ -- -- 
I"·a~~~~~·~~~·ssL · 1Lp--------------------





After having implemented an initial version of Qprof, it has become clear that although it is an
effective profiling tool on a single-processor system, its present overhead will generate a heavy probe
effect in parallel environments. Four key optimizations must be made to the current version of Qprof
before it can become an effective tool on a multiple-processor network. Runtime statistics must be
stored more efficiently, basic block counter placement must be optimized, instrumentation points
must be moved to larger objects, and the instrumentation code must be made to use both functional
units.
6.1 Storing the Runtime Statistics More Efficiently
First, the runtime system must be modified to allow a variable number of statistics slots to be stored
in each CBD. The number of storage slots available in the CBD should be a function of the code
block rather than a global constant. Such an improvement would both save space when few statistics
need be collected in a given code block, and at the same time allow for a code block requiring an
arbitrarily large number of statistics slots. Instead of actually making the CBD of variable size, the
goal can be accomplished as shown in Figure 6-1, where only two dedicated slots in the CBD are
required to accumulate basic block and called procedure statistics. When a CBD is created, the top
location is set to the number of slots that will be required to store runtime statistics on all the basic
blocks in that code block together with all the procedures called by that code block. In the bottom
location, we store a pointer that is originally null. If a given code block is eventually invoked during
the execution of a particular program, space to hold the precalculated number of slots is malloc'ed,
and a pointer to that space is stored back into the bottom location. This approach has an advantage












Figure 6-1: A scheme for making variable the amount of profiler storage associated with CBDs. The
top slot contains a count of how many words should be malloc'ed by the runtime system, and lower
slot contains a pointer, possibly null, to the allocated space.
is not allocated unless the associated code block is called at least once.
6.2 Separating Real Time Collection and Invocation Count-
ing
Next, the maintenance of basic block counts must be divorced from collection of elapsed real time to
allow graph-theory optimizations to be performed on basic block counter placement. As discussed
in-depth in the last section below, a minimal set of counters placed at the correct locations in each
code block can be used to reconstruct a full set of invocation counts for the basic blocks of every
partition. Unless real time collection and invocation counting are separated, therefore, the real time
measurements at many basic blocks will be lost. A slight complication that must be dealt with is
that two levels of control graphs must be studied in working with the set of basic blocks in a code
block-the *RISC control flow between partitions, and the assembly language control flow between
basic blocks in a partition. Even a naive application of the techniques of Section 6.6 should result
63
in an overhead reduction of 50% or more.
6.3 Collecting Real Time at Coarser Intervals
Unlike invocation counts, real time must be collected at every point where it is desired-it can't be
reconstructed from a small set of measurements. Since it cannot be so optimized, the instrumentation
to collect it must be moved to coarser objects to alleviate the heavy overhead currently experienced
at each basic block. To maintain an acceptable level of overhead, the instrumentation points for
real time must be changed from basic blocks to partitions. As it turns out, 8 out of the 13 cycles
in the current Qprof header/trailer are required to accumulate the real time, and this is too great a
price to pay at each basic block. However, most partitions are long enough to make such a one-time
overhead acceptable. In addition, if the programmer knew a given application would run for strictly
less than one second, those eight cycles could be optimized down to five by ignoring the upper word
of the time. Since real time must be used to find memory hierarchy delays, our change will mean
that such effects can only be localized to a partition rather than a basic block, but this is not that
bad of a situation since partitions represent a relatively fine-grained division of an Id procedure.
6.4 Using Both the Fixed and Floating Point Units
Finally, the invocation count headers and trailers should in some cases be made to use floating-point
registers. Since independent pipelines exist for floating-point and fixed-point operations, such a
scheme could save at least one cycle at each instrumentation point. No workable means exists to
perform the real time collection with the floating point unit, since the real time must be brought
through the fixed-point unit during retrieval from the RTC. However, the invocation-counting code
can easily be made to use the floating-point unit by permanently assigning one of the floating-point
registers to the profiler and keeping the value 1.0 in it at all times. While dependent floating-point
operations do create a pipeline bubble because of the latency of the two-stage multiply-add logic,
this delay only occurs if the second instruction is a floating-point register-to-register operation. If
the second instruction is a store, no delay ensues because of the floating-point unit's pending store
queue and data store queue. Fortunately, the code sequence for updating an invocation count follows
that pattern exactly--a calculation followed by a store. What's more, since the fixed-point unit
often runs ahead of the floating-point unit by a few instructions, and the fixed-point unit is the one
which computes the address of a floating-point load and makes the D-cache request, the one cycle
load-use delay found in fixed-point code is usually not present in floating-point code.
64
6.5 Improvements for the Multiple-Processor Environment
Other optional improvements might also be considered. For example, we could collect the idle time
on a per-processor basis. The idle time is defined as the time spent unsuccessfully in the scheduling
loop looking for enabled threads in allocated frames belonging to user code blocks. By modifying the
runtime system's low-level assembly kernel, the scheduling loop could be appropriately instrumented
to collect elapsed time. A possible technique for more accurately gauging the cause of idle time
would involve dividing the idle time evenly among the code blocks associated with live frames (on
that processor); this would allow the idle time to be a per-code-block rather than a per-processor
statistic. The thinking is that the live code blocks are waiting for some result to arrive, and that if
it was simply the case that all computation at that processor were done, all frames would have been
deallocated.
As a tentative proposal, it might also be possible to collect total synchronization times-total
actual time between arrival of the first and last thread at a join control point, summed over all
invocations-for each *RISC join in a code block. In order to facilitate such a scheme, room would
have to be made in the statistics slots of the frame for a timestamp, and additional overhead would
have to be taken at each *RISC join. Whether or not the feedback is worth paying the overhead is
a matter for further research.
6.6 Basic Block Invocation Count Optimizations
In order to run the algorithm allocating a minimal set of basic block counters in the most efficient
configuration, a complete map of control flow through the basic blocks of a code block must be
obtained. This entails a two-step process. First, working at the partitioned program graph or *RISC
level, a heuristic must be applied to each code block to build a directed graph of a special type, T.
What makes graphs of type T special is that each node can either represent an ordinary graph node,
or contain a list of graphs, themselves of type T. In such a scheme, partitions in *RISC become
ordinary nodes, and control flow such as a conditional branch can be represented by multiple edges
leaving a single node. The special feature of type T is used to represent synchronization. Sequences
of partitions which must synchronize with respect to each other are placed within a single node, and
can themselves contain control-flow substructure. A sample type T graph is shown in Figure 6-2; it
happens to be the graph of type T corresponding to the Fib code block described in Culler's TAM
paper [3].
Second, after a graph of type T is obtained for each *RISC-level code block, a simple basic block
detection algorithm like that implemented in the simulator's nodemap module can be used to
65
Figure 6-2: An instance of a graph of type T. This one represents the *T code block implementing
Fibonacci in Culler's TAM paper.
66
construct a directed graph of the basic blocks in each partition. After flattening' the compound
nodes by recursively replacing each one with its interior control graphs linked in series, and inserting
each partition's basic block map into its node on the graph, a single directed graph of basic blocks
is obtained for the entire code block. The reason this strategy yields a valid graph for the purpose
of invocation counting is that treating the parallel, synchronizing threads as sequentially executed
threads enforces the constraint that if one thread in the set is executed n times, every member of
the set must be.
6.6.1 Adding Optimal Counters to a Basic Block Graph
Our clever idea is the following: consider the control flow graph as an undirected graph G, and remove
a spanning tree to generate G'; then, the flow of control in G (in terms of number of traversals of each
edge) is uniquely determined by the flow of control in G'. Thus, by removing a maximum spanning
tree (where the edges have been weighted by expected number of traversals), the remaining edges
contain all necessary information to reconstruct an invocation count of the BB's, and have minimal
overhead 2 . To recover the "lost" edge data later in PostProf, we follow Goldberg's algorithm [5]:
1. Set the labels of all edges in G to 0.
2. For each edge e that is not in S = G-G', there is a unique cycle in G, called a fundamental
cycle, comprised of e and edges from S. Traverse the cycle clockwise, edge by edge. If an edge
is in S and the direction of traversal matches the direction of the edge, add the label of e to the
label of the edge, otherwise subtract e's label.
A major remaining problem is how to select a maximal spanning tree. One possibility is that
multiple passes could be made with Qprof, so that execution statistics from the first pass could be
used to reduce the cost of profiling the second run, perhaps leading to more accurate measurements.
However, a cleaner approach would be to use a heuristic to artificially weight the control graph before
removing the maximal spanning tree. Goldberg suggests the a method which relies on a depth-first
search to weight control-flow nodes heavier at the "bottom" of the graph. When integrated with
the original idea of removing a spanning tree, we get the following algorithm:
1To flatten a compound node in a graph of type T, the interior graphs are removed and placed in sequence. In
other words, when control flow enters the compound node, it goes first to the initial node of one of the interior graphs,
travels down that graph until it reaches a node with no successor, and then enters the first node of one of the remaining
interior graphs. After all interior graphs are exhausted, control flow leaves the compound node.
2A slight complication is that self-loops and loops between a pair of nodes in the original control-flow graph must
be expanded to three node loops before subtracting the spanning tree. Another complication is that if some user
calls (e.g. exit) do not return to the user code, yet are not in basic blocks represented by leaves in the control flow
diagram, additional edges must be added to the graph
67
1. Perform a depth-first search and assign post-order numbers to nodes.
2. Label each edge with the post-order number of its source node. (These numbers will range from
1 to N.)
3. Assign edges corresponding to induction variables the label 0 (highest priority for instrumenta-
tion).
4. Add N to the label of loop return edges that do not correspond to induction variables (lowest
priority).
5. Assign pseudo-edges from the leaves to the root (added to satisfy Kirchoff's laws) the priority
N+1
6. Eliminate the maximum cost spanning tree (which contains the lowest priority edges) and select
the remaining high priority edges to measure3 .
In Goldberg's tests, this heuristic scored better than random selection of a spanning tree on most
benchmarks, although far from ideal. Larus & Ball [9] suggest another heuristic in their paper on
optimal profiling; the heuristic assumes (1) each loop executes 10 times, (2) if a loop is entered N
times and has E exit edges then each exit edge gets weight N/E, and (3) predicates are equally likely
to take any of their non-exit branches. The evidence in favor of the Larus & Ball heuristic is slight
but it may be the better of the two.
Thus, by first creating a control-flow graph of the basic blocks in a code block, and then removing
the maximum spanning tree, we are left with the set of edges that should be instrumented to
collect basic block counts with the least overhead. In cases where the user code already maintains
a loop counter, counting along the loop return path can be done for free. Together with the other
optimizations, minimizing the number of instrumentation points as just discussed should lower the
overhead of Qprof enough to permit successful parallel profiling in a multi-computer environment.




Since the commonly available Unix tools for profiling the performance of a program lack the resolu-
tion to make fine-grain time measurements and aren't designed to be effective in a multicomputer
environment, Qprof was developed, and implemented as an integral part of the Q runtime system
and the IBM POWER RS/6000. Taking advantage of the data structures of the runtime system to
store statistics near where they are updated, and relying on the high-resolution on-board timer of
the RS/6000 processor set, Qprof can successfully generate real time measurements of basic block
and C subroutine execution as well as Id procedure live time. By combining statically collected
measures such as basic block ideal execution time and instruction mix with real time measurements
and invocation counts, Qprof can present the user with an accurate picture of memory hierarchy hot
spots and dynamic instruction mix. Qprof even collects call-tree information (to a single level) so a
user can at a glance determine which Id procedures are called by a given code block, how often they
are called, and how much time they remain active. Although the current implementation of Qprof
suffers from a heavy overhead (100% in some test cases), methods have been outlined to optimize
overhead down to an acceptable level by measuring real time intervals at the partition level rather
than the basic block level, and by including only the minimal set of basic block counters at strategic
locations rather than blindly counting invocations at each basic block. When Qprof is eventually
reimplemented in a multicomputer setting, the post processor will be written using graphic pre-
sentation techniques such as the three-dimensional histogram described by Kesselman in order to
communicate the performance statistics cleanly and effectively.
69
Appendix A
Appendix: The Simulator Code
70
// QPROF in.C file -- yboklatheaa













ainclude "T_to_PPC. h "
Minclude "ICaha. C"
Minclude "prsr. C"
void min(int rgc, chre argvl])
{ String t_label *(n w String(begi'"));
bool xit_direction - TRUE;
bool repeat FALSE;
bool printeodule - FALSE;




printsodule - (etrcup("-p" .argv4]) - 0);
caee 4:
repeat - (trcp("l",rva3]) -- 0);
print_odule e print_odule I1 (trcp("-p",rgv[3]) - 0);
cale 3:
oxit_direction - (strcp("l,erv2]) - 0);





sigerr("ia: Iproper aber o argonats pan d to qprof . " ) ;
};
ppc_Modi od o file_to_pp_Hod(te.t_..odul);
ICache ici e nee IC-ceha(od); // gpnrate let ICaha
ICachee ic2 - nev ICache(aod); // genrate 2nd ICath
nodape nap e new odap(ici,od);
if (print_odule)
aod.printo);
// generat the lit of all posible branch charts
branch_chrt_list bcl -
oap->ske .br aoch_nap(t_label,ext_direcion,rept);
// This section can be nabled to print out the branch chart that
// ill be used by the esulator.
/l
for(DLintIter<String> iter(bclO]) ; !iter.end_p(); iter+)
cout <C *(iter.valueO()) << andl;
// create a trial braneh unit using th hd of this list
bru rO000(ici,eic,*bcl([0,t..label);
// report the results -- the Ideal Tine
cout << argvl] << "," << exit..dirlction << "," << rpeat << "-";
cout << r0OOO.ti_code() << ndl;
// QPROF bru ADT -- ybokeathea






















// none useful enms
eanm bfetatus (OFF, AIT.ON_ADDR, CALC_ADDR, WAIT_0N_FETCH, LOADED);
void swp_icachen(ICachet ici, ICaehb* i2); // forward declarations
// the elass 'bru'
class bru
public:
// the contructor for a BRU; it'. used by the bigh-level caller
bru(ICacha ici, ICaehat ic2, DLit<Stringf& branch_chart,
conet Stringt etart, const Strinpg nd - nd_dfault);
int tis_code(); // sieulate tRS000
void bru_tickO; // advance the BRU one tick
private:



















void fetch_instruction (); // gt intructions fro ICache
void diopatch_intructioas(); // send thn to FXU and FPF
void tick_fxfpdc); // advance the FXU, FPU, and DCache one tick
void detect_nd_..etup_brnch();// begin to siulate branch




I/ QPROF bru ADT -- ybokethena




bru: bru(lCch e icl, ICache i2, DLit<String*& branch_chart,
conit String trget, conat Stringt end)
{ ta r nvew timer(target,end);
dcache_unit u new DCtehe;
c_uanit - nv wynchro;
ilock_unit - nov binterlock(dceche_unit);
fxu_unit - nov fxu(ilock_unit,dcch_unit,icunitt r);
fpu_unit - new fpu(ilock.unit,dcache_unit,c_unit,ter);









if eedr - nn bquu;
icache_uniti-fetch(initil_fetch_t arg );
ifeeder-load_seq(icachh_uniti--lod());











// look or branches nd handle any neen
// dvance itattu of current brunch
// dispatch the instructions to FXU and FPU
fe
ifeeder->print);





// fetch in. into BRU dispatch buffers
// advance timer
// coplete current branch if proper
1;
// finish the current branch if ite tine to do io
void bru::complt_branchh()





el1e if ((fetch_tat LOADED) II (fetch_stt i WAIT_ON_FETCH))
{ifeeder-,>ideloadO;
ivp_icachs ( icch e_unitl , icah_nit2);
fetch_utat - OFF;
ilock_unit-)branch_d.on0 (;);
// dispatch instructions froe the BRU through the binterlock object
// to the FU nd FFPU
void bru: :diipatch_initruction)
tick_fxfpdc0);
int txufpu_ins - 0;
bool can.dinpatch - TRUE;
bool uv_cr_op e FALSE;
bool av_branch - FALSE;
ppc_nnt curr_ins;
vhile ((ifeeder->sqsize) > O) U can_dipatch)
{curr_ini - ifeedr-eq_pk();
can_dipatch - ilock._unit-odispatch_radyy_p(curr._i ,*br_.addr);
mvitch(curr_ins->op().opto)) {
cans BRNCH:
cen_diipatch - cn_dispatch AA (!sv_branch);
break;
ce CR_OP:
can_diipatch -cn_diepatch U& (!iv_cr_op);
brenk;
default




































// Advance the FU, FPU, and DCache units one tick. Checkpointed
// copies are advanced an vell. The check followed by the lternmte
// ordering of clls to the FXU and FPU in to enforce the contraints








// look for branch in the BRU intruction buffera nd handle any seen
void bru: :dtectand_etup_brnchO)
















(if ((b_ins->op() .us.._lr_pO) A& (ilock_unit->braeodeO))
{if (ilock_unit->link_reg_redy_p0)
fstch_stt - WAIT_ON_FTCH;)
















// Seep the sequential nd branch trget ICches





// QPROF fan AD -- ybokeathna-























// eoe useful constants
cout IBUFSIZE - 4;
const PSqQ.FOR-PLOADS-SIZE - 3;
const SB_OR_FX_LOADS_SIZE - 1;
const NO_SPR_REQ - -1;




fxu(binterlock bi_loc,k DCchee dc, synchro sc, ti-er tsr);
void tick(); // dvance PXU on clock tick
void dilpatch(ppc_Instt ins); // dispatch n FXU instruction
bool full_p() retur (ibuf-roo() < 0);); // Is the FXU full?
void print();
private:
int exec_dlay; // FlU pipeline occupancy





gquue<ppc_Instte) dO, dl, exe, exe_plus_l, .xe_plus_2;




// end subordinate functional unit pointers
int pr_rq_buf; // handle special register loads
bool dcche_buy; // is the DCach busy?












// handle ex. of non-load instructions
// handle exe. of load instructions
// hndle exe. of store instruction
// clear flgs in bintrlock unit
// treat load returning fro DCache
// push PX store bu. entry to DCache
// push FP store buf. entry to DCache
// advance sxe. buffers
73
};
// these two procedures dvence the line of execution history buffers;
// these buffers r used to cler flegs in the dispatch unit
inline void fu: :dvenc_lins(ppc_nt ins)
if (! xe_plus_2->npty_p )
exe_plu._2-pop();
if ( !.x_plusl->)npty_p O)
axe_plns_2->load_elent(ae _plan_I-)pop());
exeoplue_i->lod_elment(ins);
inline void fxu :: dvnce_line()
if ( ! xe_plu_.2->-epty_p )
exe_pl_2-)pop()O;
if ( ! xe_plus_l->epty_p())
eoe_plans_2-)lo _elent (oxe_pls_ 1-pop O);
endif
// PROF fxu ADT -- yboklathna
// The fxu ADT represents the FlU unit in the R/O000D.
Sinclud "fixn.h"
// construct n FXU






ibuf - new gqeque<ppc_Int)(lIBUF_SIZE);
dO n w gqanue<ppc_..nt>(l);
di - nu gqune<ppc_Inst*>(1);
xs - new gqueaueppc_Insto*(l);
exs_plu_l - nvew gqunuppc_Insts>(l);
ex_plaus_2 - nwv gqanueppc_Instof(l);
spr_req_bf NOSPR_REQ;
fpu_pq -nev gqueu <ddr(l);
fp_psq2 - newv quena<ddr>(i);
fpu_psq3 - nev gqustddr)>();
fpu_psq4 newv gqueu<ddr>(i);





// disptch en instruction to the FXU
void fxu: disptch( ppc_Inst ins )
if (ibuf->izeO <- O)
{if (dO--roo() O)
dO->lo.d_elmnt (bins);
e1ls if (di-)rooO() > O)
dl->lo.d_elenont (tins);
else if (ibuf->roo() > O)
ibuf-Dlod_elent (tins);
else
igerr("fxu dispatch No room to dispatch.");
1ise if (ibuf->roosO > O)
ibuf->load_.elent (tins);
sigerr("fxu: disptch: No roon to dispatch.");
void f::tick()
// puh the instruction buffer entries, if ay, down into the dO nd di stages
if (dO->roos() O)




clesr_dispatch_flgsO;; // cleur flgs in bintOrlock unit
dcoche_bnsy - FALSE; // has csch bndith been used?
l_dvnced · FALSE; // have exe. history bu. been dvanced
// handle lods returning from the DCch unit
hondlo_fxu_loadO;
// hndle clering interlocks set during lods froe special registers
if (spr_reqbuf ! NO_SPR_REQ)
({buyrege-)unlock(spr_rq_b.f);
epr_req_buf NOSPR_REQ;);
// decrement pipeline occupncy count
if ((eos->siz() > O) && (eoec_delay 0 0))
exec_delay--;
// push th pipeline through






// handle the FIU store buffer
// handle the FPU store buffer
// handle the FXU xeution stage
// handle the aoveent of instructions
// fro decode into the execute stage
// handle the execution stage of the FXU pipeline, *xcept lords
void txu::handle_sx()
if ((exec_dalay <C O) t (Cle->size) > O) At
!(exe->pe(kO)->op).lod_-pO)) // don't handle loads hers
{if (exe->pee)->op).storp()pO // handle stores
pertom_storeO);
else if (exe->peeh)->opO(.optO) FIU_NFSPR) // handle loads fro
{spr_req_but . exe->peks)->rnd() i->reg ) .ctual); // special registers
busyrg->loclsprrqbuf);
advancline(xe->popFC;)




// push loads through the execution stage of the FXU pipeline
void fxu::handle_exe_lodO
if ((exec_dely <- O) kt (eve->siz() > 0))
if (exe->pekO)->opO.load_pO))
(ppc_Inte ins -.e->pek();
addr* nev_addr - nev ddr(ins);
if (!(dcoche_buy) At // ake sure th DCache is
dcache_unit->requet_lo adsedy.p)) // not be used for other






















I/ Sed signals to the binterlock unit to clear dispetch flags when
// instructions pass through certin FXU pipeline stages
void txu::cleur_disptch_flgO()
i (xe->sizeO() > O)
bra_lock->s ignal_fXU_Exe(* exe->peek));
if (exe_plus_-size() > O)
bra_lock->signal_FXU_Exe_plus_ l(exeplusJ->peek );




nddr. n v_ ddr - new ddr(eins);
if ((ins->op).fpstore_pO)) A











// dvanc deode r-gion of FIU pipeline
void xu: :h ndle_decode(
for(int i - ; i <- 2; i+) // shift two intructions out, if possible.
{I~ ~ ~ // per clock cycle
if ((dO->sizeO > O) t sc_unit->fxu_shift.o_pO))
switch(dO->peeO)->op() optO)) 




default: // proce FXU operations





t_unit-alertit r(axe->pehO- >lnbelO ;)};
break;
1;
if ((dl->sizne) O) &t (dO->roo() > 0)) // dvance decode bufers
dO->load_elant(dl->popO);
// pop nd process FXU store buffer ntries it possible
void xu: :hndle_ixu_tbuf()




// pop and process FPU tore buffer ntries if possible
void fu: :handle_.pu_stbuf 
{ it ((fpu_psq4->sie) > 0) U !dcache_busy Ut





if ((fpu_psq4->rooO() > O) (fpu_peq3->size) > 0))
pu._psq4->lod_leent (fpu_pq3->pop );
if ((fpu_psq3-rootO > O) A& (fpu_pq2->sizeO > 0))
tpu._pq3->loi_eletnt (fpu_pq2->popO);
if ((fpu_psq2->roo() 0 ) tU (fpu_psq->size() > 0))
tpu_pq2->lod_elent(fpu_psq->popO));
// print out the current state of the FXU
void tfxu: :print()
{ cout << "COUNT " (< overload_count (< ndl;
cout << .".......********* FXU STATE .. ****+..+..++**.+++*. ";
cout << ndl;
cout << "IBUF:" '< ndl; ibuf->print);
cout << "Dl: 
"
<( endl; dl->print);
cout <C "DO: "
<
ndi; dO-print)O;
cout << "EXE(" << xec._deley << ") :" *< ndl;
exe->print();
coat << "EXE+: "<< undl; exe_plus_l->print);
cout << "EXE+2: 
" <<
ndl; exe_plu_2->print(;
cout << "FPU_PSQ: 0 < ndl; fpu_psq->printO;
cout << "FPU_PSQ: << e ndl; fpu_pq2->print();
cout < "FPU_PSQ: 
"
<< endl; fpu_psq3->print();
cout << "FPU_PSQ: "< endl; fpu_psq4->print);




// QPPROF pu D -- ybokethena
























// Thin ADT iulaten the RS/6000 FPU unit.
conOt FPU_IBUF_SIZE - 6;
const FPU_DBUF_SIZE - 3;
const PPU_DSQ_SIZE 3;
cont FPU_PSQ_SIZE 3;
cot PPU_PTRQ_SIZE - 8;
cont PU_FLIST_SIZE 8;
cont FPU_OLQ_SIZE - ; // Choen for simulation purpose-- rbitrry
// the clfpss pu
claos tpu
public:
// contructor, used by BRU anti
tpu(binterlock* bi_lock, DCche dc, ynchro sc, tier' tr);
void dispatch( ppc_Inst ins); // dieptch an FPU inetruction
void ticl(); /i dvnce FPU one tick
bool full_p() (return (ibuf->roo() <- 0);); // test vhether FPU i fll
void print(); // print FPU state
private:
Ipue aved_tate; // eaved etate
int exec._doly; // intruction pipeline occupancy
















// end ubordinate functional unit pointers
bool dcche_busy; // i the DCach buoy?
bool cnd code, can_pop;
I/ does inetruction u lkhed registers?
bool clanh(ppc._lnt inn) const;

















// clear flge in binterloch unit
// dvance input buffers
// treat returning lods
// do rean phase
// rna storee nd fill PSQ
// rane urith. inn. end fill dbufs
// rena loads and codify rg. ap
// raove peq ntry and proces
I// dvance dcode buera
/! send out DSQ entry if possibl
I;
// handle the ptying of the pending trget return queue
inline void pu:::pty_PTRQ()
if ((ptrq-)iz(O ) O) t& (ptrq-acuapopO))
if (peq-)contaime((nev peq.entry(O,ptrq-pehk()))))
peq-)apply_to_nee(o(e pq_.ntry(O,ptrq->popO))) ,etgb);
nine if (fre_iet->rooO ) O)
ree_list-)load_&elent(ptrq-)popO);
// mov intructions in th dcode buffer into th dcod etage
imlime void fpu::;dvncedbufs()
if ((d code-roo() > O) && (dbuf-size() > 0))
decod e-lood_.lunt (dbuf->pop());
// If the data storag queue in not empty, try to nd some data to
// th DCache
inline void 1pu ::dvnce_dq()
if ((deq->)izeO > O) Lt (!dcach_buy) LL
dcach_unmit-)deq_.storeedy_pO)
dcach_bucy a TRUE;
dcach_unit->dq_tore(dq->pop ( ) );
;
// handle an incoing load from the DCache to the FPU
inlin void fpu: :handl_fpu_lodO
if ((dcach_unit->fpu_lod_.val_redy_pO) U (olq-ize() > 0))
dcech._buny TRUE;
olq->pop(); // actual regiter
dcachuamit-fpu_lod_vlO; // virtual register
// fill the renu tapges if empty
inline void pu: :lod_remmestg()
if ((rmO->roo() > O) && (rnl-roomO > 0))
{if (pdO->sizeO() 0)
rnO->lod_lemet(pdO->pop));




// QPROF pu ADT -- ybokathena
// The ipu ADT i ued to iulate the operation of th RS/6000 FPU unit
#include "fpu.h"
// contruct an FPU




fpretable - nev fprup;
ibuf new gqueueCppc_Int*>(FPU_IBUF_SIZE);
pdO * nav gqouea<ppcInt*>(l);
pdl * newv gquue<ppc_Inat*(l);
rnO - nw vgquu<cppc_Inat)(l);
rnl - nev gquua<ppc_Inet)();
dbu - nev gquu<fpdecode_.ntry>(FPU_DBUFSIZE);
decode - new gqueu<efpdecode..ntry>(l);
axel - nev gqueu<ppc_Int) ();
xe2 - newv qu ue<ppc_Int*(l);
exe2_plo_.l * nav gquua<cppc..Int*>(l);
exe2_plu_.2 - nuv gquu<ppc_In-t*>(i);
exe2_plu_.3 - nav gqu ue<ppc_Inst*>(l);
exe2_plu._4 - nev gquue<ppc_Inte>(l);
deq - nv Sgqauo <fpu_dat>(FPU_DSQ_SIZE);
olq - nv gqueu<int>(FPU_OLQ_SIZE);
paq - nv grquac<paq_entry>(FPU_PSQSIZE);
free_list -new gquu<int)(PU_FLIST_SIZE);
for(int i 32; i < 40; i++)
free_list->load_elesnt(i);







// dispatch un FPU instruction
void fpu::disptch( ppc_Int in)
if (ibuf->sizeO <- 0)
if (pdO-roo(O O)
pdO->load_.lnant (in.);
elee if (pdl->rooO) > O)
pdl->load_eluent(ina);
else i (ibuf->roo() > 0)
ibuf->load_elament(tine);
cigrr("fpu diapatch No rooe to dispatch.");}
1se i (ibuof-rooO) > O)
ibuf->load_elemant(tina);
aigrr("tpu: :dispatch: No room to dispatch.");
};
I/ advance the FPU by ona clock tick
void fpu: :tick()
advance_ibuf(); // cove inatructions down from the inc. buffers
clear_dipatc.h_flagsO; // clear i lgs in the binterlock nit
dcache_bu.y FALSE;
if ((exei->ize() > O) N& (exec._dlay > 0))
xec._delay--;
// stup flgs for decoding and popping pq ntry, rpectively
can_decode - (dcod->eizeO > O) ? !claeh(d.code-pe() .get.inO()) T: RUE;









//I Sd ignl to the binterlock unit to clear dimpatch iflgs vhen
// intructiona pss through certain FPU pipeline tge
void fpu: :clar_disptch_flgC)1
it (xel->cizeO > 0)
bra_lock->signaLl_FPU_Exel (x el-)pekO));
if (sxi2_plu_l->ize) > 0)
bra_loch->cipnal.FPUExe2_plu_.. (eoxe2_plu._l->peekO);
if (xe2_plo._3->ize)O > 0)
bra_loc->igpal_FP_E x2_plus_3 (*ox.2_plu_.3->peek );
// advance inetructions in the bffers upstrun from the prod code boffaer
// into th predecod rgister
void fpu ::advnce.ibuf()
if ((pdO-roo() > O) & (pdl->roo() > 0))
{if (ibuf->size() > 0)
pd0->load_lunt(ibuf->pop());
if (ibuf->izO > O)
pdl->lod_ele-mnt(ibuf->pop()););
II// Tice, do the following://
// Rename the instruction in rn, vhther n arithmetic FPU inetruction
// or n FPU ld or tore, and then diapatch it to the appropriate pipeline.
// Finally, dvnc the preceding proceeaor buffers one notch.
void pu: :reane _nd_pu h_in()
for(int diep_loop - 1; diap_loop <- 2; diep_loop-)
if ((rnO->oize) > O) h& scunit->fpu_hift.ok_p(O)
evitch(rnO->peek)->opO.pt() {












if ((mrnl>siz- i > O) t (rnO->roo)O > 0))
rnO->lod_elent (rl->pop());
load_renue_stge 0);





if (dbuf->oizlO ) O)
dbuf->apply_to_lt (&add_.tora);
alc i (decod->eizm)O 0O)
decode->pply_to_lat(tadd_stor);
else
peq-rele O C );
sc_unit-fpu_.hift() ;
// update the virtual-Oreal register ap and varioe rgiter qouus
// to reflect n incoming load to an FP rgiter
void fpu ::prepre._for_fpu.loadO
if (fprtble->load.papposible._pfree._lit, ptrq, olq))
77
(fprmtable-)lod_map(rnO-)pop(), fre_liet, ptrq, olq);
if (dbuf-)izei() > O)
dbuf->·pply_to_last (hdd_load);





// rename the instruction in rnO nd tick it into the decode stag, or
II the decode buffers, a roon pornits
void fpu: :rnae_into_decode.._st ge
if ((dbuf->·siue) <- 0) & (d code->roo() > 0))
(decode-lo d_elment (fprtable-·arith..ap(rO->pop()));
ec_unit->fpu_hift( ) ;)




// Attept to ove the instruction in decode down to the first execute
// stage.
void fp: :handle_decode()
if ((exel->roo() > O) A (decod e->izeO) > O) c n_decode)
{exel->lod_eleent(decode->peekO .gt_ins 0);
exec_delay - decod e-peekO .get_insO->occupncyyO;
t_unit -alert_tier(decod e->pekO .get_ins ()-)l helO);
for(int i deod icode-peek0-gtjooutt; i 0; i--)
ptrq->releae (0;




// handle the roval of entries fron the FPU pending stor queue
void ipu: :pop_pq()
if ((exel-roo(O) > O) k (peq-·ize() > O) (psq->cen_pop0) &
cen_pop)
(if (psq->peek . give_bckp 0 )
(if (iree._lit-roo( ) 0)
{exel->lod_elent (pq->peek) .gt.in ( ));




// handle the execute tages of the pipeline
void fpu: :handle_executO)
if (exel-sizeO() > 0)
(if ((xe2->peek()->opO.tp_store_pO)) a (deq->rooaO > 0))
(deq->lo.d_elment (PDATA);
dvance_line(xe2->popO);)








void fpu: :advnce_line(ppc_lnet ins)
if ( exe2_plue_4-eapty_p )
exe2_plus..4->pop );
if (I xe2_plue_3->epty_p( ))
ex.2_plue_4->load_elmnt (ex.2_plus_3->pop );













if ( ! xe2_plu_l->epty_p )
exe2_plus_2->load_eleent( exe2_plu._l-pop0);
// look for "clash" which prevents an instruction for being executed
bool fpu::claeh( ppc_Ineto ins) conat
{ // check first for pipeline dependencies involving djacent instructions
if (exe2-·size() 0) // check for PR-dependent loads
for(DLitlter<int>
jl(x2-pek()-targ.tFPrandrO()); !jl.end_p(); jl++)
for(DLitIter<int> il(in-eoure_A_r !il.nd dp)) ; iiendp; l++)
if (il .value) - jl.value))
return TRUE;
if (exl->iueO) > 0) // check for a11 other dependent lod
for (DLitltersint
j2Cxei-peek()-OtargSt_FPR_rsnds()); !j2.end_p(); j2++)
for(DLitlterdint> i2 (ins->ourc_FPR_rads()); i2.end_p(); i2++)
if (i2.valueO) - j2.value))
return TRUE;
// check for source or destination registers to which en outstanding load
// is assigned
for(DListlter<int> i3(ins->source_FP_rmd nds); !i3.end_p(O; i3++)
if (olq-Ocontains(i3.value))
return TRUE;





// print out the current state of the PPU
void fpu::print()
cout (< "**e*******e************** FPU STATE ***************e****ee**;
eout ¢< andl;
cout u< "IBUF:" <(C ndl; ibuf->printO;
cout << "PDI: "( < ndl; pdli-printO;
cout << "PDO: "CC<< ndl; pdO-printO;
cout << "RNI: " < ndl; rnl-print();
cout CC cRNO: 
"
C<< ndl; rnO-print(;
cout CC "FREELIST:" C< ndl; free_liet->print);
cout << "PTRQ:" << andl; ptrq->print();
cout DBUP:" <CC ndl; dbuf->print();
cout C< "DECODE:" << ndl; dcode->print();
cout c< cPSQ" CC< endl; pq-Oprint();
cout "EXEI(" << xec_dely CC<< ") :" CC andl; xel->print);
cout < "EXE2: <C< ndl; xe2->print();
cout << "EXE2+I: "CC
<
ndl; exe2_plu_l->print();
cout C< EXE2+2: 
" <
endl; xe.2_plu_2->print();
cout << "EXE2+3: 
< <
endl; xe2_plu_3-print ;
cout "EXE2+4: "CC ndl; .e2_plu_4-printO);
cout CC<< "OLQ: " CC< ndl; olq->print();
cout (C "DSq: " CC ndl; dq-print();

















nu pars_poe (AT_LABEL, AT_UP, AT_RANDS);
inline void addCR(chere 1, ppc_nto int);
void ddOP(chr 1, ppc_Int* inet);
void addFPR(chero 1, ppc_Intet iet);
void addNUM(char 1, ppc_Inst int);
void ddLABEL(charo l, ppc.Int inet);
bool i_._space(char c);
// The parser is used to convert an RS/6000 aesably file into
// a PowerPC module in order to test the ideal tie module of QPROF
ppc_od fil._to_ppc_od(char filename)
// first create a stram bound to the input file
if strom f_in(ileno);
if (fin.fail())
sigrr("parsr: Given file nine not found.");
const char cond_reg[3] - "cr";
cost char fp_reg[4] "fpr";




par_pos pos - AT_LABEL;
Stringo l_name O;




ppc_Mod nod - new ppcHod("tst nod");
ppc_Proce proc new ppc_Proc("tast proc",NORIALPROC);
ppcCB cb - new ppc_CB(W"tet cb");
ppcPrt part - nev ppc_Part("tet part",THREAD_PART); part->regidx(O);
ppc_lnst ins;
while (!f_in. eof(O)
{if _in. get (curr_chr)
while ((is_w_space(curr_chr)) U (!f_in.eof))
(if (curr_cher- '\n')





vwhile ((!is_v_epace(curr_chr)) (!f_in.eof))) {







rat - new ppc_.Rtor(l_ne);
op - rt->op();
delete rat;
if (op - ppc_none)
pos - AT_OP;











po , - A.RANDS;
break;
df ault:
if ((label - srtr(label,cond_reg)) & (strlen(label) =- 3))
nddCR(label, in');
else if ((label - etrchr(label,fs_reg)) b (trlen(label) <- 3))
addGPR(label,ins);








if (curr_hr - '\n')








/ add condition register operand to an instruction
inline void ddCR(chero 1, ppc_Int inet)
int r (int) 11[2]-'0';
inst->ddRnd(new ppc_Rnd((new ppc_RPg(r,CR_RTYPE,r))));
// add gnera1 purpose regiter op rand to n instruction
void ddGPR(cbhr 1, ppc_Inet int)
t
int r;
if ([2] -- '\o0')
r (int) 1[1] - '0';
else
r (oint) ((1(1] - 'O')o10 + (112]-'0'));
inot->addRlnd(nev ppc_Rand(*(new ppc_Reg(r,GPR_RTYPE,r))));
);
// add floating-point register operand to an instruction
void ddPR(char 1, ppc_Inet inet)
int r;
if (1[4] - '\0')
r - (int) 13 - '0';
1ese
r - (int) ((1(3] - '0')O10 + (114]-'0'));
inst->addRnd(new ppc_Rand((new ppc_Reg(r,FPR_TPE,r))));
// add an integer operand to an instruction
void ddNUM(chro 1, ppc_I..nte int)
int nn -s 0;
int index - 0;
hile (l[indn] !I '\0')
nun - i10nus+((int) l[index++]-'O');
inst->)ddRnd(new ppc_Rnd(nu));
};
// add a label opernnd to an instruction
void ddLABEL(char* l, ppc_Int inst)
int->ddRnd(new ppc_Rnd((nvw String(l))));
// define a "whit space" character n a conventional C white space
// character, or a coe.
bool i_.space(char c) (return (ipace (c) II (c ',')););
79
// QPROF ADT binterlock -- ybohkathna
I/ This DT repr sente the instruction dispatch interlocks that limit


















bool chck_intersection(int k, t crfild);
// useful onus
nm status {CLER, SET); // Statu of flags
eon bdsttus {RESOLVED, UNRESOLVED); // Statue of currant branch
eon settype (O_STYPE, FXU_STYPE, FPU_STYPE);
conat NO_CRFIELD - -1;




// The class binterlock
class binterlock
public:
binterlock(DCache dc); // constructor
void link(ifu fixed, fpuO floating); // vire in FXU and FPU unit
// to already constructed binterlock
I// These re the main operations called by high order user--they
// check vhethr a nev (general) inatruction cn be dispatched, and
I/ dispatch it, respectively
bool dispatch_ready.p(const ppc.Inst& test_ins, StringL targ) const;
void dispatch(ppc_Int& ins, timer tsr, String& turg);
I// These procedures re clled by the (subordinate) FPXU and PU units,
// to clear certain interlock flags.
void ignal_PXU_Exe(ppc_nst ins);
void signal_FU_Ex_plus_i(ppc_s t ins);




// thse are signalers used by the caller to end a dispatch or branch
void signal_dispatch_daone () {cr_op_xit PFALSE;};
void brnch_dones();
I// here re some predicates useful for exuaining the binterlock's state
bool bralsode() {return baods;);
bool branchrsolvd_p);
bool link_rog_redy_p) const {return (!lr.vflag);};
bool cunt_reg_redy_p() const (return (!ctr.vflag););
private:
int cond.disp_in ; I/ the number of conditionally dispatched
bool fpu_not_rdy;
// instructions
// prevent intruction from bing conditionally
// dispatched during a brnch dependent
// upon n FPU computation?
settyp stype[8];


















// record vhether CR-setting instruction
// is n FXU or FPU operation
// conditional branch tatus
// pointer to the branch instruction
// and the branch condition field
// interlocks for CR operations
fxu fxu_unit; // pointer to functional units
fpun fpu_unit;
DCache dcach_unit;
I// flag uanipultion op rations
bool cbhck_cr_fieldint field) comst;
bool check_cr_fild_r(int field) conet;




void st_cralsk(int kas, settype st);
void cler_crnsk(int mask);





bool can_dispatch_branch_p(const ppc_lnt test_ins) cont;
bool cn_dispatch_mt.pr_p(cont ppc_l..I t test_in ) cont;
bool can_disptch_cr_op_p(co nt ppc_lnt& taatin) const;
// link in FXU and FPU units after creating a binterlock object





// check vhether the current branch hs been resolved
inline bool binterlock::branch_resolved_pO
if (!b.ode) sigNrr("bintrlock: branchreolved_p: No branch.");
return ((rsolve_stat - RESOLVED) Uh (cntr_rstat RESOLVED));
};
// some useful macros for classifying PoverPC instruction
inline bool in_is_tlr(const ppc_Inst ins);
inlin bool in_ise.tctr(const ppc_lnstb ins);
inline bool ins_i_.flr(const ppc_Inst ins);
inline bool i_isfctr(const ppc_IUet ins);
inline bool ins_iafxer(const ppc_lntS ins);
// some subprocedurs ior handling the dispatch of PoverPC instructions
// nd testing dispatch readiness
// can branch instruction be dispatched?
inline bool binterlock :cn_disptch_branch_p(const ppc_Inst test_ina) const










// chick for illegal dispatchs md flg thi
inlin void bintrloch: :hndl_illgal_dipatch e(optyp op)
if (op - ILLEGAL)
igrr("binttrloch: :dispatch: Illegal dispatch.");
if (bodm) // if brnch mode i in ffict
if ((op - CR_OP) II (op - BRCH))
sigirr("bintrlock: :dispmtch: Illegal dispatch.");
lse if (ctr_rtat !- RESOLVED)
sigerr("binttrlock::dipatch: Illegal dispatch.");
ml1. if (resolvstat 1- RESOLVED)
cond_disp_ins+*;
if (cond_disp.in , 4) igrr("binterlock; :dispatch: Illegal dispatch.");
// internal flg (CR fields nd LRCTR sp-cil registt fieilds) oparation//
/t chkch urite flag of brnch field
inline bool binterlock::chck..cr_fiild(iat bf) coast
if (cr_op_.xists U (((cr_op_lock-bf) X 4) - 0))
rmturn FALSE;
if (((bf - 0) II (bi - 4)) hU crO_4.flag) return FALSE;
lse if (((b -- 1) II (bf *- 5)) aU cri_6..fl) rturn FALSE;
sli i (((bf - 2) 11 (b -- 6)) t cr2_8.vflg) riturn FALSE;
ise if (((bf -- 3) 11 (bf *- 7)) Ut cr3_7.vfla) rturn FALSE;
slie return TRUE;
// chkch red flag of branch field
inline bool binterloc:: ch.ckcr_fild_r(it bhi) const
{
if (cr_op_exists A (((cr_op_lock-bf) X 4) - 0))
return FALSE;
if (((bf -- 0) II ( - 4)) & crO_4.rflg) return FALSE;
lse if (((bf - 1) II (bf - 5)) && crl_6.rlag) return FALSE;
.lmi if (((ibf - 2) II (bf -- 6)) t cr_6.rflag) ritarn FALSE;
sle if (((bf - 3) 11 (bf - 7)) U er3_7.rflag) return FALSE;
lse return TRUE;
i;
// aet both red d n  vrit flags of field
inline void binterlock: :set_cr..field(int bh, sttype it)
{ stype[bf] - it;
if ((bf -- 0) 11 (b - 4)) {crO_4.rflag - SET; crO_4.flag - SET;)
lse if ((bf -- 1) II (b -s 5))(crl_S.rflg - SET; cri_.v.flg - SET;)
ise if ((b - 2) 11 (bi G )){cr2_6.rflg - SET; cr2_.vflg - SET;)
s1le if ((bf -- 3) 11 (b -s 7))(cr3_7.rflg - SET; cr3_7.vflg - SET;)
ipgrr("binterloc ::set_cr_fild: Invlid field.");
};
// cler both rad d n  rite flags of fild
ialine void bintsrloch::elur_cr_field(in t bi)
if ((bi - 0) II (bf - 4)) crO_4.rflag - CLEAR; cr0_4.vfla - CLEAR;)
else if ((bi - 1) II (bf - 5)){crl_.rflg - CLEAR; cri_S.vflg - CLEAR;)
ela if ((bfi - 2) II (bf - 6)){cr2_.8rflag - CLEAR; cr2_6.vflag CLEAR;)
m1ls if ((bf - 3) 11 (bf - 7)){cr_7.lag - CLEAR; cr3_7. lag - CLEAR;)
sigerr("binterlock: :clearr_field: Invlid field.");
};
// chck a11 read flags, including the CR opration lock fiald
inline bool bintrloc::check.all_CR_r( conast
return((!crO_4.rflag) a (!crl_6.rila) U (!cr_op_exists) aa
(!cr2_6. rflag) a (!cr3_7.rflag));
// sit read nod vrite flags basd on a ask
inline void bintrlock:: et_cr_.ak(in t mask, smttype it)
if (ash 0:88)
{crO_.4.rflg - SET; crO_4,vflag - SET;
if (mask & Ox80) stypc[O] - it;
if (mask k 0O08) typ4] - it;)};
if (mask & Ox44)
{crl_6.rflg - SET; crl_S.v-flag - SET;
if (sk a Ox40) type[i] - it;
if (shk 0O04) typeS] - st;);
if ( hk & 0O22)
(cr2_6.rfl - SET; cr2_6.vflg - SET;
if (mask & 020) stype[2] - st;
if (mask & 0x02) styps(6] - st;);
if (mask a Oxl)
(cr3_7.rflg - SET; cr3_7.flag - SET;
if (mk & 0x10) stypst3] - it;
if (akh & OxOl) itypt73] - st;);
inlian void binturloch: :set_cramk(int kh)
{ if (hsk & Ox88) {crO_4.rfll - SET; crO_4.vfla * SET;);
if (mask & 044) {crl.rflg *- SET; criS.vfl g * SET;);
if (mask A 022) (cr2_6.rflg - SET; cr2_..vfl g * SET;);
if (mask A 011) {cr3_7.rflg - SET; cr3_7.vfla - SET;);
// clar read nd vrit flags based on a as
inline void bintmrloc: :clr_cr_mask(int mask)
{ if (mask a 0:88) {crO_4.rflg - CLEAR crO_4.vflg - CLEAR;);
if (mask & 044) (criS.rflg - CLEAR; crl_S.flg - CLEAR;};
if (mask 022) {cr2_6.rflg - CLEAR; cr2_6..flag - CLEAR;};
if (mask & Ox11) {crS_.7rflag - CLEAR; cr3_7.uflag - CLEAR;);
endif
81
// QPROF ADT binterlock -- yboklathn
/! This AD? rpreents the inntruction dispatch intrlock that liit








for(int i - O; i < 8; i++)
stype[i] - NO_STYPE;
ir.oflg - CLEAR; lr.rflag CLEAR;
ctr.vflag e CLEAR; ctr.rflg - CLEAR;
crO_4.flag CLEAR; cr04.rflag CLEAR;
crloS.vflg e CLEAR; crl_6.rflag - CLEAR;
cr2_..flag * CLEAR; cr2_.rfflg CLEAR;
cr3_7.vtflg CLEAR; cr3_7.rflag CLEAR;
cr_A CLEAR;









// hook. called by ubordinte FXU unit
void binterlock ;ignal..FXU_Et(pp c_lntk ins)




if (opcode - ppc_tcrf)
{int nk - in.rand(l)-nuO();
Ir.rflag n CLEAR;
ctr.rflg - CLEAR;
if (bodo & check_interecttion(akbcrfield))
bl CLEAR;)
a1se if (inn_i_tlr(in) II insis_atctr(inn))
{crO_4.rflag - CLEAR; crlS.rflg * CLEAR;
cr2_6.rflg - CLEAR; cr3_7.rflg - CLEAR;
lr.rflag - CLEAR; ctr.rflag * CLEAR;)
ese if (opcod. - ppc.crr)








if ((opcode ppc_adi_) t I (opcode - ppc_ndie_))




void binterlock: :signl_FXU_Exe_plu_l(ppc_Inta ins)




if (opcode e- ppc_tcrf)




if (reolve_tat - UNRESOLVED)
reeolv_tat - RESOLVED;)





1le. if (opcod -- ppc_.-rxr)
clear_cr_ield(in. rnd(l) ->rg(). atu ());
if (baod. & (bcrtield - in.rnd()->rgO.actunlO))











if (bod. && (bcrfield - in.rnd(l)->regO().ctunlO))




if ((opcode - ppc_ndi_) II (opcod - pprndis_))
(clear_cr_field(O);
if (baod.e & ((bcrfield -- 0)))




void binterlock: :igonal_FXU_Exe_plu_.2(ppcIntk ins)
ppc_Opcod. opcode in. opO.opO;
optype op - in.opO.optO;
if (op - XU_FSPR)







lr.flg * CLEAR; ctr. flag - CLEAR;)
e.s if (in _ieftr(inn))
(cr_A CLEAR;
ctr.rflg CLEAR;
ctr.flag - CLEAR; lr.vflag CLEAR;);
// hook called by ubordinate FPU unit
void binterlock: :ipnal_FPUExel(ppc_Inatk in)
optype op - in.op().optO;
if (op .- FPU_CNP)
if (bnode ((bcrfield ins.rand(i)->rg().actual())))
bli CLEAR;
void binterlock: :oinipl_FPU_Exe2_pl_l(ppc_lnstk ins)
optype op - in.opO.optO;
if (op - FPU_CNP)
if (bnod. a ((bcrfield - ins.rand(l)->reg() actualO)))
fpu._not_redy * FALSE;
void binterlock i::igna_FPU_Ex2_plu_.3(ppco_lnt in.)
ppc_Opcode opode ins.op().opO;
optype op in.op().opt();
if (op - FPU_CHP)
(clenr_cr_field(i.r(in r (l)-re() actu );
if (baod.e (b-crield i.rand(i)->reg().actul())
if (reeolve_tat UNRESOLVED)
resolve_tat - RESOLVED;};





// cn pticulr intruction be dispatched?
bool
binterlock: :dipatch_rady_p(conn t ppc_Ineta test_in, String trg) cooct
82
{ ppcOpcode opcode - tt_in.op().opO;
optype op - tost_in.op().opt();
if (baods) II if branch node i in ffect
if ((op - CR_OP) II (op - BRNCH))
rtrn FALSE;












if ((op !- CR_OP) ha (op !- BRNCH))
if (fxu_unit->full_pp) I I fp_unit->full_p))
return FALSE;
// in ny vant, .witcl
switch (op) {













case FXU_CNP: case FPU_CNP:
return ((!bl) A& (check_cr_fild(test_inn .rnd(l)->regO) .actualO)));
case FXU_LOG:
if ((opcode - ppc_andi_) 11 (opcod - ppc_andis_))
return ((!bl) t (chck_cr_field(O)));
else return TRUE;
case CR_OP:





// dispatch an instruction
void bintsrlock::dispatch( ppc_lnttk ins, tinera tr, StrigL trg)
{






























if (opcode - ppc_crf)
(cr_op_exists - TRUE;
cr_op_lock * ins.rnd(I)->regO.actual() X 4;)
else
{cr_op_ezistu - TRUE;







if (binut->op) .brc_p(binst->rande ) ,bint-)>randso))
(if (bint->rand(l)->reg_p)
bcrfield - binut-rnd(i)->regO .atal();
else




! ch ck_cr_f ield(bcrf ild) )
{resolve_stat - UNRESOLVED; bi - SET;)
else
{resolve_tuat RESOLVED; bi * CLEAR;)
if (ina.op) .cdp(binst-)randO) binst->nrandO()) t ctr.wflag)









/! subprocsdures for handling dispatch and dispatch readiness
// handle the dispatch of a sove to pecial register intruction
void binterlock: :hndl.e_tspr_dispatch(ppcInt& ins)





else if (ins.op().op() - ppcscrxr)
set_cr_field(in rand(l)->rego) .actul() ,FXU_STYPE);
else if (ins_isjtlr(in))
{crO_4.rflag - SET; crl_S.rflg SET;





{crO_4.rflag - SET; crl_S.rflg * SET;
cr2_S.rflg *- SET; cr3_7.rflag - SET;
lr.rflg - SET;
ctr.rflg SET;
ctr. flg - SET;);
fxu_unit->dispntch(in);
fpu_unit-)dispatch(ins);
// handle the dispatch of a nove fro special registr instruction
void binterlock: :hndlespr_dispatch(ppc_Intt in.)




















// cn a sove to special register intruction be dispatched?
83
((mask 0x02) &U (crfield - 6)) II
bool binterlock:: cn_dispatch_tspr_p(const ppc_lnstk test_ins) coast ((mask 0O01) k (faiId - 7)));
if (test_in.op() op - ppc.-tcrf)




else if (test_ins. opop - ppc_.crr)
return ((!bl) Sx (chec_cr_field(tet_ins.rd(l)-)regO.actualO)));
else if (ine_iedtlr(test_ins))




// can a CR operation instruction be dispatched?
bool binterloc ::cn_disptch.cr_op_p(conet ppc_I-nth test_in) coast
if (test_ins..op()o - ppc_cr)
return ((chech_cr_field(testins.rand(l)-)reg() .ctul)) L
(chock_cr.field_r(tst_ins r.r nd(2)-)reg() actul ())));
else if (test_in s.rund(O ) 3)
return ((checkcr.field(teet_ins .rnd(1)-nu(nmO/4)) a
(checL_cr.fil_r(test_is .r nd(2)->snu()/4)) &&
(chc_cr_field_r (tet_ins. rnd (3) -)snuO/4) ) );
else if (testine.nrndO - 2)
return ((chec_cr fild(tet_ins.rnd(1)-snum/4)) &&




// some useful uacros for clssifying PoverPC instructions
inline bool in_isJtlr(const ppc.lnstt ins)
{return
((in.op().op() - ppcetlr) 11
((ins.op().op() -ppc_.tspr) &
(in.rand(l)-reg()O.ctual() 8))););
inline bool ins_istctr(con st ppc_lnst ins)
(return
((in.op().op() - ppc._tctr) II
((ins.opO.op()O - ppc._tspr) l
(ins.rend()->rg() actul() -- 9))););
inline bool ins_is.flr(const ppc_lnstt ins)
(return
((ins.opO.op() * ppc._rflr) II
((ins.op().op() -- ppc.fepr) t
(ins.rand(2)->regO.() actu () - 8)));};
inline bool ins_iefctr(const ppc_Iltk ins)
return
((ins.op().op() -- ppcfctr) II
((ins.opO.op() - ppcefspr) a
(in.rnd(2)->regO(). tu l() - )));};





// CR flg checking and sttingclering procedures
// check vhether a given ask interferes with set CR flags
bool binterlock: :check_cr..ask(in t ask) conet
( int local_lock cr_op_exits ? cr_op_lock : -i;
ii ((nash k 088) l (crO_4.vflg II (local_lock - 0)))
return FALSE;
else if ((sash & 0x44) l (crl_5.Wflg II (local_lock *- 1)))
return FALSE;
else if ((mask 0x22) t (cr2_.. flg II (local_lock - 2)))
return FALSE;




// check vhbether a ash intersects a CR field--this function is not a ebr
// of binterlock--ene as bove escept using n explicit crfield rther than
// the current binterlock object' flags
bol1 check_intersection(int nok, int crield)
zeturn
(((mask h 0x80) h& (crfield -- 0)) II
((suas k 0x40) ha (crfield - 1)) II
((mask 0x20) L (crfield " 2)) 11
((sas k 0xl0) hU (crfisld - 3)) 11
((as k 0r08) & (crfield - 4)) It
((msk k 0x04) ha (crfisld 5)) I11
84
// QPROF odp ADT -- ybohlkthbna
// The odap ADT reprements a collection of nodeps, one ior each









nodnap(ICacheo ic, ppe_Nodk& od); // contructor
brnch_chrtlistt // build branch crt list for a given BB
akebrnch.ap(Btring target, bool fall_though_p, bool repeatp);
brnch_.chrt_list& // build branch chart list for string of connected BB.
ake_brnchmap(Stringl targeti, String& target2,
bool fll_though_p, bool repeat_p);
void print(ppc_Part ptr)





/ generate branch np to a given BB
inline branch_chrt_list




// genertea branch up to a tring of topologically connected BB.
inline branch_bchrt_listt
dnp: :u_branchp(Stringt tartl, String trget2,





II QPROF odup DT -- ybokeathn.
// The odsp ADT represent. a collection of nodeaps, one for ech
// partition in the odule
Sinclud "dap.h"
// conetruct *odnap
nodep::eodeap(ICachee ic, ppc_odt od): icche_anit(ic) )
partap - nev QVHIMapppc_Parto,nodap,>(0,20);
for (ppc_ProcIter iProc(aod.procIterO); !iProc.nd_pO(); iProc++)
for (ppc_CBIter iCB(iProc.velue() -, CBlter()); !iCB.nd_p() ; iCB+)
for (ppc_Partlter iPert(iCB.valu() -, prtIter();
!iPart.end_p(); iPrt++)
(*parteap) tiPrt.vlueO()] a nv uodep(iPurt.vnlue() ,ic);
}i
85
// QPROF nodeap ADT -- sybokelthna
// THe nodeap ADT i the top-level data type for "branch eap" ynthesis--
// generating list of labels to guide the siulator through the target
/ basic block. To each partition in the current module, there vill












// mak so e shorthand notation
typedef DList<Strig)> branch_chart;
typedeS DList<DList<String>*> branch_chrt_list;
// some torvard declrations
bra ch_-chrt_listt
operator *(branch_chart_list el, branch_chert_list& s2);
brach_chrtlistt
multi_ppnd_lit(brnch_ch rt_lit& ml1, branch_chrt& tail);
,/ the class nodemap
class nodemp
publc:
nodnp(ppc_Purto partition, IC-che* icache_p); II constructor
bbnode& grab_root() con·t // astract the root bbnod
{return root;); // of a nodemnap
branch_chart_liett I/ build branch chart for given BB
meke_brnch_..p(String target, bool fllthrough_p, bool rpeat_p);
branch_chart_listt // build branch chart for tring of connected BB.
ke_branch_-mp(String trgeti, Stringt target2,
bool tfll_throughp, bool repat_p);
void print )
{root->print(); cout << sndl;);
private:
ICacbhe icp; // store the ICache built from the modmp'e module
ppc_Prto part; // store pointer to the nodenp's partition
bbnode+ root; // store the root bbnode of the nodemp
void find_pth(bbnode curr_node, // find path from
String& target, /I a node of the nodemap
branchchart&t orking_path, // to a node labeled vith
QPYH plbbnode,bool>l nr, /I a target string
bbnodet trget_node,
branch_chrtlist& ml);
void create_node(ppci_ntltar& ilter, // recarsively build a nod in the
bbnode& bbref); // nodeap given an iterator
// pointing at the beginning of the
// appropriate BB
void find_exit_path(branch_chart& eit_path, // find a path out of
bbnodes t_node, I// nodap from 




QVHMpppc_In.st,bbn ode*> n _registry; // this holds the node registry
// used to avoid duplicating
// node unnec. oarily or
// anterig infinite loops
void build_init..it_ _path(branc_chert& exit_path, // treat exit fro




Stringt btargt, boolt done..vith_ain_BB);
86
*endit
// QPROF nodnamp ADT -- sybokthen
// THe nodeuap UOT i the top-level data type for "brch up" *yntheis--
// generating a list of labels to guid th ilator through th trpt
// basic block. To each partition in the current odule, thre will
// correspond nodnap data object.
linclude "nodenp.h"
I/ Thi i the constructor for the nodnap ADT. It takes s argante
// th partition to b rpreeented by the nodeap and the ICch object of
/! the containing odule. A call to cretea_node generate the nodal
// structure pointed to by the root nod of nodmap.
nodeap: :odemp(ppc_Part partition, ICache* icach.p):
part(pertition), icp(icache_p), nod._registry(O.10)
{bbnodue rootnode;
ppe_ntlter i(part->intIlter()); // sat up n itesrtor at th first






// build the nods
// This m-ber, creat_node, recurively builds the nodep structure. It
// takes -a iterator ovr instructions nd a refernce to a pointer to a node.
// It first builds a node correponding to the BB beginning at the instraction
// arked by the iterator's rend position, then ah-e the reference-paseed
// pointer to point to the new node, ad finally calls itself to genmrate
// the deecandant nodsa of th current node. It Unes the noderegitry
// variable to keep track of the nodes it's seen in a given partition.






(DLit(<ppc_Int> bb a new DLiet<ppc_Inct*; // this will point to the list
// of instructions,
II// copriing BB, that
// vill be pse d to the
// bbnode constructor
ppc_Inot curr_int; // points at most current instruction
bbnodeo new_nod; // will be passed refrentially to callee to
// permit the to set it to the created bbnode
ppc_Purt part; // hold current partition
ppc_nst ins; // hold an instruction
ppc_Int firt_i - iltr.vlueO(); // grab first instruction in BB
// the reason all variables had t bh declared above i that none
// can bh declared inside a loop.
while (!iIter.end_p()) 
curr_int - iIter.valu(); // grab pointer to crrnt instruction and
bb->append(curr_inet); / append it to the object pointed at by bb
ilter++;
if ((!curr_int->opO.inrt_pO) II // check for end of BB
((!ilter.end_p()) U
(!(iIter.vlueO()->labelO(.eptyO))))
{ bbnref a new bbnode(Obb,icp); // create the nev bbnode
node.regitry[first_i] - bbnref; // record in rgitry
// that this bbnode hba now
// been already seen
if (bbnref->has_direct_target_pO) // if BB has a direct-targt BB...
part * icp-)prtition(bbref->targetlbel()); /I Frab prtition
ins W icp->int (bbref->turget_label 0); / and inet of
found_right TRUE; // top of trget BB
for(ppc_Intlter innr.lIter(part->inetltrO0); // construct an
!(inner_Iter.end_pO); inner_Iter4) // iterator pointing
if (innsr_Iter.vale() e ins) // at the beginning
{ // of the teret BB
cretea_node(inner_Iter,nw_nod); II// create the





// set the right child
// of the original node
// bbaref to point to
// newly created new_node
if (bbaref-),h_indirect_trget_p() II // if BB has a fall-thru BB
bbnrf->ha_fall_thru_pO) // or an indirect-target BB...
crette_nod (itr,ne_node); // create a new bbnod, using
II the current iterator po.






// set right child of
// bbaref to new bbnode





bbnref ->et_uid (GO_LEFT););); //
II/ for fall-thru, t
left child of bbaref





// The find_path enber uses the connectivity of th nodeap to copute
// all uniqu path fron a given node to a second nods (BB) ith a given
// target label (which touch each nod at aot once). Thsee pathe are
// specitied as lists of actions required at brnaches. nod registry
// i usdne to nsure that infinite loops do not occur during procesing.
// Th reult is returnd via the reference variable a1, a list of branch
// charts (actioa lists), and the pointer pased to tartnode is et to
// the teret BB's bbnode when it is found.
void nodenap::find_pth(bbnod.e currnoda, String target,
brnch_chartt orking_path,
QVlNp<bbnode,bool)L nr, bbnodek taret_node,
branch_chart_lith )
if (crr_noo _label(taret)) // if nod has tret label, we are
({l.appnd(Avorkingpth); // done; append crrent brench chart
target_node - cur_node;) // and return to caller
ele
if (!(nr.contain(curr_nod))) // otherwise...
QVHIaHp<bbnode ,booVl> rgistry // duplicate registry,
- new QVIapbbnod,bool)(nr); // since it is passed
// referentially
(*regictry)[ cur_nod - TRUE; // add current bbnode to registry
// In the cans of a fall through label, siply recurse on dsecendnt
if ((curr_node-get_nittypeO) P FALLTHRU_2_LABEL)
find_path(curr_node->get_fll_thruO, target, working_path,
aregitry, target_node, 1l);
// Otherwise, for a branch-terninated BB...
else
(if (curr_node-he_fll_thra_p)
If It the branch has a fall through, add a " to the branch chart and recure




// If the branch has a non-empty direct targt, add turgt label to branch
// chart and recurne,
if ((curr_nod->hcs_direct_target_p()) U
(! (crr_node->trget_label() .eptyO)))
{brnch_shart right_ppth new branch_chart(working_path);
right_pth->append(new String(curr_oAde-taret_lbelO));
tind_patb(crr_node->t_dir ct_t.aret(), targt, right_path,
oregietry, tarut_nod, 1l);)
I/ or if instead it hs a non-epty indirect target, add the target label
// of th sieulated RTS tub and a retan label to the branch chart--then
// recurne
else if ((curr_ode->hs_indirct_targt_pO) &
(!(cu-rrod->t_indir ct_te tO(->getlb10 .eptyO()))









// The find_exit_path muber, when given n initial node ud two
// boolean pareters, finds a unique short path to the nd of th
I/ partition, favoring fall-through over branches when the fall-through
// node has not been visited before.
// The exit_path refernce variable is uned to rturn the exit churt
// (which will become the tail end of the branch chart), nd node
// registry, nr, is again used to prvent infinit loops from occurring
// during procesing.
I/ The flag repat_p causes the BB to be repeated once before taking
// the exit path, nd the flag fall_through_p dtermines whether the
// exit from the chosen B occurs via fall-through or branch, where the
// chosen option is possible.
void nodmp::find_xit_pth(brnch_chart& exit_pth,bbnode t_node,
QVIlMap<bbnode ,booli& nr,
bool fall_through_p, bool repent_p,
Stringt btarget)
{bool done_vith_in_B - PALSE;
// while the current node still has children and is untouched...




((t_node-)target_lbel() ! btargt) 1 dona_vithain_BB);
if (done._ith_min_BB A fallthrough_p & !repeat_p)
nrEt_ode]-TRUE; // urk node rached
*/
// If th current node i a fall-through type, w can't loop back
// or determine the exit mthod to satisfy the 2 boolean flags,
// so trn off the initial mode flug nd begin to work on the
// fall-through node.






// If the current node is a branch-type node (terminated by branch),
// then clear the initial mode flag (for the benefit of future loop







// If the initial tag i over,
else
II For the default case (other thn the initial tag) ...
/I If the current bbnode has fall-through node which ba not been touchad,
// then add " to the exit chart nd recurse on the fall-through node.




I/ If the current bbnode has direct target which has not been touched,






// If the current bbnode has an indirect target which has not been touched,
// then add jmp to "_isulated_RTS_tub' to the exit churt, followed
// by jmp back to the user bbnode pointed at the right child of







eigerr("nodmp: :find_.xit_path: Partitioning error--Inf. loop.");
};
// At the end of the exit chart, add a jump into the simulated RTS stub
// to simulate th return to the runti syyst cod (scheduler) following
// the coupltion of partition.
exit_pth.·ppend(new String("_silated_RTS.tub"));
;
// The uber ke._branch_xp i one of the two user-callable routines
// for creating complete brunch chart through partition. Given
// single target, aud the two boolean par-e ters fall.through_p and
// repeat_p, it coRputee all paths froe the root node to the target BB
// (which touch each bbnode at most once), and then dds an exit chart
// to each route to the target bbnode to create a complete branch chart
// for that partition
brnch_chart_list& nodemap: :akh_brch_.ap(String& trget.
bool fall_through_p, bool repeat_p)
branch_chart_list uter_list - new branch_chart_list;
QVIDIapbbnode,bool) nr new QVHNMapbbnode,bool(FALSE,10);
bbnods* t_node;
// find all pths to the trget bbnode
fiid_path(root,trget,*(new branch_chart),nr,t_node, master_list);
// clear the nods registry
ur-)QVHp()O;
nr - new VHp<bbnodeo,bool>(FALSE,10);
// compute the exit chart from the target bbnode
bruch_chart xit_pth t (n·ew brnch_chart);
f ind_exit_path(exit_pth, t.node, nr,f allthroughp, repeetp, target);





// Although bearing the snm n u the preceding ember, this verion
// of ak_branchsp allows branch charts to be generated which, instead
// of merely passing through single target bbnode, can he choen to
// pass through series of connected bbnodes. To call this branch charting
// proeedure, therefore, you nut supply *twuo targets, which hould
// correspond to the labels of the firet ud lt bbnodn in th connected
// series you want to travel down. The paru ters are the samo a in
// the first version of mke_brnchmp, and determine exit chart
// characteristics.
brnch_chart_list
nodemap: :mke_branch_map(String trgetl, Stringi target2,
bool fall_through_p, bool repat_p)
{ brnch_chart_listo uster_listi * new branch_chart_list;
QVHMp<bbnodeo,bool>o nr - new QVNllp<bbnode,bool>(FALSE.,);
bbnode t_node;
// compute branch charts from the root bbnode t the first target bbnode
find_path(root,targeti,o(nw branch_chart),onr,t_node, *uter_litl);
// clear the node registry
nr- VIQVMnapO;
nr new QVllap<bbnode, bool)(FALSE,10);
// compute branch charts from the first targt bbnode to the second
branch_chart_list aster_list2 - new brunch_chart_list;
find_path(t_node,trgt2,*(new br nchcchart),or,t.ode, *ter_list2);
// take the direct product of th two preceding branch chrts
branch_chart_linstl - (unter_listl) * (*mater.list2);
// clear the nods rgistry
r->-QVNMapO;
nr new QVNMap<bbnode*,bool>(FALSE,10);
// compute exit chart from the second target bbnode out of the partition
bruch_chart exit_path - *(nw branch_chart);
find_xit_pth(exitpth, t_ de. *ur, f llthroughp. repeatp. trget );
// append the exit chart onto each branch chart n ml running from the




// The followuig proc dure appends one brnch chart to each elont of
I/ a list of branch chrts.. It's ued to ppend the exit chart to the




{for(DLitIter<bruch_chcrt~ cllter (i.itr()); !llIter.ond_p(); lIltor+-)
llIter. volu ()->ppend(til);
return c;);
// The following proc dure acting on brnch_chart_lit was found
// ueful in th iplmenttion of th nodeop ADT. It tkes direct
// product of two brnchhat_lists.
brnchchrt_lit& operator *(bronch_chrt_lit* .1, branch_chrt_lit& .2)
branch_chrt_liet* product nw brchchhrt_list; // ake th output lit
DLitIter(brnch_chrt*) ilterl - *sl.iter(); // build the iterator
DLitter<brnch_chort)> iter2 - *s2.itr(); // ovr the to input
I/ licts
bronch..chrt tp;
for ( ; !lIterl.edpp); iltri4+, ilter2.first()) // loop over the two
for ( ; !iIter2.nd_p(); ltr2+-) // list
({t-p - n-w brnch_chrt(ilteri.volue());
tup->ppend(*iIter2 .valu());
product-yoppend(tecp);); // oppndiog each product of two
// toer from eparate liots to
/ the output list
return product; // return th output list
;
// handle the initial portion of the exit path in a branch chart
void
nod e.p: :build_init_xit_path(br nch_ch rt& exit_pth, bbnode+& t_node,
QVHMpcbbnode ,bool,& nr,
boolt fall_through_p, boolk repeat_p,
Strig& btaget, boola done_ith_nain_BB)
// If the rept flg i et, and the loop cn jup to the beginning
/i of the current node, dd such a loop back to the exit chart.






// If the fall-through flog is et, and the node h a fill-through
t/ descendent, then tke the fll-through nd dd a "" to the exit chart.
if ((fll_through_p) && t_nod->he._fillthru_pO)
{exit_path.oppend(nev Striog(""));
t_node - t_node-get_follthruO ;)
// If th fll-through fl i clear, and the nod hs direct
// branch turget, add tht branch label to the exit chart, nd then tke
// th branch to th trget BB node.
e1se if (!(fall_through_p))
i ll_throughp - TRUE;
if (t.node->hs._direct_trget_p ) )
(exit_pth.oppnd(nv Strig(t_node->trgt_lobeIO));
t_nodo - tnodo-get_direct_trget()O;)
/ If the fall-through flo i cler and the node has an indirect targt,
// then jp to the indirectly linked node, and dd a "_s.iultod_RTS_.tub"







// QPROPF bbnode ADT iplemtation -- ybohOathona
// This data type i ed to represeat bic blocks of PovwrPC instructions,
// to allow the branch path gnorotor to geanrat a correct equence of

















enu ntyp ({NDDE, INDIRECT_NODDE, EPTY); // typo of child
anu nxit_type BRLICH, FALLTHRU_2_LABEL, UNINITIlLIZED); /I typ of bbnod
0nu guide_elt {NO_GUIDE,GO_LEFT,GORIGHT,END_POINT);




bbnod (): _type (UNINITIALIZED)
(lbl ""; lft_node O0; right_nod - 0;
left_type EMPTY; righttype - EPTY; guide NO_GUIDE;);





// left child predicates and odifiers
bool h_..fall_thru_pO) const





/ right child direct-targt predicate. and odifiers
bool has_direct_targot_p() const





// right child indirect-trget prdicates and codifier.
bool h_indirect_trget_p( cont
(return (right_type -- INDIRECTNODE););
void t_indirect_trt t(bbnodde bbp)
(right_node - bbp;);
bbnodos gft_indirect_tart( ) const
(return right_node;);
// got branch bbnode'0 trget
String target_label() const
(return t_lbel;);
// got type of bbnodo
cont nexit_type gtexittype()
(return _type;);
// check and gt label of bbnod.
bool contains_l&abl(conot Striug& s)
89
{rturn (lbl - ););
String gt_label()
(return lbl;);
void print(int level O) 0
if (levl < 10)
{ cot " <c lbl < "(";
if (has_fall_thru_p())
pt_fall_thrO--)prit (lvel+i);
else cout <C "EMPTY";









bool is_lbr_rtl_jmp( ppc_lIst& i) cont;
// local pointer to the ICache
ICtchee icp;










/ QPEROP bbnode DT iplntatien -- ybokeathna
// Thin data type is ued to represent basic blocks of PowerPC inetruction,
// to alovn the branch path generator to generate a correct equence of
II branches throh partition.
finclude "bbnode .h"
// This the constructor for bbnodes which works given DList of
// the instruction in the bbnode and the ICache object for the containing
// aodule.
bbnode::bbnode( DList<ppc_Inte>& dl, ICache* icache_p)
t_label * (nev String("));
guide ' NO_GUIDE;
icp - icache_p;






















/? The i_lbr_rt._jup function looks at thb trget nae of a direct
// ju p to n label and deteraine vhether the lbel i in near code or
// the RTS. It rturn true iff the jump is into the RTS.
booheel bbnod::is._lbr_rts_jup( ppc_Intt i) conet
ppc_Rnd** rarray- i.rnds();
int location - NOT_FOUND;
if (i.op().insrt_p() 11 i.opO.reg.br_p())
igrr("bbnod:: i_lbr._rts_jnp: p Argunt not a branch to label.");
for (int j - 0 ; j i.opO.nrndsO;() j)
if (rarray[j]-label_p()) location -j;
if (location - NOTFDUND)
igerr("bbnode :is_lbr_rts_jup: Branch to label ising label. );
return !(icp-exists_p(Frarray[location]-label));
// The get_brnch.targt function returns the eit target lbel of a bbnod
// supposing thet the bbnode is terninated by a branch to label instrction
// (rather than branch to register).
String gpt_branch_targt( ppc_Inst i)
ppc_Rnde rarry- i.rand();
int location e NOT_FOUND;
if (i.opO.inert_pO I1 i.opO.rg_br_pO))
.igerr("bbnod :get_branch_target: Arlgment not a branch to label.");
for (int j O0; j < i.op().rndsO); j++)
if (rarray[j]->label_pO) location - j;
if (location -- NOT_FOUND)
sigrr("bbnod ::get_brnch_target: Branch to label issing lbel.");
String local_str - nw String(rarray[locationj->label();
return olocal_str;
90
// QPROF bqueue ADT -- ybokethen
// The bqaeue ADT i dig-ad to eimulte the inetruction dispateh buffer'
// of the branch processor. There mre 8 buffer. for dispatching equential
// inetruction, snd 4 for collecting branch trget instructions before
/ the branch hs been reeolved.
Nifndef NULL
def it NULL - 0
#endif
// define n iproved modulus function
#ifndf NEWIOD
Sdefin NEWhOD
inline int od(int n, int m)









const SEQSIZE - 8;
cont JIIPSIZE - 4;
// eae iportnt prlsitives
inline int eeq_.ucc(it J) return sod(j + I,SEqSIZE+););
inline int seqprd(int j) ({returt od(j - i,SEQSIZE+i);};
inline int jp_succ(iot j) {retorn nod(j + 1,.JPSIZE1+););
inline int t j prd( j) I(rturn od(j - I,JIIPSIZE+I););
claess bqueue
public:
bquueO() {eqhd O0; eqtail O;
jsphed - ; jspteil - 0;);
int eq_pty_p() {return (eqtail .- ssqhsmd);};
int jp_enpty_pO return (jptsil - jsphead););
void load_seq( DList<ppc_Instot ftetched_list );
void load_jnp( DLitppc_Inst>L fetchd_list );
void ideload(); // lan the seq. buffers vith target buf. contents
void prg.e_jp(); // reset jp. buffers to npty state (branch failed)
bool eebrnch_p(); // can you see branch downtrean?
ppc_Inet etract_branch(); // extract it
int eq_roo() retrn (SEQSIZE - eqsize(););




int seqizeO) (return mod(.qtail-sqhsd,SEQSIZE-i););
int jpesize() (return od(jptil-jphead,JPSIZE+1););
void print();
private:







// load seq. buffer vith I instruction
inline void bqueu::internal_lodsq(ppc_Inst ins)
if (q_roo() <- O) igerr(bquue: : intertl_lo d_.q: No rooe left.");
seqbuftseqtil] - ins;
seqtail - seq_succ(sqtail);
// load branch targt buffer vith inetruction
inliue void bqeue ::internsl_lod_jp(ppc_In.st ins)
if (jsp_roo() <- ) igerr("bqu ue :interll_load_jp: No room left.");
jspbuf[jsptil] - ins;
jsptail - Jp_ucc(jptail);
/ load eq. buffer vith list of instructioun
inline void bquee ::lod_eq( DLietdppc_Int*)> fetched.list )
for(ppc_IetItr& ilter - (fetched_list.iter));
!itar.end_p()O ; ilter++)
internal_lod_sq(ilter. value ));
//I load branch trget buffer vith list of instructions
inline void bquee::lod_jsp( DLit<ppc_Into>& fetchedlist )




// pop seq. buffer etry
inliue ppc_Inset bqueue::seq_pop()
if (eqsieO() <- ) ipgrr("bquu::seq_pop: Cn't pop pty queue.");




// peek at bend of eq. buffer
inline ppc_nlt bqueue;::seqpekO()
if (seqeizse( <- 0) igrr("bqu ue::eq_pop: Cn't pop apty queue.");
return (eqbuf [eqhbead );
// pop branch trget buffer try
inline ppc_Insto bquue: :jp_pop()
if (jlpeieO ()- ) igrr("bque:: j_pop: Can't pop pty queue.");
ppc_IntO ip - jspbuf[jsphead];
jnphed jp_ucc(jsphead);
return ip;
// mash sq. buf uith branch trget buffer striee
inline void bqeue :sideload()
{ eqhead - O; seqtail O;
while (jmpsize() > O)
itterntl_load_seq(jp_poppO);
jsphead - ; jptail 0;
// reset jp. butf (fil branch)
inline void bqueue::porgep_jpO
{japhead 0; jptail 0;)
// look downstrean for a branch
inline bool bque ::se._brnch_p()
for (int i - seqhead;
(i !- eeqtil) t (i !- (od(seqhed+x5,SEQSIZE+)));
i eoq_succ(i))
if ( ! qbuf i] ->op ( .inert_p ())
return TRUE;
return FALSE;
// eutract branch froe sequential buffer
inline ppc_Iuts bqueu ::extrctbrnch()
91
};
for (int i - seqhecd;




sigerr("bque :entr-ct_brnch No Branch.");
return NULL;
inline void bqueue::print()
cout << "BQUEUE OBJECT:" << ndl;
cout < "------------------'------------------ (( andl;
cout << "SEQ QUEUE: ize - " << seqize();
cout << ", rooe " << seqroo() << "]" << endl;
for(int i - .sqhad; i !- qtail; i seq_-ucc(i))
seqbu [i]-,print();
cout << "[JMP UEUE: ize -" << jpsizeO;
cout << ", roo -" << jap_roo << "]" < endl;
for(i - jphead; i ! jptail; i - j.p_succ(i))
jtpbuf i-)print()0;
cout << --------------- ------------------- e.* ndil;
#eadir
// qPROP ICache abstraction -- syboklathena
// This DT simulates the ICache by fetching up to 4 instructions at ti-m
// from a partition of PoverPC module. A now ICache object ust be
/ ganerated for each PowerPC module. The ICache has its o iateral
// instruction pointer which i used as a dfult when intructions are
// fetched and loaded fros it. A label target can be specified instead,














// The class in_.pointr rpresents an ICache ip. It must contain
// a pointer to the partition tfo allo the itruction ooig th






inl_pointer(ppc.In.st i, ppc_Part p)
(i_.ptr i; p_pntr p;)
ppc_.Inte inO) conat {return i_pntr;);





// sos iportant definitions and constants
typedef QVIMcp<String,ine_pointero> labelHap; // map targets to instructions
typedef DList<ppc_Inst> inslist; // The ICache will return a list
// pointers to PPC instructions
// rpresenting the fetched
// instructions.
cost String hbeder("_sieuleted_RTS_stub");
conat FETCH_LIMIT - 4;
// the class ICache
class ICache
public:
ICache(cost ppc.HodA in_od): _map(O)
nmod - in_.od; _curr_inst 0; _outlist
void fetch(const StringS targt);
void fetch(it li FETCH_LIMIT);
ins.listt lod() const return *_outlit;)
bool exists_p(cont Stringt target) const
(return _p.ontatin(t rpst);)
ppc.Parto partition(cot String target);








// construct from ppc module
- ; init_lbelHapp O};)
// fetch at target
// fetch at current ICache ip
// load fetched instructions
// checks whethr label exists
II in current modul
// returns partition containing
// targt label--assRes label
// eists in current moduls
// the initialization function
// the original PPC module
// the target -> itructioa map
// the ICach ip
// the fetched instructions
92
// This function returns the partition associated with lbel.
inline ppcPart ICch: :partition(const Stringl trget)
if (!_ap.contains(target)) ierr("ICache: :prtition: Trget not found.");
return _ap[trget]->prt()O;
inline ppc_net ICacha :inst(conut Stringk target)




I/ QPROF ICache abstraction -- ybohOthna
// This DT eimulates the ICache by fetching up to 4 instructions at tine
// from partition of PowerPC odule. A new ICach. object nust be
// generated for each PoverPC odule. The ICache has its on internal
// instruction pointer which i used s a default when instructions are
// fetched and loaded froe it. A label trgt can be specified insted,
// which would typically be usd to satisfy branches.
Sinclude "ICache.h"
// To perfor the fetch t the current ICache ip, traverse the
// partition corresponding to thet ICache ip, and when the instruction
// earked by that ip i found, start building the list of fetched
// instructions. Collect exactly lie instructions, where li is
// beteen 0 and 4 inclusive. The list becoms ..outliet.
void ICache :fatch(int li)
li (li < FETCH_LIMIT) li : FETCH_LIMIT;
if (_outlist !i O) _outlist-)iDListO;






if ((_curr_inst ! O) & (li 0))
for(ppc_Instlter ilnet(_curr_inst-)partO-nintlterO) ;
!iInst.end_p() U (sinz_cat li); ilnst*4)
if ((ilnst.valu(O->op().op() - ppc_none) 11
((iInt.vlueO(->opO.op) >- ppc__proc) U
(Inst.vlueO-)opO.op()O - ppc__axtar)))
(if (inst.valu() _curr_inst->insOl) (liB++; size_ct++;);
if ((!int.valuO-lablOpty()) . (chlbelptyO) U ( hdlbl.epty())
cached_label - iInt .vlueO->labelO;
continue; );
if ((ilnt.valu(O - _curr_inst->insO) 11 (size_cnt > 0))
if ((cachd_labhel.eptyO) 11
((iInet.value() -i _curr_inst-)ins())




ppci - new ppc_lnst(cched_lab l,iLnst.vlue()-op()
iInt .value()->ourc(),
iInst. vluO- ()in ());
ppci->codedenProps(int. vlue O()->codeenProp );
pr" inst.valueO()->rndsO;




while ((!iIn t.and_p()) tk
((iInt.vlueO->op.op()O ppc_non) II








II// To fetch from a target, saeh the ICache ip with the value of the nap
// function, the n revert to the siple fetch function.
void ICche: :fetch(const String targt)
if (!_nap.contains(trget)) sigerr("ICach: etch Target not found.");
_curr_inst - _aptarget];
fetch(FETCHLI)IT);
// To initialize the nap, we traverse each Procedure, Code Blocl, Partition,




for(ppc_ProcIter iProc(_od.procItrO); !iProc.en_pO; iProca+)
for(ppc_CBIter iCB(iProc.value -> CBlter()); !iCB.end_pO ; iCB+)
ior(ppc_Partlter iPart(iCB. vlu-)prtIterO(); iPart .ndp();iPert)
for(ppe_IntIter




pp._Prt part - gnrate_RTS_stubO;
_ap[header] - nev ins_pointr((apart) O] ,part);
// We generate n artificial partition i-ulting instruction that
/! night b tound in tbh rnti yst cods to iulate J -p into
// the RTS.
ppc_Prt ICache: :gnerate._RTS_stub(
ppc_Prt part nv ppcPart("RTS tub".IREAD_PAT);
ppc_Inst* i new ppc_lnt[11];
i[l] - nev ppc_Inet(ppc_.flr,headr);
i [1->rand(new ppc_Rnd(r0));
i[2] - nev ppc_Int (ppc_.ncr);
i[2]->rnd(ne ppe_Rand(rlO));
i13] - new ppc_Int(ppc_t);
iC[3]-)rand(new ppcRnd(rO),nv ppc_R.nd(rSP) .ne ppc_nd(8));
i(4] - new ppc_nt(ppc_.tv);
it4]->rand(nev ppc_Rand(rlO) ,n v ppc_Rnd(rSP) ,nv ppc_.Rnd(4));
i[6] - nev ppe_Inat(ppc_stu);
i[5]->rend(nev ppc_nd(r rSP).ne pp.cRnd(rSP) ne w ppc_Rand("neg_enda"));
it] - new ppc_Inst(ppc_addi);
i[f]->rend(nev ppc_Rnd(rSP),nev ppRand(rSP).,new ppc.Rnd("zda'));
i[?] e new ppe_Int(ppc_lz);
i[?l]->rnds(nev ppc_Rnd(rO) ,eu ppcand(rSP),new ppc_Rnd(8));
i[8] - n ppc_Int(ppc_ljz);
i[8]->rnd(nev ppe_Rnd(ri0),new ppc_Rnd(rSP),new ppc_Rnd(4));
i[9] - new ppc_Int(ppc.stlr);
ig9]->rand(new ppc_Rand(rO));
iilO] a new ppc_Int(ppctcrt);
i[IO] ->randa(nev ppc_Rand(Oz38) ,ne ppc_RAnd(rlO));
i(li] a new ppc_Int(ppc_b);
i[ll]-rande(nev ppcRnd(ha.der));
Ior(int j - 1; j c<- 11; j++)
pert->append(i j]);
return part;
// QPROF DCeh AiDT -- ybokathn.
/! Thin ADT imulates the behavior of the data c-che in accepting lod
// and tore requnest fron the FXU nd FPU units nd returning "data"









DCach(); // the contructor
void tick(); // advanc the state of th DCche by on clock cycle
// return the DCachs's saved state
// process th "return valu" phase o an FXU ld requst
bool fu_lod_val_ready_p() cant return xuloudreturn;);
int fxu_lod_valO;
// proces the "return valnue phase of an FPU load request
bool fpu_lod_val_rdy_p() cnst (return fpu_load_rturn;);
int fpu_lod_val0;
// proces the "request" phase of a load request (lways de by FXU)





// prcss an FX "store buffer" store (FXU supplies DATA and ADDRESS)
bool b_tore_rdy_p() conat (return !fu_tore_presnt;);
void sb_store(addr tret_addr);






















// lots for receiving lo d request
// lots ior lod returning to FXU
// lot for load returning to PU
// lots for accepting FX tore fro FPXU






// lots for accepting FP store fron FXU
// and FPU
// dlay slot for PPU tore
addr ake._lpneric_ddrO
(return nev ddr(*(nvw ppc_Int(ppc_lwzx))););
inline it DCache::fxu_lod_vl()




inline void DCech:: b_store(ddr tcrget_addr)
if (fxu_store_present)
sigerr("DCach: :sb_store: Can't Send FXU Store Yt.");
fxu_st_target_ddr - target_.ddr;
fxu_tor_preoent * TRUE;
inline void DCache :: pq_store(addr tprget_.ddr)
if (fpustore_ddr_prannt)




inlin int DCach: :fpu_lod_val()
if (!fpu_load_return) igrr("DCach: :fpu_lad_val: Mo FPU Load Ready.");
fpu_loed_return - FALSE;
return fpu_loed_ret_target;
mline void DCche :dsq_tore(tpu_dta data)
it (fpu_store_dta._present)
sigerr("DCche::dsq_store No FPU Lod Ready.");
fpu_store_dta_preeent TRUE;
Sendit
// QPROF DCach ADT -- ybokeuthena
// This ADT simulates the behavior of th data cache in accepting load
// end store request. ftro the FxU and FPU unite and returning "data"
// (siulated) to the requestiny functional unit.
finclude "DCach.h"












// check wvhthr a tore ha just occurred vith an address that atches
// that of the load waiting due to a detected collision. If so, decrenant
// the collision counter.







// if all collisions hav ben dealt vith, and the load return lot
// for the appropriate unit are free, ill than vith the return data
// for this load and free up the load request slot.
if ((load_request_present) U (load_collisions u- 0))












fxu_pertforing_stor e - TRUE;
fxu_ppri_st_ttrget_addr - fxu__target_addr;);
// process an fpu store.
fpu_perforin_tore - FALSE;























// PROF fprnap DT -- ybokethen
// The fprmp AD i designed to simulate the register rpping strategy














// handl rithetic PPU instructions
fpdecod_entry arith_-np(ppc_Int ins) const;
// handle FP stores
poq_ntry storejp(ppc_Inst ins) conet;
// handle FP lods
bool load_-ap_possible_p(gqueue<int>* flist,
grquaue(int)* ptrq, gquune<int>s olq);
void lodap(ppc_Int in. gquelu<int> flist,
grqueue<int>o ptrq, gqueue<int>o olq);
// allow direct ccs to up
int chckhnp(int orig_target) const return mptbl[orig_target];);







for(int I - ; i < FPR_COUNT; i++)
naptable[i - i;
};
/, chick vhether doing * lo d i possible
inline bool fprap:: lod_np_posibl_p(gqueu<int.> flist,
grqu eumint>* ptrq,
gquus<Cint>o olq)
return((flist->sie() > O) U (ptrq-roo() > 0) O) 
(olq->room () 0));
1;
// copy the fprmp ADT object
inline fprp* fprp: copy()
fprnap* new_fprmp a new fprap;




// QPRO7 fpnp ADT -- ybohkthbn
// The fprap ADT i dsigned to simulate the rgister reupping strategy
// used by the RS/6000 FPU.
Sinclude "fprnap.h"
// rithatic instructions re handled simply by raupping their registers
// according to the table.
fpdecode_entry fprmp::rithap(ppc_Inst ins) const
optype op - ins-opO.optO);
if ((op - FPU) & (op !- FPU.A) (op !- PP_CP))
sigerr('fprp: :rith_pp: lot n FPU arithntic instruction.
M);
ppc_Inst nev_ins 
new ppc_Inst(ins-)labl(), ins->opO,ins->source ,ins->lin ));
for(int i 1; i cs ins->nrnds(); i++)
(ppc_Rnd rd - ins-rand(i);
if ((rd->reg.pO) U (rd->rg() O.type() - FPRRTYPE))
{ppc_Rg rg e new ppc_Rg(rd->reg() id(),
FPR_RTYPE,






// Stores also have their PP registers rapped, but the result is
// a peqentry with a "CB" bit of zero rather than a fpdecod _entry object
psq_ntry fprup: :toreutp(ppc_Intn ins) const
if (!(is->op() .fp_tore_pO))
sigerr("fprmap: :tore_ap: Not n PPU store.");
ppc_Int neuw_ins 
new ppc_Inst (in-label O(), ins->op ),ins->souc (), ins->lin ());
for(int i - 1; i <, ins->nrandsO(); i++)
{ppc_Rnd rd ino->rand(i);
if ((rd->rg_p()) U (rd-,rg().typeO - FPR_RTYPE))







// Do the load by mnipulting the various register flag buffers nd
II altrting the table.
void fprap::loedmap(ppc_Int ins, gquue<int>* flist,
grqueucint>o ptrq, squeue<int>o olq)
if ((flist->iz ()O - 0) II (ptrq->roo() <- 0) II
(olq->roon() <- 0))
sigerr("fpr-p: :lodmap: Cn't load nov.");
ppc_Rg rg - ins->rnd(l)->rgO;
ptrq->lod_leu nt (aptable[rg. actual(]);
-aptable[rg.ctulO - flit->pop();




// QPROF timer ADT -- ybokfathna-
// The timer ADT i used to khep track of the cycle. required to
// execute a series of PowrPC instructions
Sifndef QPROF_TINER
Odefine QPROFPTINER
ens- timsr_cond {TIMER_DONE, TIER..READY TIMER_ON);
const String end_def ult "_qprof..single_BB_defalt";
clss tinr
public:
ti-er(const StringA _label, const Stringt e_l.abl): // constructor
etart(s_label), top(e_label), orig_stop(e_label)
(tc TINER_READY; comunt 0; een_tart B FALSE;);
bool done_p( (return (tc - TINER_DONE););
int get.cout() cost ireturn(count););
void oprator() if (tc -- TIIER_ON) count+;);
void alrt_tiser(cont StringS label);




String start, top, orig.top;
bool ee.n_st rt;
// is timing done?
// get the count
// advace tier
// ignal tier of
// event
// found gneeric and
inline void tir: :ark_..d(coet StringS label)
if (tart label)
een_tart r TRUE;
nse i (een_et rt 5 !label.mnptyO)
(if (etop - end_default)
stop - label;
eeeoetet - FALSE;);
inline void timer::alart_tiser(con t StringS label)
sitch(tc) {
cuae TIRM_READY:
if (label me strt)
tc - TIR _ON;
bras;
cane TIER_ON:
if (label - stop)
tc * TIMER_DONE;
break;





I/ IPROFP ynchro DT -- ybokhathena
// The synchro ADT i designed to enforce the contraint that the FPU nd
// FlU rain certin numbr of cyclee, t moet, out of sync. By noting
// vhen each us dispatchms n instruction into the pipelin, it can









bool fu_shift_ohk_pO; // is it OK to hift an FXU in into FXU pipeline
bool fpu_hift_ok_pO; // in it OK to shif tan FPU ins into FPU pipeline
void fxu_hift(); // hift n FXU ins into FXU pipeline
void fpu_.hiftO ; // hift n FPU ins into FPU pipeline
bool fxu_.hed() const retur (dieplacent 0);); // vhich unit is ahead?
bool fpu_ahead() const (return (displaceent > 0););




// ok to hift out FU in?
inline bool ynchro ::fxu_hift_o _p()
return(dieplacment ) -FXU_lead_ax);
// ok to hift out fpu ins?
inlin bool ynchro::pu._ahift_ok_p()
return(displaceent FPU_leadsax);
// shift out n FXU in
inlin void synchro: :xu_shift()
if (diplacnemot > -FXU_lead)
diplacesent--;





sigrr("ynchro: :fpu_hift: PPU unit too fr ahead.");
inline void ynchro::print()




// QPROP pd code_entry ADT --- ybolathen
// The fpd code_.ntry ADT is ued to tore the representation of a PowerPC
// instruction in the dcode stage or dcod buffers--it includes a pointer










fpdeod_.ntry(ppc_Instt ins - O);
const ppc_Rtor get_rtaor() coast return *rat;);
ppc_lnot gt_ins() con t return it;);
int get_lcount() const (return loadcct;);
int gt_count() coast (return tore._ct;);
friend fpdocode_ontry addload(fpdecode_ntry e);
friend fpdecode_entry add_.tore(fpdcod..entry e);
bool oprator -(pdecode ntry e) const (return FPLSE;);








// QPROP fpdecodo_oetry AD -- ybokethena
// The fpdecod_entry ADT i ued to tore the representation of a PowerPC
// istruction in the dcod o d g r decode buffers--it includee a pointer




// construct an fpdecode_.ntry
fpdecod_entry: : fpdecode_.ntry(ppc_Inst ins):
inot (ins). rat(&(ins-)op())
if (ins !. 0)
(optyp op - in-opO().opt();
if ((op !- FPU) &A (op !- PPU_A) A (op !- FPU_CHP))
sigerr("fpdecode_etry:: constructor: lot F instruction.");
load_cnt O0;
store_cnt 0;
II/ create new object ith incrneeated LC
inline fpdecod_entry dd_lod(fpdecod_entry o)
[
fpdecode_entrye new_entry - new fpdecod_.ntry(e);
nev_entry-)lood_cnt o .lodcnt+l;
return(nsov_ntry);
// create new object vith increented SC
inline fpd code_ntry add_.tore(fpdecod _ntry e)




inline otrea& operator << (ostreen As, const fpdecode_ntryA fpdec_obj)
cut <c "[FPDECODE_ENTRY: LC -* C fpdec_obj.lod_cnt;
cout << ", SC " < ftpdec_obj.store_c c < ", ";
tpdec_obj.inst-)print();
coat << "]" << ondl;
return s;
inline void fpdcod_ontry: :priot()
cout << "FPDECODE_ENTRY: LC " << load_cnt;
cout << ", SC -" << store_ct << ";
int->print O;
cout << "1" << endl;
tendif
98
// QPROF rgbsy DT -- ybok·lthen
// The regbuny ID i used to hold the lock status of the FU regiters














bool clsh(DList<int)> i) onst;
// construct the register lock
// dterine whether a given
// lock i or i not buy
// loc register
// unlock register
// does register list
// contain any locked rgs.?
private
boolo lockset;
// construct a rgbusy object
inline regbuy::regbusy()
lochkst - nev bool[RECOUNT];
for(int i - ; i < REGCOUIT; i++)
lockset[i] FALSE;
I// lock a register
inlin void rgbusy::lock(int n)
if (locket[n])
sigerr("regbay;:: lock: Sit already lockLd.");
else
locksettn] - TRUE;
// QPROF ddr ADT -- syboketheun
// This ADT we derigned to hold the addresses stored in the SB and PSQ







eddr(count ppc_Inot& ins); // standard constructor
ddr() label ""; generic * TRUE;); // default constructor (generic ddr)
bool operator (conot ddrk ddrees) con·t; // equality testor
frind otrea opertor << (oetream As, cont ddrk addr_obj);
void print();
private:
String label; // target label of ddr (if trget i label)
int n_label; // nsaerical label of addr (if target i nuber)
int rg; // base regiter
bool gnric; // hethr or not the addre i genri; bove info
// only applies if it's not
inlin oftream operator << (ostrea As, const ddrL addr_obj)
cot << "[ADDR object: lbel - " <(< ddr_obj.label;
cout ", n_lbel - " < addr_obj.n_label;
cot << ", regiter " << addr_obj.reg;
cout << ", gnric -" << ddr_obj.geeric << "]";
return ;
inline void ddr::printO
cout << "IADDR object: label " < lbel << ", n_label " << n_label;
cout << ", register - " << reg << ", genric -" << generic << "]" endl;
Sendif
// onlock register
inline void rgbusy: :unlock(int n)
if (!locksot[n])
sigerr("rgbusy: :unlock: Bit already clear.");
else
locksottn] - FALSE;
'/ test hethsr any registers in n iterator are locked
inline bool rbuy: :clnsh(DListsintt i) coast
for (DLitlter<int> j(i); !j.end_p() ; j++)





// QPROF ddr ADT -- syboklathena
// This ADT was designed to hold the addrsses tored in the SB and PSQ
// of the FI and in crtain internal DCache slots.
Sinclude "addr.h"
linclde "err.h"
// Construct an 'addr' object fro, a tore. To be qual to objects ust
// not b "generic" ddreses (too register addresiog), and mut share the
// sa e base rgiater and offset.
addr::addr(const ppc_Inst* ins)
ppc_Opcode opcode - ins.op().opO;
optype op - is.opO.optO;
if ((in.opO.load_p()) II
(ins.op() .stor_p()))















sigerr(ddr: constructor: Illegal Offset.");
resg - be->regl().actualO;
sigrr("eaddr :constructor: Not a Storage Command.");
;
// test objects re squality based on previous definition.
bool addr::oprator-(cont addrx address) coast
// This is a best-cae scenario; generic addreses ore assued
// to miss.
if (generic II address.generic)
return FALSE;
else if (reg ! address .resg)
return FALSE;
else ii ((label.epty()) b& (addren.label.npty()))
return ((n_label -- addrse.n_label));
else
return((label ee addrees.label));
//QPROF gqueu ADT -- sybohathen
// The gque ADT i a abstraction representing a generalized queu of
// fixd length vith hed nd tail pointers. The gqoue i a contain r
// class which can b inttiated o as to contain an arbitrary ciass a
// quene elements
II// an iproved modlus function
titad NEItOD
defaine NEIIOD
inlime int od(int n, int m)























gquue(const int sin e 1): sz(sie) // our constructor; the
(head O0; tail - ; buf nev T[sz+l];); // size is specified a
// prameter
int upty_p() return (tail m head););
void lod(DList<T>l fetched_list); I/ load the queue
void lod_elant(T ins);
int roo() (return (s - size()); h// c ek hov ny vca 'cis
// it currently contains
int siz() conat (return mod(tail-head,sz+l);); // check hov many ntries
// it currentlyh contains
T pop(); // pop an entry and return it
T peek() const; // just peek at it
// gqueue higher-order operation and tests
int contains(T ins) const; // search for some elt.
void apply_to_all(T (*reap)(T)); // apply a function to each elt.
void apply_to_soa(T trget, // apply a function to each
T (*romap)()); // elt. matching "target
void apply_to_first(T (*remap)(T)); // apply function to first lt.
void apply_to_lat(T (*remap)(T)); // apply function to last elt.
void print();
protected: // protected so as to allow subclasses to be defined
To buf; // the array storing the queue entry pointers
int sz; // gqunue size
int head, tail; // gqueue head nd tail pointers
int succ(int j) conet (return od(j + 1,sz+l);); // useful primitives
int prod(int j) conot return od(j - ,sz+l););
// load the gqueu from a DList of the contained class
tanplate <class T>
inline void gqu ueT>::load(DLit<T>& ftched_list)





// load the gqFueu uuing a ingle pointer to an inetance of the contained
// clne
toplte <clas T>
inline void queoelT> ::load_eleent(T ins)
if (roo( ) 0)
bu [tail] * inn;
tail * ucec(tail);
lue
*igerr("gquue: :lo d_elent: No roo left.");
// pop n lt. of the gqueue md rturn it
teplate <luo T>
inline T gqueue<Tl::pop()
if (izO() <- ) igerr('gque ::pop: Cn't pop epty quoue.');
T ip - b[head];
head - eucc(head);
return ip;
// jout peek t the top lt. of the gqoeUe
teplate <cluns T
inline T gqeueT): :peek() coaut
if (ize() <- ) igrr("gq ueu::pek: Cn't peek t apty qeou.");
return(buf[h ead);
coot << "CQUEUE: ize . " << siz() << ", roo * " <<roo( ) << "]" << endl;




chk whthr h gqu ntin t the p cified trge t "ins". Equally
// is dterained uing th - operator of the contained cln.
teplate (clas T>
inline bool quoue<T>::contains(T in) conut
tint count O0;
for(int i head; i !- til; i * *ucc(i))




// apply a pecified function to ach laent of the gqueun
toeplate (clen T>
inline void gqueoeT>::apply_to_a11(T (rea-p)(T))
fer(int i head; i !- til; i ucc(i))
bufri] - (erep)(buf[i]);
};
// pply a epecified function to each elont of the gqueu which tches
//I the target "trget". Eqully i dterained oning the - operator of the
// contained cla.
teplate <clas T>
inline void gq eue<T>::apply_tnc._s.e(T target T (roap)(T))
for(int i - head; i !- til; i - ucc(i))
if (target - bufril)
buf [i] (+reup)(bf[i]););
// apply a pecified function to the firut lt. of the gquue
teplat <clss T>
inlian void gqueeCTl: :apply_to_firrt(T (remp)(T))
if (head !- til)
buth ead (rastap)(but head);
);
// apply a pecified function the lt elt. of the gqueu
teaplte <class T>
inline void gque ueT>::apply_to..lat(T (rp)(T))
if (head !- til)
buf[pred(tail)] - (*rmap)(buf[prd(tail)]);




// QPROF gqueue AD -- syboknthen.
/ The qunue ADT is an abstraction representin a gen ralized queu of
// fixed lengh with head and tail pointers. The gquue is a conatiner
// class which cn be intatiated so na to contain an rbitrry class a
/! quu eleents.
linclude "gquue eh
// overload the output primitive 'outshow'
void outehow(ppc_Inst i) i->printO););
void outshow(addr ) a.print(O;);
void outehow(int i) cot << i;);
void outshoe(String ) (cout << ;});
void outshow(fpd code_.ntry ) (cout << e;);
void outehow(psq..ntry ) cout << ;;
// PRP grqueu ADT -- eybokathena
// The grquue DT i · uubclas of the grqueue data type, and xtends it
// by including a releue pointer which prevents the hbd o. the queue
// froe beang popped until it (the rleae pointer) has been advenced by
// n appropriate mount.
// an iproved modulus function
#itndef NEUNOD
tdef ins N1MOD
inline int mod(int n, t 





// the clans grqumue
template <claes T)
class grqueo: public gqu ulc<T
public:
grqueu(const int ize 1): gqueue<T>(size) // construct a grqueu
({rl O;);
bool cn_pop( coast (return (had !- rel);); // cn we pop the hed?
T pop(); // pop the hd




// here's the dditional stats of grqueue above and beyond a gqueu
inct el; // its release pointer
};
// pop the head of grqueue
template <class T>
inline T grqu u l ::pop()
if (size() <- 0) igrr("grquue::pop: Can't pop empty queue.");
if (rel *- hd) sigrr("grqueu:: pop: Cn't pop unreleased ites.");
T ip - bu[hed];
head - succ(h ed);
return ip;
// advanc the release pointer of the grqueu
template <class T>
inline void grq.ue<T>: :release()
if (rel - til)
sigerr("grqueue::releae: Too eany releases.'");
rel * succ(rel);
}r
// display the grqueu
template<clase Tb
inline void grque<T::print()
{ int r_lt * O;
for(int i - head; i !- rel; i eucc(i)) // count n mbr of released lts.
r_elt+;
cout << "CRQUEUE: size -" << siz() (< ". room - " < roo()
cout << ", rel. lt. " << r_elt << "]" c<< ndl;





// QPROF psq.ntry ADOT -- yboklathena
// rh pq_.ntry ADT rprseants antries in the PU pending tore qu ue--
I/ terget register nd "give bck" bit. (It ctunlly tores a pointer











const ILLEGAL_TARGET - -i;
class pqentry
public:
psq_entry(ppc_Int t O, int t ILLEGALTARGET); // constructor
int gttrgt() const {return trpt;); // return trpgt rg.
ppc_Int gt_ins() coost {return store;); I/ return ptr. to tore ins.
bool giv_beck_p()O const {return gi_bck;; // return "give bck" bit
bool operator"(pq_entry e) // equlity definition
({rturn (turget - .turgt););
friend pq..entry t_gb(psq..ntry ); // define n operation to






// yield new pq_entry
// ith odified give bck"
// bit
can t pq_entry& pq_obj);
// construct the initial peq_entry (vith & zero "give back" bit)
inline psq_ntry::psq_entry(ppc_lnt st, int t)
store t;
if (t !- 0)




// yield a nv psq_entry with t set "give back" bit
ioline peq_ntry set_gb(psq_ntry e)
it (o.give_back)
sigerr("peqgentsry:: st_gb lready oet.");




// hendle output of a peq_.ntry object
inline otrsae operator < (otrs As. conat psq_entryt psq_obj)
coot << "PSQ_.EWRY: " < psq_obj .tore;
cout << GB -" << pq_obj.give_back << "]" <C ndl;
return e;
// handle printing pq_entry object
inline void pq_sntry::printO













/ Number of lot uned for tatietics in fres nd CBDe. /
def inn AI_BB_PER_CB 30
tde ineu AX_CALLED_PNPER_CB 6
Idefine INAX_CALLED_CB_PER_CB 
define CBD_ACCUNULATORS 6+3eMAX_CALLED_CB_PER_C B
idofin FPRAN_STATS_BLOT (4oNAl_BBPEA_CB+ 6eNIX_CALLD_F_PPRCB2)
define CBD_STATSSLOTS (FlRANE_STATS_SLOTS+CBD_ACCUULATORS+I )
void getPrnaeStat( int p, I:dlord fp );
void getProgStt(void);
void printProgStatn(it verbooe);
void nettie(ldord* cbd_ptr, long offset);
void ccmtiae(IdVord cbd_ptr, long offnet);
double sept(double x,doubla y);
double convert_tim(unsigned long hightick, unigned long lovtick);
Sendif
/* .B. Sinc the CD spaca for state i fixed nd in apernt of the
locl frae tate, roe n t be allocated for vorst cae. Thu, so
axim nunber of BB per CB ust be nlud.
Thi i not too bd in rlity, ince the nmber of BB pr CB in
a tatic feature of progra-, and thun if particular progro
requires · ore than the currently allowed number, that nmober can
be rised to allow that progra to b profiled.
/0 Alejandro Cro





count long CPU.CLOCX_R&TE - 42e6; /* hertz */
count long RTC_TICKS_PER_SECOND - le9; I bhertz 0/
const double IDVORD_SCALE_ACTR - 4294929 96.72 0; / 2-32 */
/* Pointer to ntart of CBD urea. Defind in anbly routines. */
extern IdWord CBD_Area;
o etPraStats
T his routine i invoked very tie a fren in dealloc·ted. It
traverees the frne nd gthrs information from the tats slots
* in the fIr into the CBD for the code block.
*/
void
ptFraState( int p, IdWord fp )
{ IdWords cbd_ptr;
IdWord prent_cbd_ptr;
int no_of_inlatn, cbdtt_tart, frem_tattart;
int parent_no_of_inlets;
int i, id, bse;
long high_ord, low_vord;
cbd_ptr - (IdWord) fp[2];
parent_cbd_ptr - (IdWord*) ((IdVord) fp[I]) 2];
if ((int) cbd_ptr[CBD_STATS] > O)
{no_o _inletn - cbd_ptr[CBDINLET_COUNT];




cbd_ptrcbd_tt_st. rt+l] - fp[froe_tet_t.ntt+ FRA_STATS_SLOTS-2];
cbd_ptrcbd_ltat_trt+2] - fp[fra _.tt_.trt+FRAME_STATS_SLOTS-1];
accuntime ((IdWord*)cbd_ptr,((cbd_etat_stnr t+t) 4);
id - cbdptr[cbd_tat_.tart+CBDSTATS-SLOTS-1];
(cbd_ptr[cbd_tt_tart] )++;
for(i - 0O; i < FRANE_STATS_SLOTS-2; i++)
cbd_ptrtcbd_tt_trt+.+i] + fpti+fra_ett_tart];








* This rotine is invoked hen a progro teruinates. It gathern state





int no_of_inlets, cbd_tnt_ttrt, fr ne_ttatstrt, fn_dta_trt;
int cb_dta_tart;
double cb_livetie, bb_live_ti-, fn_live_tiae;
int cb_count - 0;
char codmblocknu [20];









fscnf (in_file, "s"', begin_tokn);
vhile ( CBD_p( currCBD ) )
if ((int) currCBD[CBD_DSTT] > O)
{currCBD 6 + currCBDCBD_INLET_COUNT] + CBDSTATS_.SLTS;
cb_cont:; )
else
currCBD +- 6 + currCBDCBD_INLETCOUNT];
currCBD = CBD_Area;
vhile ( CBD_p( currCBD ))
if ((int) currCBDCBD_STATS] > 0)
{/ Iit thread faction d eeriptor. /
Idlord· FD - (IdVord· ) currCBD[CBD_INIT_CODE];
fscanf(in_fil,"Xd Xd",Abb_count,Afn_count);
for(i - O; i < cb_coat; i++)
fscnf(in_file," ",cod ebloclaae);
fprintf(out_file, "Xs\n", crrCBDCBD_ IE]); i
fprintf (out_film, "d\n", currCBD [CBD.STAS]);
noofinlets - currCBD[CBD_INLET_COUNT];
cbd_tat_tart n_of_inlets+;
fprint! (out_il e,"Xd\n", currCBD [cbd_stat_.strt] );
cb_live.tiae - convert_ti(crrCBDcbdsttstrt+4].
currCBD[cbd.stat_start+3]);
fprintf (out_file, "X lf\", cb_]ive_t ie);
for(i - 0; i < bb_cont; i+)
{fprint (out_file, "Xd\n", (inat) currCBD [cbd_stat_st rt++i4] );
fprintf (out_f ile, "Xd\n", (int) currCBD [cbd_stt_strt++i4] );
bb_live_ti e - convert_tie (currCBDcbdtt_str t+8+i4],
currCBD [cbd_stat_st art++i4] );
fprintf (out_f ile, "X. if\n", bb live_tie);
n._dat_start - cbd_.stat_t.t+6+t 4AXPB_PER _PCB4;
for(i - 0; i < fn_cont; i ++)
fprintf (out_f ile, "Xd\n", currCD Ifn_dta_strt+6Soi );
fn_live_ti -convrt_ti-(crrCBD[fn _t·_strt+5i+4],
currCBD [fn_dta_.start+6i+3] );
fprintf (out_file, X. lf\n", fn_live_tie);
};
cb_dta_start - fn._data_strtHAX_CALLED_FN_PER_CBeS+2;
for(i - O; i · cb_count; i++)




currCBD - 5 + currCBD[CBD_INLET_COUNT] + CBDSTATSSLOTS;
else








int CBD_p( IdVord CBDptr )
unsined long lD - (unsigned long ) CBDptr[O];
int FSize - (int) CBDptr[l];
int Stats - (int) CBDptr[3];
int inlets - (int) CBDptr[4];
return ( ( FD[O > CODE_SE ) A ( D[O] DATA_SE )
U ( FD11] > DATASE ) ( FDI ] HEAP_SEO )
( FSize 0 ) FPSize < 1000)
( Stats < Size )
U ( Ninlets > 0 ) ( inlets < 1000 ) );
/ss·e·eeeL eoe·e+·,···o·eee **···*··se·**e*·e*+*···e··* **ele·
printProgStts





int no.of_inlets, cbd_t.t_st.rt, frae_set_stert, fn_data_start;
int cbdta_e.trt;
double cb_liv_ti-, bb_live.tis., fn_live_tie;





hile ( CBD_p( currCBD ) )
{/· Init thread function descriptor. /
IdVord FD - (IdWord· ) currCBD[CBD_INIT_CODE];
printf ("Xs\n", currCBD[CD_DIAE] );
/0 Nee of init thread. /
printf("\tInit Thread: Xl\n", (abs(FD[2)));
/* rune Size · /
printf("\tFrne Size: Xd\n", currCBDCCBD_FRAE.SIZE] );
/0 Stats location in frene /
printf("\tStats Fre Loc.: d\n", currCBDCB[D_STATS] );
/0 Nuber of inlets /
printf("\tInlets: Xd\n\n", crrCBDCBD_INLET_COUNT] );
no_of_inlets - currCBD[CBD_.ILET_COWIT];
cbd_stt_start - no_of_inlets+5;
if ((int) currCBDCCBD_STATS] O)
printf ( "\t Invocations: Xd\n", currCBD cbd_sttstart]);
cb_live.tie. convert_ti(currCBD[cbd_tat_etrt+4],
currCBD[cbd_stt_t rt+3]);
print ("\tLive tie: X Of\n",cb_live_tie)
printf("\tBASIC BLOCK STATISTICS\n");
for(i - O; i < KAX_BB_PER_CB; i++)
( if ((currCBD[cbd_tt_strt+5+i·4] I O) AS




printf("\t\t\t (fall thru) %d\n",






CBD_p el This i tporary code.
o Tests if pointer is likely to be a pointer to CD.
o The first vord should point to soething like a function descriptor.
The second vord hould be n integer tht is not too lrge.
fn_datat.rt - cbd_stat_start+5+HAX_BB_PER_CB*4;
for(i - ; i < MAX_CALLED_FN_PER_C; i+)
( if (currCBD[fn_data_strt+·i] - 0)
continue;







cb_dt._s.tt - fn_dat_teft+ MAX_CALLED_FN_PER_CBe5+2;
for(i O; i < IU_CALLED_CB_PE_CB; i++)
( if (currCBD[Eb_det_text+3*i] - 0)
c ontinue;
printf(#\n\t\t Celled Id Procedure *Xd\n",i+i);
printf("\t\t Invoet ion: Xd\n", currCBD [cb_dt _start+3e*i] );
cb_live_tie - convert..time(currCBD[cb_dat_trt+3ei+2],
currCBD[cb_date_ctart+ Si+1J]);
printf("\t\tLive tie: X.Of\n .cb_live_tie);
);
currCBD +- 6 + carrCBD[CBD_INLET_CUNT] + CBDSTATS_SLOTS;
1 1




double xpt(double x,double y) {return xp(log(x)oy););
double convert_time(uncigned long hightics, unsigned long lovticke)
double totelticks u hiShticksIDWORD_SCALEFACTOR+Iloticks;
return( (totltick/iTC_TICKS_PE _SECOND) eCPU_CLOCK_RATE););
/0 Alejendro Cro









/··*e*·e·····l·*· · · 1e,···· e···+ *****···* *****
* Internal nodule definitions.
/· Naber of extra slote uced in * freme by the RTS. They a r ctualy
· sllocted fro the preceding frme!
e/
def ine FRAMINFO_IDWORDS 1
td ine FRAMINFO.SIZEOFF -I
/* Pointer to the next avilble frme. /
static IdWrd frae_vnilmblc;
/· Strt of frme area . */
static IdWord · fre_ areatrt;
/· End of fre area . /
sttic IdWord efr_rema_end;
/· The external interfeces siaply end message to one of these
* hndlers.
void mlloc_frm_eager_hjndler(void);
/· The actul ellocetor of fir. · /





* Sipl FPran Allocetor
o This frme llocetor imply llocate. It never dellocte, so
froe usp vill keep increseing n th prors runm. Ech rne
+ in th hp hs several header vords, hich re "invisible" to the
o uer of the fre.
o iitrere_
* The RTS allocated the rp C[tert, nd] for the hemp. Th se pointers
* ar psed to this routine in case the fre manger needo to initilize.
· otrt: guarnteed ligned on page boundary
* end: guarnteed to point to end of pe
*/
void init_frme_re(void *trt, void *end)
/ The folloving lipgn the stert pointer on & free alignment
· boundry, leving nough pac for the FRANE_INFO_IDWORDS before the
o firt free.
0/
void freBae - tert + (FRAE_INFP_IDWORDS * izeof(IdWord));
fre_rea_tt r
(IdWord ) align_ptr_next(froBoel , FRAIE_ALIGNMENTT_B TES);
froze_veileble * freeree_tert;
frme.mreeond 
(IdWord ) align_ptr_prev(end - izeof(IdWord), cizeof(Idord));
/o Check for bed boundo on froe rea. /
if (frae_ore_tart > frme_re_end)
rts_errormg( "(init_fr _rea): bed fra ara bound.." );




o Allocate a local frae.
0/
IdWord oalloc_freae_local(IdI ord · cbdptr, IdVord rtcontl, IdVord retcont2)
( long fre_t-t_start cbdptr[CBDSTATS];
int noof_inlats - cbd_ptrCBD_INLET_COUNT];
int cbd_tat_ta t - no_o_inlet46;
int ize (int) cbd_ptrtCBD_FRA _SIZE;




if (fra_tt_start > O)
· ettie((Idordo) frae, (fra-e_tattart+FRAiESTATS_SLOTS-2)04);




fr CO] - retconti; fratl] - retcont2;
frac[2] - (IdWord) cbd_ptr;
/0 initial frae tat lots to 0 /
if (fraa_tt_tt > O)
for(i - fra_.tat_tart+FRANE_TATSSLOTS-3; i > fracstat_start; i--)
tfraei] - 0;
/+ Initialize the frae to the proper value. /
(ofi)(frau, rotcontl, retcont2);
/ Rturn the frane pointer to the caller. */
return frae;
* alloc_fre_eager
* Sends a uss· to aloc_fr_e..agrr_.handlrr. Thie "two-etage"
0 approach allovs the code t rturn vry quickly to the calling thread.
* INOTE: Thi h heen opticized for uniproceaeore.
0/
void alloc_frae_ear( d D Idord rCBD, dW contip, IdWord retcontfp,
IdWord reqcontip, IdWord reqcontfp)
/0 5end u-sage to the routine that ctually hndle the
· heap allocation.
* The eeage prnters ar:
o - 6 pi ecs of data
- pe zero (thi i actually ignored)
o - targ tIP i alloc_fram_eager
· - target PP (ignored in thie case)
- dat...
s_..end( 6, 0. (IdWord) alloc_fraeeagr_handler. O.
CBD, rtontip rtontp, reonp, qcontip, rqcontfp );
· alloc_frae._eagr_handler
* Split-phse, eager allocation of freas on any procesor.
























ug.recv( 6, cegbuffer );





fra ._stt_start - cbd_ptr[CBD_STTS];
no_of_inlets - cbd_ptrCBD_INLET_COUNT];
cbd_.tat_tart - no_of_inlet+6;
iz · (int) cbd_ptr[CBD_FRAiE_SIZE];








francO] n retcont_ip; fraoel] - retcontfp;
frae2] (IdWord) cbd_ptr;
/0 initial frae tat slots to 0 */
if (fr>e_tet_tart 0 O)
for(i r_ tt_trtUtFRIAE_STATS_SL0TS-3; i >- fra_sltt_tart; i--)
fr[i] * 0;
/0 Initialize the frae */
(*fi)(frace, retcont_ip, retcont_fp);
/ Contruct sage to request continuation.
0 OS NOTE: This cod i opticized for uniproce.eore
* ince it ignore the PE field.
*/
{ Idord ip CONT_IP( reqcont_ip, rqcont_fp );
IdWord fp CONT_FP( rqcont_ip, reqcont_fp );
/+ For no, verything i on PEGO. */
u_Send(2, O, ip, p, 0. (Idord) franc);
o dzalloc_frne_at
o Dellocates a raote froe. The iplaentation of this procedure is
o currently a stub. It icply negates the vlue of th fraon iz to
o indicate that it h hbeen dallocated.
· SSNOTE: This procedure i only for uniprocaoor.
*/
void dealloc_fra_at(int pa, IdWord *fp)
int franoSize - *(((int ) fp) - );
/* Check if the frae h been dellocated. */
if (franSize < 0)
rts_errorsg( "(dealloc_frae_at): frace at [Xp] already deallocated.",
(void ) fp );
/* Gathr statistics fron the frae. */
getFraeStat( pe, fp );
/* "Dllocate" the frue. */






o This functions llocents a fIre fro the fre ree nd returns m
o pointer to the binning of the frn.
IdWord *lloc_frne_internlint nidworde)
div_t fre Info - div(nidworde, FRAMELIG MT_IDVORDS );
int fr-Sie;
IdWord ofre;
/ The follovig fren (in msory) vritee its internal infor ntion
* et the end of this frene. Therefore, e hve to leve enough room
o for thet informtion. The followvig code checlk thnt we hve nough
* rooe, or othervise, it incrnoen the ix. of the free by one allocetion
* unit (64 byte. or FRAlE_ALIGNDTD_BYTES).
*/
if (fr nfo FRAEALIIIDORDS - FRU_ISPO.IDWDORDS))
freeSize · (freInfo.quot + ) * PRAI _ALIGNIENTIDORDS;
fraeSize (frInefo.quot 2) * FRAME_ALIGNMENTIDVORDS;
/* Save the current fre_vailable pointer na the return vlu, end
* increeont it to point to the next vaileble free.
*/
frne o fre_vailable;
froe_vmilmbl fr Ire_vailble + framSize;
if (freeel_avileble ) fre_rea_end)
rts_errnoreg( "(alloc_frae_internel): run out of frase. );
/* Write the freSize info into the free.
*/
o(((int ) frs) + FPRA_INFO.SIZEL_F) - frneSize;
return fren;
* .* .*.*.*.****** .o**.** o. .**.***.** ** ..*.ee* ****oee**.***.**
· fru _report
· Prints report bout ll the frmue in the eyte, beginning *t the
o strt of the free res.
*/
void frene_rport(void)
IdWord *curreontFr - frme_rest rt;
int frmeSize e (int) o(currntFr -i);
if (currentFre < frae_veilble)
{ printf ("Fre Rport\n");
printf("Addrese\t\tSize (idword)\tttue\n%");
printf (--------------- ----------------- \n");
printf("No freme .lloction/dellocntione hve occurrd.\n");
vhile ( currentFra < fr .. available)
if (freSize < 0)
I( ireSize -freSize;
printf("OxXp\t\tXd\t\tD\n" , currentFre, freSize);
printf("OxXp\t\tXd\t\tA\n", currentFree, fromoSize);
currentFre e currentPFrn + freeSi ;
freSie nt) *(nt tFroe-l);
/- US huge hack continued! */
void
init_free_hendler( into des., chero ne )
*((churo)(d c+2)) - n ;
* Unieplmnted.
*/
/* Rturn fp in ns to reqcont. */
void alloc_fre_l zy_locel(IdWord *cbd_ptr, IdWord retcontl, IdWord retcont2,
IdWord reqcontl, IdWord rqcont2)
rterrormeg( "(mlloc_frn_l _locnl): not ipleuented." );
/0 Rturns fp in esge. to rqcont. o/
void lloc_fr _ly_t(Idord cbdptr, ord rteont, IdoWod tco , Wor  tcont2,
int p., IdWord reqconti, IdWord reqcont2)
rte_error_usg( "(lloc_frm_llzy_st): not implemented." );
I* Return fp in emesg to reqcont. 1
void elloc_frene_oer_t(dWord *cbd_ptr, IdWord retcontl, IdWord retcont2,
int pe, IdWord rqcontl, IdWord rqcont2)
rte_error.seg( "(Clloc_frer_eat): not iplnnted. );
/0 Returns fp nd pe in neme to rqcont. */
void lloc_fre_lzy(ldWord cbd_ptr, IdWord retcontl, IdWord retont2,
IdWord reqcontl, IdWord reqcont2)
rteerrorneg( "(alloc_fre_lzy): not ipleented." );
/***********************++****************************************
*0e init_free_hndlers
* Thi i z huge hck. It rite · string into the lst position
* of the function descriptor for each fre hndler. Thie vill
* sllow the RTS to print the noe of the hendler when it in invoked.
*/
static ehur nlloc_fre_eger_hndler_oe[] J "elloc_fraengerh ndler;






















/**..e...e*.*...oe . .e**eo.e*.e ooe..e..ess .*e e....**e* 
* Declarations
*/
/e The return value of the Id progr is atored here by the boot cod. a/
IdWord vlueId;
IdWord ainArg;
inlia void rn.paue(void) 




int index - 0;
while (str[inde] !- '\0')




* in i the standud function called to start C progn-.
* In the RTS, in priorue initialization, and then clls the
a rts_nin_disptch_loop to begin axscution of the Id program.
* By daTult, uin rturns the value returned by the Id version of




cost SEED u 2384901;
int min (int rgc, char -rgvi)
{ int i, j, count;
int verbose;
*rando(SED);
it (argc > 1)
count a str_to_nus(rrgv[i]);
count - 100;
if (rgc > 2)
verbose - TRUE;
verboe - FALSE;
/* Initializs the RTS. This sets up the different eory reas and
* internal data tructures sed by the Run Tie Sysot nnd the Id prog.
*/
rt._initO;
/+ Etablish longjop target for exceptions, nd jump
o to xception handler if ncessary.
*/
I inct etaCode - .etjsp(sigalsExitContext);
if ( statCod )
rts_genric_igl_handler(statCod e);
/* Create input vector for Id program. hon the Id program
* i booted by the diopatch loop, this vlue vill be
passed as arguent t the Id version of "min
*/
meinrg - (IdVord) rtscreats_Id_ rp(rgc, rg);
/ Start the dispatch loop, hich cauoe the Id version of sain
* to be invohed.
*/
{ int tatCode;
/* Rturns non-zero if there is an error */
for(i 0; i count; i+)
























(ICach. icp - nv ICach(-od);
ppc_Hod. nvod - n v ppc_.od(od.n-eo);
for(DLitltrcs_l.nst itsr(od.diractivs()); !iter.nd_p(); itr++)
nv-od-)ppendDirctive(itr. valu ());
int i 1;
int cout - 0;
int eb_ount 0;
of stru t f_out ("/homs/jj/sybok/Profiler/Stt ic .msurs/002/qttshort n );
it (tf_out. il())






_out <C eb_count <(< ndl;
t_out.close ();
for(ppc_Procltr iProc(mod.proclter()); !iProc.e d_p(O; iPro-, i++)
{int j - 1;
ppc_Proc oe_proc -
nev ppc_Proc(iProc.valuo->nmO () , iProc .vlueO->type();
























ppc_Rog RTCL - *(nev ppc_Rg(6,SPR_RTYE,5));




intruant_CB(int i, int j, ppc_CB& cb, ICache i);
ppc_CB& nev_CB() (return *ev_cb;);
private:
void build_h.der(DListppc_Inat*>& bb, int k, int 1,
cont Stringt bb_label,
int bb_id, int reg_id_.tart, PrtTypo ptype);
DLiet<ppc_Inst)t& ake.fallthr_ins(int first_rg, String joinpt);
DLtit<ppc_net ake._jp_entry_in((int first_reg, conat Stringi lbl);
DListeppc_Inot*& ask._comeon_in(int first_reg, const Stringt join_pt);
DList(ppc_nst*&L ke._cold..ntry(it first_rg, String c_entry,
String ne-_label, int id, PartType ptype);
DListppc_lnte)& Sgenrate.,bootstrap(conat String& init_label, int k,
int tart_rg., bool cthred_p);
ppc_Inet* filter(ppc_lut ris_ins, conat String& trget,
DList<ppc_lntl>* suffix, int k, int i,
int start_rsg, int ccount, PertType ptypa);
String& genrate_nxt_non();
int i,j; // procedure end cod block noabers
ppc_CBe nev_cb;
ICtch* i_cache;




ppc_Prt new._prt new ppc_Prt(oldprt,8);
DLiat<ppc_nt) nev_prt.uffi * new DLitppc_Inat;
bool stt_BB - TRUE;
const String lut_llbel * 0;
DLiteppc_Inet*> ne.v_bb nec DListcppc_inst+>;
Stringt first_lnbel
newv Strin("first. +nunto_atr(i)4 . "*nu_to_r( j)+". n+
nm_to_tr(k)+ ".i");
tor(ppc_Ilntlter ilnet(oldprt.intltrO); !ilnt.nd_p); )
(ppc_Inat in.s new ppc_ls.t(*int.value();
it (in-)label()O.pty() t (in->op().op() -ppc.non))




(String lbl - nev String(genrate_next_nonO);
iInt. value()-)label(lbl);
ins-)label(lbl);];
if (1 > i)
{buildheader( nev_bb,k, l, lt label,
coult++, t rt_r..eg.,n_pert-type););
nev_part-)append(*nv_bb);}
else if (1 - i)
{firstBB . last_label;





nevbb * new DListppc_Inst*>;);
if (1 > 0)
n.v_bb-append (filtr (ins, *firt_BB., nv_prtsuffix,
kl,stert_reg, ccount, nv_part->typeO()));
else
nev_bb->ppend(filter (ins, *lat_label, naev_prt_suffix,
k,ltrt_reg, ccount, nevpert-)typeO()));
if (!in-op() .inert_p())









instrunent_CB:: instrn tCB(nttCB inu, int jnn, ppc_CB& old_cb, ICacha ic):
i(inmu), j (jnu)
{ non_id_codnt - 0;
nvw.cb nev ppc_CB(old.cb.b.ne,old_cb.frnSizeO);
aeu_b-)tt0of set (old_c.b statrsO st ( ) );
n.ev_cb->init(old_cb.init() );
n.v_cb-inlets(old_cb. inlet0););
int couat - 0;
it ccount - O;




con t String first_BB - 0;
for(ppc_Partlter iPrt(old_cb.prtItert(); !iPrt.end_p();iPart++, k+*)
(ppc_Part old_prt - *iPrt.vlue();
int stert_reg - old_part.regidl();
int 1 - 0;
hile (!ilnt.end_p() && int.vlueO-labelO .enptyO) tt
(inst.vlueO()->op() .op() - ppc_.one))
{ilnt+;);
it (ilnt.end_p() II
((!ilnt.end_pO) 55 (!ilnst.,vlu()O->labelO .ptyO)))
if (!atart_BB)
{.trtBB - TRUE; 1++;);
if (ilnt.end_pO)
if (1 > 1)
{buildheader ( *nv.bb , 1, *last..labe1,
count++,startreg, nev_prt->type());
nev_prt->append(nev._bb);)











if (count > IAx_BB_PsERC)
sigerr("intrulent_CB: :constructor Too ny BB for a single code block.");
of rtre f_out ("/hoe/jj/sybok/Prof iler/Stat i c_moaur/002/qstat short",
io: :app);
if (f_out. failO)
eigerr("instrwnt_CBconatructor Error in writing short stat tile.");





*igrr("instruat_CB: Error in vriting tatie aurm file.);
void
instrmt_CB: :build_headr(DList<ppc_Inte>& bb, int k, int 1,
coant String bb_labl, int bb_id,
int rg_idx_strt, PartTypa ptypm)
{ Strina suffix -
*(n String(anu_to_rtri) + ." + n_to_str(j) + "."+
nm_to_str(k)+" "+nuto_str(1)));
String nmvlabml a *(nv String("bg." + suffix));
String joi_point - a(nv String("joia." + sufix));
String nd.point a *(nv String("nd." + muffix));
Strig cold..ntry *(nev String("c." + mufi));











intrnnt_CB: :ak _fall_thr.il(int firmt_rsg, Stringt join_pt)
ppc_Rlg ri - *(nev ppe_Reg(first_rgil));
ppc_Rag rl2 - (nev ppc_Rg(first_.rg+2));
ppc_Rg r4 - *(amV ppcg_R(ftirt.r6g+4));
ppc_Reg r6 - *(nev ppc_Rg(first.rgr6));
DList<ppc_In.t)> cod_g.. g nov DLit<ppc_Int*>;
ppc_Int ft_l - nv ppc_Inst(ppc_.spr);
fti-)rand(nv ppc_Rand(r2), nv ppc_R-nd(RTCL));
cod_sjg->appmd(ft_l);
ppc_Inst ft_2 - ne ppc_Inst(ppc_lvz);
ft_2-)rmnds(ne ppcRnd(r4), nv ppc_Rand(rl), nn ppcRnd());
code_g->append(ft_2);
ppc_Int ft.3 - nv ppc_Inst(ppc_laz);
ft_3->rand(ev ppcjd(rE). - ppcRnd(r), n d(r), nv ppc_Rand(8));
cod.e_seg->ppnad(ft_3);
ppc_Int tt 4 - n ppc_Int(ppc_sddi);
ft_4->rand(aev ppc_Rand(r4), nov ppcRnd(r4), nv ppc_Rnd(i));
cod.e_sg-)>ppmnd(ft_4);
ppc_Inta ft_S - nev ppc_Int(ppcstv);
tt->rand(nmv ppc_Rnd(r4l, nev ppcRtand(ri), nn ppcad(O));
code_eg->ppnad(ft_6);





instrun at_CB:: mk_jmup_eatry.. in(int iirmetr gs cont Stringi labl "")
ppc_Rg ri - *(nv ppc_Rag(firt_reg+l));
ppc_Rag r2 - *(nv ppc_Rtg(tirst_resg2));
ppc_Rmg r4 *(nv ppc_Rg(firt_rsg+4));
ppc_Rmg rS6 *(nev ppc_Rg(firt_rsr6));
DList<ppc_Ianst> cod_e..g - nov DList<ppc_Inst*>;
ppc_Inta je_i - ne ppc_Int(ppc_fspr, lbl);
j.I->rand(nv ppc_Rnd(r2), nv ppc_Rnad(RTCL));
code_og->)ppmnd(je_l);
ppe_Int j_2 nv ppc_Int(ppc_lz);
j2-rnd(n ppRand(r4), n c_RRand(rl), n ppcRd(rl), ppc.Rnd(4));
code_stg->)ppmnd(je_2);
ppc_Inst j_3 - nv ppcInst(ppc_lz);
je.3->rnd(nsv ppc_Rnd(r6), n v ppcRand(rl), nev ppc_Rand(B));
cod.e_sg->ppnd(j_3);
ppc_Int j_4 - nov ppc_Int(ppc_.ddi);
je.4->rands(nv ppcRand(r4), no ppc_Rand(r4), nm ppc_Rand(l));
code_.sg->append(je_4);
ppc_Inet* j.e_5 - n ppcInt(ppc.stv);




inmtrunt_CB :: aak_conins(int first_reg, cont Stringt join_pt - ")
ppe_Rg rl - *(nmv ppc_Rg(first_-rsl));
ppc_Rlg r2 s- (nea ppc._Rg(firmt_rg+2));
ppc_R g rJ *(na v ppcRg(tirt_rp3));
ppc_Rt$ r4 a o(nv ppc_Rg(firmt_rugS4));
ppc_Rb r6 S *(nev ppc_R.lg(firt_rg+5));
DLitc<ppc_Iet*>* codme_sg ne DLit<ppc_Inst>);
ppc_Int jan_l nv ppc_Int(ppc_lvz,join_pt);
jnJ->rand(n v ppc_Rnd(r4), nv ppc_ind(rl), nv ppc_Rand(l2));
cod_seg->appnd(jn_l);
ppe_Int jn_2 - nv ppcInst(ppc.doz);
ja_2->rnds(nae ppc_Rad(r3), ne ppc_Rnd(rSTAT), nov ppc_Rmd(r2));
code_mg->)ppand(jn_2);
ppc_Inat jn_.3 n- ppcInst(ppc_.ddc);
jn_3->rnd(nv ppc_Rnd(r6), ne ppc_Rnd(r6), nev ppc_Raad(r3));
cod.e_mg-)appnd(jn_3);
ppc_Int* jn_4 - n-v ppc.Int(ppc.ddz);
jn_4->rand(nev ppc_Rand(r4), ne ppc_Raand(r4));
cod_.mg->ppand(j.n_4);
ppc_Inst jn_6 - nv ppcInt(ppc.tv);
jn_6->rnda(nv ppc_and(r5). nv ppcRad(ri), nv ppc_Rand(8));
cod._seg->append(J.n_6);
ppc_Int* j.n_ - ne ppc_Int(ppc_stv);







int id, PartType ptype)
{ String id_str -
nv Strig(a_cb->mme()+"...ftatmbe+" + numto_str(i8id));
ppc_Reg ri - *(nev ppc_Rag(firt_rg+l));
ppc_Rag r2 - *(nv ppc_R(first_rg+2));













DLimt<ppc_Int>* codm_smg - nv DList<ppc_Inst>;
ppc_Insta co_l - nv ppc_Int(ppc_.ddi,c_entry);
cs_i->rnds(nv ppc_Rnd(rl), nev ppc_Rand(bCe_Rg),
nm ppc_Rand(id_tr));
code_..Sg->ppend(c._l);
ppc_lnt. c_2 - ne ppc_Int(ppc_fSpr,nav_lbel);





intrment_CB::iiltr(ppc_Iante origins, conast String trgt,
DLit<ppc_Inata* s u fix, int k, it 1.
int *trt.reg, intC ccount, PrtType ptype)
ppc_In..t n vins - nev ppc_Int(*orig_in.);
ppc_Rand* rlit - orig_ins->rand();
ppcjRand trg_ddrs - 0;
Stringa linker_labl;
114
ppc_Rg rCARGI - C(new ppc_Reg(3,PR_RTYPE.3));
ppc_Rg rCARG2 - (new ppcReg(4.GPR_RTYPE.4));
ppc.Jb crO - *(new ppc.Aeg(O.CQRTYPEO));
ppc_Rg rl -'(nw ppc_Rg(strt.rg+l));
ppc_Reg r2 - '(new ppc_Reg(tart.reg+2));
ppc_Rg rS - *(new ppc_Res(trt_rg+3));
ppce r4 a *(new ppe_Rqg(tanrtg+4));
ppc_Rg r56 - (new ppc_Rg(trtreg+5));
int calil_count;
for(int loop O0; loop origjn-nra(ndO; loop++)
if ((rliet[loop]-)label_pO) (rlietloop]-)label() - turget))
new_ine-Orands ) [loop]
new ppc_Rand("firt. "+nuto.trr(i)+" "+nu_tto_tr(j)+" ."+
nu_to_str(k)+ 1");
if (orig_ine-)op() linked.p() L
! i_cech-exints_pe(getbrnch_trget (origin)))
for(loop a 0; loop < orig_ine-nrnd()O; loop++)
if (rliet[loop]->labe_p( ) )
{(trg.addres - rlist(loop];
it (reglitry-eotains(targ_ddree-labl ()))
call_count - (*regietry) [targ_ddres-)lablO];
else
{call_count - ccount++;
(oregistry) [targ_ddree-)label()] e call_count;);
if (call_count > MX_CALLED_FN_PER_CB)
igeFr("intrunent_CB: :filter: too any called procs.");
linker_label -
new String(terg_.ddres->label() +".lnk. "+num_to_tr(i)+".
"
+
nuto_tr(j)+" "+ au_to _str(k)+". "+ no_to _tr(1+1));
ne_in->randO) [loop] - new ppc_Rnd(linkr_labl);














nov String(newcb-no() "... ftatbae+" +
um_to_str (16jpX_BB_PER_CB+20*ca_count) );
ppc_Iant grb_t_i ' new ppc_Inet(ppc._fspr);
grab_t.l->rand(new ppc_Rand(ri), new ppc.Rnd(RTCU));
euffix-)append(grab.t_ );
ppc_Iat grab_t_2 - anw ppc_Iant(ppc_.fspr);
grab_t_2-rand(nev ppc_Rnd(r2), new ppc_Rand(RTCL));
euffix->append(grab_t_2);
ppc_Int grab_t_3 anew ppc_Iant(ppcfspr);
grab_t_3->)rand(new ppc_Rand(r3), anw ppcRannd(RTCU));
suff ix->ppend(grab_t_3);
ppce_Int grab_t_4 - new ppcInet(ppc_cpwv);
grb_t4-rnd(new ppc_Rnd(crO), new ppc_Rand(rl),
anw ppc_R-nd(r3));
nuff ix->append(grab_t_4);
ppc_Inlt sti-e_l new ppcInet(ppc_ddi);
_tine_l->rands(ane ppc_Rand(r4), new ppc.Rand(bnee.Reg),
new ppc_Rand(*cid_atr));
ouffix->append(s_t is_);
ppc_Inut* count_cal_l - new ppcInst(ppc_lv);
count_call->rand( ppc nd(r5) ne ppc_Rnd ),  nd(r4),
new ppc_Rad(O));
uffio->append(count_call_i);:
String local_t rg g nerate.next_anon(;
ppe_.Inet s_tine_2 - new ppc_Inet(ppc_beq);
_tie_2-rand(new ppc_Rand(crO),new ppc.Rnd(locl_targ));
uff ix->append(s.tie_.2);
ppc_Inutw _tine_3 - new ppc..lnet(ppefspr,geerate_net_no ());
s_ti_3-rnds(new ppcRjnd(r2), new ppc.Rnd(RTCL));
uffi->append (_tie_3);
ppc_Inet s_tie_4 - new ppeInet(ppc_st.,local_targ);
s_tin_e4-rnds(new ppc_Rand(r2), new ppc_Rand(r4),
new ppc_Rand(4));
nffiSix-append(_time_4);
ppc_Inste _tiJ_6 - nw ppc_lnet(ppc_stv);
s_tie.-rands(new ppc_Rand(r3), nov ppc_Rand(r4),
new ppcRand());
*uff in-)ppaad(s _ine5);
ppc_Inst' countcal_2 - nw ppc.Inet(ppc_ddi);
coant_caL_2-)randn(nev ppcand(r6), new ppc_Rnd(r),
new ppc_Rand(l));
uff ix-)append(cout_call_2);
ppc_Int count_cal._3 . new ppc.In-t(ppc_stw);
count_cal_-)randl(nv ppcRnd(r), newov ppc.Rand(r4),
new ppc_Rand(O));
suffi-appnd(count_call_3);
ppc_Inst ext_brnchb new ppc_Inst(ppc_bl);
ext_brnch-)rands(targ_ddress);
muffin->append(entbrnch);
ppc_Int nap - new ppc_Inet(ppc_cror,pneratt_next_anonO);
nop-,rande(nwv ppc_Rand(3l) ,nv ppc_Rand(3L), - ppc_Rnd(31));
suffix->append(nop);
*/
ppc_Int sve - new ppcfInt(ppcmr, nrats_anet_anonO);
save-)rand(new ppc_and(rl), nvew ppc_Rnd(rCRCG1));
uffix->append(eave);
ppc_Iant accu 1 - nwv ppc_Inet(ppc_ddi);
accun_l->rand(nvw ppc_Rnd(rCARCi), new ppc_Rand(bae_Reg).
new ppc.Rnd(cid_str));
uff ix-)append(accum_1);
ppcInet* acc. 2 - new ppc_Inet(ppc_li);
accu.m_2-)rand(new ppc_Rand(rCARO2), -w ppcRand(4));
uff in-)appnd(accu_2);
ppc_Inte accu._3 - nov ppc_Inet(ppc_bl);
accu.3-)rand(new ppc_Rand(".accutie"));
ff ix-)append(acc_u3);
ppc_Int restore n ppc_Inlt(ppc_.r, gnrat_next_anon());
reetore-)>rnd(new ppc_Rnd(rCARC1), new ppc._Rnd(rl));
cuffi-)append(restora);
ppc_Int ret_brnch - new ppc_Inst(ppc_b);








instrunntCBg:: nerate_bootstrap(const String& nit_lbbl,
int k, int tart_rog, bool cthre.d_p)
{ ppc_Rg rO * '(new ppc_Rg(tzrtreg));
ppcng rCRlO1 - (new ppc_Rge(3,CPR_RtYPE,3));




{for(it I - 0; i <- INSTR_REGS; i++)
(ppc_-Reg tore_reg (i < INSTR_REGS) ?
(anv ppc_Reg(13+i,GPR_RTYPE.,13+)):
· (new ppc_Rg(31,GPR_RTPE,31));
ppc_Iant St_iant - new ppc_Iant(ppc_tw);




boot_setup-rand(new ppc_Rnnd(rFP), new ppcRand(rCARG));
hader->append(boot_etup););
ppc_Inet boot_l - new ppc_Inst(ppc_.flr);
boot_l->rand(nve ppc_Rand(rO));
hedr->append(boot_1);
pps_Into boot_2 - new ppc_Inet(ppc_bl);
boot_2->rnd (ane ppcRand("ce. ".*nu_to_tr(i)+'. "+n totr(j)+
. "+nu_to_etr(k)+". i "));
header-)append(boot_2);
/*
ppc_Inet boot_3 e new ppc_Iant(ppc_crrgnerate_nxt_anoO);










tor(it i O0; i IISTRJIEGS; i)
(ppc_bg loadreg - (i < INSTRREGS) ?
*(nev ppc_Rg(13i,GPR_RTYPE,13i)) : (nev ppc_Reg(3i,GPRtTYPE.31));
ppc_Inte ld.ist nw ppc_lnst(ppclvz);






{return (uesv Stringl("non. "+naus_to_tr(i)+". "+nu_to_tr(J) +".+ 
n_to_tr ( no _id_count+))) ;)
//QPROF ppc_type ADT
// The ppc_type ADT i used to collect tatistic about the types of













type_coutCt(l che ici, DLit<trigt& brancchrt, conet String trg);
int rith_ctO (return rith;);
int lol_cnt()O return logl;);
it contrl_catO (return cotrl;);
int floatp_cat() retur floatp;);
int ee_catO (return me;);
it alu_it_ct) {return alu_it;);
int alu_flt_cOt() (rturn alu_flt;);
ilt hap_ctO () return heap;);
int i_etructct) (return i__etruct;);
it *ched_ct() (return ched;);
int netvork_ct() return ntork;);
int hen_ ct() (raturn Se;);
int regalloc_cntOt) (return regalloc;);
DLitStrig*> gt_callO (return call_list;);
private:
int rith, logl cotrl, floatp, oem;






void ccus_call(ppc_n .st ins);
ilie void type_count::-ccum_cell(ppc_Inlt is)
( ppc_RPdo rliet ins.rnd();
if (is.opO).linked_p.))
for(it loop - 0; loop < i.nnrndO); loop+)
if (rliet [loop] -)lal_p p )
if ((rliet[loop]-labl()O.froq('.') ) 6) t
(rlist [loop] -label () cotins (".lnk ) )
{it poe - rlist[loop]-)labelO.indor(.lnk.");















type_count: :type_count(ICach& ici, DLit<String>&L branchchhrt,
coast Stringt targ): trget(trg)
ic kicl;
bool collecting - FALSE;
bool done FALSE;
arith - 0; l o ntrl 0; lntr lp - O; s - O0;
ala_int O0; lu_lt - ; heap - O; i__truct O0;




ic-fltch (init il_f tch_t eget);




vhile (!vorking_i_li.t.eptyO) &t !done) {
for(i vorking.i_lit.iterO; !i->end_p(); (i)++)
if ((collecting) U (!i-fvelueO-)labelO.eptyO))
(collecting - FALSE; done - TRUE;)
else if ((!collecting) & (i-vlue()-l bel() -- target))
collecting - TRUE;
op - i->valu(O-)op .opt();
sT_op i->valu()-)sTlourc)O;



















void type_count::accs_ppc_tyype( op ) (
Ivitch(op) {
cage FU: case FXU_CP:
*rith++;
breik;
came FXU_LOG cae CR_OP:
logl+ +;
break;
cue FU_LX: ce FXU_LLI: c FXU_LPX: cuN FXU_LPI:
cage FXU_SXX: clse FXU_SXI: ce FXUSPX: came FXUSPI:
brek;
cae FXU_IIFSPR: cae FXU.NTSPR: came BSUCH:
contrl+;
break;












cae T_ul: came sT_div:
cu-e T_or: cme sT_xor:
case T_shiftr came T_ashiftr
cae Tcup: cae T_cpl:
cas T_tadd: cal T.ub: case s T_aul: case sT_fdiv:
cla sT_fug: case sT_cp: cae sT_fb:
cae sT_toeflot: cae sT_toint
aluflt++;
break;
case T_load: case T_stor; case mT_preftch:
break;
cae sT_hload: cae sT_hstor:; case sT_hptpb: ce sT_hsetpb:
heap++;
brek;
cas sT_iload: case T_istor c ae ITnlod:
cae sT*tore: case mTlxlin: c Tase update:
i__.truct++;
break;
case sT_ tcont: cale T_contpe cue sT_contip:
case sT_contfp: case sTkrgcont: cunes T_kvalcoat:
case T_send: cse sT_recv: cse sT_recvdone: cas sT_antpoll:
utvork+;
break;
case sT_lrge: case sT_jmp: case Tblt: case sT_bl: case sT_beq:
case sT_be: ca sbge: case sbt: case sTbjoia: case sTbnjoin:
case sT_fork: ce sT_post: case sT_hlt:








//PROFP nahe .anure odule
// Thi nodule fensrates tatic noures fron a PowerPC nodule,
// including ideal execution ti-, instruction ix, "flavor" nix,







td finn IAKE_IESURE_H I
void ukenuour(ppc_od oldod, ppc_od nenod);
void nauure_BB(cont String tr, otrek f_oat,
















void nukeen ure(ppc_Nodl old_od, ppc_odk ne.od) {
ICache icl nw ICache(nwnod); // generate let ICache
ICache ic2 · new ClCach(neutod); // genrate 2nd ICtche
nodnap p -np nodnp(ici,newvod);
String rch_trget;
ofstren f_out("/ho/jj /sybok/Profilr/Stat icueaoure/02/qtatlolng" );
if (f_out.failO)))
sigerr("nke._easure Error in writing tatics ansure file.");
int i 1;
for(ppc_Proclter iProc(old_od.proclterO); !iProc.end_p()O; iProc++, i++)(int j - 1;









ilunt(iPart.vlue O()->inotIter);!iInst .endpO; )
ippe_InstA ins- oiInst.value();




iif (!ins.labl(). pty() )
({f_out <C i << " < <(j "( < k << " < < <" " ;
f_out << iPert.vlue()->nAe() < ndl;
srch_trgt ,, "boS."+
nu_to.str(i) + "+no_to_str tj) ". "+
nu_to_tr(k)+"." + nu_too.tr(l);
nensur_BB(srch_trpgt,f_out,icl,ic2,nap);}
sigerr("k_neaure BB detectd without valid label."););
if (!in.op(). iert_p())
{startBB - TRUE; 1++;};
iIant++;
while (!ilat.nd_p() ilnt.valu(O)-)label() .pty() 
(ilnt.vlueO-bop() - ppcnooe))
(int++;);
while (ino.label().opty() A (ins.op().op() ppcnone))
{ilnst++;);






void ea.ure_BB(const String trg_.tr, otree f_out,
IChache icl che i, I c2, odp p)
{ String tr(terg_str);
// tin "fll through" pth





cout <" " .--------- n << dl;
for(DLitltr<lStringe> iter(ebcllCO]); iter.nd_pO; itUr++)
coat <( "#" < (iter.valueO) <C endl;
cout C<<" --------- " <C ndl;
*/
bru r o6U000(icl,ic2,bcll[O]t ,tr);
fout (< r000.ti-e.code) << endl;
// tie "jup exit" pth




coat (<" 1+++:--------- "< C endl;
for(DLietlter(Stringo> itr(bcl120]); !itr.end_p); itr++)
cot <o "t" < (itr.valeO()) < ndl;
cout <<" ++ … - -------- " << ondl;
*/
bru rO000b(oicl,oic2,*bc1210] ,str);
f_out << rSOOOb.tie._code() << andl;
// collect other etatisticl
type_count itypes(*icl, bcll(O], tr);
f_out <C itypes.rith_cntO() (< " < itypeo.lol-_..t() << "
f_out C< ityp .contrl_cnt() <"" << ityp.s.floatp-_ct() << ";
L_out << itypes.ne_cnt() < mdl;
f_out << itype.lu_itctsnt() <<" " < itypes.alu_flt_cnt( < " "C
f_out <C itypes.hp_cat()O < " < < itypes.i__.tru ct_ctnt( <<" ";
f_out << dca ityp.ohdtt) CC C .networityp k_cn t <<"";
f_out < itype.lne_cnt() << " << itypen.reglloc_cnt() <C endl;
DLiestString> c_list - itype.gt_cll(O;
f_out << c.list.lengthO << ndl;
for(DListlter<String> trIter(c_lilt); !trIter .nd_pO; utrlter++)
f_out < *trIter.valuO() < andl;
String un._to_str(int i)
Strings str nw String;
do
{(otr) +- (char) (48+(i 10));
- i /10;)











coast CNTRL_N - 14;
coost CNTRL_F - 8;
coost CNTRL_B - 2;
conat CNTIRL_S - 1;
coast CNTRL_L - 12;






















int cb_pos - 0;
int prt_pos - PARTITIONS;
int innr_pos BBS;
int co.end " EOF;
cout << "Entering PostProf visualier." (< endl;
while ((conud ! ' ') (cod nd I '\n'))
{cb_list cb_poj] ->genrate_plot(p rt_po , innerpo);
coannd s plotnd_cpture(innr_pos ,cb_list);
switch(connd) {
cane CNTRL_P
if (cb_pos > O)
{cb_pos--;
prt_pos * PARTITIONS;
inn r_pos - BBS;
cout <C "aoving to previons Code Block." << endl;);
break;
case CNTRL_N:




cout << "ovin to next Code Block." << ndl;);
break;
case CNTRL_F:
if (psrt_pos C cb_list[cb_pos]->gt_p_co. ntO)-i)
{part_pos++;
inner_po - BS;
coot << "Adancing forward through current Code Block." << ndl;);
break;
case CNTRL_B:
if (part_po > PARTITIONS)
{part_poe--;
inner_pos - BBS;
coot << "Backing throgh current Code Block." << ndl;};
break;
case CNTRL.S:
if (part_poS >- O)
{innr_pos " (inner_pos+l) X 3;
coat << "Sitching statistics ode." << endl;);
break;
case CNTRL_L:
coat << "Sea run installed!" << endl;
break;
case ' case '\n':
break;
def ault:
cout << "Unknown cond. ";
coat << Comnds: C-n, C-p,. C-f, C-b, C-s, C-L, RETURN.";
coat <C ndl;);
coat <( "I. Id Code Block " (< eb_posti (C "/" .I cb_count << "." <( endl;
1;
cout <C "Done." << endl;
};
int plot_.nd_capture(int stet, DListCB_datae>) cb_list) (
String prog("/ho/ljjl/ybok/PLrofilrgraphic/gnnplot ");
String rg("/h"e/jj/sybok/Profiler/graphics/cofile ");
String optionli("-geotry 860600+200+100 ");
String option2("-fn 58 -potry 860x600+200+100 ");
PILEa filep;
freopen("/dv/null"."w",stderr);
if (stat - BBS)
filep - popen(progoptioni+rg.,"w");
filep popen(prog+option2+arg,"a");



















DList<CB_data*>* cb_list - ney DList<CB_dat*>;







/I Thi ADT holds CB data during the postprof phase of execution
lendit
#ifadef CB_DATAH







enun plotesat {PARTITIOIIS - -3, CCALLS -2, IDBCS * -1);
enus tatstate {BIS, PPC, STCODE);
clams CB_dta
public:
CB_data(ifstre& ft_qhort, if treu f_qetatic,








int gt_p_osunt() ( return p_count;);

















double fn_l.tiE AXC ALLDJN_PEA_CB ];
int fn_i_cnt [ AX_CALLED_F_PEI..CB ];
String ccb_u e [MAX_ALLED_CB _PER_CB];










DLitString*>* p_list - new DList<String*);




iat CB_dati: :u_i_i_lt(int i, DLiettint>L dl)
{int su - 0;
for(DListltercint> dllter(dl); !dllter.end_p(); dllter+-)










CB_data::CB_data(ifstreat f_qshort, if strunt f_qttic,
ifstreen f_qreslt, int celled_cb_cnt)
ccb_count(clled_cb_cnt)
{ int cb_n, proc_nou, pert_nun, bb_nun;
int fre_slote;
String called_mne;
int currnt_fn - 0;
QPVIapString,int fn_registry(-l,10);
int j;
t_qohort >> bb_count ) fn. count;
for(int i 0; i cb_counc; i++)
_qshort >) ccb_nu-e[i];
f_qreult ) cb_nae >) fre_slote >> invocations >> l_tie;
registry - nevm QVHpString,int(-1,iO0);
for(j - 0; j AIX_BB_PER_CB; j)
(p_l_tie[j] e 0.0;
p_av_dur[j] - 0.0;);
for(i - 0; i bb_count; i+)
{ f_qrult )>> bb_t_cnt[i] > bb_br_cnt[i] >> bb_l_tie[i];
f_qtatic 00 cb_nou >> proc_non ) prt_nun >> bb_nu;
f_qstatic >> bb_pnai];
p_count prt.num;
f_qettic ,> bb_ft_thy[i] >) bb_br_thyi];
for(j - 0; j 5; j)
f_qstatic >> bb_i_ix[i] .ppc_count[j];
for(j 0; j 8; j++)
tf_qsttic >> bb_iainti].sT_cott[j];








for(j - 0; j local_calls; j++)
{f_qetatic 0 called_nae;




for(i - 0; i <p_count; i++)
if (p_ave_durti] 0)
p_ave_dr [i] p_l_tio[i/p_ave_dur[i];
for(i - 0; i fn_.cout; i+)
f_qroeult >> fn_i_cnt[i] > ftnltie[i];
fo(i - 0; i ccb_count; i++:)
f_qreeult > ccb_i_cnt[i] > ccb_l_tieo[i];
};
void CB_data :dioply_CB_parts()
int xleft - 0;
int xright - p_count+i;
double ytop - 20.0;
double c_factor 0;
for(int i 0; i p_count; i++)
if (ytop p_l_tie[i])
double ybottoa -ytop/20.0;








igrr("CB_data::disply_CB_prts: Error in vriting cound file.");
of trum data_outi ("/hoae/jj/sybok/Profilr/grazphice/dl" );
of strum data_out2("/ho/jjleybok/Profiler/graphice/d2");
if (dat._o.utl.fil() II dtaout2.ifilO)
sigerr("CB_dat: :disply_CB_parts: Error in vriting data.");
f_out (C "set boxwidth 0.4" (C endl;
f_.out r "set xtic 0,10 CC ndl;
f.out CC "set xlabel 'Prtition'" (C endl;
t_out CC "set ylabel 'Live tli/Reltive LT'" CC ndl;
f_out (C "cd '/ho-o/jj/sybok/Profiler/grephic '" CC ndl;
fout C "et label '" CC cb_n- e;
f_out CC "Partition' at " CC xright/2.0 CC "," CC ytop.966;
f_out <C" center" < ndl;
f_out CC "set lbel '" CC invocations (" invocations' t " CC xright/2.0;
f_out CC "" CC ytope.933 C" center" CC setprecision(16) (C ndl;
f_out CC "st label 'Livetin " CC l_ti;
f_out C<" cpu cycls' t " << xright/2.0;
f_out CC "," CC ytopO.9 C " cnter" <C ndl;
double vork_tiae - 0.0;
or(i - 0; i C p_count; i++)
f_out ( "set lbel "'CC p_na i ( "'at" CC i+i << ",";
f.out CC ybottoe/2.0 <<" center" CC endl;
double duration - (p_v_dur[i] < 0.001) ? 0.001: p_ave_dur[i];
f_out < "set label '(" << pl_ti [i]/dration CC "X)' at ";
f_out CC i+.80 CC "," << p_l_tiie[i]+ytop/30 " center" (C endl;
f_out ( "t label 'v' t ";
f_out <i+1.20 (< "," <C p_ave_dur[il]*c_fucrto p/30;
fout (<" center" C< endl;
data_outi <C i+.80 <CC" "< p_l_tiei] CC<< endl;
data_out2 CC i+1.20 ¢ " " ( p_ave_dur[i]s c_factor C endl;
vork_tiae t p_l_tiie[i];
f_out CC "set label 'Work tie '" <C work_tiee;
f_out <( cpu cycles' t " < xright/2.0;
f_out (< "," CC ytop.868 CC <" center" CC dl;
f_out CC "plot " CC xleft <( ":" (< xright CC "] [";
f_out (C ybottoa (C ":" CC ytop;
f_out CC "] 'dl' title 'Tot. Live Tie' vith boxe";
f_out CC ", 'd2' title 'Ave LT/cll (scaled)' vith boxes" C endl;





int xleft - 0;
int xright n fn_count+l
double ytop - 20.0;
double sc_factor 0;




double ybottom - -ytop/20.0;
for(i - 0; i fn_cont; i++)
if ((fn_l_titmi] > 0.001) tt (c_factor fn_l.tie[i]/fn_i_ct[i]))
*c_factor - fn_l_tim i] /fn_i_ct i];





.iprr("CB_dt : diply..CBPFNe Error in writing comnd file.");
of ertre dnta_outl("/home/jj/ybok/Profilr/grapics/dl");
ofetrem dataout2("/hoe/j j/sybok/Proiler/grapic/d2");
if (data_outl.fail() 11 dta_out2.fil())
sigerr("CB_dt::display.CB_PFN Error in writing dta.");
f_out << "met boxvidth 0.4" C< mdl;
f_out << "set xtics 0,1" <C mndl;
f_out (( "met xlabel 'C Function'" << nmdl;
f_out CC "met ylabel 'Live Tim/Relative LT'" < emndl;
f_out < "cd '/hom/jj/eybok/Profiler/grapic.'" CC mndl;
f_out CC "et label '" << cb_num << " C Calle' at " xright/2.0;
f_out C< "," < ytopo.966;
f_out < " center" (( emdl;
f_out << "set label '" CC invocation<e " invocation' at " C xright/2.0;
f_out << "," << ytope.933 < C center" <( mndl;
f_out < "set label 'Livetime" << setprecision(16);
f_out << l_tim <C .tprecision(16) C " cpu cycle.' t " CC xright/2.0;
f_out << "," (< ytop.9 " center" (C mndl;
for(i - 0; i < fn_count; i-)
{ f_out (( "set labl '" <C fn_nametil] < "' at < lti < ",";
f_out <C ybotton/2.0 < n center" C( mndl;
f_out <( "et lbel '(" CC fn_i_cntiJ] << "X)' t ";
f_out CC 1i.80 f< "," < fn_l_timei]+ ytop/30 · · " center" < amndl;
f_out <C "et lbel 'Av' at ";
double local_count (fn_i_cnt[i] 0.001) ? 0.001 :fn_i_cnt[i];
f_out <<1+1.20 < "," Cc fn_l.tineoi]*ecfactor/local_countytop/30;
f_out <<" center" << ndl;
dnt_.outl CC i+.80 (<" 
" <<
f._l_tii] << nmdl;
data_out2 C< i1.20 << " (C fn_l_tins[i]Joc.factor/local_count;
f_out <C mdl;
f_out < "plot " << xleft c ":" C< xright << "] t";
fout CC ybotto c< ":" ( ytop;
t_out < " 'dl' title 'ot. Live Time' vith boxes";
f_out C· ", 'd2' title 'Ave LT/call (scaled)' vith boxs." CC mndl;





int xleft - 0;
int right ccb_count+1;
double ytop 20.0;
double sc_factor - 0;
for(int i - 0; i ccb_count; i++)
if (ytop ccb_l_tin i])
ytop ccb_l_tieCi];
double ybottom - -ytop/20.0;
for(i 0; i ccbcont; i4n)
if ((ccb_l_tiei] > 0.001) t (c_factor C ccb_l_tiei]/ccb_i_cntti]))
c_factor r ceb..l_timC[i]/ccb_L_cnti];





.ipgrr("CB_dat: :displayCCB: Error in writing cnnd il.");
ofitrem. data_outl("/hoe/jj/.ybok/Profiler/graphicl/dl");
oftream data_out2("/hoe/jj/.ybok/Profilr/grephic/d2");
if (data_outl.fail() 1 dat_.out2.fil())
eigrr("CB_dt display_CCB: Error in wrting data.");
f_out << "met boxwidth 0.4" CC< mdl;
f_out <C "met xticu 01" <C< ndl;
fout <C "set xlabel 'Called CB Nm'" (C mndl;
fout CC "set ylabel 'Live Tim/Rfeltie LT"' C *ndl;
t_out <C "cd '/home/jj/sybok/Profiler/graphice"' CC mndl;
f_out <C "set label '" cb_nue C " Id CB Cells' t ";
fout CC nright/2.0 Cc "," C< ytop.966;
fout <<" center" (C ndl;
f_out CC "et label '" f< invocation < " invocations' t " << xright/2.0;
f.out CC "," <C ytop.933 C " center" <C *ndl;
f_out C "set label 'Livetim " <C setprecision(16);
f_out << l_ti-me CC etpreciion(15) CC " cpu cyclem' at " < xright/2.0;
f_out CC "," <C ytop.9 <<" canter" C mndl;
for(i - 0; i C ccb_count; i++)
f_out CC "et label '" Cc ccb_nneti] · "' at " <C l+i < ",";
t_out C ybotto/2.0 C " center" CC nmdl;
t_out <C "et label '(" ccb_i_ct[i < "X)' *t ";
t_out CC i+.80 <( "," <C ccb_l_tie[i]+ytop/30 <<" cnter" Cc ndl;
f_out <C "et lbel 'v' t ";
double localcount (ccb_i_cnt ti] < 0.001) ? 0.001 : ccb_i_cnt[i];
f_out <<i+1.20 <( "," CC ccb_l_tit[i]*sc_factor/local.._countytop/30;
f_out (C" cnter" <( mndl;
dat._outl <C i+.80 (<" C ccbltiei] < mndl;





f_out CC<< "plot [" CC xleft C( "" C< xright << "] [";
f_out <C ybotton C( ":" << ytop;
T_out (< " 'dl' title 'Tot. Live Tim' with boxe.";
f_out CC " 'd2' titl 'Ave LT/call (caled)' itb boxes" <C mndl;






int p_num (oregistry) [p_nae-];
DListeint) bb_list bbs[p_nul;
int local_bb_count bb_lit.lengthO;
int xleft * 0;
int xright * local_bb_countl;
double ytop - 20.0;
double bb_tot_thy_ti [KAiX_BB_PER_CB];
for(DListIterCint ilter(bb_lit); iIter.and_pO; ilter-)
(int id - iter.vlueO;
double local_live_tie bb_l_tiet[idx];
bb_tot_thy_tie tidx] bb_ft_cnt [idx] bb_ft_thy[idx]+
bb_br_cnt [idlx] bb_br_thy [idx];
if (ytop < local_livetin)
ytop * local_live_tie;
if (ytop < bb_tot_thy_timtidx])





sigerr("CB_dta: :diplay_BBs: Error in writing command file.");
of etrea data_outl("/hom/jj/sybok/Profiler/grphics/dl");
of tre dat_..out2("/ho/jj/ybok/Profiler/gaphic./d2" );
if (data_outl.fail() II dta_out2.failO)
124
*igrr("CBdat: :diply_B: Error in writing data.");
f_out C "sat boxuidth 0.4" Cc andl;
f_out CC "set tia 0,1" C endl;
f_out < "eat xlbel 'BB Bunber" cC ndl;
f_out CC "set ylabel 'Real iT/Ideal LT"' CC(( dl;
f_out CC "cd '/hou/jj/ybok/Profiler/graphics'" C endl;
T_out (C "et label CC p_nane CC "Basic Blocks' at ";
t_out C<< rit/2.0 CC<< ", << ytopo.9BE;
f_out (C" c terr" <CC dl;
f_out CC "met label '" CC bbft_cntCbb_lietCO]]+bb_br..ct[bblit[033;
f_out CC " invocations' at " C right/2.0;
f_out CC "," CC ytope.933 CC " cnter" <C endl;
i_out <C "set label 'Livetis" (C satpraciaion (15);
f_oot (C p_l_tie[p_nunl setprecision(15);
fout <C " pu cycles' at " C< right/2.0;
t_out CC "," CC ytope.9 C " cuter" cl endl;
int i - 0;
for(DLietltersint> iter2(bb_lilt); !iIter2.nd_p(); iter2,i+)
{ int idx - ilt-r2.vluO;
f_out CC "set label :'(" < bb_ft_cat[idx]+bb_br_ct[id] C "X) ' at ";
t_out C< i+i CC ," · · ybotto/2.0 C" canter" CC ndl;
dta._outl <c i+.B80 c n 0 c bb_l_tie[idx] C andi;
data_out2 C ii.20 cc " " CC bb_tot_thy.ti-e[idx] C endl;
};
i_out (C "plot [" CC xleft <C :" CC xright CC "] [";
f_out CC ybotto <CC ":" · C ytop;
fout C " 'dl' title 'Actual Live Tie' with boxes";
tfout CC ', 'd2' title 'Ideal Tie' with boxes" C< endl;
fout C< "pause -1" C< edl;
f_out.cloe 0;




int pnuan (*regiatry)[_n.a ];
DLitsint> bb_liot - bbe[p..nl;
int xlft - 0;
int xright - 65+1;
double ytop - 20.0;
int sum.sd_stat[13];
for(int i - O; i 13; i:
(sud_statsit] * su_i_i_olt (ibb_list);
if (ytop ·C sud_statsCi])





sigrr("CB_data: :disply_IN Error in writing coanzd file.");
oinstram data_outl("/hous/j eyboklProf ilrlgraphico/dl");
ii (dataouti.til())
sigprr("CB_dat: :display_lN: Error in writing data.");
f_out CC "set boxwidth 0.4" CC endl;
t_out CC "set noticl" C<< edl;
f_out CC "et xlabel 'Iix tatitic' (C endl;
f_out CC "set ylabel 'Fraquncy'" CC undl;
tout C< "cd '/hoe/Jj/sybok/Profiler/graphica'" CC endl;
_out CC "set label '" CC p_n C " · ic Basic lock' at ";
t_out cc rright/2.0 <C "," C ytop .966;
L_out <c" centr" CC endl;
f_out CC "set label '" CC bb.ft_cnt[bb_lit[O]]+bb_br-ct[bbliet[O]];
f_out C" invocations' at '' C xright/2.0;
T_out Cc "," CC ytop .933 c 
"
center" CC endi;
f_out CC "eet label 'Livetie " C< astprecision(16);
i_out CC pl_tia[p_n.a] C setpreciaion(15) < " cpu cycles' at ";
f_out CC xright/2.0;
f_out CC "," (< ytop*.9 C<<" centr" CC en dl;
for(i - ; I C 5; i++)
f_out CC "t label '" <C bb_i_i.Ex[O] gt.ppc_nAu(i) CC "' t ";
i_oat i1 <C ", CC ybotto/2.0 <<CC" center" CC endl;
data_otl CC< +1 <<cc < sued_stats[i] CC< ndl;
tout <C "plot [" << let C ":U CC xright CC<< " [;
f_out CC ybottn CC ":" CC ytop;
tfout <C "] 'dl' title 'PowerPC Iix' with boxes" CC andl;





int p_nun a (orgiltry) [p_oun-];
DLiltint> bb_list bbs[p_nuJ];
int xlaft O0;
int xright a 81;
doubl ytop 20.0;
int -unmad_stats[13j;
for(int i ; i C 13; )
{amd_tatsC[i] - su_i_ix_elt(ibb..listt) ;
if (ytop · suud_.etatsi])
ytop - unaedetats[i;);
double ybotto - -ytop/20.O;
ytop 1.25*ytop;
ottreu S_out(" /hou /j/sybok/Protler/grap ics/couf ile") ;
if (f_out.tail O)
igprr("CB_dat: :display_IM: Error in writing co.nd ile.");
of tre dataoutl("/hoaljj/yboklProfilar/grphic/dl );
if (data_outl. fal))
sigrr("CB_.dt: :diplay_IM: Error in writing data.");
T_out CC "aet boxvidth 0.4" CC eudl;
f_out CC "st noxtic" CC audi;
t_out CC "et xlabel 'INix Sttistic'" CC ndl;
f_out CC "et ylabel 'Frequency'" C< andl;
f_out <C "cd '/hoa/jj/ybok/Profiler/grphics"' C endl;
f_out <C "set label "' CC p_nae CC<< "Baic Blocks' at ";
t_oot C< xright/2.0 C< "," CC ytopo.966;
f_out CC " center" CC endl;
f_out CC "set label '" CC bb_.t_nt[bb_liat[O]]+bb_br_c.t[bb_lict[10];
f_out < " invocations' at " << right/2.0;
t_out <<C ," CC ytope.933 C" center" CC ndl;
i_out CC "st label 'Livetie n CC stpreciion(iS);
f_out cC p_l_tiae[p_nulj CC setprecision(1) c< " cpu cyclea' at ";
f_out CC xright/2.0;
t_out <Cc "," CC ytop.9 C " center" CC endl;
for(i - ; i 8; i++)
f.out "'et label " CC bbi.nix[ol .gst_T.ea (i) CC "' t ;
f_out CC i+1 << "," C<< ybottou/2.0 CC" center" Cc endl;
dat._ootl CC i+1 << 
" " ( su d_.stats[i+53 C<< ndl;
f_ot CC "plot C" <C lesft CC ":" << xright cc j] [.;
f_out CC ybottom CC ":" CC ytop <CC "] 'di' title 'T Iix' with boxes";
f.out CC andl;;
f_out CC "pause -1 " CC ndl;
f_out.closO;
dataout. clos ();
























// QPRO i.ix DT -- used for storing info about instruction nix during
I/ the postprof phas of procsing
tifndef I_MIX_H








String pt_ppc_n e (int index) {return ppc_n.usCindex ;)};





int i.nix:: t_count_univ(int idx)
if ((idx < 0) II (id. > 12))
sigerr("inix:get.count_univ: Imix tatitics out of range.");


















for(int i - O0; i < 5; i++)
{ppccounti] - 0;
ppc_nse[i] - *(n v String(nmuasli]));};
for(i O0; i < 8; i)
{iT_count[i] - 0;


































































stor a ril, lF, 
store irl2, IFP, 6
bnjoin IFP, 3, lblO
post IAIN.prtO, IFP
lablO:








otoar ir43, IFP, 
otore ir44, IFP, 






load tr31, FP, 6
lod tr32, PP, 2
pdd tr32, tr32, 7
okcont tr29, tr30, , tr32, FP
nov tr27, 10
nov tr28, FACT. .CBD
call tr36, allocre_local tr, t2tr28 ,r , tr30
padd tr34, tr28, 6
mklont tr34, tr36. 0, tr34, tr35
nov tr24, 0
load tr34, tr34, tr24
aend tr34, tr36, tr29, r27
atora tr36, FP, 7






load tr55, FP, 6
load tr67, FP, 7
load tr59, FP, 




call tr54, dealloc_frame_at, trl000000, tr58
Nov tr61, 0
load tr1, FP, 0
load tre2, FP, 1
ov tr0, 0
load tr61, tr61, tr50
































































stare ir°, IFP, 
stora irlO, IFP, 4






load tr73, FP, 4
load tr34, FP, 5
128
Bibliography
[1] H. B. Bakoglu, G. F. Grohoski, and R. K. Montoye. The IBM RISC System/6000 processor:
Hardware overview. IBM Journal of Research and Development, 34(1):12-22, January 1990.
[2] Bernstein, et al. Performance Evaluation of Instruction Scheduling on the IBM RISC Sys-
tem/6000. IEEE Transactions 0-8186-3175-9, IBM Isreal Scientific Center, The Technion City,
1992.
[3] Culler, et al. TAM-A Compiler Controlled Threaded Abstract Machine. Journal of Parallel
and Distributed Computing, 18:347-370, 1993.
[4] Digital Equipment Corporation. pixie(1). Ultrix 4.0 General Information, Vol. 3A (Com-
mands(1): M-Z).
[5] Aaron J. Goldberg. Reducing Overhead in Counter-Based Execution Profiling. Technical Re-
port: CSL-TR-91-495, Stanford University, Stanford, October 1991.
[6] G. Grohoski, J. Kahle, and L. Thatcher. Branch and Fixed-Point Instruction Execution Units.
In IBM RISC System/6000 Technology, Order No. SA23-2619, pages 24-33. IBM, 1990.
[7] G. F. Grohoski. Machine organization of the IBM RISC System/6000 processor. IBM Journal
of Research and Development, 34(1):37-58, January 1990.
[8] Carl Kesselman. Tools and Techniques for Performance Measurement and Performance Im-
provement in Parallel Programs. Dissertation, University of California, Los Angeles, 1991.
129
[9] James R. Larus and Thomas Ball. Optimally Profiling and Tracing Programs. ACM 089791-
453-8, University of Wisconsin, Madison, 1992.
[10] R. R. Oehler and R. D. Groves. The IBM RISC System/6000 processor architecture. IBM
Journal of Research and Development, 34(1):23-36, January 1990.
[11] B. Olsson, R. Montoye, P. Markstein, and M. Nguyenphu. RISC System/6000 Floating-Point
Unit. In IBM RISC System/6000 Technology, Order No. SA23-2619, pages 34-43. IBM, 1990.
[12] Unix User's Reference Manual. gprof(1). Computer Systems Research Group, Computer Science
Division, Department of Electrical Engineering and Computer Science, Berkeley, California,
1986.
[13] Unix User's Reference Manual. prof(1). Computer Systems Research Group, Computer Science
Division, Department of Electrical Engineering and Computer Science, Berkeley, California,
1986.
[14] H. S. Warren, Jr. Instruction scheduling for the IBM RISC System/6000 processor. IBM Journal
of Research and Development, 34(1):85-92, January 1990.
130
