A Quantitative Framework for Automated Pre-Execution Thread Selection by Roth, Amir & Sohi, Gurindar  S.
University of Pennsylvania 
ScholarlyCommons 
Technical Reports (CIS) Department of Computer & Information Science 
January 2002 
A Quantitative Framework for Automated Pre-Execution Thread 
Selection 
Amir Roth 
University of Pennsylvania, amir@cis.upenn.edu 
Gurindar S. Sohi 
University of Pennsylvania 
Follow this and additional works at: https://repository.upenn.edu/cis_reports 
Recommended Citation 
Amir Roth and Gurindar S. Sohi, "A Quantitative Framework for Automated Pre-Execution Thread 
Selection", . January 2002. 
University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-02-23. 
This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_reports/158 
For more information, please contact repository@pobox.upenn.edu. 
A Quantitative Framework for Automated Pre-Execution Thread Selection 
Abstract 
Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is 
ineffective. In pre-execution, copies of cache miss computations are isolated from the main program and 
launched as separate threads called p-threads whenever the processor anticipates an upcoming miss. P-
thread selection is the task of deciding what computations should execute on p-threads and when they 
should be launched such that total execution time is minimized. P-thread selection is central to the 
success of pre-execution. 
We introduce a framework for automated static p-thread selection, a static p-thread being one whose 
dynamic instances are repeatedly launched during the course of program execution. Our approach is to 
formalize the problem quantitatively and then apply standard techniques to solve it analytically. The 
framework has two novel components. The slice tree is a new data structure that compactly represents 
the space of all possible static p-threads. Aggregate advantage is a formula that uses raw program 
statistics and computation structure to assign each candidate static p-thread a numeric score based on 
estimated latency tolerance and overhead aggregated over its expected dynamic executions. Our 
framework finds the set of p-threads whose aggregate advantages sum to a maximum. The framework is 
simple and intuitively parameterized to model the salient microarchitecture features. 
We apply our framework to the task of choosing p-threads that cover L2 cache misses. Using detailed 
simulation, we study the effectiveness of our framework, and pre-execution in general, under difference 
conditions. We measure the effect of constraining p-thread length, of adding localized optimization to p-
threads, and of using various program samples as a statistical basis for the p-thread selection, and show 
that our framework responds to these changes in an intuitive way. In the microarchitecture dimension, we 
measure the effect of varying memory latency and processor width and observe that our framework 
adapts well to these changes. Each experiment includes a validation component which checks that the 
formal model presented to our framework correctly represents actual execution. 
Comments 
University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-
CIS-02-23. 
This technical report is available at ScholarlyCommons: https://repository.upenn.edu/cis_reports/158 
University of Pennsylvania, Deparhnent of Computer and Information Science Technical Report MS-CIS-02-23 
available at http:llwww.cis.upenn.edui-amirlpubsltrltselect-tr2002.pdf 
A Quantitative Framework for Automated Pre-Execution Thread Selection 
Amir Roth Gurindar S. Sohi 
Department of Computer and Information Science Computer Sciences Department 
University of Pennsylvania University of Wisconsin-Madison 
amir@cis.upenn.edu sohi@cs.wisc.edu 
Abstract 
Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is ineffec- 
tive. In pre-execution, copies of cache miss computations are isolated from the main program and launched as 
separate threads called p-threads whenever the processor anticipates an upcoming miss. P-thread selection is 
the task of deciding what computations should execute on p-threads and when they should be launched such 
that total execution time is minimized. P-thread selection is central to the success ofpre-execution. 
We introduce a framework for automated staticp-thread selection, a staticp-thread being one whose dynamic 
instances are repeatedly launched during the course ofprogram execution. Our approach is to formalize the 
problem quantitatively and then apply standard techniques to solve it analytically. The framework has two 
novel components. The slice tree is a new data structure that compactly represents the space of all possible 
static p-threads. Aggregate advantage is a formula that uses raw program statistics and computation struc- 
ture to assign each candidate static p-thread a numeric score based on estimated latency tolerance and over- 
head aggregated over its expected dynamic executions. Our framework finds the set of p-threads whose 
aggregate advantages sum to a maximum. The framework is simple and intuitivelyparameterized to model the 
salient microarchitecture features. 
We apply our framework to the task of choosing p-threads that cover L2 cache misses. Using detailed simula- 
tion, we study the effectiveness of our framework, andpre-execution in general, under different conditions. We 
measure the effect of constrainingp-thread length, of adding localized optimization to p-threads, and of using 
various program samples as a statistical basis for the p-thread selection, and show that our framework 
responds to these changes in an intuitive way. In the microarchitecture dimension, we measure the effect of 
varying memory latency and processor width and observe that our framework adapts well to these changes. 
Each experiment includes a validation component which checks that the formal model presented to our frame- 
work correctly represents actual execution. 
1 Introduction 
Second-level cache misses constrain processor performance and will constrain it hrther as memory latencies rela- 
tively increase. Driven by address prediction, non-binding prefetching hides memory latency by speculatively "hoist- 
ing" the cache miss portion of a load, overlapping it with many prior instructions. Prefetching eliminates many 
misses. However, certain static problem loads defy address prediction and their misses elude prefetching. 
Pre-execution is a recently proposed technique for dealing with problem loads1. Pre-execution sidesteps address pre- 
diction and generates prefetch addresses by executing a copy of the load computation in parallel with the main pro- 
gram as a separate thread-called a p-thread2-in a multithreaded processor. "Hoisting" is accomplished as the p- 
thread fetches and executes many fewer instructions than the main program thread and thus arrives at and initiates the 
cache miss first. The multithreaded execution model, in which p-threads are decoupled from the main program and 
I .  Pre-execution has also been proposed as a way of dealing with problem (i.e., frequently mis-predicted) branches. While we do 
not expticitly discuss branch pre-execution here, all of our methods do apply in that scenario. 
2. These have been alternately called data-driven threads, p-threads and p-slices. The temp-threads is the "average" of the three. 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
one another, has many advantages. P-thread execution and cache miss initiation are accelerated because p-threads are 
isolated from stalls and squashes that occur in the main thread. Overlapping is enhanced because while a cache miss 
stalls the p-thread, the main thread continues fetching, executing and retiring instructions from the main program. 
With hardware multithreading becoming prevalent, pre-execution is gaining popularity [3,8,  I I ,  14,201. 
The benefits and limitations of pre-execution have been well-documented. Here, we attack the problem of p-thread 
selection, the task of deciding which p-threads to pre-execute and when to pre-execute them. P-thread selection is a 
crucial component of pre-execution. It is also a complex task that must balance many inter-related, often antagonistic 
concerns including cache miss latency tolerance, p-thread resource consumption (important when p-threads share 
resources with the main thread), and prefetch coverage and accuracy. To date, p-thread selection has been approached 
both manually [20] and automatically [ 2 , 3 , 5 ,7 ,  1 I] and with promising results. However, past approaches have been 
generally heuristic. We present a framework for attacking the problem in a formal, quantitative, and holistic fashion. 
We focus on staticp-threads, copies of which are launched repeatedly during program execution. The dynamic pro- 
gram intervals for which p-threads are chosen can be short, modeling on-the-fly p-thread generation, or a full run, 
modeling an off-line implementation. For each program sample, we select p-threads using what is effectively an ana- 
lytical pre-execution limit study. First, we use an execution trace to enumerate all possible static p-threads. Then, we 
apply a simple model called aggregate advantage to calculate the performance benefit of each static p-thread aggre- 
gated over its dynamic invocations. Finally, we "solve" the selection problem by choosing the set of static p-threads 
that maximizes total performance benefit. Two novel components make this approach feasible. The first is aggregate 
advantage, which uses a few key abstractions to effectively model the microscopic interactions of a p-thread with the 
main thread using only a few intuitive high level parameters. The second is the slice tree, a data structure that com- 
pactly represents the space of all possible static p-threads and the relationships between them. The slice tree allows us 
to accurately assess miss coverage and to ensure that pre-execution work is not replicated. The framework also 
includes facilities for optimizing p-threads. Constructed from first principles, the framework is simple and, via a few 
intuitive parameters, applicable to a wide range of pre-execution implementations and processor configurations. In 
this work, we assume a simultaneous multithreading (SMT) [I71 processor, where resources are shared among all 
threads. The framework, however, is easily adapted to other multithreaded models. 
At first glance, the use of exhaustive analysis on dynamic execution traces seems impractical: the trace-driven 
approach meshes well with dynamic optimization while exhaustive search seems a better fit for off-line implementa- 
tions. However, representative execution samples can be obtained for off-line analysis or reconstructed from profiles 
and the structure of the problem allows us to perform our exhaustive search using a simple iterative procedure that 
converges quickly. Independently, the framework has intrinsic value in that the p-threads it finds are optimal insofar 
as aggregate advantage accurately models pre-execution. The conditional optimality statement derives from the stan- 
dard iterative techniques we use to solve the problem. To remove the condition, we use correlation and cross-valida- 
tion methodologies to measure the fidelity of aggregate advantage. Our results show that, although simple, this 
formula is quite accurate under many conditions. While perhaps not always optimal in reality, the p-threads produced 
by our framework are often close to it. Thus, our framework provides a robust analytical foundation for future p- 
thread selection algorithms. In addition, it allows us to characterize p-threads and evaluate the performance potential 
of pre-execution under different processor and pre-execution configurations and conditions. In this paper, we do 
exactly that in the context of L2 misses. Our experiments confirm an intuitive result-maximum pre-execution effec- 
University of Pennsylvania, Department of Computer and lnfonnation Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubdtr/tse1ect-~2002.pdf 
tiveness and the p-threads required to achieve it are a function of program structure and processor configuration. As 
we remove constraints, our framework naturally gravitates to this canonical set of p-threads. 
The next section provides background and describes the important aspects of the problem. Section 3 describes the 
framework in detail. The final three sections contain an experimental evaluation, related work, and our conclusions. 
2 Background 
We review pre-execution and introduce the p-thread selection problem using an example. Figure 1 shows a loop exe- 
cuted by a mythical pharmacy cash register. The loop iterates over the day's transactions and sums the appropriate 
prices for the purchased drugs. Load #09 (in bold), which accesses the price of the drug, is a static problem load. Its 
cache misses cannot be handled via conventional prefetching-their addresses do not form an arithmetic series-and 
must be attacked via pre-execution. The left side of the figure shows the static code. The right shows an p-thread- 
assisted dynamic execution-the main thread is on the left with loop iterations separated by horizontal lines, the p- 
thread is to the right. As a running example throughout the paper, we construct this p-thread from the ground up. 
Abstract pre-execution model. Selecting proper p-threads requires an understanding of p-thread execution. A p- 
thread has two components: the body is a list of instructions that mimics a cache miss computation, the trigger is a PC 
of an instruction in the main thread. A static p-thread is a triggerfbody pair. A dynamicp-thread is an instance of a p- 
thread body launched when the main thread executes an instance of the corresponding trigger. On the right side of the 
figure, the p-thread body is shown in a box with the trigger as a annotation on top. The p-thread and the main thread 
computation it mimics are in bold. Although not shown, a dynamic p-thread is launched by every main thread 
FIGURE 1. Pre-Execution Example 
STATIC CODE 
for (i = 0; i < N-XACT; i++) { 
if (xact[i].coverage == FULL) 
continue; 
else if (xact[i].coverage == PARTIAL) 
drug-id = xact[i].drug-id; 
else 
drug-id = xact[i].generic-drug-id; 
todays-take += drugs[drug-id].price; 
1 
#00: bge R4, R1,#14 
MI: Iw k6, O(H5) 
#02: beq R6, R2, # l  1 
#03: bne R6, R3, #06 
#04: Iw R7,4(R5) 
#05: j #07 
#06: Iw R7, 8(R5) 
#07: sll R7, R7, #2 
#08: addi R7, R7, #drugs 
#09: Iw R8,O(R7) 
#lo: add R9, R9, R8 
# l  1 : addl H5, H5, #16 
#12: addi R4, R4, # I  
#13: j #00 
DYNAMIC EXECUTION 
# I  1 : addi R5, R5, #16-~ : 
#12: addi R4, R4, #1 # I  I : addi R5, R5, #16 
#04: Iw R7,4(R5) 
#07: sll R7, R7, #2 
#08: addi R7, R7, #drugs 
#09: Iw R8,O(R7) 
#13: j #OO 
#00: bge H4, Hl,#14 
#01: Iw R6, O(R5) 
#02: beq R6, R2, #11 - 
#03: bne R6, R3, #06 
#04: Iw R71 4(R5) 
#05: j #07 
#07: sll R7, R7, #2 
#08: addi R7, R7, #drugs 
#09: Iw R8, O(R7) 
#lo: add R9, R9, R8 
# I  1 : addi R5, R5, #16 
#12: addi R4, R4, #1 
#13: j #00 
-800: bge R4, R1,#14 
#01: Iw R6, O(R5) 
#02: beq R6, R2, # l l  
#03: bne R6, R3, #06 
#04: Iw R7,4(R5) 
#05: j #07 
#07: sll R7, R7, #2 
#08: addi R7, R7, #drugs 
#09: Iw R8,O(R7) 
P-THREAD 
MAIN THREAD 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
instance of instruction #I I .  
As shown in the figure, a p-thread's trigger and body are closely related: the body corresponds to the miss computa- 
tion starting from the trigger. This relationship forms the basis of the abstractpre-erecution model. The p-thread and 
main thread execute in parallel, with the p-thread arriving at the cache miss first by virtue of fetching and executing 
fewer instructions, i.e., only the cache miss computation as opposed to the full program. A different view of the trig- 
gerhody relationship allows us to automate p-thread selection: the body and trigger form a dynamic backwards data- 
dependence slice that starts at the problem instruction. For a given problem instruction, we enumerate aNpossible 
staticp-threads by constructing successively longer backward slices. P-thread selection thus becomes a true selection 
process. From this enumerated set, we select the staticp-threads whose dynamic instances will tolerate the most miss 
latency while incurring the least amount of overhead. 
Our abstract model makes two assumptions about the p-thread sequencing model. First, p-threads are control-less- 
they are fixed sequences that are executed in their entirety. Second, p-thread launching is not chained [3]--only the 
main thread may launch p-threads (with chaining p-threads can launch other p-threads). These restrictions simplify 
the pre-execution implementation-allowing only the main thread to launch fixed size p-threads naturally ties p- 
thread progress to main thread, limiting redundant and runaway pre-execution and reducing early prefetching effects. 
They also allows us to analyze the benefit of a static p-thread as the aggregate benefit of its dynamic instances-we 
know exactly what each instance looks Like and exactly how many of them there will be. At the same time, they do 
not excessively constrain the power of pre-execution. The primary use of control [20] and chaining [3] in pre-execu- 
tion is to implement p-thread loops for increased lookahead and latency tolerance. In our model, where loops are 
needed, they are simulated by including multiple copies of the induction in a p-thread, an idiom called induction 
unmlling [2, 141. Our example p-thread uses a copy of instruction #I I to effectively skip one loop iteration ahead. 
Our framework uses the abstract pre-execution model-the parallel execution of an isolated computation and the 
dynamic main thread region which contains it-in its calculations. It ignores the "constant-overhead" mechanical 
details of pre-execution (e.g., the mechanism which initializes p-threads with seed values from the main thread) and 
incorporates the important aspects (e.g., how much sequencing bandwidth p-threads are allocated) as parameters. 
Aspects of p-thread selection. Since p-threads are backwards slices of problem loads, the only thing we can vary in 
a p-thread is its length. While choosing a proper p-thread length may sound straightforward, it can be quite subtle. 
Obviously, a longer p-thread is launched earlier with respect to is target cache miss and will typically tolerate more 
latency. However, it also execute more instructions and consume more resources. In fact, if it executes too many 
instructions less latency will be tolerated. That is not all. A given static p-thread will launch a certain number of use- 
less dynamic instances. There are two kind of useless p-threads: the first pre-executes loads that would have been 
cache hits anyway, the second pre-executes no main thread load at all (i.e., the main thread executes along a different 
path than the one the p-thread assumes). Our example p-thread is launched once per loop iteration by instruction #11 
while not every loop iteration contains an instance of load #09. Increasing p-thread length often increases the inci- 
dence of useless p-threads of the second kind. Another phenomenon is that longer p-threads, while tolerating more 
latency per miss, cover fewer dynamic misses. A given instance of load #09 may be arrived at via two different com- 
putations: one containing #04 (drugid=xact[i].drugid), the other containing #06 (drugid=xact[i].generic-drugid). 
A longer p-thread that contains this portion of the computation will target only the subset of misses that exercise the 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http:llwww.cis.upenn.edui-amir/pubs/tr/tselect-tr2002.~f 
corresponding instruction. To fully cover all #09 misses potentially requires two static p-threads. Our framework 
simultaneously examines all of these considerations and makes trade-offs between them quantitatively. 
3 P-Thread Selection Framework 
We now construct our framework, using first principles of pre-execution as guides. We first describe aggregate advan- 
tage (ADVa&, a formula that quantifies the performance impact of static p-thread candidates and show how it is used 
to select the best p-thread from within a single computation. We then introduce the slice tree and show how it enables 
the simultaneous selection of multiple p-threads from multiple, partially overlapping computations. Finally, we 
describe two enhancements to the basic framework: p-thread merging and p-thread optimization. 
3.1 Aggregate Advantage: Quantifying the Performance Impact of a Single Static P-Thread 
Obtaining a backward data-dependence slice of a single load is straightforward, even in hardware [2, 11.1. A slice 
comprising N instructions presents us with a choice of N possible p-threads that increase in length from 1 to N 
instructions. The basic p-thread selection problem is to choose the slice sum that makes the best p-thread, allowing 
for the possibility that no sub-slice makes an acceptable p-thread. 
P-thread selection balances four considerations: latency tolerance, overhead, miss coverage and uselessp-threadfre- 
quency. Longer p-threads tolerate more latency per miss, but incur more overhead, generally cover fewer misses, and 
generally result in more useless p-thread instances. Aggregate advantage (ADVagg) combines these considerations 
into a single numerical score, allowing them to be simultaneously optimized. The advantage (ADV) of a dynamic p- 
thread instance is the estimated number of cycles by which it accelerates program execution; the aggregate advantage 
of a static p-thread is the sum of the advantages of all of its dynamic instances. Advantage is the difference of two 
terms: latency tolerance (L7J is the number of cycles by which the p-thread accelerates its targeted cache miss, and 
overhead (OH) is the number of cycles by which the p-thread slows down the main thread by stealing resources from 
it. When computing ADVagg, it is convenient to aggregate latency tolerance (LTagg) and overhead (OHagg) separately. 
Every dynamic p-thread exacts overhead on the main thread, but only dynamic p-threads that pre-execute actual 
dynamic cache miss computations achieve any latency tolerance. Useless p-threads have no latency tolerance associ- 
ated with them because their associated main thread loads have no latency (they either hit in the cache or do not 
exist). If DCtrig is the dynamic count of triggers in the program (i.e., the number of times a p-thread is launched) and 
DCp,,, is the number of times a given launched p-thread actually pre-executes a main thread miss, then: 
ADV,,, = LT,,, - OH,, 
OHagg = DCtrig * O H  
LT,, = DC,,,, * LT 
where OH and LT are the overhead and latency tolerance for a single dynamic p-thread instance, respectively. 
Overhead per dynamic p-thread (OH). Given our assumption of SMT execution, the number of sequencing cycles 
stolen from the main thread is the most direct way of measuring overhead. All other forms of contention are either 
subsumed by this measure (e.g., execution bandwidth), not easily estimated (e.g., bus bandwidth), or both (e.g., buff- 
ering resources). The number of cycles it takes to sequence a p-thread is the number of instructions in the p-thread 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2OO2.pdf 
(SIZEpt) divided by the sequencing width of the processor (BWseq). Since overhead is opportunity cost, we discount 
overhead for a given p-thread by the expected main thread sequencing utilization (BWseq-,, / BWseq). For instance, if 
on an 8-wide processor the main thread fetches 4 instructions per cycle, then a p-thread is only penalized for half of 
its bandwidth consumption. One in two cycles it uses would not have been used by the main thread anyway. 
Latency tolerance per useful dynamic p-thread (LT). P-thread latency tolerance (LT) is quantified as a difference 
in execution times of the p-thread and the main thread. Our execution time estimation metric is sequencing-con- 
strained dataflow-height (SCDH), a function that models both data-dependences and limited sequencing bandwidth. 
Starting from the trigger instruction (when the main thread and p-thread begin executing in parallel), we calculate the 
number of cycles it would take the p-thread to execute the cache miss and the number of cycles it takes an unassisted 
main thread to do the same. The difference between these two estimates, SCDH,, - SCDHpt, is the number of cycles 
by which the p-thread hoists the miss with respect to the main thread and thus the amount of latency it tolerates on its 
behalf. Since it does not benefit the main thread to tolerate more latency than the latency of the miss, we bound LT by 
the original miss latency (L,,). 
The recursive equations for SCDH are those for standard dataflow-height, except that the input height at a given 
instruction also models a sequencing constraint (SC)-the cycle at which the instruction is sequenced (fetched). 
To calculate SC for a given instruction, we divide the instruction's trigger distance (DISTWig)-its distance in 
dynamic instructions from the trigger-by the available sequencing bandwidth (BWSeq.,, for the main thread, 
BWseq-pt for the p-thread). SCDHpt is smaller than SCDH,, because of SC: the p-thread has fewer instructions to 
sequence through, so each p-thread instruction has a smaller DISTWig than its main thread counterpart. Now let us 
define the values used for BW,q-,t and BWseq-pt. BWseq-,, is the rate at which the main thread actually sequences. 
To account for main thread speculative execution, we heuristically calculate BWseq-,, as the average of the unassisted 
main thread IPC and the sequencing width of the processor (BWseq), weighted 2-to-1 in favor of the IPC. BWseq-pt is 
the rate at which a p-thread is allowed to sequence. We set BWseq-pt o 1 because p-threads are single computations 
that execute serially and there is no sense allocating a p-thread more sequencing bandwidth than it will use. 
Working Example. To illustrate the working of ADVagg, we select a p-thread for one particular dynamic computa- 
tion of load #09 from our example, the one that contains instruction #04. We make the following assumptions. The 
loop executes 100 iterations. The first branch is taken 20 times such that only 80 iterations contain instances of #09. 
The second branch is taken 60 times, thus of the 80 iterations that contain instances of #09, 60 use the computation 
that includes #04, the remaining 20 use #06. Half of all #09 instances result in misses (there are 40 #09 misses). All 
operations have unit latency and cache miss latency is 8 cycles. Note, the highest possible ADVagg score in this case 
is 320: 8 cycles of latency tolerance for each of the 40 #09 misses, with 0 overhead. This score is impossible to 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
availabIe at http://www.cis.upenn.edu~-amir/pubs/tr/tselect-tr2002.pdf 
achieve if p-threads have non-zero cost. The processor is 4 wide, and the unassisted IPC of the loop is 1 (BW,,q.mt is 
2). Finally, our slicing mechanism examines 40 instructions and limits p-threads to fewer than 8 instructions, these 
constraints result in a slice with 6 instructions (plus the trigger) implying there are six p-thread candidates. 
Figure 2 shows the ADVagp calculation for each of the six candidate p-threads. The calculation for the winning p- 
thread is shaded. Each calculation is represented by two tables. The table on the left shows the SCDH calculations. In 
the p-thread, trigger distances (DISTtrig) are sequential and the sequencing constraint (SC) is obtained by dividing the 
FIGURE 2. Working Example: using aggregate advantage a single static p-thread. 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
trigger distance by 1 (BWscq-pt). In the main thread, trigger distances are sparse and SC is obtained by dividing the 
DISTtrig by 2 (BWScq-,,). SCDHpt and SCDH,, are SCDH,,, of the load #09 instance. The table on the right shows 
the LTagg, OHagg, and ADVagg calculations. In the LTag calculation, SCDHdiff is SCDH,, - SCDHpt, and latency tol- 
erance (LT) is the minimum of SCDHdiK and LC, (8 in our example) per Equation 5. LTagg is the product of DCpt-cm 
and LT per Equation 3. In the OHagg calculation, SIZE is the number of instructions in the p-thread, OH is SIZE mul- 
tiplied by 0.125 (using Equation 4 and plugging the value 4 for BWSeq and 2 for BWscq-,,), and OHagg is OH times 
DCmig per Equation 2. 
Neither of the first two candidates provides a fetch advantage (latency tolerance) over the main thread-starting at the 
trigger, the main thread and the p-thread fetch exactly the same instructions-while both incur overhead (which 
increases linearly with p-thread size). Pre-executing either will reduce performance. Notice, for both p-threads DCmig 
is 80 while DCpbcm is 4O-each p-thread is executed 80 times, but only 40 executions correspond to misses. 
Setting #04 (D) as the trigger for the third candidate imparts the p-thread with minimal sequencing advantage over the 
main thread-the p-thread gets to skip instructions #05 and # O b a n d  1 cycle of latency tolerance. However, in con- 
trast with the first two candidates, #04 is executed only 60 times (#06 is executed the other 20 times) and the compu- 
tation triggered by #04 results in only 30 misses (the other 10 #09 misses have instruction #06 in their computation). 
Unlike the first two, this p-thread has a positive advantage, tolerating 1 cycle of latency for each of 30 misses, and 
incurring 0.375 cycles of overhead for each of 60 p-threads launched. 
The trigger for the fourth candidate is instruction #11 from the previous iteration. Since an instance of #11 occurs 
once per iteration, DChg is 100. DCpt.,, is still 3 G t h e  computation includes instruction #04 and correctly pre-exe- 
cutes only 30 misses. The changes in DCp,,, and D C ~ g  observed for the last two candidates illustrate two general 
trends. First, within a given backwards slice, DCp,,, monotonically decreases as p-thread length increases. This is an 
intuitive result: the longer the slice, the fewer dynamic computations it corresponds to. In contrast, DChg has no 
direct relationship to p-thread length. Trends aside, the fourth p-thread candidate is better than the third. Although the 
number of useless p-threads--computed by subtracting DCpt.,, from DCmig-rises from 30 to 70 and the overhead 
of each p-thread increases, the additional 2 cycles of latency tolerance achieved for each miss produces a net gain. 
Once the induction instruction, #I 1, is encountered in a slice, further p-thread growth generally comes from the addi- 
tion of instances of this instruction. This pre-execution idiom is called induction unrolling [2, 141 and it generates 
most of the fetch advantage (lookahead) used by pre-execution to achieve latency tolerance. Each additional level of 
unrolling provides the latency tolerance of one full loop iteration for the price of one additional instruction. Induction 
unrolling falls naturally from dynamic backward slicing and is automatically performed to the level dictated by LC,. 
The final two p-thread candidates are similar-the fifth uses a single level of induction unrolling, the sixth unrolls 
twice. The first unrolling provides the p-thread with an additional fetch advantage of 12 instructions over the main 
thread, which translates into 5 additional cycles of execution time advantage (SCDHdiff) for a total of 8. This is as 
much latency tolerance as we need. The score achieved for this p-thread is 177: full latency tolerance for 30 misses, at 
the cost of 63 overhead cycles. Predictably, the final candidate has worse projected performance. With full latency tol- 
erance already achieved, adding another level of unrolling only serves to increase overhead. 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http:l/www.cis.upenn.edu/-amir/pubs/tr/t2002.pdf 
3.2 Slice Tree: Selecting Multiple P-Threads Simultaneously 
The pre-execution solution for a given program typically involves multiple p-threads. Even the simple problem we 
tackled in the previous section is not fully solved by a single p-thread. The p-thread we found covers only 30 of the 40 
possible misses. In this section we show how to select a set of p-threads by solving multiple, partially overlapping 
sub-problems simultaneously. The specific question we answer is: "given the set of computations for all the misses of 
a given static load, what is the best set of p-threads for pre-executing as many of those misses as possible?" 
One approach to any large problem is "divide-and-conquer". In p-thread selection, the naive divide-and-conquer 
approach does not work because the underlying assumption that aggregate advantage adds-ADVagg(A+B) = 
ADVagg(A) + ADVagg(B)-does not always hold. If two p-threads ove r l ap i f  at least one dynamic miss is pre-exe- 
cuted by both of them-then their aggregate latency tolerances do not add. Once one p-thread has tolerated the 
latency of a miss, a second p-thread cannot tolerate it again. To ensure that p-threads across individual sub-problems 
do not overlap, we divide the p-thread selection problem for an entire program into sub-problems each of which han- 
dles the misses of a different static load. To solve each sub-problem, we use a new data structure-the slice tree-that 
naturally and precisely represents p-thread overlap. 
Slice tree. The slice tree is a tree of static backward slices with the static load at the root. Each instruction node in a 
slice tree represents a static p-thread whose trigger is that instruction. The p-thread is constructed by walking from the 
node to the root. Figure 3 shows the slice tree representing both slices from our example. To save space, we represent 
linear tree regions as tables. The figure shows two partially overlapping slices targeting the misses of instruction #09 
which is at the root of the tree (node A). The slice formed by the tables along the left path (nodes A-G) is the one we 
optimized in the previous section. The slice formed by the tables along the right path (nodes A-C,H-K) represents the 
"other" computation, the one that contains instruction #06 rather than instruction #04. The slices triggered by B and C 
are shared suffixes of the larger slices. When discussing p-threads in a slice tree, we talk about parent-child relation- 
ships. Given a parent p-thread, a child p-thread is constructed by extending the slice by one instruction. In the figure, 
C is a child p-thread of B, and D and H are children of p-thread C. 
Each slice tree node is annotated with information that summarizes the behavior of dynamic instances of the corre- 
sponding static trigger and static p-thread. DISTpl is a concise representation of the average DISTWig in the main 
thread context: an instruction's DISTWig with respect to any trigger is obtained by subtracting its DISTpl from the trig- 
ger's, DCWig and DCP,,, were previously defined in Section 3.1. Note, DCWig is a trigger property while DCpt-cm is a 
FIGURE 3. Slice Tree 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
p-thread property. For instance, DCVig is the same for nodes E, F, G, I, J, and K as these are all instances of instruction 
#11. However, DCp,,, is different for each one of these nodes as each corresponds to a different static p-thread. 
The use of the slice tree to summarize information for a static p-thread across its dynamic executions means that tem- 
poral information is lost and that we cannot compute the interactions of p-threads with one another. Our framework 
calculates the aggregate effects for each static p-thread individually, implicitly assuming that dynamic p-threads do 
not interact, whether they be instances of the same static p-thread or not. This assumption is not egregious for L2 
cache misses which are infrequent enough to make the likelihood of concurrent p-threads low. Either way, we will- 
ingly trade this inaccuracy for the computational leverage and statistical support provided by summary information. 
P-thread overlap and addition of aggregate advantages. The degree of overlap between a p-thread and its parent is 
defined by DCpbcm. Consider the p-threads corresponding to instructions C, D, and H. P-thread C pre-executes all 40 
misses, but cannot achieve latency tolerance for any of them. If we want a longer p-thread, one that can tolerate more 
latency, we are faced with a choice of two. P-thread D and its children (E-G) will pre-execute only 30 of the misses. 
P-thread H and its children (I-K) will target the other 10. That a parent p-thread's DCpt-,, is the sum of the DCpt-,, 
of its children is an invariant. Also note, a parent-child relationship is the only possible source of overlap between two 
p-threads. P-thread A can either be longer and more specialized than p-thread B or shorter and less specialized, not 
both at the same time. 
Now that we understand the relationship between different p-threads in a slice, we can explain how to combine the 
aggregate advantages of two p-threads. If two p-threads are not a parent and child (either directly or indirectly) then 
their aggregate advantages simply add. If the two are a parent and child, then there is some component of aggregate 
latency tolerance that is counted in both, and this double counted component must be subtracted from the total. The 
number of misses attacked by both p-threads is given by the child's DCp,,,. The amount of latency that is "double- 
tolerated" for each one of these is LT of the parent. Because the parent thread tolerates less latency per miss, we typi- 
cally associate advantage reduction with the parent p-thread. 
where P is the parent and C is the child. 
The solution of a composite p-thread selection problem (covering the misses of a single static load) is the set of p- 
threads whose aggregate advantages-where latency tolerance reductions due to overlap have been accounted for- 
sum to a maximum. Because p-threads within a slice tree obey certain relationships to one another, we can find this 
set using an iterative procedure rather than an exhaustive search. For each leaf (separate linear slice) in the slice tree, 
we select a p-thread as in the previous section. If any of the independently selected p-threads overlap, we reduce the 
advantages of the parent p-threads and reselect. The process terminates once the reductions performed in one iteration 
do not influence the p-threads selected in the next iteration. 
Working example. Obtaining a complete solution for the slice tree in our example is trivial. Selecting the two p- 
threads separately we find that the best p-thread along the left hand side of the tree is p-thread F (found in the previ- 
ous section) and that the best p-thread along the right side of the tree is p-thread J. Since these two p-threads do not 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
overlap, no corrections must be made, and no further iterations are necessary. 
3.3 Framework Extensions: Merging and Optimization 
In addition to the basic facilities for solving the static p-thread selection problem, our framework contains two 
enhancements. First, we support merging of partially redundant p-threads to reduce overhead. Second, we support p- 
thread optimization or specialization. Both of these capabilities are automated. 
Merging. The two p-threads chosen in the previous section do not overlap from a cache miss standpoint, each targets 
a disjoint set of misses. However, many of the instructions dynamically executed by these p-threads+.g., every 
instance of instruction #ll-are redundant. Rather than execute two separate p-threads, one with instructions #1 I, 
#04, #07, #08, #09 and one with instructions #I  I ,  #06, #07, #08, and #09, we create a single p-thread with instruc- 
tions # I  1,  #04, #06, #07, #08, and #09 that captures both computations. A merged p-thread achieves the same latency 
tolerance as separate instances of each of the original p-threads and incurs less overhead. Our merging algorithm 
merges p-threads with matching data-fow prefixes. A p-thread's data-flow prefix is its trigger instruction plus any 
contiguous chunk of data-flow sub-graph connected directly to any other instruction external to the p-thread (trigger 
included). Merging proceeds in dataflow order with register renaming and code duplication performed as needed to 
preserve the computational semantics of each of the original component p-threads. In our example, instructions #07, 
#08 and #09 cannot be merged in the final p-thread, they must be replicated: one copy must take the computation that 
contains instruction #04 to its completion and the other must complete the computation that contains #06. 
Optimization. Optimized p-threads are p-threads that are not exact copies of dynamic computations from the pro- 
gram, but rather specialized versions of them. We fit p-thread optimization into our framework by allowing the calcu- 
lations for SCDHp and SIZEpt to use any sequence of instructions that is functionally equivalent to the actual sub- 
slice. P-thread optimization is both easier and more productive than full program optimization. First, since p-threads 
are control-less, traditional control-flow and iterative data-flow analyses are replaced by a simple linear scan. Second, 
only optimizations that are enabled by the highly specialized nature of the p-thread need be considered. Register allo- 
cation was already performed by the compiler that generated the initial program and scheduling is unnecessary since 
a p-thread is a single computation. We have found that store-loadpair elimination and constant folding capture most 
p-thread optimization opportunities. Figure 2 contains one optimization opportunity: in the final candidate, the two 
instances of instruction #16 (addi R5, R5, #16) may be folded into a single instruction (addi R5, R5, #32). This opti- 
mization reduces both p-thread latency (the height of the dataflow graph is cut by one instruction) and overhead. 
4 Experimental Evaluation 
We evaluate our p-thread selection framework's capacity for selecting p-threads that target L2 misses. In section 4.2, 
we validate the framework's performance model by comparing predicted statistics against statistics measured from 
pre-execution simulations. In sections 4.3 and 4.4, we measure the framework's response to variations in several p- 
thread and machine parameters, respectively. 
4.1 Methodology 
Our experiments use a suite of tools built using the Simplescalar Alpha AXP ISA and system call modules. A func- 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
avaiIable at http://www.cis.upenn.edui-amir/pubs/tr/tselect-tr2002.pdf 
) bzip2 I crafty I gap I gcc I mcf 1 parser I twolf I vortex I vpr.p I vpr.r 
)~nstructions (M) 1 6000.00 1 2600.001 900.00 1 500.001 900.00 1 1300.00 1 1300.00 1 1700.00 1 300.00 1 1 100.00 
tional cache simulator generates program traces and constructs backward slices of all dynamic L2 misses and collects 
them into slice trees which are written out to files. The p-thread selection tool takes a slice tree file and parameters 
describing the processor (sequencing width, memory latency), unoptimized program performance (IPC), and p-thread 
construction constraints (maximum p-thread length, optimization level) and produces a list of static p-threads. This 
arrangement allows multiple p-thread sets for the same cache configuration but different pipeline, latency and p- 
thread optimization configurations to be generated quickly. Our default configuration has a maximum slicing scope of 
1024 instructions, maximum p-thread length of 32 instructions, and full merging and p-thread optimization. 
Loads (M) 
L2 misses (M) 
IPC 
Perfect L2 IPC 
Performance results are obtained via a detailed timing simulator that models a parametrizable pipeline with register 
renaming, reservation stations, load speculation, and an event-driven memory hierarchy with realistic bandwidth con- 
tention. Our base configuration is an 8-wide dynamically scheduled processor, with a 14 stage pipeline, 80 reserva- 
tion stations, and a maximum of 128 instructions in-flight. The front-end has a 32KJ3, 2-way set-associative 
instruction cache, 32-entry TLB, and 6K-entry hybrid branch predictor with 2K-entry BTB. The data-memory system 
includes a 16KB, 32B line, 2-way set associative, 2-cycle access, write-back data cache, a 256KB, 64B line, 4-way 
set-associative, 6-cycle access second-level cache, and a 32-entry TLB. Store-to-load forwarding via a 64-entry store 
queue also takes 2 cycles, with all memory accesses preceded by 1-cycle address generation. We model an infinite 
main memory with 70 cycle access latency, a 32B wide backside bus clocked at processor frequency, and a 32B mem- 
ory bus clocked at one fourth processor frequency. 32 simultaneously outstanding misses are allowed. 
The simulator models the run-time functions of pre-execution. A p-thread is launched when the main thread renames 
the corresponding trigger. The p-thread is allocated to one of three additional thread contexts or dropped if no context 
is available. P-thread instructions are injected into the execution core at register renaming in a bursty fashion, 8 
instructions once every 8 cycles per active p-thread. P-thread instructions are allocated physical registers and reserva- 
tion stations and contend with main thread instructions for these resources and for scheduling slots. A p-thread con- 
text is freed when all p-thread instructions have been renamed. Physical registers allocated to p-thread instructions are 
recycled in a circular fashion and our simulator models an additional 64 physical registers for p-thread use. The sim- 
ulator does not model the p-thread selectionlpre-execution interface assuming that p-threads are accessible in one 
cycle from an ideal p-thread cache that experiences no misses. Our (untested) assumption is that this interface has no 
first-order performance effects. Because our experiments target L2 misses, we disable the data cache fill path for p- 
thread loads-p-thread loads prefetch only into the L2. While prefetching into the first level cache improves perfor- 
mance, it perturbs our ability to validate the framework's performance model, an important aspect of this evaluation. 





All simulation tools exploit sampling, cycling through off (fast-forwarding), warm-up (caches and branch predictor 
only) and on (full detail) phases at regular intervals. We have performed experiments (not shown) which confirm that, 
by both miss rates and IPCs, cyclic sampling is "equivalent" to unsampled execution. Our experiments use the 





































University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edui-amir/pubs/tr/tselect-tr2002.pdf 
TABLE 2. Basic results and performance model validation 
mance numbers are reported using the training inputs sampled at 1 OOM of every 1B instructions with 1 OM instruction 
warm-up phases. Unless otherwise noted, p-threads are selected using the same program sample on which they are 
subsequently measured. This arrangement allows our framework to produce performance predictions which may then 
be checked. A relevant benchmark characterization is shown in Table 1. Of the 16 benchmaruinput combinations, we 
use only 10; eon (3 inputs), p i p ,  and perlbmk (2 inputs) exhibit negligible L2 miss rates. 
4.2 Primary Performance Results 
The middle section of Table 2 (Pre-exec, the shaded section) shows the performance of the p-threads selected by our 
framework. In addition to IPC, we list the number of p-threads launched, the average number of instructions per p- 
thread and L2 misses both in part and in full. A miss is fully covered if the p-thread initiates the miss far enough in 
advance to overlap the full latency of the miss. If the latency is only partially overlapped, the miss is partially covered. 
Overhead I P C  and Latency tolerance I P C  are produced by simulations which model only p-thread cost or benefit and 
are used in the next section to validate the model. The results show that the p-threads selected by our framework gen- 
erally improve performance. The p-threads cover between 10% (mcj) and 82% (vprp) of the L2 misses in the pro- 
gram-with full coverage in general achieved for about half of all misses covered-and result in performance 
improvements of up to 24% (vprr). One benchmark, crafg, experiences a 1% performance degradation due to the 
addition of p-threads. These results are good in absolute terms, but the point of using a quantitative framework is to 
obtain the best possible results. In other words, we want assurance that the reason only 10% of mcfs L2 misses are 
covered is that the structure of  the program is such that our pre-execution model cannot be used to cover the other 
90%, not because the framework couldn't find them. We devote the next section-and parts of  the following sec- 
tions-to constructing experiments that increase our confidence in this regard. 
4.3 Model Validation 
One way to gain confidence in our framework is to compare the performance of  p-threads it produces with that of all 
other sets of possible p-threads. This approach is infeasible. We take a different tack based on the following argument. 
Our framework uses standard optimization techniques to find good solutions for a given function. That we are sure of. 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
What we are not sure of is whether the function our framework optimizes, ADVagg, accurately models reality such 
that solutions which are good in the model space are also good in the real world. Fortunately, the modeling fidelity of 
ADVapg is easier to verify. In performing its selections, our framework implicitly makes diagnostic predictions of p- 
thread behavior. We check these predictions against simulated measurements. If they align, we can be confident that 
ADVagg is the appropriate p-thread evaluation function, and that our framework produces solutions that are good in 
the real world. The bottom portion of Table 2 (Predict) lists the framework predictions that allow us to perform this 
validation. We validate overhead and latency tolerance individually to better identify model inaccuracies. 
Overhead. Three diagnostics check our overhead model: number of p-threads launched, average number of  instruc- 
tions per dynamic p-thread, and performance of an overhead-only implementation. We address the last diagnostic 
first. Overhead performance degradation is measured in two ways. In the first (execute), p-threads execute as usual, 
but do not access the data cache (thus do not have the pre-execution effect). In the second (sequence), p-thread 
instructions consume sequencing cycles but are immediately discarded. The first simulation measures true p-thread 
overhead, the second measures overhead as modeled by our framework (i.e., the only cost of a p-thread instruction is 
the bandwidth used to sequence it). As we see in the table, these simulations often produce identical results indicating 
that our "overhead as sequencing bandwidth consumption" assumption is valid. 
Estimates of performance loss due to overhead are generally accurate. Predictions of average p-thread length are self- 
fulfilling. Occasional p-thread launch count over-estimation (e.g., bzip2) is due to the finite number of p-thread con- 
texts-a p-thread launch request is dropped if a thread context is not immediately available. This effect is especially 
prominent in programs that require many p-threads. Typically, however, p-thread launch counts are under-estimated 
due to the fact that our framework does not account for p-threads launched from wrong-path trigger instructions. We 
have run simulations in which p-threads are launched only from correct path triggers and observed nearly perfect cor- 
relation between predicted and simulated launch counts. Interestingly, wrong-path p-thread launches do not increase 
overhead. Since the majority of p-threads are short and are sequenced within a cycle or two of launch, wrong-path p- 
threads primarily contend with wrong-path instructions whose latency does not directly impact performance. 
Latency Tolerance. Latency tolerance is also validated via three diagnostics: L2 misses covered (i.e., turned into 
either full or partial hits), L2 misses fully covered (i.e., turned into full hits) and performance of a latency-tolerance- 
only implementation. Miss coverage is measured by timestamping cache blocks with p-thread request, main thread 
request, and ready times. Fully and partially covered misses are detected by the appropriate relationships between 
timestamps and are tabulated at instruction retirement to avoid overinflating the counts with wrong-path data. 
Latency-tolerance impact is measured via an additional simulation in which p-threads are not charged for bandwidth. 
Miss coverage, both full and partial, is more difficult to predict than overhead, as there are many factors which affect 
miss coverage that are not considered by ADVapg. Miss coverage over-estimation (too few misses actually covered) is 
the result of p-thread issue delays caused by contention with the main thread and other p-threads. Full miss coverage 
overestimation (too few actual full miss coverages) implies post-issue delays for p-thread misses, with the primary 
source being contention in the memory bus. Full miss coverage under-estimation (too many actual full miss cover- 
ages--a good problem to have) indicates main thread delays, primarily due branch mis-predictions but also to con- 
tention with p-threads. Miss coverage underestimation (too many actual misses covered) implies the presence of 
unintentional L2 prefetches within a p-thread. A given static load may not have statistical character that merits the 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
construction of p-threads for its own sake. However, such loads are sometimes embedded in p-threads targeted for 
other loads or consistently share cache lines with such embedded loads. Each of these factors acts to some degree in 
every benchmark, with the net effect determined by the dominant phenomena. For instance, full miss-coverage under- 
estimation is dominant in benchmarks with high branch mis-prediction rates (e.g., crafty, gcc and vprr). 
Unfortunately, the simulated metric that most poorly correlates with its predicted value is performance improvement. 
Even more unfortunately, it is generally overestimated. The primary cause is a single assumption built into our frame- 
work-that miss latency translates cycle for cycle into execution latency and, therefore, that miss latency tolerance 
translates cycle for cycle into performance improvement. Effectively, we assume that L2 misses are handled serially 
and are not overlapped with the execution of any instructions in the program. This is obviously not the case-in a 
dynamically scheduled processor, some degree of overlapping, either with other misses or other instructions in the 
program, is almost always possible-meaning that our framework is fooled into believing that there is more latency 
to tolerate than actually exists. An interesting piece of future work is to combine our framework with a critical path 
model [4] that can assign a "true" latency to each miss. Finally, while never completely true for L2 misses, the serial- 
ization assumption is even less true for L1 misses. Our experience shows that, while our framework easily finds p- 
threads that target L1 misses, its predictions in that scenario are less accurate. 
We have presented a single validation experiment on a single design point in the processor and p-thread selection con- 
figuration space. We have performed similar experiments with other configurations-narrower processors, slower 
memories, and different thread selection parameters-and have obtained similar qualitative results. While our frame- 
work does not always predict end performance and diagnostics with perfect accuracy, in many cases it comes quite 
close. This suggests that ADVagg, accurately captures pre-execution behavior under many conditions. 
4.4 Sensitivity to P-Thread Selection Parameters 
In this section and the next we measure our framework's response-i.e., changes in the p-threads it produces-to 
variations in p-thread selection parameters (this section) and the underlying microarchitecture (Section 4.5). By mea- 
suring this response, we directly observe the performance potential of pre-execution under those same conditions. 
From this point forward, our results are presented graphically. The graphs all have a format similar to the one in Fig- 
ure 4. Each bar in a group shows the results of one experiment using five diagnostics. Miss coverage (dark grey, top) 
and full miss coverage (light gray, bottom) are shown as stacked bars. Their units are in percentages of the number of 
L2 misses in the unoptimized program. Overhead is shown as a tick in each column and is computed as the number of 
FIGURE 4. Combined impact of slicing scope and p-thread length. 
~ -- 
loo v a n  Overhead r P-thread length I L2 miss ~avenge C1 L2 miss full caverage 4.65 Pement speedup a I 
m - 
bzip2 crafty gap gCC mcf parser wo l f  vortex VP1.P vpr.1 
University of Pennsylvania, Department of Computer and lnfomation Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
p-thread instructions executed over the number of instructions retired by the main thread. Average dynamic p-thread 
length is shown as a cross mark. Finally, percent speedup over the base configuration is shown in text over each bar. 
Slicing scope and p-thread length. Our first test measures the effects of slicing scope, the length of the dynamic 
trace that is examined to construct a p-thread, andp-thread length. Limiting these is a concession to the finite buffer- 
ing of the p-thread constructor. Limiting p-thread length is also a concession to the implementation of the p-thread 
memory hierarchy. Figure 4 characterizes the performance of p-threads selected in four scope/length combinations. 
For instance, the left bar corresponds to a scope limit of 256 instructions and a maximum p-thread length of 8 instruc- 
tions. 
Two intuitive and comforting trends are evident. First, actual p-thread length, miss coverage, full miss coverage, and 
performance increase as p-thread selection constraints are relaxed. Second, they all saturate at some point and do not 
benefit from further relaxations. These trends imply that pre-execution performance potential is a strong property of 
program structure, that each combination of program and processor configuration has a natural set ofp-threads and, 
more significantly, that our frameworkgravitates to this natural set. Quantitatively, the pre-execution performance of 
most programs effectively saturates at slicing windows of 512 instructions and with post-optimization lengths of 16 
instructions, although several programs (e.g., vortex) benefit from further relaxations. 
For brevity, we only present the combined effects of length and scope restrictions. The importance of each constraint 
individually varies from one benchmark to the next. Most programs are more sensitive to p-thread length constraints, 
unable to achieve any gain whatsoever with very short p-threads, even with large (2K instruction) slicing windows. 
This is an indication that miss computations in these programs are dense in the locus leading up to the miss-small 
computations are unable to obtain any sequencing advantage. Two programs which are more sensitive to scope 
restrictions are parser and two& This is signature of the complementary program structure, sparse computations 
which can achieve latency tolerance with small computations, but need large windows to "see" these computations. 
P-Thread optimization and merging. One of the stated strengths of our framework is its support for p-thread opti- 
mization and merging. As Figure 5 shows, the addition of p-thread optimization and merging can have a profound 
performance impact on pre-execution, witness vpcr. 
Optimization reduces average p-thread length, often significantly (e.g., crafty, parser, vortex, and vpr). Less intu- 
itively perhaps, it also often results in a significant increase in p-thread launches. With optimization reducing p-thread 
FIGURE 5. Impact of p-thread optimization and merging. 
- 








B B B 
0 H B B B 
bzipZO crafty gap gCC mcf parser twolf  O vortex0 vPr.P vpr.r 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http:ilwww.cis.upenn.edd-amiripubsltritselect-tr2002.pdf 
FIGURE 6. Impact of p-thread selection granularity. 
. . I r instruction overhead x P-thread length I L2 mlss coverage U L2 miss full coverage 4.65 Percent speedup 3 n N - 
.-,,,A O N ?  1 
bzio2 craftv e a ~  ecc mcf ~arser twolf vortex V D ~ . D  wr.r 
length, increasing latency tolerance by compressing p-thread dataflow graphs and decreasing overhead, p-threads that 
were either unprofitable or illegal (i.e., too long) in their unoptimized forms become viable. An increase in the num- 
ber of viable p-thread candidates results in increased miss coverage, more complete latency tolerance for covered 
misses, and improved performance. Vortex and vprr are prime examples of the power of this secondary effect, which 
is much stronger than the primary effect of overhead reduction for pre-existing p-threads. Of our three optimizations, 
we have found store-load pair elimination to be the most effective. Register-move elimination has almost no impact. 
Intuitively, our experience shows that optimization becomes increasingly effective as length constraints are tightened. 
Merging primarily reduces overhead but does not increase the number of viable p-thread candidates-its performance 
effects are thus less pronounced. Merging increases p-thread length while decreasing the p-thread launch counts. 
Merging generally improves performance by reducing contention for p-thread contexts, although it can occasionally 
increase contention by creating long p-threads that occupy a single context for several 8-cycle sequencing periods. 
P-thread selection granularity. Our default selection granularity is an entire (sampled) run of the program. Figure 6 
compares this coarse grain approach with finer grain strategies in which p-threads are specialized for dynamic pro- 
gram regions of 100, 10, and 1 million instructions each. Intuition says that breaking the program into smaller chunks 
and optimizing each chunk separately will produce more highly specialized p-threads and higher performance. After 
all, at the limit of this process we would find a custom p-thread for every dynamic L2 miss! 
The trends we expect to see are those we observe in bzip2, parser, vprr or the first three bars of gap and gcc- 
increased miss coverage, reduced overhead, and increased performance. Although these trends appear frequently, 
they are not consistent or even monotonic within a benchmark. Counter intuitive trends are most often seen at the fin- 
est (1 million instruction) granularity although they may appear sooner as they do in vortex. Finer selection granular- 
ities do not always mean increased miss coverage. If a p-thread is deemed profitable at a coarse-grain region, but not 
at all finer-grain sub-regions, coverage for any misses that do occur at unselected sub-regions will be lost. This phe- 
nomenon suggests a slight mis-calculation by our framework at fine granularities, specifically one that is overly- 
biased towards overhead. One reason for this may be that our overhead model is not quite accurate at very high or 
very low IPCs, which are typically seen only over small dynamic regions. 
Overall, the consistency of results across grains suggests a certain amount of self-similarity in programs and builds 
confidence in our approach of using coarse-grain information (IPC) to model microscopic behavior. 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
FIGURE 7. Impact of p-thread selection input data-set. 
- - -- . 
loo F ~ n s m c t i o n  overhead r P-thread length I L2 .IS~ coverage L2 ~ I S  h l l  coverage 4.61 percent speed~- 
- - 
bzip2 crafty gap gCC mcf parser twolf vortex vPr.P vpr.r 
Input Data Sets. Our evaluation to this point has shown that automated p-thread selection is possible given perfect 
information-a previous execution of the same program using the same input data. In reality, such information is 
never available. For our final experiment in p-thread selection parameter space, we vary the input data set used to con- 
struct p-threads to test the viability of performing automated p-thread selection in real world scenarios. We consider a 
dynamic scenario in which p-threads are selected on-line (ignoring for a moment that the exhaustive nature of our 
framework is not conducive to on-line implementation) using small profiling program phases. The dynamic scenario 
models p-thread selection as it would be implemented by a profile-driven, dynamically optimizing just-in-time com- 
piler. We also consider a static scenario in which p-threads are selected using profiles from test inputs. The static sce- 
nario models p-thread selection implemented by a profile-driven static compiler. A simulated performance evaluation 
of p-threads selected under each scenario is shown in Figure 7. 
The figure shows that good p-thread selection is theoretically possible under these two realistic implementation sce- 
narios. P-threads selected in the dynamic scenario often approach the performance of those selected with perfect 
information. This is a testament to the fact that programs have a finite number of characteristic behaviors (determined 
by program structure) and further proof that our sampling methodology is sound. The less encouraging results from 
static scenario are the product of our choice of test inputs which use small data sets and incur significantly fewer L2 
misses. In fact, the test data working sets for twoffand vprp fit into our L2 cache resulting in no p-threads being 
selected for those two benchmarks in the static scenario. However, for most other programs, static information is 
nearly as effective as dynamic information and even perfect information. This result reinforces our belief that p- 
threads and pre-execution performance potential are most strongly a function of program structure. 
Occasionally, imperfect information yields better results than perfect information. Our framework makes several 
unrealistic assumptions-primarily that miss latency translates directly to program execution latency-which we can 
correct by essentially "lying" to the framework via adjusted parameters. In mcf, the framework's interpretation of the 
characteristics of the dynamic sample better match the true characteristics of the perfect sample, than the framework's 
interpretation of those characteristics. While manifestations of this problem are rare, they do suggest that future work 
may be needed to free the framework from these assumptions. 
4.5 Sensitivity to Machine Parameters 
An important aspect of our framework is its ability to accurately parametrize important processor features. In the 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
interest of space, we cannot validate our framework's response to many parameters. We demonstrate our methodol- 
ogy using one important parameter, memory latency, which directly impacts latency tolerance requirements and 
hence p-thread structure. We have performed similar validation experiments on other parameters including processor 
width and L2 size, with similar results. 
Intuition says that our framework models the underlying microarchitecture correctly if for a given configuration, the 
p-threads selected with that configuration in mind are better than p-threads selected for another configuration. We 
conduct a limited version of this study. For a given parameter X, let txl be the set of p-threads chosen using statistics 
obtained from a configuration where X is XI andpxl(t) be the performance of a processor in which the parameter has 
a value XI pre-executing the set of p-threads t. Within each study, we perform four experiments on two configura- 
tions, XI and X2-pxl (tXl), px2(txJ, pxl ( t x j ,  and pX2(txl). We gain confidence in our framework's model of param- 
eter X if pxl(txl) > pxl(txJ and px2(txt) > px2(txl). This cross-validation also allows us to test the framework's 
response to both under- and over- specification of parameter X. 
Memory Latency. Results for memory latency are shown in Figure 8. We show four experiments split into two 
groups. Within each group, the simulated memory latency value is constant. Within each group, the right bar is the 
self-validation experiment and the left bar is the cross-validation experiment. In the left group, simulated memory 
latency is 140 cycles. In this group, the left bar corresponds to p-threads selected assuming 70 cycle memory latency 
(cross-validation) and the right bar to p-threads selected assuming 140 cycle latency (self-validation). In the right bar 
group, the roles are reversed. Here, we simulate a memory latency of 70 cycles, meaning that p-threads chosen 
assuming 70 cycle latency are used for self validation and p-threads chosen with a 140 cycle latency in mind are used 
for cross-validation. Due to our cross-validation methodology, the inner columns of a given bar group are similar to 
each other, as are the outer two columns. 
Within each bar group, we are interested in two trends. First, we expect the self-validation experiments to outperform 
the corresponding cross-validation experiments. Second, we can intuitively gauge the framework's response to varia- 
tions in a given parameter by comparing the self-validation experiments to one another. 
Comparing p70(t7a) with p140(t14a) shows that our framework responds to memory latency variations in an intuitive 
way. A latency increase results in the selection of longer p-threads which cover fewer misses and h l ly  cover fewer 
still. That it is the correct response is confirmed by cross-validation. 
FIGURE 8. Response to variations in memory latency. 
'"" I A Instruction overhead x P-thread length 1 12 miss coverage Cl L2 miss full coverage 3.17 Percent speedup 
asf -- 2 3 I 
bziv2 craftv eav PCC mcf varser twolf vortex V D ~ . D  vvr.r 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http:llwww.cis.upenn.edui-amirIpubs/tr/t2002.pdf 
Inp70(t14a) (third bar in each group) we over-spec13 memory latency. We pretend that the processor has more latency 
to tolerate than it actually does, causing the framework produces more aggressive p-threads that more completely 
cover the latency that does exist. The framework does, in fact, produce longer p-threads and these fully cover more 
misses (the light gray bars are highest in this group). However, end performance is mixed. On most benchmarks, the 
expected effect is observed. As there is no more actual latency to tolerate, longer p-threads provide no additional ben- 
efit. At the same time, fewer total misses are covered as an increase in perceived latency makes previously viable p- 
threads unprofitable, with the net result being performance loss. However, on several benchmarks, the reverse hap- 
pens. On these benchmarks, high memory bus contention-which our framework does not explicitly model-effec- 
tively increases memory latency and exploits the increased latency tolerance of the longer p-threads. By over- 
specifying latency, we are effectively helping the framework model bus contention. 
plro(t7a) is an under-specijcation experiment. By pretending that the underlying processor has less latency to tolerate 
than actually exists, we elicit the framework to produce less aggressive p-threads that can cover more total misses. 
Although sometimes unexpectedly successful (again, for the highlighted benchmarks) this strategy typically back- 
fires. The framework produces more and shorter p-threads capable of tolerating less latency but doing so for more 
misses. The lost opportunity to tolerate longer latencies typically results in lower performance. However, in several 
cases, memory latency under-specification can produce better results. This is a concrete example of parameter adjust- 
ment overcoming the shortcomings of the framework. Recall, one of the framework's stated problems is that it incor- 
rectly assumes that latencies are serial, i.e., that they translate into performance cycle for cycle. In twolfand vortex, 
by feeding the framework lower than actual latencies, we help it simulate the true conditions of naturally overlapped 
misses. Actual latency tolerance is not reduced, as there were never 140 cycles worth of latency per miss anyway. 
Reduced overhead and increased total miss coverage produce a net gain. 
5 Related Work 
The analysis of cache misses and techniques for avoiding and eliminating them has been has been an active subject of 
research for as long as caches have existed. Software, hardware, and cooperative techniques for prefetching regular 
and irregular accesses have been proposed and implemented [ l ,  6, 9, 121, including several that use finite state 
machines (FSMs) to mimic execution and generate prefetch addresses from loaded prefetch values [ lo,  131. 
An early proposal to accelerate a single sequential program via prefetching using general-purpose threads was 
Assisted Execution [16]. Implementations of pre-execution in its current form include Speculative Data-Driven Mul- 
tithreading (DDMT) [14], Speculative Pre-Computation [2, 3, 181, Speculative Slices [20], Software Controlled Pre- 
Execution [8], and Slice Processors [ l  I]. Each implementation has its own special feature. Recent work proposes pre- 
execution on an in-memory processor creating a "push" prefetching model which cuts round-trip requestireply effects 
[IS, 191. 
Our framework complements this body of work. We parametrize the pre-execution run-time model. From a run-time 
perspective our results apply to all of these implementations, whether p-threads are executed on dedicated resources 
[3, 2, 11, 15, 191 or in a shared resource environment [8, 14, 201 and whether they are executed at the architectural 
level [8, 15, 19, 201 or the microarchitectural level [2, 3, 11, 141. Our results are directly applicable to those imple- 
mentations which also use static p-threads [8, 14, 15, 19, 201. Short p-thread selection intervals (Section 4.2) can be 
used to for comparison with systems in which p-threads are generated continuously and on-the-fly [2,3, 1 11. Even so, 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
applicability to dynamic p-thread selection systems is tenuous: such systems typically do not analyze p-thread behav- 
ior and do not explicitly target aggregate effects but rather continuously adapt and modify p-threads via feedback. For 
instance, one incarnation of Speculative Precomputation [2] initially selects a conservative p-thread and adds levels of 
induction unrolling if more latency tolerance is needed. In addition to implementation differences, there may be p- 
thread sequencing model differences. Our framework assumes control-less p-threads and no chaining, a model used 
by several proposed systems [2, 1 1, 141. The use of chaining [3] or control flow [8, 15, 19, 201 reduces the applicabil- 
ity of our results, although extensions to handle these different p-thread sequencing models are possible. 
More recently, several frameworks for compiler- or linker-based p-thread generation have appeared [5, 7, 81 one of 
which even employs inter-procedural analysis [5]. These frameworks form a strong implementation path for pre-exe- 
cution. However, they have certain limitations which derive from their static nature. Primarily, they cannot analyze 
the latency tolerance of general dataflow graphs and must instead approximate these from high-level static constructs. 
Although the amount of main thread work available for overlapping can be approximated for simple loops, such anal- 
ysis is difficult for general conditional and call constructs. By dealing with execution traces, our framework sees the 
dynamic instruction stream as a long straightline piece of code (in which loops are unrolled) and sidesteps this prob- 
lem. We hope that some of the insight supplied by our framework can be combined with the practical aspects of these. 
6 Conclusions 
Memory latency is a significant component of total execution time for integer programs running on modem proces- 
sors. With multithreading becoming prevalent, pre-execution-a recently proposed technique for effectively moving 
cache miss latency to other threads-is becoming popular. This paper presents a quantitative framework for reasoning 
about the performance potential of pre-execution and, as a useful side effect, for selecting static p-threads. Our frame- 
work contains two novel components. Aggregate advantage is a function that combines the important p-thread selec- 
tion criteria-latency tolerance per miss, overhead, and ratio of p-threads launched to misses covered-into a single 
numerical value, allowing these often antagonistic considerations to be simultaneously optimized. The slice tree is a 
data structure that naturally represents the set of all possible candidate p-threads and the overlap relationships 
between them, allowing non-redundant solutions comprising multiple p-threads to be found. The framework is built 
from first principles. A few external parameters allow it model most processor configurations. 
We apply our framework to find static p-threads for covering L2 misses in the SPEC2000 integer benchmarks, and 
evaluate the performance of these p-threads using detailed timing simulation. In addition to measuring pre-execution 
performance under different p-thread selection and processor conditions, we evaluate the framework itself by check- 
ing its diagnostic predictions against simulated measurements and by verifying that it qualitatively responds to under- 
lying parameter variations as an optimization framework would. We find that aggregate advantage models p-thread 
behavior accurately, and that it parametrizes the important aspects of the underlying processor-miss latency and pro- 
cessor width-accurately to a first order. The performance results themselves reveal several interesting facts about p- 
threads. P-thread effectiveness monotonically increases as selection constraints are relaxed but saturates at certain 
characteristic points. This behavior strongly suggests that pre-execution effectiveness and p-thread structure are prop- 
erties of the program-a given program/processor pair is associated with a certain canonical set of static p-threads. 
Encouragingly, our framework gravitates to this set when left to its own devices. 
There are several interesting directions for future work. Primarily, alignment of the framework's performance model 
University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 
available at http://www.cis.upenn.edu/-amir/pubs/tr/tselect-tr2002.pdf 
and its assumptions with reality is a continuing process. This may involve adding a critical path modeling [4] compo- 
nent to the framework o r  enriching its vocabulary to allow it to quantitatively reason about naturally overlapped 
misses. Such an addition is important as it would allow us  to better model p-threads for pre-executing L1 misses. 
7 References 
[I] T. Chen and J. Baer. "Effective Hardware Based Data Prefetching for High Performance Processors." IEEE Transactions on 
Computers, 44:609-623, May. 1995. 
[2] J. Collins, D. Tullsen, H. Wang, and J. Shen. "Dynamic Speculative Precomputation." In Proc. 34th lnternational Sympo- 
sium on Microrchitecture, pages 3 0 6 3  17, Dec. 200 1. 
[3] J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and I. Shen. "Speculative Pre-Computation: Long Range 
Prefetching of Delinquent Loads." In Prc. 28th lnternational Symposium on Computer Architecture, pages 14-25, Jul. 2001. 
[4] B. Fields, S. Rubin, and R. Bodik. "Focusing Processor Policies via Critical Path Prediction." In Proc. 27th Annual lnterna- 
tional Symposium on Computer Architecture, pages 74-85, Jul. 2001. 
[5] D. Kim and D. Yeung. "Design and Evaluation of Compiler Algorithms for Pre-Execution." In Proc. 10th lnternational Con- ference on Architectural Support for Programming Languages and Operating Systems (to appear), Oct. 2002. 
[6] A. Lai, C. Fide, and B. Falsafi. "Dead-block prediction and dead-block correlating prefetchers." In Prc. 28th International 
Symposium on Computer Architecture, pages 144-1 54, Jul. 200 I .  
[7] S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. "Post-Pass Binary Adaptation for Software-Based Specu- 
lative Pre-Computation." In Proc. ACM 2002 Conference on Programming Language Design and Implementation, Jun. 
2002. 
C.-K. Luk. "Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Pro- 
cessors." In Prc. 28th International Symposium on Computer Architecture, pages 4&5 1, Jul. 2001. 
C.-K. Luk and T. Mowry. "Compiler Based Prefetching for Recursive Data Structures." In Proc. 7th lnternational Confer- 
ence on Architectural Support for Programming Languages and Operating Systems, pages 222-233, Oct. 1996. 
S. Mehrotra and W. Harrison. "Examination of a Memory Access Classification Scheme for Pointer-Intensive and Numeric 
Program." In Proc. 10th lnternational Conference on Supercomputing, pages 133-1 39, May 1996. 
A. Moshovos, D. Pnevmatikatos, and A. Baniasadi. "Slice Processors: An Implementation of Operation-Prediction." In 
Proc. 2001 International Conference on Supercomputing, Jun. 200 1. 
T. Mowry, M. Lam, and A. Gupta. "Design and evaluation of a compiler algorithm for prefetching." In Proc. 5th Interna- 
tional Conference on Architectural Support for Programming Languages and Operating Systems, pages 62-73, Oct. 1992. 
A. Roth, A. Moshovos, and G. Sohi. "Dependence Based Prefetching for Linked Data Structures." In Proc. 8th Conference 
on Architectural Support for Programming Languages and Operating Systems, pages 1 15- 126, Oct. 1998. 
A. Roth and G. Sohi. "Speculative Data-Driven Multithreading." In Proc. 7th lnternational Symposium on High-Perfor- 
mance Computer Architecture, pages 3748,  Jan. 200 I. 
Y. Solihin, J. Lee, and J. Torrellas. "Using a User Level Memory Thread for Correlation Prefetching." In Proc. 29th Interna- 
tional Symposium on Computer Architecture, pages 17 1 - 182, May 2002. 
Y. Song and M. Dubois. "Assisted Execution." Technical Report #CENG 98-25, Department of EE-Systems, University of 
Southern California, Oct. 1998. 
D. M. Tullsen, S. J. Eggers, and H. M. Levy. "Simultaneous Multithreading: Maximizing On-Chip Parallelism." In Proc. 
22nd lnternational Symposium on Computer Architecture, pages 392403, Jun. 1995. 
P. Wang, H. Wang, I. Collins, E. Grochowski, R. Kling, and J. Shen. "Memory Latency Tolerance Approaches for Itanium 
Processors: Out-of-Order Execution vs. Speculative Precomputation." In Proc. 8th lnternational Syposium on High-Perfor- 
mance Computer Architecture, Jan. 2002. 
C.-L. Yang and A. Lebeck. "Push vs. Pull." In Proc. 2000 International Conference on Supercomputing, May 2000. 
C. Zilles and G. Sohi. "Execution Based Prediction Using Speculative Slices." In Prc. 28th International Symposium on 
Computer Architecture, pages 2- 13, Jul. 200 1. 
