Compiler-assisted multiple instruction rollback recovery using a read buffer by Chen, S.-K. et al.
May 1993 UILU-ENG-93-2220
CRHC-93-11
Center for Reliable and High-Performance Computing
. / - .- -: . J"
• _.,
COMPILER-ASSISTED
MULTIPLE INSTRUCTION
ROLLBACK RECOVERY
USING A READ BUFFER
N. J. Alewine, S.-K. Chen, W. K. Fuchs, and W.-M. Hwu
(NASa-Ca.-1931?5) CUMPILER-ASS ISTED
MULTIPLE INSTRUCTION R_3LLBACK
_,LCCI_/F!RY !.J$I _'':-_: _, R_AL) IUFF_R,
(Illinois Univ.) 3_ p
N93-zgI70
Uncl _s
G3/61 0111503
Coordinated Science Laboratory
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
https://ntrs.nasa.gov/search.jsp?R=19930019981 2020-03-17T04:50:30+00:00Z
L]_CL.-\S S ! F I ED
SECUmFY C_S_IFI_rION OF f_JS PAGE
la. REPORT SECURITY CLASSIFICATION
Unclassified
2a. SECURITY CLASSIFICATION AUTHORITY
2b. DECLASSIFICATION / DOWNGRADING SCHEDULE
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-93-2220
REPORT DOCUMENTATION PAGE
lb. RESTRICTIVE MARKINGS
None
3. DISTRIBUTION/AVAILABILITY OF REPORT
Approved for public release;
distribution unlimited
5. MONITORING ORGANIZATION REPORT NUMBER(S)
CRHC-93-11
6a. NAME OF PERFORMING ORGANIZATION
6b. OFFICE SY'IVlBOL
(If applicable)
N/A
Coordinated Science Lab
University of Illinois
6c. ADDRESS(_ State, and ZIPCodc)
XX__Y_X_X_ 1308 &¢. Main St.
I Bb. OFFICE SYMBOL
(If applicab/c)
7a. NAME OF MONITORING ORGANIZATION
Intl Business Machines
NASA and Office of Naval Research
7b. AOORESS(Ciry, State, andZIPCodc)
Boca Raton FL
Moffitt Field, CA
Arlington, VA
9. PROCUREMENTINSTRUMENTIDENTIFICATION NUMBER
Urbana, IL 61801
8a. NAME OF FUNDING/SPONSORING
ORGANIZATION
7A
_.AODRESS(O_ State, and ZlPCode) 10. SOURCE OF FUNDING NUMBERS
PROGRAM PROJECT I TNAOSK
ELEMENT NO. NO.
7B
1. TITLE Onclude Security Oa_fication)
Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer
WORK UNIT
ACCESSION NO.
12. PERSONAL AUTHOR(S) ALEWINE, N. J., S.-K.
13a. TYPE OF REP'ORT 113b.TIME COVERED
Technical ! FROM TO
16. SUPPLEMENTARY NOTATION
Chen, W. K. Fuchs, and W.-M. Hwu
I4. DATE OFREPORT _a_Mocrth, Oa_ IS. PAGE COUNT1993 May 31
17, COSATICODES 18. S_BJEETTERMS(Continue onrever_if_ece_a_ ar_didenti_ by bl_k numbed
FIELD GROUP I SUB-GROUP fault-tolerance, error recovery, instruction retry,
I compilers, hardware assisted retry
!9 ABSTR_CT(Continueonreve_e ifnece_a_ aodidenti_ by bl_k humor)
._Iultii)le instruction rollback (.\111{) is a Iechnique that has been implemented in mainframe computers to
pro_i,te rapid recovery f:om transien_ processor failures. Ilardware-based MIR designs eliminate rollback
data hazards by providing data redundancy i,nplemeuted in hardware. Compiler-based M[I_ designs have
also been developed which remove rollback data hazzards directly with data-flow transformations.
This paper focuses on compiler-assisted techniques to achieve multiple instructionrollback recovery. X\"e
observe that some data hazards resulting rom instruction rollback can be resolved emciently by provid-
ing an operand read buffer while others are resolved more efficiently with compiler transformations. A
compiler-assisted multiple instructionrollback scheme is developed which combines hardware-implemented
!.,t., redundancy wit h compiler-d riven hazard removal transformation.s Experimental performance eval,:a-
tions indicate improved efficiency over previous hardware-based and compiler-based schemes.
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT J21. ABSTRACT SECURITY CLASSIFICATION
_IUNCLASSIFIEDAJNLIMITED [] SAME AS RPT. [-1 DTIC USERS Ij Unclassified
22a. NAME OF RESPONSIBLE INDIVIDUAL 122b.TELEPHONEOncIudeAre, Code) IZ2c. OFF,CE SYMBOL
I I
DD FORM 1473, 84 MAR 83 APR edition may be used until exhausted. SECURITY CLASSIFICATION OF THIS PAGE
All other editions are obsolete.
UFCLASS IFIED
COMPILER-ASSISTED MULTIPLE INSTRUCTION
ROLLBACK RECOVERY USING A READ BUFFER
N. J. Alewine x, ,ft.-I(. G"hen, W. K. Fuch_, W.-M. Hum
Center for Reliable and High-Performance Computing
Coordinated Science Laboratory
1308 West Main Street
University of Rlinois
Urbana, IL 61801
Primary contact: W. Kent Fuchs
Phone: (217) 333-8294
FAX: (217) 244-5686
e-mail to fuchs@crhc.uiuc, edu
May, 1993
ABSTRACT
Multiple instruction rollback (MIR) is a technique that has been implemented in mainframe com-
puters to provide rapid recovery from transient processor failures. Hardware-based MIR designs
eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-
based MIR designshave also been developed which remove rollbackdata hazards directlywith
data-flowtransformations.
This paper focuseson compiler-assistedtechniquesto achievemultipleinstructionrollbackre-
covery. We observe that some data hazards resultingfrom instructionrollbackcan be resolved
efficientlyb providingan operand read bufferwhile othersare resolvedmore efficientlywith com-
pilertransformations.A compiler-assistedmultipleinstructionrollbackscheme isdeveloped which
combines hardware-implemented data redundancy with compiler-drivenhazard removal transforma-
tions.Experimental performance evaluationsindicateimproved efficiencyover previous hardware
based and compiler-basedschemes.
/ndez terr_: fault-tolerance, error recovery, instruction retry, compilers, hardware assisted retry.
aInternationM Business Mschines Corporation, Boca ]Lston, FI.
This research wu supported in part by the National Aeronautics and Space Administration (NASA) under grant
NASA NAG 1-613, in cooperation with the Illinois Computer Laboratory for Aerospace Systems and Software
(ICLASS), and in part by the Department of the Navy and managed by the Office of the Chief of Naval Research
under Contract N00014-91-J-1283.
1 Introduction
Instruction retry is a technique for rapid recovery from transient faults in a processing system.
Multiple instruction rollback recovery is particularly appropriate when error detection latencies or
when error reporting latencies are greater than a single instruction cycle.
When transient processor errors occur, multiple instruction rollback (also referred to as mul-
tiple instruction retry or simply instruction retry) can be an effective alternative to system-level
checkpointing and rollback recovery [1-6]. Multiple instruction retry within a sliding window of
a few instructions [2-5], or re-execution of a few cycles [7], can be implemented in parallel with
concurrent, algorithm-based, or control-flow error detection methods for recovery from transient
processor errors.
1.1 Hardware-Based Instruction Rollback
Hardware implemented instruction retry schemes belong to one of two groups: 1) full checkpointing
and 2) incremental checkpointing. Full checkpointing maintains "snapshots" of the required system
state space at regular, or predetermined, intervals. Upon error detection, the system can be rolled
back to the appropriate checkpointed system state. Incremental checkpointing maintains changes
to the system state in a "sliding window". Upon error detection the system state is restored
by undoing, or "backing-out" the system state changes up to the instruction in which the error
occurred.
The issuesassociatedwith instructionretryare similarto the issuesencountered with exception
handling in an out-of-orderinstructionexecution architecture.Ifan instructionisto write to a
registerand N isthe ma_mum errordetectionlatency (or exceptionlatency),two copiesof the
data must be maintained forN cycles.Hardware schemes such as reorderbuffers,historybuffers,
futurefiles[8],and micro-rollba_k[2]differin where the updated and old valuesreside,circuit
complexity,CPU cycletimes,and rollbackefficiency.
Table 1 givesa descriptionof varioushardware-ba_ed methods to restorethe generalpurpose
registerfilecontentsduringsingleor multipleinstructionrollback.In the VAX 8600 and VAX 9000,
errorsare detected priorto the completion of a faultyinstruction.For most VAX instructions,
updates to the system stateoccur at the end of the instruction.Ifthe errorisdetected priorto
the updating ofthe system state,the instructioncan be rolledback and re-executed.Ifthe system
Table 1: Hardware-based single and multiple instruction rollback schemes.
Rollback Scheme
IBM4341[9]
IBM3o81[z]
VAX 8600 [10]
IBM patent 4,912,707 [6]
IBM patent 4,044,337 [11]
micro-rollback [2]
history buffer [8]
history fih [8]
VAX 9000 [12]
IBMz/s 9000[5]
Checkpoint
Type
full
full
full
full
incremental
incremental
incremental
incremental
full
incremental
Rollback
Distance
singleinstr.
10-20 instr.
singleinstr.
variable
singleinstr.
variable
variable
vaxiable
singleinstr.
variable
Location of Data
Primary
feaster file
registerfile
registerfile
feaster file
registerfile
write buffer
registerfile
re_sterfile
registerfile
virtualfile
Redundant
shadow file
shadow file
not required
shadow file
shadow files
registerfile
historybuffer
shadow file
not required
physical file
state has changed prior to detection of the error, a flag is set to indicate that instruction rollback
cannot be accomplished. Redundant data storage is not required for the VAX 8600 and VAX 9000.
The IBM 4341, IBM 3081, IBM patent 4,912,707, IBM patent 4,044,337, and history file all
require shadow file structures to maintain redundant data. This data is used to restore the system
state during rollback recovery. Shadow file structures can add significant circuit overhead, although
the level sensitive scan design [13] of the IBM 4341 and IBM 3081 provides this feature without
additional cost over that incurred to obtain testability. 2 The VAX 8600 and VAX 9000 schemes
avoid shadow files, however, require an error detection latency of only one instruction.
The micro-rollback scheme also avoids shadow fries by using a delayed write buffer to prevent
old data from being overwritten until the error detection latency has expired; ensuring that the
new data is fault-free. In a delayed write scheme, the most recent write values are contained in
the delayed write buffer, and bypass circuitry is required to forward this data on subsequent reads.
The performance impact introduced by the bypass circuitry is a function of the register Me size
and the maximum rollback distance [2].
The history buffer scheme maintains redundant data in a separate push-down array and there-
fore does not require bypass circuitry [8]. The history buffer does however require an extra register
file port which complicates the file design and can impact performance by increasing fih access
2The 126 scan rings of the IBM 3081 contains 35,000 bits of data.
2
times.
In an effort to increase the register file size while maintaining down-level code compatibil-
ity relative to the 16 architectural registers, the IBM E/S 9000 has introduced a virtual register
management (VRM) system [14]. The VRM circuitry dynamically maps the eight architectural
registers into 32 physical registers. When the data in a physical register becomes obsolete, the
physical register is released for reassignment as a new virtual register. Although the VR.M system
was primarily intended to reduce register pressure and therefore improve system performance, it has
been extended to provide data redundancy to assist in rollback recovery. In the VRM extension,
remapping of a physical register to a new virtual register is postponed until the error detection
latency has been exceeded for the data contained in the physical register.
1.2 Compiler-Based Instruction Rollback
Recently, compiler-based approaches to multiple instruction rollback recovery have been inves-
tigated [3,4]. Compiler-based MIR uses data-flow manipulations to remove data hazards that
result from multiple instruction rollback. Rollback data hazards (or just hazards) are identified
by antidependencie# 3 of length __ N, where N represents the maximum rollback distance. Antide-
pendencies are removed at three levels: 1) pseudo-code level, or the code level prior to variables
being assigned to physical registers, 2) machine-code level,, or the code level in which variables are
assigned to physical registers, and 3) post-pass level, which represents assembler-level code emitted
by the compiler. Compiler-based multiple instruction rollback reduces the requirement for data
redundancy logic present in hardware-based instruction rollback approaches.
1.3 Compiler-Assisted Instruction Rollback
Compiler-based multiple instruction rollback resolves all data hazards using compiler transforma-
tions. This paper introduces a compiler-assisted instruction rollback scheme which uses dedicated
data redundancy hardware to resolve one type of rollback data hazard while relying on compiler
assistance to resolve the remaining hazards. Experimental results indicate that by exploiting the
unique characteristics of differing hazard types, the new compiler-assisted MIR design can achieve
superior performance to either a hardware-only or compiler-based instruction rollback scheme.
3For a complete presentation of dat_-flow properties and manipulation methods, see [15].
3
2 Error Model and Hazard Classification
2.1 Rollback Data Hazard Model
The followingfourassumptions areusedinthe generalerrormodel: i)the maximum errordetection
latencyisN instructions,2)memory and I/O have delayedwritebuffersand can rollbackN cycles,
3) the statesofthe program counterand program statusword (PSW) are preservedby an external
recordingdeviceor by shadow registers[2],and 4) the CPU statecan be restoredby loading the
correctcontentsof the registerfile,progrmm counter,and PSW.
Given the above assumptions,any errorwhich does not manifest itselfas an illegalpath in the
control-flowgraph (CFG) of the program isMlowed provided that the followingtwo conditionsare
satisfied:I) registerfilecontentsdo not spontaneouslychange, and 2) data can not be writtento
an incorrectregisterlocation.There are four targetederrortypes: 1) CPU errorssuch as those
caused by an ALU failure,2) incorrectvaluesbeing read from I/O, memory, the registerfile,or
extern_lfunctionalunits such as the floatingpoint unit,3) correct/incorrectvalues being read
from incorrectlocationswithinthe I/O, memory, or registerfile,and 4) incorrectbranch decisions
resultingfrom errortypes i,2,or 3.
2.2 Hazard Classification
The code can be representedas a CFG G(V',E), where V isthe setof nodes denoting instructions
and E isthe set of edges denoting control-flow.Ifthere isa directcontrol-flowfrom instruction
i, denoted I_, to lj, where I_ E V and Ij E V, then there is an edge (I_, Ij) E E. Let d,,_,_(I_, Ij)
denote the smallest number of instructions along any path from I_ to Ij.
The hazard set Hregs of the error model is defined as the set of pseudo registers (or machine
registers) whose values are inconsistent during different executions of an instruction sequence due
to retry. A formal classification of hazard set Hregm follows.
Property 1: z E Hre_e iff there exists a sequence of instructions I1, I2,..., IN which form a
legal walk 4 in G such that z is live at/1, and z is defined during the walk.
Proof: For the i.fcase, _n error occurring in Il will be detected by IN. During the retry of I1,
z will be in an inconsistent state since it was defined during the walk. Since z is live at I1, there
'A wo/k is a sequence of edge traversals in a graph where the edges visited can be repeated [16],
4
issome path along which z isused priorto itsredefinition,and sincez isin an inconsistentstate,
z E Hregm. For the only ifcase,we suppose the contrary.Assume that among alllegalwalks of
length N in G, eitherz isnot liveat the beginning,or z isnot definedduring the walk. It then
followsthat z eitherhas no use,or z isnot changed. (The errormodel does not allow a writeto
a wrong locationand the contentsofregisterz can not spontaneously change.) Thereforethere is
no inconsistency problem for z, which implies z _ Hre_o.
Property 2: Hazards can be classified as one of two types: 1) those that appear as antide-
pendencies of length <_ N in G(V, E), referred to as on-path hazards, and 2) those that appear at
branch boundaries, referred to as branch hazards. These two hazard types may overlap.
Proof: Since z E H, there exists a legal walk Wx = I1,I2,...,IN in G, such that z is live at
/1, and after the execution of Ix,I2,...,IN in sequence, z has a different value. The latter implies
that there is at least one instruction defining z along Wx (the error model does not allow a write to
a wrong location and the content of register z can not spontaneously change). Let i be the largest
index that Ii defines z, where i E {1, 2, ..., N). Property 1 implies that there exists a legal walk
W2 in G, beginning with Ix, such that the first instruction Ij along W2 referring z is a use. Case
1: if W2 C W1, instructions Ij and Ii constitute an antidependency of length _< N, and there is
an on-path hazard on z. Case 2: if W2 _ Wx, there exists a branch instruction It between Ix and
Ii-1. Since d_i,,(It, l/) _< N, there is a hazard on z at a branch boundary.
An on-path or branch data hazard occurs when Ii defines variable z, and after rollback, Ij uses
the corrupted z value prior to its being redefined. To simplify subsequent discussion, such on-path
and branch hazards will be denoted ho(i,j,z)and hs(i,j, z) respectively. Figure 1 illustrates this
hazard notation.
3 Compiler-Assisted Instruction Rollback
As shown in Section2,rollbackdata hazards are of two types:I) on-path hazards,and 2) branch
hazards. Previous work has shown that compiler-drivendata-flowmanipulations can be used to
resolveboth on-path [3]and branch [4]hazards. Compiler-assistedmultiple instructionrollback
describedin thissectionuseshardware to resolveon-path hazards and relieson compiler assistance
to resolvethe remaining branch hazards.
5
• 9uIDo anoqi • ouBa ooo awauMioan_o riD.
• "*'0o
• ° i
• •
• • -
Ik: __.. lj: _..._ -4---- rollback
•, "....... ho(ij )i 
• hb(iJc,X)'"..., j•
Ii: li }=I..
0- ,,o
e_'X_&IC'C_ -----,--_9_---'"
Figure1: On-path and branch hazards.
3.1 On-path Hazard Resolution Using a Read Buffer
Figure 2 shows a hardware scheme to resolve on-path hazards. A read buffer is attached to the
output ports of the register file. Each time a register is used it appears on the read port and is
saved in the read buffer. If a register r_ is defined in Ii and it is an on-path hazard, then rk must
have been read within the last/V cycles. In this case, the read buffer will contain the old value
and it is permissible to write the new value into the register file. In the event of a rollback of N
instructions, the contents of the read buffer are flushed in reverse order and stored back to the
register file. For an on-path hazard, the path taken after the rollback will be the same as the path
taken prior to rollback and each read of rk will produce the same value as before. It is assumed
that the read buffer is an integral part of the register file and any error in the system does not
corrupt the transfer to the read buffer or its contents.
In contrast to a write history buffer which forces a read of rk prior to writing rk, the read buffer
monitors the register file ports and stores only the values read as part of the normal program flow
and, therefore, should not significantly impact the register file performance or CPU cycle time. The
read buffer is twice the width of a register with a depth of/V. This is twice the size of a delayed
write buffer, but eliminates the requirement for complex bypassing and prioritization logic.
6
Figure2: Read buffer.
3.1.1 Covering on-path hazards
In addition to resolvingallon-path hazards, the read bufferwillresolvesome branch hazards.
Figure3 shows an on-path hazard and a branch hazard both with defmitionsofz in I_and uses of
z, afterrollback,in instructionsIj and lj,respectively.Note that ifpath !isinitiallytaken,the
read bufferwillcontainthe old valueof z and rollbackwould be successful.However ifpath m is
taken,the read bufferwillnot containthe old value of z and rollbackwould be unsuccessful.If
onlypaths such as Iexist,the presenceofthe on-path hazard assuressuccessfulrollbackor "covers"
the branch hazard. In thiscase,resolutionof the branch hazard using compiler techniquesisnot
necessary.
3.1.2 Post-pass transformation
Given the efficiency of the read buffer in resolving on-path hazards, a post-pass transformation on
assembler-level code becomes possible as an alternative to nop insertion transformations [3]. The
post-pass transformation creates on-path hazards when necessary to assure that all branch hazards
are resolved by the read buffer. Given one such branch hazard which defines physical register rk
at instruction Ii, the transformation inserts an MOV rk, r_ instruction immediately before Ii. This
guarantees that all paths leading to Ii are like path I in Figure 3.
7
path 1 i_--. roUbac_
L .
l x Ilk
•( •
I X-- lIi "
Figure 3: Covering on-path hazard.
3.2 Branch Hazard Resolution
Compiler transformations have been shown to be effective in resolving branch hazards [4]. Branch
hazards are resolved at three levels: 1) pseudo-level, 2) machine-level, sad 3) post-pass level.
Pseudo-level hazards are removed by variable renaming, for example, renaming variable z to y in
instruction I_ of Figure 1. Machine-level branch hazards occur when register assignments result in
branch hazards that were not present at the pseudo-level. Machine-level hazards axe resolved by
adding hazard constraints to live range constraints prior to register assignment. Branch hazards
which remain after pseudo-level sad machine-level transformations are resolved at the post-pass
level with read insertions as described in Section 3.1.2.
The primary pseudo-level renaming transformation for the removal of branch hazards, involves
node splitting [4]. This section presents a new one-pass node splitting algorithm which results in
marginally reduced code growths sad dramatically reduced compile-times relative to previous node
splitting algorithms.
3.2.1 Iterative node splitting algorithm
Node splitting breaks equivalence relationships which would prevent pseudo register renanling [3,
15]. When two definitions of a hazard variable reach a node in which the hazard variable is live, the
node is split. Node splitting to resolve one hazard variable often resolves other unrelated hazard
variables. This implies that the hazard set should be recalculated after splitting is performed
for each hazard variable. Previous node splitting algorithms use this iterative algorithm to avoid
unnecessary node splitting [3].
Figure 4 demonstrates the effect of the iterative node splitting algorithm on an example sub-
graph. Node splitting relative to hazard variable z ensures that the definition of z in node nl and
the definition of z in node n2 do not both reach the same use of z in node ns. Node splitting
relative to y ensures that the definition of y in node n3 and the definition of y in node n4 do not
both reach the same use of y in node he. Figure 4 also shows _n optimal subgraph which resolves
both hazards with less splitting than produced by the iterative algorithm, indicating that excessive
node splitting is possible with the iterative algorithm.
3.2.2 Node splitting using graph coloring
To ensure minimal splitting, a new node splitting algorithm is developed using the concept of
conflicting parents [17]. Ensuring that node n does not have conflicting parents enables resolution
of the hazard using variable renaming. The node splitting strategy for a particular node is to group
the parents of that node such that elements within a group do not conflict. Each group becomes
parent nodes for a duplicate of the original node. For example, if node n has six parent nodes and
these nodes can be organized into three nonconflicting groups, then only three total copies of n axe
required.
Figure 5 illustratesthe use of conflictingparents and graph coloringin node splittingfor the
QSORT applicationdescribedinTable 3 ofSection4.1.Node splittingisperformed on pseudo-level
code, which for thisexample isrepresentedby [,codefrom the IMPACT C compiler [18].Figure
5 shows node 48 from the QSORT application.Node 48 has sixparent nodes priorto splitting.
These nodes can be arranged in a parent conflictgraph, where each arc of the graph represents
two nodes which conflict.Establishinggroups can be achieved by findingthe minimum coloring
of the parent conflictgraph, i.e.,coloringthe nodes such that no two nodes connected by an arc
9
Unsplit_bgraph Splitrelativetovariablcx
nl
hazardnode
Splitrelativetovariabley Optimally splitsubgraph
Figure 4: Iterative node splitting relative to hazard variables z and y.
10
Node48beforesplitting
Parentconflict graph
Node 48, 48', and 48" after splitting
Figure 5: Node splitting using graph coloring; QSORT.
have the same color. For the example shown in Figure 5, three colors are sufficient to color the
parent conflict graph, resulting in the splitting of node 48 into nodes 48, 48' and 48". Determining
whether a graph is k-colorable is NP-complete in general. The graph coloring heuristic used for our
one-pass node splitting algorithm is a modified version of an algorithm used for register allocation
4)
[15].
3.2.3 One-pass node splitting algorithm
Both live_in(n) and reaching_out(n) 5 analyses are required to identify conflicting parent nodes. A
one-pass node splitting algorithm becomes possible by precalculating live_in and the hazard node
set, and then, beginning with the root node, splitting in a topological traversal of the CFG. A
topological traversal ensures than when processing node n, all ancestors of n have been processed
and no descendantsof n have been processed.This lattercaseensuresthat the presplitcalculation
of live_in(n)can be used for parent conflictidentificationwhen processinga given node. Unlike
live_in(n), reaching=out(n) is affected by the splitting of ancestor nodes. Since reaching_out(n)
SA complete description of d_t_flow terminology can be found in "Compilers: Principles, Techniques, and Tools",
Aho et aL, [15].
11
Table2: Nodesplitting algorithmcomparisons:COMPRESS.
• IterativeAlgorithm run time --614.0seconds
• One-pass Algorithm run time = 20.3seconds
• Speedup = 30.2
Orig. Node Cnt. IterativeAlg. % Increase One-pass Alg. % Increase
547 601 9.9 566 3.5
461 499 8.2 496 7.6
144 147 2.1 147 2.1
181 209 15.5 207 14.4
75 80 6.7 80 6.7
21 28 33.3 27 28.6
45 79 75.6 48 6.7
isbased solelyon node n and itsancestors,reaching_out(n)can be calculatedas node splitting
proceeds.Ifa hazard node issplit,each duplicateof the node must be added to the hazard node
set. Since the root node does not have conflictingparents,a topologicaltraversalof the CFG
using the graph coloringnode splittingtechniqueensuresthat no node in the resultinggraph has
conRictingparents.
Table 2 illustratesthe improvement of the one-passnode splittingalgorithmover the iterative
algorithmfor the COMPRESS applicationdescribedin Table 3 of Section4.1. The COMPRESS
applicationwas compiled on a SPARCserver 490 using the IMPACT C compiler[18]with a rollback
distanceof 10. Node count valuesrepresentpseudo instructions(Lcode) createdby the IMPACT C
compilerbeforeand aftersplitting.Seven ofthe 14 COMPRESS functionswhich requiredsplitting
axe listed.Algorithm run times representthe overallcompile times given each of the two node
splittingalgorithms.
Table 2 shows a marginal overallcode growth reductionfor the one-pass algorithm.Although
one functiondemonstrated a significantcode growth reduction(6.7% down from 75.6%), the func-
tionissmall and has minimal effecton the overallcode size.The improvement in compile-time
of the one-pass algorithm is more dramatic, resultingin a speedup of 30.2. The compile-time
improvement can be explainedas follows.If60 hazard variablesare presentin a given function,
the iterativealgorithm may requireup to 60 passesthrouF,h the CFG of that function,including
12
=_
• i_-- rollback
r-_ / read[ ' ro= _ + yi_ insenio_
• !
• !J I,,- J)
Figure 6: Post-pass hazard removal using rea_i insertion.
60 dat_-flow xnalysis and hazard calculations. Although processing a given node in the one-pass
algorithm is slightly more complex, a single dat_-flow analysis calculation and a single pass through
the CFG are sufficient.
3.3 Performance Enhancement Through Profiling
3.3.1 Post-pass transformation versus loop protection
After hazards are removed by the compiler,some hazards remain and must be removed using a
post-passtransformation.Previous post-passtransformationsused hop insertionsto increaseall
antidependency distancesto > N [3].Since nop insertioncan be costlyto performance, previous
compiler transformationsremoved allhazards possible,leavingonly unresolvablehazards to be
removed by the post-passtransformation.
In Section3.1.2,a new post-passtransformationwas introduced in which nop insertionwas
replacedby read insertionsas the primary hazard removal technique.As illustratedin Figure6,up
to two branch hazards can be removed by a singleread instruction.The new post-passtransfor-
mation isveryefficientand insome casescan resolvebranch hazards with lessperformance impact
than pseudo-leveltransformations.Figures 11 and 13 of Section4.2 show performance overhead
comparisons between compiler-drivendata-flowmanipulationsand the post-passtransformationfor
the PUZZLE and TBL applicationsdescribedin Table 3 of Section4.1. Comp//PP indicatesthat
hazards areresolvedby the compilerwhere possible,with the remaining hazards being resolvedat
13
the pOstopass level. PP (post-pass) indicates that compiler transformations have been disabled and
that all hazards are removed at the post-pass phase.
For the PUZZLE application, compiler transformations produce better performance than the
post-pass transformation alone. For the TBL appl/cation, using the post-pass transformation to
remove all hazards produces slightly better performance than the combination of compiler and
post-pass transformations. Hazard elimination via read insertion introduces a guaranteed but small
performance impact due to the longer instruction path length. As demonstrated by the PUZZLE
appUcation, pseudo register renaming can eliminate hazards without impacting performance when
loop protection is infrequent. The save/restore operations of loop protection can result in more
performance impact than read insertion when loop protection is frequent, as demonstrated by
results for the TBL application.
Figure 7 illustrates the potential effect on performance given the following two types of hazard
removal: 1) hazard removal using register renaming that results in loop protection, and 2) hazard
removal using read insertion. If the protected loop of Figure 7 is executed 20 times and the hazard
instruction is executed two times, loop protection would require the execution of 40 additional
instructions, where read insertion would require the execution of only two additional instructions.
If the loop and hazard instruction execution frequencies were reversed, then read insertion would
produce more performance impact than loop protection. As shown in Figure 7, profiling data can
be used to aid in loop protection decisions. "'
3.3.2 Profiling effectiveness
Profiled data was included in the pseudo-level transformations of Section 3.2. The profile data is
comprised of both dynamic profile sampling and static prediction. The static prediction is used as
a supplement for areas of the application code that are unexecuted during profile sampling. For
static profiling, a loop is assumed to iterate ten times. Inner loops, therefore, iterate multiples
of 10 times depending on the depth of loop nesting. All loop header nodes and hazard nodes are
assigned weights based on the profile data.
Protection of loop I due to hazard node nh is required based on the following condition: if
nh_weight > 3 • (hdr_node(1)_weight), then protect loop I. The constant 3 adjusts the weights
to account for both direct and indirect loop protection costs. Direct loop protection costs result
14
Loop Protection Read Insertion
save I rt = _
,.. rx dead .....
! •
1211.0 "
I i
|
change :
all _'s ]
mlr s [
I
profile dam
Figure 7: Loop protection versus read insertion.
from the save/restoreinstructionpair shown in Figure 7. Indirectloop protectioncosts result
from: 1)an increasednumber ofhazards which in turn requiredmore node splittingand more loop
protection,and 2) increasedregisterusage due to the save/restoreinstructionswhich can result
in additionalregisterspills.Figure 8 shows the run-time overhead for the TBL applicationwith
rollbackdistancesfrom I to 10. Pro//PP indicatesthatprofilingdata was used in loop protection
decisions.
The resultsshow thatthe use ofprofiledata can improve applicationperformance by postponing
some hazard resolutionsuntilthe post-passphase. Using profiledata to aid in loop protection
decisionsdid not produce performance equal to thatforthe post-passtransformation,forthe TBL
application.As an extensionto thiswork, profiledata can be used to aid inregisterallocation.As
discussedin Section3.2,hazards that are presentafterpseudo registerrenaming are resolvedby
adding hazard constraintsto liverange constraintspriorto registerallocation.These additional
constraintscan cause increasedregisterspillageand impact performance. Similar techniquesto
those developed forloop protectioncan be used to enhance registerallocationdecisions.
15
4Time OH: TBL
10- pp:
8- Comp/PP:
Prof/PP:
_
2
0
-2
-4
--.,,iJi--
° -i_lr o
°o_°oo
& &. & A
• ," o..-'"_.. ,-_.:'-i" n
_ & .." "2-. ..:. ....... "0. ......
• I" .." _ "_,,-0"
I I 1 I I I I I I I
12345678910
Rollback Distance
Figure 8: TBL: profile data used for loop protection decisions.
Performance Evaluation
4.1 Implementation and Application Programs
The hazard removal transformation algorithms have been implemented in the MIPS code generator
of the IMPACT C compiler [18]. Transformations resolving pseudo register hazards (loop protec-
tion, node splitting, and loop expansion) are called just before register allocation. Transformations
resolving machine register hazards are called after the live range constraints have been generated
and before physical register allocation. The nop insertion algorithm, or post-pass algorithm, is
called before the assembly code output routine.
Table 3 lists the eleven application programs used in the evaluations. The applications were
cross-compiled on a SPARCserver 490 and then the compiled program was run on a DECstation
3100. Static Size is the number of assembly instructions emitted by the code generator, not including
the library routines and other fixed overhead.
The results are summarized in Figures 9 through 13. Each figure contains two plots, the first
plot shows the percent of run-time overhead ( Time 01t) of the referenced hazard resolution scheme,
and the second plot shows the percent of code growth overhead (Size OH) relative to the base values
in Table 3.
Four hazard resolution techniques were evaluated. Compiler I resolves on-path hazards only, us-
ing the compiler-driven data-flow manipulations. Compiler 2 extends the compiler transformations
16
Table3: Applicationprograms.
Program Static Size Description
QUEEN 148 eisht-queen program
WC 181 UNIX utility
252QSORT quick sort algorithm
UNIX utilityCMP 262
GR,EP 907 UNIX utility
PUZZLE 932 simple game
COMPRESS 1826
LEX 6856
UNIX utility
lexical analyzer
YACC 8099 parser-generator
TBL 8197 table formatting preprocessor
CCCP 8775 preprocessor for gnu C compiler
to resolve both on-path and branch hazards. PP (post-pass) disables the compiler transforma-
tions and relies solely on the post-pass transformation presented in Section 3.1.2. Comp/PP uses
compiler transformations to resolve branch hazards with the techniques described in Section 3.2,
assumes a read buffer to resolve on-path hazards, and uses the post-pass transformation to remove
remaining branch hazards. Comp/PP represents the compiler-assisted multiple instruction rollback
scheme.
Due to the excessive compile times of the previous Compiler 1 and Compiler 2 algorithms for
large applications, the evaluations of these schemes were restricted to applications QUEEN, WC,
COMPRESS, CMP, PUZZLE, sad QSORT. Both Comp/PP sad PP were evaluated for all eleven
applications.
4.2 Performance analysis
Compiler transformations used for the removal of data hazards can impact performance in several
ways. Loop protection inserts save/restore operations at the head and tail of the loop. This increases
the path length and, therefore, the run time. Additional arcs in the dependency graph can cause
more spill code to be generated, increasing memory references and cache misses. Nop insertion
can be costly since up to N hops could be inserted for each unresolved hazard. The insertion of
MOV rk, rk instructions to create covering on-path hazards in the post-pass transformation also
17
increasespath lengths, although typically less than with nop insertions. Finally, the increase in
code size, mainly due to loop expansion, may cause more run-time cache misses. The performance
numbers shown in Figures 9 through 13 are for execution of the eleven application programs on a
DECstation 3100 after they have been compiled with the transforms described.
4.3 Results: Compiler
As can be seenin Figures9 through 11,extendingthe compiler hazard resolutionscheme toinclude
branch hazards introduceslittleincrementalperformance impact or code growth overhead. Given a
rollbackdistanceof 10,resolvingboth on-path and branch hazards using compilertransformations
resultedin a maximum performance impact of 32.6% and an averageperformance impact of 12.6%.
This compares with maximum and averageimpacts of35.4% and 15.4%,respectively,forcompiler-
drivenon-path hazard resolutiononly.The maximum code sizeoverhead measured forthe extended
compiler-basedtechniquewas 328% with an averageoverhead of 207%, for a rollbackdistanceof
10. This compares with a maximum and averageoverhead of 372% and 225%, respectively,for the
unextended compiler-basedscheme.
These resultsindicatea small incrementalrun-time performance overhead and a small code
sizeoverhead given compiler-basedbranch hazard removal compared to compiler-basedon-path
hazard removal alone.Three factorsaccount forthese small incrementalimpacts. First,on-path
hazards dominate in frequencyof occurrence.Second, resolvingan on-path hazard at instruction
Ii through renazningcan sometimes resolvea branch hazard at instructionIi. Third, resolving
on-path hazards with nop insertionmay resolvea corresponding branch hazard by increasingthe
distancebetween the hazard node and itsnearestpredecessorbranch node.
4.4 Results: PP
Figures9 through 13 show the run-time and code sizeoverheadsforeach applicationstudiedusing
the read bufferto resolveon-path hazards and the post-passtransformationdescribedin Section
3 to cover allbranch hazards. The resultsare worst case in that many of the branch hazards
could have been resolved with no performance impact using the compiler techniques;instead,
they are resolvedby the insertionof MOV instructionswhich cause a guaranteed,although small,
performance impact. Given a rollbackdistanceof 10, the post-passtransformationproduced a
18
maximum performance impact of 7.695{ with an average performance impact of 2.43%, significantly
below the levels produced by the compiler-baaed scheme. Code growth overhead measurements were
correspondingly lower with a maximum overhead of 13.0% and an average overhead of 8.59%.
4.5 Results: Comp/PP
The compiler-assisted scheme achieved consistently low performance overheads across all appUca.
tions and slightly better performance than with the post-pass transformation only. Given a rollback
distance of 10, the compiler-aasisted scheme produced a maximum performance impact of 6.57%
with an average performance impact of 2.03%, and a maximum code growth overhead of 51.2%
with and an average overhead of 15.5%. The run time results of PUZZLE, YACC, and CCCP in-
dicate that compiler techniques axe still useful in reducing run-time performance penalties. These
compiler techniques, however, have the disadvantage of requiring re, compilation and additional code
growth. The primary advantage of the compiler-aasisted mad post-pans schemes are their utilization
of the read buffer to resolve the more frequent on-path hazards.
Time OH: QUEEN
35(%) ('_
3250t Compiler h .-.a- 400-
Compiler 2: - .0. -
pP. ...x.... p 350
20 Comp/PP: + /: 3250_0
#/
15 n..._ ---a 200
I0 "'" 150
50
" I I I 0
1 2 3 4 5 6 7 8 9 10 0
Rollback Distance
SizeOH: QUEEN
)
Compiler h --o-
Compiler 2: -o - ,_
pP. ...K.... /9
C__ °° ....
& # A A ,& A A A & &
J m T .....T---T .....7" ....Y .....? ? ?
1 2 3 4 .5 6 7 8 9 10
Rollback Distance
Figure 9: thin-time overhead and code size overhead: QUEEN.
19
Time OH: WC Size OH: WC
(%) (%)
35 Compiler 1: _ 400 Compiler 1: --e-
30 Compiler 2: - o- 350 Compiler 2: - o-pp.. ...x.... pp.. ...x....
25 Comp/Pp. _ 300 Comp/Pp. .-_-.-
20 250
15 2OO
10 150
5 _ 1000 50
-5 , , , , , , , , , , 0 , , '_" V Y 7 , , , ,
1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Rollback Distance Rollback Distance
Tm_ OH: COMPRESS
(')Compiler 11 (:35- -,- 400-
30 Compiler -o- 350pP. ...K-.-.
25 Comp/PP: -.,t- ,= 300
20 /f 25015 200
/
10 __ ..d 1505 .a.. "--w'" "" " 100
0 _"_ 50
-5 , , , , , , , , , , 0
1 2 3 4 5 6 7 8 9 10
RoLlback Distance
Size OH: COMPRESS
)Compiler 1:
Compiler 2: - o -
pP. ...x....
Comp/PP: ..._...
0 I 2 3 4 5 6 7 8 9 10
Rollback Distance
Tin_
(¢,
35-
3O
25
20
15-
10
5
0
-5
OH: CMP Size OH: CMP
;) 400(._;)Compiler 1: --0- Compiler h --0-
Compiler 2: - o -Compilerpp. 2: -...x....o- 350 - pp. ...K-..
Comp/PP: + 300
250
200
- 150
- I00
m, A A_ _ A j, a A A A
_, ,., _...."- ........ x.-........ ,---, 50
i I , l I I , I , , I 0
1 2 3 4 5 6 7 8 9 10
RoLlback Distance
- Comp/PP: ..._...
e......._ ..... _ ..... _ ..... _ ..... _....._ ..... _ ..... _ ..... :_
I I I I I I | ! I I
0 1 2 3 4 5 6 7 8 9 10
RoLlback Distance
Figure 10: Run-time overhead and code size overhead: WC, COMPRESS, and CMP.
2O
"Fmu OH: PUZZLE
20-
15
I0
5
0
-5
35-
30-
25-
20-
15-
I0-
5-
0-
-5
_)
Compiler 1"
Compiler 2: - o-
pp. ...x....
Comp/PP: _,-
.X......X
°X°. o."
..,_,.....X-...,.X, .....X. ....
L....._"'"'_'"•*" _, A , , , ,,
Size OH: PUZZLE
400 (%)
" Compiler I:---0-
350 Compiler 2: - o -
PP:
3OO
250
2OO
150
100
50
.,.)(..,.
Comp/PP: ...a,...
0 w _ _ .... _ ..... X ...... :(. ..... X ...... X ...... X. ..... X
I ! I I I I I I I I T T I I I I I I I I
I 2 3 4 5 6 7 8 9 10 0 I 2 3 4 5 6 7 8 9 10
Rollback Distance Rollback Distance
,OH: QSORT Size( ;OH:QSORT
')
Compiler I:_v- , 400 Compiler I:--o-
Compiler 2:-o - a # 350 Compiler 2:-o -
pp. -..x.... Q'"a_',/ pP: ...K.... ,-,
C_ 200250300C omp/PP: ...a,... /, , _,,,_
150 _ ..j,.__:. "
too
"_ -_.....-_---,,-_ _- -_....._ _ A 50
Jm _ db _ _ sl_ dr ..... _Ir...... _ ...... ]I
I I I I I I I 1 ! I • I I I I I I I | I
1 2 3 4 5 6 7 8 9 10 0 I 2 3 4 5 6 7 8 9 10
Rollback Distance Rollback Distance
Tune OH: GREP
(%)
10-
8-
6-.
4-
2-
0
-2-
..4,
PP:
Comp/PP: -.a,-
,, "-A.-A-.&.-A-.A.-A-.4..A
# _ v v
1 I I I I I I I I I
! 2 3 4 5 6 7 8 9 10
Rollb_k Distance
Size OH: GREP
%)
35 pp: -w-
30 Comp/PP: - _-
25
20
15
10
5
0
A..A--A
.- A.- -&''"
.A...A -'A _ v __
I I I I I l I I l I
1 2 3 4 5 6 7 8 9 I0
Rollback Distance
Figure 11: Run-time overhead and code size overhead: PUZZLE, QSORT, and GREP.
21
TimeOH: LEX Size OH: LEX
(%) (%)
10 pp. _,- 35 pp. --,,-
8 Comp/PP: -.a.- 30 Comp/I'P: -.,s.-
6 25
4 20
2 15
0 _ 10 A A''A''A''A''A'-A''_
-2 5 A-'" "" ,,, v
Rollback Distance Rollback Distance
Time OH: YACC
(%)
10 pp: -_ 35
Comp/PP: -.a.-8 - 30
64 20
2 _..._-. _.... _-. _-- _" 15
0 I0
-2 5
-4 , , , , , , , _ _ , 0
1 2 3 4 5 6 7 8 9 10
Rollback Distance
T'mm OH: CCCP
(%)
I0 pp: -_
8 Comp/PP: --_-
6-
4-
2-
0-
-2
-4
• , _.- 4-.._s_. A."
A'" ""4 "''A""
t" I I I I I I I I I l
1 2 3 4 5 6 7 8 9 I0
Rollback Distance
Size OH: YACC
st)
pp.. --_
Comp/PP: -_,-
A...&-- A'-_
I I I I I I I I I I
1 2 3 4 5 6 7 8 9 1O
Rollback Distance
Size OH: CCCP
(%)
35 _ pp:
30 1 Comp/PP: -,_-
25 t A2O
"'"_"15I A..A.. 4-- 4'
I0-_ ,, A..A- -A''"
/ I I t I I I I I I 1
1 2 3 4 5 6 7 8 9 I0
Rollback Distance
Figure 12: Run-time overhead and code sizeoverhead: LEX, YACC, and CCCP.
22
5Time OH: TBL
10 pp: -_- 60 -
8 Comp/PP: "'_" 50
6 40
4 _, A.. A .. A
, . - .-- 30
2 ,, .A "'. A'"
0 _ 20
-2 I0
-4 o
RoLlback Distance
Size OH: TBL
..A"'A''A''A"'A--A'"
,,,A PP"
.A" Co_np/PP: - _-
,
12345678910
Rollback Distance
Figure 13: Run-time overhead and code size overhead: TBL.
Read Buffer Size Requirement
A practical lower bound and average size requirement for the read buffer are established in this
section by modifying the design to save only the data required for ronback. The study measures
the effect on the performance of ten application programs using six read buffer configurations with
varying read buffer sizes. Two alternative configurations are shown to be the most efficient.
Given a read buffer, rollback is accomplished by first flushing the read buffer back to the general
purpose register GPRF in the _everse order of which the values were saved. Provided that the depth
of the dual first-in-first-out (FIFO) read buffers are N, redundant copies of the appropriate register
values are available to restore the register file given a rollback of _< N.
The read buffer size requirement of 22V is the worst case. The buffer maintains the last N
register reads from the GPILF, assuring data redundancy for all values required. The read buffer
may also save data which is not required during rollback, gegister reads that must be saved can
be determined at compile time. If this information is added to the instruction encoding (e.g., as
an extra bit field for source 1 and for source 2), then the read buffer can be designed to save only
those values required. As long as the required values are maintained for N cycles, a less than 22V
read buffer size design is possible.
Figure 14 illustrates a case in which all register reads do not have to be placed in the read
buffer. The register values (denoted _alue(r_)) which require saving are marked with an "*." Since
23
: ,, = ¢i ovemow ovemow I
GPR
Figure 14: Read buffer of size < 2N.
only the required values are saved, the read buffer total size can now potentially be less than N.
In this case, however, the instruction count must also be saved so that the value can be maintained
for at least N cycles. In the event that the read buffer overflows, the oldest value in the buffer
must be pushed to memory and a record kept so that during rollback the value can be retrieved
from memory. Given a dual FIFO depth of M, memory would serve the function of the remaining
N - M of the two FIFOs.
5.1 Read Buffer Designs and Evaluation Methodology
Six read buffer configurations were studied. Configuration A1, shown in Figure 15, has a separate
FIFO for each source bus. Configuration A2 allows access to either FIFO from either source bus.
Configuration B1 contains a single FIFO and assumes that both source operands can be written into
the single FIFO within the same cycle. This latter split-cycle-save assumption is consistent with a
register file design that writes during the first half of the cycle and reads during the second half of
the cycle [19]. Configuration B2 assumes no split-cycle-save capability. Configuration C contains
a single level dual queue to absorb a simultaneous operand save and configuration D extends this
design to allow access to either queue from either source bus.
The read buffer was simulated at the instruction level. The s-code emitted by the IMPACT
C compiler [18] was instrumented with procedure calls to a simulation program containing models
for the six read buffer configurations. Branch hazards were removed by the compiler for a rollback
24
$1 $1 $1
S2 $2 $2
Config. A1 Comfig. A2 Cop.fig. B1
SI SI SI
$2 $2 $2
Config. B2 Config. C Config. D
Figure 15: Read buffer configurations.
distance of 10. Parameters such as which operands require saving in the read buffer were determined
at the post-pass level and instrumentation code se_nents were adjusted to pass this information to
the simulation program. Table 3 lists the ten s application programs used in the evaluations. The
applications were cross-compiled on a SPARCserver 490 and run on a DECstation 3100 with read
buffer sizes ranging from 0 to 20 (note that 20 represents the maximum read buffer size of 2N).
5.2 Evaluation Results
5.2.1 Detailed analysis: QUEEN
Figure 16 shows changes in performance overhead (Cycles OH) for variousread buffersizesand
configurationsrunning the QUEEN application.Looking at Figure 16, configurationAt, itcan
be seen that significantperformance impact is incurredeven with a modest reduction in read
buffersize.ConfigurationA1 was consistentlythe leastefficientof the six configurationsacross
the ten applicationsstudied/ This is due to the factthat the dual FIFO's are dedicated to a
singlesourcebus. In many casessaving$1 willcause an overflowbecause the $1 FIFO isfull,even
though thereisroom inthe $2 FIFO. ConfigurationA1 does allowforsimultaneoussavesof$1 and
$2, given sufficientroom in each,but thisfeaturedoes not compensate for the latterinefficiency.
6The TBL application was not included in the read buffer size evaluation.
7An efficient configuration is one with _ low performance overhead given a small read buffer size.
25
cycle:OH cyW OH
t 100 _ Conf. B2:
100 Conf. AI:
Conf. A2: -o- -I Conf. C: -o-
80 Conf. BI: ..a..- 80 Conf. D: ..,_.-.
0"]
I I I I I I -- I T'-Y- l I I I I I I I I I
0 4 8 12 16 20 0 4 8 12 16 20
Read Buffer Size Read Buffer Size
Figltre 16: Cycle overhead: QUEEN.
Configuration A2 demonstrates the improvement gained by allowing either source bus access to
either FIFO. Configuration B1 was the most effident of the six configurations for the QUEEN
application. In this configuration a total read buffer size of 13 would produce zero performance
impact with a 35% reduction in read buffer size.
It should be noted that configuration B1 assumes that simultaneous saves of $1 and $2 can be
handled within the same cycle. If this latter assumption is invalid, Figure 16, configuration B2,
shows that no less than 9.4% performance impact is achieved regardless of the read buffer size. The
41
"leveling off" of B2 is due to the bottleneck at the single FIFO entry point and not the depth of
the FIFO. The fiat part of the curve shows the percent of instructions requiring simultaneous saves
of S1 and $2 in the QUEEN application.
Figure 16, configuration C, shows how a single level dual queue placed between the source bus
and the single FIFO can alleviate some of the bottleneck effects. The dual queue can absorb a single
simultaneous save of S1 and $2, distributing the saves over multiple cycles. A nonzero minimum
performance overhead is still present due to cases in which the dual queue has not emptied before
the next simultaneous save occurs.
Figure 16, configuration D, shows the results of an improved queue structure which permits
saves from either bus into either queue. This configuration avoids stalls in some cases (e.g., $2
must be saved while the queue dedicated to $2 in configuration C is full and the other queue
is empty). Configuration D also has a nonzero minimum performance overhead but gives better
26
Table 4: Read buffer size evaluation summary.
RBosize Oil_level (_
Program A2[ B1 A2 [ B1
QUEEN 14 12 1.66 1.36
WC 10 8 0.00 2.54
QSORT 16 15 2.28 0.94
CMP 12 11 0.00 0.00
GREP 10 10 0.18 0.18
PUZZLE 10 9 2.87 0.32
COMPRESS 12 12 2.87 1.12
LEX 12 12 2.73 1.55
YACC 16 15 1.07 0.00
CCCP 12 12 2.34 1.74
performance than configuration C.
The simulation results for QUEEN show that configuration A1 is the least efficient and that
given the ability to do split-cycle-saves, configuration B1 is the most efficient. Without the split-
cycle-save capability, configuration D is the best of the single FIFO designs resulting in a minimum
performance overhead of 4.5%, and configuration A2 is the best of the dual FIFO designs resulting
in a 1.7% performance overhead with a read buffer size of 14. For configurations B1, B2, C, and
D, a total read buffer size of 13 is su_cient to maximize performance, s
5.2.2 Evaluation of all application programs
Results for the other nine application programs are similar to those for QUEEN [17]. The differences
between the application results are the points at which the curve _levels off" (i.e., the buffer size)
and, in the case of configurations B2 through D, at what level the performance overhead stabilizes.
Table 4 summarizes measurements obtained for the ten applications given the two most efficient
configurations, A2 and B1. It is assumed for this study that minimal performance overhead can be
tolerated as a result of read buffer size reduction. For this reason, configuration comparisons are
made at read buffer size values which produce low values of performance overhead. Configuration
A2 does not level off like configuration D and does not rapidly approach zero like configuration
STwo must be added to each read buffer size value in C and D to account for the queues.
27
B1. For a better comparison of configurations A2 and B1, Table 4 gives the read buffer size value
where the performance overhead value drops below 3%. The read buffer size value is referred to as
RB_size and the performance overhead value is referred to as OH_level.
It can be seen from Table 4 that the read buffer size requirement is roughly the same, per
application, regardless of the split-cycle-save assumption (i.e., comparing configurations A2 and
B1). The size requirement is application dependent - from 8 for WC, to 15 for QSOR.T and YACC.
The measurements show that a considerable reduction in read buffer size is achievable. Given the
split-cycle-save assumption and configuration B1, a rn|n|mnm Of 25_, a maximum of 60_, and an
average of 42% reduction was achieved. For configuration A2 and no split-cycle-save assumption,
a minimum of 20%, a maximum of 50%, and an average of 38.0% reduction was achieved. The
measurements indicate that care should be taken relative to the ultimate selection of read buffer
size. Given the steepness of the B1 curve around the RB_size value, small decreases in size can
produce large performance overheads.
5.2.3 Read buffer size requirement summary
Results show that two read buffer configurations were the most efficient. A dual FIFO with source
bus access to each (configuration A2) and the single FIFO with the split-cycle-save capability
(configuration B1) consistently out-performed the other four configurations. There were moderate
variances between the buffer sizes required for minimum performance impact between the ten
applications studied and the performance stabilization value assuming no split-cycle-save capability.
Up to a 55% read buffer size reduction was achieved with an average reduction of 39.5% given the
most efficient read buffer configuration for the applications. It was also found that given the
split-cycle-save assumption and single FIFO configuration, significant changes in the performance
overhead result from small changes in the read buffer size. Our results indicate that care should be
taken in the final selection of read buffer size in any given design.
6 Concluding Remarks
This paper has presented a compiler-assisted multiple instruction rollback scheme which combines
compiler-driven data-flow manipulations with dedicated data redundancy hardware to remove data
28
hazards that resultfrom multipleinstructionrollbac.k.Experimental evaluationof the proposed
compiler-assistedscheme with a maximum rollbackdistanceof ten showed performance impacts of
no more than 6.57% and an averageimpact of 1.80%,overthe elevenapplicationprograms studied.
The performance evaluationindicateslowerperformance penaltiesthan forpreviouscompiler-only
approac.hesor comparable hardware-only approac.hes.Six read bufferconfigurationswere studied
to determine the minimum sizerequirementforgeneralapplications.Itwas found that a 55% read
buffersizereductionisachievablewith an averagereductionof 39.5%, but that additionalcontrol
logicto handle read bufferoverflowsmay limitthe overallhardware savings.
Future researchincludesapplicationof compiler-assistedmultipleinstructionrollbackrecov-
ery to super-scalar,VLIW, and parallelprocessingarchitectures.Evaluationsof compiler-assisted
rollbackrecovery appliedto speculativeexecutionrepaLrwould includemodifying compiler trans-
formationsto operatein a super-scalarand VLIW environment.
7 Acknowledgements
The authors wish to thank C.-C. Jim Li for hishelp with the compiler aspectsof thispaper, and
Scott Mahlke and William Chen for theirinvaluableassistancewith the IMPACT compiler. We
alsoexpressour thanks to Janak Patelforhiscontributionsto thisresearch.
References
[1]
[2]
[3]
[4]
[5]
M. S. Pittler,D. M. Powers, and D. L. Schnabel, "System Development and Technology
Aspects of the IBM 3081 ProcessorComplex," IBM J. Res. Des.,vol.26, pp. 2-11,Jan. 1982.
Y. Tamir and M. Tremblay, "IIigh-PerformanceFanlt-TolerantVLSI Systems Using Micro
Rollback,"IEEE Trans.Comput., vol.39, pp. 548-554,Apr. 1990.
C.-C. J. Li, S.-K. Chen, W. K. Fuchs, and W.-M. W. Hwu, "Compiler-Assisted Multiple
InstructionRetry," Tech. Rep. CRHC-91-31, Coordinated ScienceLaboratory, Universityof
Illinois,May 1991.
N. J. Alewine, S.-K. Chen, C.-C. J. Li, W. K. Fuchs, and W.-M. W. Hwu, "Branch Recovery
with Compiler-Assisted Multiple Instruction Retry," in Proc. 22th Int. Syrup. Fault-Tolerant
Comput., pp. 66--73, July 1992.
L. Spalnhower, J. Isenberg,R. Chillarege,and J. Berding, "Design for Fanlt-Tolerancein
System.ES/9000 Model 900," in Proc. 22th Int.Syrap. Fault-TolerantComput., pp. 38-47,
July 1992.
29
[sl
[7]
[81
[91
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[1T]
[18]
[19]
P. M. Kogge, K. T. Trnong, D. A. Richard, sad It. L. Schoenike, "Checkpoint Retry Mech-
snism." United States Patent, no. 4912707, Max. 1990. Assignee: International Business
Machines Corporation, Armonk, N.Y.
Y. Tsmir, M. Liang, T. Lal, sad M. Tremblay, "The UCLA Mirror Processor: A Building Block
for Self-Checking Self-Repairing Computing Nodes," in Proc. 2Ith Int. Syrup. Fault.Tolerant
Comput., pp. 178-185, June 1991.
J. E. Smith and A. It. Pleszkun, "Implementing Precise Interrupts in Pipelined Processors,"
IEEE Trans. Comput., vol. 37, pp. 562-573, May 1988.
M. L. CiaceUi, "Fault Handling on the IBM 4341 Processor," in Prac. 11th Int. Symp. Fault-
Tolerant Comput., pp. 9-12, June 1981.
W. F. Brnckert and tL E. Josephson, "Designing Reliability into the VAX 8600 System,"
Digital Tech. J. Digital Equip. Corp., vol. 1, no. 1, pp. 71-77, Aug. 1985.
G. L. Hicks, D. Howe, Jr., sad A. Zurla, Jr., "Insrnction Retry Mechanism for a Data Process-
ing System." United States Patent, no. 4044337, Aug. 1977. Assignee: International Business
Machines Corporation, Armonk, N.Y.
D. B. Fite, T. Fossum, and D. Manley, "Design Strategy for the VAX 9000 System," Digital
Tech. J. Digital Equip. Corp., vol. 2, no. 4, pp. 13-24, Fall 1990.
E. B. Eichelberger and T. W. Williams, "A Logic Design Structure for LSI Testability," in
Proc. l_th Design Aurora. Conf., pp. 462-468, 1977.
J. S. Liptay, _rhe ES/9000 High End Processor Design," IBM J. Res. Dev., vol. 36, no. 3,
May 1992.
A. V. Aho, It. Serial, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Reading,
MA: Addison-Wesley, 1986.
J. A. Bondy sad U. Murty, Graph Theory with Applications. London, England: Macmillan
Press Ltd., 1979.
N. J. Alewine, Compiler.assisted Multiple Instruction Rollback Recovery using a Read Buffer.
PhD thesis, Tech. Rep. CRttC-93-06, University of Illinois at Urbane-Champaign, 1993.
P. Chang, W. Chen, N. Waxter, and W.-M. W. Hwu, "IMPACT: An Architecture Framework
for Multiple-Instruction-Issue Processors," in Proc. 18th Annu. Syrup. Comput. Architecture,
pp. 266-275, May 1991.
J. L. Hennessy sad D. A. Patterson, Computer Architecture: A Quantitative Approach. San
Mateo, CA: Morgan Kaufmann Publishers, Inc., 1990.
30
