113b.TIME COVERED

Technical
! FROM TO
SUPPLEMENTARY NOTATION
Hardware-Based Instruction Rollback
Hardware implemented instruction retry schemes belong to one of two groups: 1) full checkpointing and 2) incremental checkpointing. Full checkpointing maintains "snapshots" of the required system state space at regular, or predetermined, intervals. Upon error detection, the system can be rolled back to the appropriate checkpointed system state. Incremental checkpointing maintains changes to the system state in a "sliding window". Upon error detection the system state is restored by undoing, or "backing-out" the system state changes up to the instruction in which the error occurred.
The issuesassociatedwith instruction retryare similar to the issuesencountered with exception handling in an out-of-order instruction execution architecture.If an instructionis to write to a register and N is the ma_mum error detectionlatency (or exception latency),two copiesof the data must be maintained forN cycles.Hardware schemes such as reorder buffers, historybuffers, future files [8] ,and micro-rollba_k[2]differin where the updated and old values reside,circuit complexity,CPU cycletimes,and rollbackefficiency. Table 1 gives a description of varioushardware-ba_ed methods to restorethe generalpurpose register file contentsduring single or multipleinstruction rollback.In the VAX 8600 and VAX 9000, errorsare detected prior to the completion of a faultyinstruction.For most VAX instructions, updates to the system state occur at the end of the instruction. If the error is detected prior to the updating of the system state,the instruction can be rolledback and re-executed.Ifthe system The micro-rollback scheme also avoids shadow fries by using a delayed write buffer to prevent old data from being overwritten until the error detection latency has expired; ensuring that the new data is fault-free. In a delayed write scheme, the most recent write values are contained in the delayed write buffer, and bypass circuitry is required to forward this data on subsequent reads. 
Hazard Classification
The code can be representedas a CFG G(V',E), where V isthe setof nodes denoting instructions and E is the set of edges denoting control-flow. An on-path or branch data hazard occurs when Ii defines variable z, and after rollback, Ij uses the corrupted z value prior to its being redefined. To simplify subsequent discussion, such on-path and branch hazards will be denoted ho(i,j,z)and hs(i,j, z) respectively. Figure 1 illustrates this hazard notation.
3
Compiler-Assisted Instruction Rollback
As shown in Section 2, rollbackdata hazards are of two types: I) on-path hazards,and 2) branch hazards. Previous work has shown that compiler-drivendata-flowmanipulations can be used to resolveboth on-path [3]and branch [4]hazards. Compiler-assisted multiple instructionrollback describedin thissectionuses hardware to resolve on-path hazards and relies on compiler assistance to resolvethe remaining branch hazards.
• 9uIDo anoqi • ouBa ooo awauMi oan_o riD.
• "*'0o
•°i
Figure 1: On-path and branch hazards.
3.1
On-path Hazard Resolution Using a Read Buffer Figure 2 shows a hardware scheme to resolve on-path hazards. A read buffer is attached to the output ports of the register file. Each time a register is used it appears on the read port and is saved in the read buffer. If a register r_ is defined in Ii and it is an on-path hazard, then rk must have been read within the last/V cycles. In this case, the read buffer will contain the old value and it is permissible to write the new value into the register file. In the event of a rollback of N instructions, the contents of the read buffer are flushed in reverse order and stored back to the register file. For an on-path hazard, the path taken after the rollback will be the same as the path taken prior to rollback and each read of rk will produce the same value as before. It is assumed that the read buffer is an integral part of the register file and any error in the system does not corrupt the transfer to the read buffer or its contents.
In contrast to a write history buffer which forces a read of rk prior to writing rk, the read buffer Figure2: Read buffer.
Covering on-path hazards
In addition to resolvingallon-path hazards, the read bufferwillresolvesome branch hazards. Figure 3 shows an on-path hazard and a branch hazard both with defmitionsof z in I_ and uses of z, afterrollback, in instructions Ij and lj,respectively. Note that ifpath ! is initially taken,the read bufferwillcontain the old value of z and rollbackwould be successful. However ifpath m is taken, the read bufferwillnot contain the old value of z and rollbackwould be unsuccessful.If only paths such as Iexist, the presenceof the on-path hazard assuressuccessful rollbackor "covers" the branch hazard. In thiscase,resolution of the branch hazard using compiler techniques is not necessary.
Post-pass transformation
Given the efficiency of the read buffer in resolving on-path hazards, 
Node splitting using graph coloring
To ensure minimal splitting, a new node splitting algorithm is developed using the concept of conflicting parents [17] . Ensuring that node n does not have conflicting parents enables resolution of the hazard using variable renaming. The node splitting strategy for a particular node is to group the parents of that node such that elements within a group do not conflict. Each group becomes parent nodes for a duplicate of the original node. For example, if node n has six parent nodes and these nodes can be organized into three nonconflicting groups, then only three total copies of n axe required. Figure 5 illustrates the use of conflicting parents and graph coloringin node splitting for the QSORT application describedin 
Node48 beforesplitting Parent conflict graph
Node 48, 48', and 48" after splitting have the same color. For the example shown in Figure 5 , three colors are sufficient to color the parent conflict graph, resulting in the splitting of node 48 into nodes 48, 48' and 48". Determining whether a graph is k-colorable is NP-complete in general. The graph coloring heuristic used for our one-pass node splitting algorithm is a modified version of an algorithm used for register allocation 4) [15].
3.2.3
One-pass node splitting algorithm Both live_in(n) and reaching_out(n) 5 analyses are required to identify conflicting parent nodes. A one-pass node splitting algorithm becomes possible by precalculating live_in and the hazard node set, and then, beginning with the root node, splitting in a topological traversal of the CFG. A topological traversal ensures than when processing node n, all ancestors of n have been processed and no descendantsof n have been processed.This lattercase ensuresthat the presplit calculation of live_in(n) can be used for parent conflict identification when processing a given node. Unlike is based solelyon node n and itsancestors,reaching_out(n) can be calculatedas node splitting proceeds. If a hazard node issplit, each duplicateof the node must be added to the hazard node set. Since the root node does not have conflicting parents,a topologicaltraversalof the CFG using the graph coloringnode splitting technique ensures that no node in the resultinggraph has conRictingparents. Table 2 illustrates the improvement of the one-passnode splitting algorithm over the iterative algorithm for the COMPRESS applicationdescribedin Table 3 If the loop and hazard instruction execution frequencies were reversed, then read insertion would produce more performance impact than loop protection. As shown in Figure 7 , profiling data can be used to aid in loop protection decisions. "'
Profiling effectiveness
Profiled data was included in the pseudo-level transformations of Section 3.2. The profile data is The results show that the use ofprofile data can improve application performance by postponing some hazard resolutionsuntilthe post-passphase. Using profile data to aid in loop protection decisions did not produce performance equal to that forthe post-passtransformation, forthe TBL application. As an extensionto thiswork, profile data can be used to aid in register allocation. As discussedin Section 3.2,hazards that are present afterpseudo register renaming are resolvedby adding hazard constraintsto liverange constraintsprior to register allocation.These additional constraintscan cause increasedregister spillage and impact performance. Similar techniques to those developed forloop protectioncan be used to enhance register allocation decisions. 
Results: Compiler
As can be seen in Figures9 through 11,extendingthe compiler hazard resolution scheme to include branch hazards introduceslittle incrementalperformance impact or code growth overhead. Given a rollbackdistanceof 10,resolvingboth on-path and branch hazards using compilertransformations resultedin a maximum performance impact of 32.6% and an average performance impact of 12.6%.
This compares with maximum and average impacts of 35.4% and 15.4%, respectively, forcompilerdrivenon-path hazard resolution only.The maximum code sizeoverhead measured forthe extended compiler-basedtechnique was 328% with an average overhead of 207%, for a rollbackdistanceof 10. This compares with a maximum and average overhead of 372% and 225%, respectively, for the unextended compiler-basedscheme.
These resultsindicatea small incremental run-time performance overhead and a small code sizeoverhead given compiler-basedbranch hazard removal compared to compiler-based on-path hazard removal alone. Three factorsaccount forthese small incremental impacts. First,on-path hazards dominate in frequency of occurrence.Second, resolvingan on-path hazard at instruction Ii through renazning can sometimes resolvea branch hazard at instructionIi. Third, resolving on-path hazards with nop insertion may resolvea corresponding branch hazard by increasingthe distancebetween the hazard node and itsnearestpredecessorbranch node.
Results: PP
Figures 9 through 13 show the run-time and code sizeoverheads foreach applicationstudied using the read bufferto resolveon-path hazards and the post-passtransformationdescribedin Section 3 to cover allbranch hazards. The resultsare worst case in that many of the branch hazards could have been resolved with no performance impact using the compiler techniques;instead, they are resolvedby the insertion of MOV instructions which cause a guaranteed,although small, performance impact. Given a rollbackdistanceof 10, the post-pass transformation produced a maximum performance impact of 7.695{ with an average performance impact of 2.43%, significantly below the levels produced by the compiler-baaed scheme. Code growth overhead measurements were correspondingly lower with a maximum overhead of 13.0% and an average overhead of 8.59%.
Results:
Comp/PP
The compiler-assisted scheme achieved consistently low performance overheads across all appUca.
tions and slightly better performance than with the post-pass transformation only. Given a rollback distance of 10, the compiler-aasisted scheme produced a maximum performance impact of 6.57%
with an average performance impact of 2.03%, and a maximum code growth overhead of 51.2% only the required values are saved, the read buffer total size can now potentially be less than N.
In this case, however, the instruction count must also be saved so that the value can be maintained for at least N cycles. In the event that the read buffer overflows, the oldest value in the buffer must be pushed to memory and a record kept so that during rollback the value can be retrieved from memory. Figure 16 shows changes in performance overhead (Cycles OH) for various read buffersizesand configurations running the QUEEN application.Looking at Figure 16 , configurationAt, it can be seen that significant performance impact is incurredeven with a modest reduction in read buffersize. ConfigurationA1 was consistently the leastefficient of the six configurationsacross the ten applicationsstudied/ This is due to the fact that the dual FIFO's are dedicated to a singlesource bus. In many casessaving$1 willcause an overflowbecause the $1 FIFO isfull, even though thereisroom in the $2 FIFO. ConfigurationA1 does allow forsimultaneous savesof $1 and $2, given sufficient room in each, but thisfeaturedoes not compensate for the latterinefficiency. It should be noted that configuration B1 assumes that simultaneous saves of $1 and $2 can be handled within the same cycle. If this latter assumption is invalid, Figure  16 , configuration B2,
Evaluation Results
Detailed analysis: QUEEN
shows that no less than 9.4% performance impact is achieved regardless of the read buffer size. The
41
"leveling off" of B2 is due to the bottleneck at the single FIFO entry point and not the depth of the FIFO. The fiat part of the curve shows the percent of instructions requiring simultaneous saves of S1 and $2 in the QUEEN application. Figure 16 , configuration C, shows how a single level dual queue placed between the source bus and the single FIFO can alleviate some of the bottleneck effects. The dual queue can absorb a single simultaneous save of S1 and $2, distributing the saves over multiple cycles. A nonzero minimum performance overhead is still present due to cases in which the dual queue has not emptied before the next simultaneous save occurs. Figure 16 , configuration D, shows the results of an improved queue structure which permits saves from either bus into either queue. This configuration avoids stalls in some cases (e.g., $2 must be saved while the queue dedicated to $2 in configuration C is full and the other queue is empty). Configuration D also has a nonzero minimum performance overhead but gives better 
Evaluation of all application programs
Results for the other nine application programs are similar to those for QUEEN [17] . The differences between the application results are the points at which the curve _levels off" (i.e., the buffer size) and, in the case of configurations B2 through D, at what level the performance overhead stabilizes. Up to a 55% read buffer size reduction was achieved with an average reduction of 39.5% given the most efficient read buffer configuration for the applications. It was also found that given the split-cycle-save assumption and single FIFO configuration, significant changes in the performance overhead result from small changes in the read buffer size. Our results indicate that care should be taken in the final selection of read buffer size in any given design.
Concluding Remarks
This paper has presented a compiler-assisted multiple instruction rollback scheme which combines compiler-driven data-flow manipulations with dedicated data redundancy hardware to remove data hazards that resultfrom multipleinstructionrollbac.k. Experimental evaluation of the proposed compiler-assisted scheme with a maximum rollbackdistanceof ten showed performance impacts of no more than 6.57% and an averageimpact of 1.80%, over the elevenapplicationprograms studied.
The performance evaluationindicateslower performance penaltiesthan forpreviouscompiler-only approac.hesor comparable hardware-only approac.hes. Six read bufferconfigurations were studied to determine the minimum sizerequirementforgeneralapplications. It was found that a 55% read buffersizereductionis achievablewith an average reductionof 39.5%, but that additionalcontrol logicto handle read bufferoverflowsmay limitthe overall hardware savings.
Future researchincludesapplicationof compiler-assisted multipleinstructionrollbackrecovery to super-scalar, VLIW, and parallel processingarchitectures. Evaluationsof compiler-assisted rollbackrecovery applied to speculative execution repaLrwould includemodifying compiler transformations to operate in a super-scalar and VLIW environment.
7
