Abstract
Introduction
Processor architectures are currently driven by the fast pace of changes in technology. This development is keeping pace with Moore's law, which states that computer speed and memory capacity double every 18 months. Technological advances in general purpose processors (GPPs) and onchip memories have significantly increased the speed and density of these components. Progress has been made while maintaining, or even increasing, the reliability of the individual components.
The reliability of future computer systems will be threatened by a variety of factors. One factor is the occurrence of soft errors. An error can simply be defined as an inappropriate change in the value of a signal (from high to low, or ' The research reported in this paper was supported in part by the NSF under Grant MIP 989602.5 vice versa). Soft errors are arbitrary, transient errors caused by unstable environmental conditions.
Although modern VLSI chip fabrication methods have improved performance dramatically, they also make chips more prone to soft errors. Microprocessor design goals include low power consumption and high performance. Designers are rewarded for systems with fast switching and low voltage levels, and tolerance levels of circuit designs are very small. Each new generation of technology pushes the envelope just a little bit further. Downward scaling of circuits imposes many design restrictions [I] . As this continues, soft errors due to environmental factors (e.g. cosmic rays) will become more frequent.
Soft errors do not necessarily translate into system failures, making it hard to measure their frequency of occurrence. The location and timing of the error determine the final effect. A given soft error may be inconsequential or may propagate for a certain time without affecting a computational result or the flow of instructions through the pipeline. Since memories have always been more susceptible to errors than processor pipelines, some recent research has focused on estimating the soft error rate in the context of memory technology [ 2 , 31. Traditionally, error-correcting codes (ECCs) have been used to insure memory reliability [4] . ECCs can drastically reduce the chance that an error will cause system failure. Currently, a given single-bit is expected to flip in a RAM only once every many billions of hours of operation. However, with the growing sizes of RAMS and other hardware components, one can expect the error frequency to become much more noticeable in the near future.
The bit error rate (BER) in a processor can be expected to be about ten times higher than in a memory chip due to the higher complexity of the processor. At current operating speeds, a processor experiences a bit flip once every 10 hours. Although not all soft errors result in a failure, the fact that even a single-bit error may cause fatal damage generates a need for error detection and recovery techniques for the system. As we become more dependent on computers, higher demand for affordable, yet dependable, computing is expected. In his recent article [ 5 ] , John Hennessy, one of the leading computer architects of the last two decades, points out that we need to change the performance-centric processor design to "other performance"-centric design. The "other" category includes reliability, maintainability, and availability as the main parameters. This recommendation would represent a significant shift in processor design.
Our Solution
Many high-end computer systems already include softerror tolerance. They implement it by replicating hardware, so that performance is not sacrificed. In today's competitive market, GPPs cannot afford decreased performance or large amounts of extra hardware. GPPs must be able to implement soft-error tolerance with minimal impacts to both die size and performance.
One method of incorporating soft-error detection capability in a processor system is to make use of the system's idle capacit),. Idle capacity exists when a functional unit within a processor is not carrying out any computation during a certain period of time. The functional unit may be idle for a variety of reasons. During this idle time, the unit can be used for other purposes without decreasing overall processor performance. If idle capacity is not sufficient, spare capacity (extra hardware) can be added to meet performance goals. This paper uses functional units to add spare capacity.
In today's microprocessors, a large amount of idle capacity exists (30-40% of execution time). Idle capacity can be used to perform recomputation, which can detect the presence of errors in the system. However, this time is not sufficient to duplicate the entire computation. Therefore, if recomputation became a standard feature in computer systems, then it might significantly impact the amount of time needed for execution of programs. As an alternative, one may perform selective recomputation or include additional hardware (like function units) to compensate for the performance loss. This would meet both performance and reliability goals. However, both methods have problems. Selective recomputation would not detect all soft errors, and adding a large amount of hardware is not a cost-effective solution.
However, a limited amount of spare capacity may be added to maintain performance standards without much extra cost.
In this paper, we propose and implement a scheme called REESE (REdundant Execution using Spare Elements). REESE is a microarchitectural method of detecting soft errors. It has two components. First, instruction recomputation capabilities are added to the microprocessor pipeline. Recomputation is an appropriate method for detecting the occurrence of soft errors because it can detect faults that are short-lived. Implementing recomputation requires no extra hardware, so chip area does not have to be increased. This method of soft-error tolerance is also sensitive to performance factors. Since performance is a top priority for GPPs, the second part of REESE includes adding a minimal amount of hardware to ensure that recomputation does not overly inhibit performance. When implementing REESE, we focussed on the following question: How much additional hardware is needed to incorporate soft error detection into a superscalar processor without increasing overall execution time? This is in contrast to past research which identifies the amount of overhead necessary to implement fault tolerance but does not attempt to decrease this overhead.
Paper Organization
Section 2 of this paper discusses the concept of time redundancy. Section 3 gives details concerning research related to instruction re-execution and soft error detection. The next section outlines how REESE can be implemented in a microprocessor pipeline to detect errors while speedily executing the extra computation. Section 5 gives details about the simulation environment. Test results and analysis are shown in Section 6. Section 7 presents a summary and conclusions.
Time Redundancy and Error Detection
To be fault tolerant, a microprocessor must be able to cope with all possible sources of errors. Time redundancy is one method of accomplishing this. In a processor with time redundancy, computations are performed multiple times on the same hardware [6, 71. One advantage of this approach is that no new hardware is necessary to implement it. However, multiple executions require additional time and/or hardware to complete.
Executing a computation two times adds a degree of fault tolerance to the system. This comparison only allows for error detection, not error correction. In this paper, we define an instruction to belong to the primary instruction stream, or P sfream, during its first execution. When the instruction is re-executed, it belongs to the redundant instruction stream, or R stream. If a soft error affects the result of a P-stream instruction, the error can be detected by the copy of the instruction in the R stream.
As the time period between P-stream and R-stream executions of an instruction grows, the probability that the Rstream version will be susceptible to the same environmental cause of a soft error decreases. Therefore, the R-stream execution of an instruction should not occur too soon after its; P-stream execution. In other words, if the cause of a soft error is present for time At, then detection of the soft error is only guaranteed if the P-stream and R-stream executions are separated by a time greater than At. If the executions are separated by a smaller time period, then both might be susceptible to the same soft error. This paper is concerned with infrequent, short-lived soft errors.
A longer separation between P-stream and R-stream instruction execution will also cause commitment of every instruction to be delayed by at least At. All time redundant schemes must balance out the desire to detect all transient errors with the need to keep the time to instruction commit as low as possible. REESE attempts to maintain a good balance between these two factors.
Related Work
Time redundancy is only one of many methods of detecting soft errors. Other methods include using software for recomputation or code replication (software redundancy) and performing a computation on different hardware elements (hardware redundancy).
Software redundancy can detect errors when redundancy is introduced at the code level. One advantage of this scheme is that fault tolerance may be directly implemented at the source code level without modifying the underlying hardware structure. However, one recent implementation of this method in [SI caused a doubling of code size and a slowdown of five times compared to normal program execution. This slowdown would be unacceptable for GPPs.
Self-checking circuits [9, 101 use hardware redundancy to provide error coverage. However, this technique cannot usually be applied globally to the processor, and the extra hardware overhead might be impractical for GPPs. IBM has taken a different approach by using partial hardware replication in the G4 and G5 versions of its S/390 mainframe [13] . In [ l l , 121, both time and hardware redundancy are used to detect errors at the circuit level. Outputs of combinational circuits are tested to find soft errors. All of these circuit-level techniques differ from our microarchitectural approach. REESE tests for errors at the pipeline level by comparing the results of individual instructions.
Hardware redundancy is used at the microarchitectural level by Franklin in [14] . Various possible sources of errors in the microprocessor pipeline are analyzed and categorized. Then hardware is added to the pipeline to handle each type of error, This approach would need less additional hardware than gate-level hardware redundancy. Results also show that performance is only slightly affected.
In [15] , time redundancy is combined with recomputation with shifted operands (RESO). Using RESO, the redundant instruction undergoes a shifting transformation both before and after it is executed. This can detect both permanent and transient errors in the processor pipeline. Instructions are duplicated early in the pipeline and temporarily stored. As a result, redundant instructions can be flexibly scheduled, allowing the redundant instruction to execute efficiently. The only drawback to this method is that it only covers the execution stage of the processor pipeline. Other research [ 161 also provides error detection that does not cover other pipeline stages.
This paper focuses on superscalar processors, since they are the type used in GPPs. However, error detection schemes have been developed for a wide variety of processor types. NOW (Network Of Workstations) [ 191 and other projects [20, 21, 22, 23] have utilized hardware on multiprocessor systems to perform global fault diagnosis. Instruction re-execution has been implemented in both multiscalar Although the microarchitecture underlying REESE is different, we apply many of the same techniques in our simulation.
Franklin implements microarchitectural time redundancy in [24] . He notes that hardware added to boost microprocessor performance is often underutilized. The hardware is added without adding functionality. The law of diminishing marginal returns states that this type of extra hardware will be less useful than the hardware that was there beforehand. The leftover usefulness can be utilized to make the hardware more fault tolerant. In his paper, Franklin duplicates all instructions at either the functional units (in the execution stage) or at the dynamic scheduler. The only extra hardware required is the hardware that stores the duplicate instructions and the hardware necessary to compare the results at the end of the pipeline. In tests of four benchmark programs, the fault-tolerance overhead ranges from 9.6% down to only 0.5%.
Our approach goes a step further than Franklin by asking the following question: How much spare hardware is needed to decrease the fault-tolerance overhead to zero? This is in contrast to past research which identifies the amount of overhead necessary but does not attempt to decrease this overhead by adding spare components to the system. The next section discusses how our approach works.
Implementation

Utilizing Idle Capacity
Superscalar processors are currently capable of executing many instructions per cycle (IPC). To accomplish this, the instruction stream flowing through the processor needs to be partitioned into small blocks of instructions, where the instructions in a block can be executed simultaneously. Processors designers seek to exploit this instruction-level parallelism (ILP) as much as possible. However, control dependencies and data dependencies limit ILP. In the former case, an instruction that follows a branch instruction cannot execute until the result of the branch is known. A data dependency arises when an instruction cannot execute because its operand has not yet been generated by a previous instruction.
Since the amount of ILP in an instruction stream changes from cycle to cycle, a portion of the available hardware is left idle during most clock cycles. The average throughput for the microprocessor pipeline is approximately 2 IPC. However, modern processors are capable of executing 4-6 IPC. Experiments done over a broad range of programs have verified that approximately 30-40% of hardware is unused during any specific cycle. This idle capacity may be used for other purposes, without adding additional hardware to the processor.
Idle capacity is likely to increase in future processor generations. The raw computing power of processors is growing faster than the ability of these processors to exploit larger amounts of ILP. This results in increasing amounts of hardware that is underutilized. Fault tolerance can be implemented by this idle hardware. 
Implementing Time Redundancy
REESE utilizes the idle capacity that is inherent in GPPs to detect soft errors. It does this by interleaving the Rstream and P-stream instruction executions. An R-stream instruction is generated by putting the P-stream version of the instruction into a FIFO queue immediately before the instruction commits. We call this structure the R-stream Queue. The delay between execution of P and R versions of an instruction is equal to the time to execute the P version plus the time for an instruction to go from the tail to the head of the R-stream Queue. REESE executes every instruction two times by interleaving P and R streams. After both P and R versions of an instruction have been executed, the two results can be compared. This gives the GPP the ability to detect soft errors. When too few primary instructions are ready to issue, REESE can issue redundant instructions to any functional units that are still available. Functional units can be utilized nearly 100% of the time.
This
Necessary Hardware Additions
We add as little hardware as possible to our microprocessor to realize REESE. The following list shows the primary ad.ditions:
The R-stream Queue Extra scheduling and forwarding logic
Hardware to compare the results of P-stream and Rstream versions of an instruction.
Connections that allow for interaction between the pipeline and R-stream Queue
First of all, we add the R-stream Queue just before the commit stage. The R-stream Queue is a FIFO queue with an (initial) maximum of 32 entries. This queue holds P-stream iristructions that are ready to be committed. An entry in the R-stream Queue stores much more than just the 32-bit instruction. It keeps the values of the instruction operands and the result of the operation. Thus the P-stream result is immediately available for comparison with its R-stream counterpart.
Any information from the P-stream execution that might speed up the R-stream execution could be included in the R-stream Queue entry. However, one must keep in mind that extra pipeline hardware might be necessary in order to use any extra information that is carried along with the Rstream instruction. We choose to include only operands and result information. This information is vital to the R-stream execution and final result comparison.
Extra scheduling logic is also important. After a P stream instruction is decoded, a decision must be made whether to execute that instruction or to take an instruction from the head of the R-stream Queue. Since performance is a priority in REESE, we want to always choose the P stream instruction, whenever possible. This is where the issue of idle capacity becomes important. When dependencies limit the ability of the pipeline to execute a P stream instruction, an R stream instruction can be scheduled instead.
The scheduler also needs logic to help avoid R-stream Queue overflow. Counters can be used to keep track of the number of P and R instructions in the pipeline, as well as the size of the R-stream Queue. This data determines whether or not an R-stream instruction must be scheduled. This is the only way that the R-stream Queue can inhibit the normal flow of instructions through the pipeline. Since a full Rstream Queue blocks the execution of P instructions, it is critical to set the buffer to an appropriate length.
Extra forwarding hardware is also necessary. No result from a primary instruction can be committed to the register or memory state before it has been compared with its redundant counterpart. However, a result can be forwarded to other instructions that need it to execute. This allows instruction execution to proceed at normal speed, while only instruction commitment is delayed. Results of P-stream STORE instructions may be forwarded to subsequent LOAD instructions, but the results may not be committed into memory before they have been compared to their R-stream counterparts.
Hardware to compare the P-stream and R-stream instruction results must be added between the writeback and commit stages of the pipeline. Very little hardware will be needed to accomplish this. If a comparison fails, the microprocessor is already capable of flushing the subsequent instructions in the pipeline. The R-stream Queue will also need to be cleared. The first instruction to enter the fetch stage after this happens will be the instruction where the error was detected. If the instruction is still found to be in error, the pipeline will have to stop and notify the user of the error.
Complex interactions between the RUU and R-stream Queue can increase overall efficiency. Specifically, the Rstream Queue can be allowed to remove instructions from the pipeline before the instructions are ready to commit. An instruction that completes quickly normally needs to wait for other instructions before it can commit (in-order commit). However, with REESE, P-stream instructions that complete early can be put into the R-stream Queue immediately. This speeds up execution, but requires additional hardware complexity. Justifying this hardware complexity is similar to justifying out-of-order execution in superscalar processors: increased speed, efficiency, and hardware utilization are worth the extra complexity.
It is difficult to know the exact amount of hardware overhead that would be required to implement REESE.
The majority of the hardware relates to forwarding, the Rstream Queue, and scheduling. Implementing these functions should not cause space problems on the chip, since similar functions are already implemented in the microprocessor without requiring large amounts of die area.
Increasing Implementation Efficiency
Since every instruction in the P stream is executed a second time, it is logical to assume that, in the worst case, total program execution time would double when REESE is used. Past research has shown that actual implementations can achieve much better execution times. This section gives three reasons why REESE can perform much better than the worst case.
The first reason is the idle capacity inherent in superscalar microprocessors. REESE can utilize this idle capacity to perform R-stream instructions. As stated earlier, the average throughput for the microprocessor pipeline is approximately half of the maximum possible throughput for modern microprocessors. Extra instructions can be strategically inserted into the instruction stream to increase IPC without increasing overall execution time. However, continuing to increase the number of extra instructions will eventually cause a time increase. REESE doubles the total number of instructions executed while minimizing this time increase.
A second way that REESE increases pipeline efficiency is by eliminating data and control dependencies between Rstream instructions. It does this by utilizing information from the P-stream execution of each instruction. After an instruction has executed once, REESE stores critical information gathered during the P-stream execution. This information is used by the instruction for value and control prediction during its redundant execution. Thus no data dependencies in R-stream instructions, and control dependencies are also eliminated. By the time a branch instruction enters the R-stream Queue, the outcome of the branch is known. That can be used for predicting the direction of the branch in the R stream. The direction is indicated by the instructions that follow the branch into the R-stream Queue. R-stream instruction execution verifies that indeed the value and control prediction were correct.
A third reason why the R stream can execute efficiently is because it will cause no extra misses in the first level of the cache. When a load/store instruction executes in the P stream, it will bring the relevant data into the cache on a miss. Therefore, the R-stream version of the instruction will always hit in the cache (until and unless there is quick thrashing between P-stream and R-stream instruction executions, which is a very likely event), even if the corresponding P-stream instruction was a miss.
Adding Spare Capacity
Even though we utilize the idle capacity of the processor, execution time will still increase. In addition to soft error detection, we want to eliminate this increase. Therefore, our research explores the possibility of compensating for this additional time by adding functional units to the microprocessor. This added hardware is what we call spare capacip. We will attempt to discover the minimum spare capacity needed to bring execution time back down to normal. We add functional units for two reasons: it is simpler than adding hardware to other portions of the pipeline, and it is an effective method of speeding up both P-stream and R-stream instruction executions. By using both existing idle capacity and added spare capacity, the resulting microprocessor will meet both performance and reliability goals. 
The Simplescalar Simulator li
We simulated REESE using the Simplescalar Tool Set, Version 2.0 [25] . We modified the execution-driven sinioutorder simulator, which supports out-of-order issue and execution. The simulator uses a Register Update Unit (RUU) to handle register renaming. The RUU commits instructions to the register file in program order by only committing instructions from the head of the RUU.
Simplescalar implements the RUU as a circular queue with pointers to the current head and tail of the queue. We simulated the R-stream Queue in a similar manner. R-stream instructions carry their operands and result with them as they proceed through the Simplescalar pipeline. This allowed us to avoid adding forwarding logic to the simulator.
In Simplescalar, a loadlstore queue (LSQ) handles all the memory instructions that flow through the processor.
The RUU calculates the effective address, and the LSQ handles the cache and translation lookaside buffer (tlb) accesses. A load instruction may also receive its value from loads or stores that are ahead of it in the LSQ.
The Simplescalar Tool Set allows us to easily change the number of functional units that are available for computation. We can also change the configuration of other processor hardware: caches, maximum IPC, branch predictors, et cetera. Table 1 shows the general simulator options that we set for REESE. From this point forward, this set of options is called the starting configuration.
ctak.lsD
Benchmark Programs
We tested six benchmark programs from the SPEC95 benchmark suite [27] . The programs we used were all integer benchmarks. This helps us to focus on how many integer units of spare capacity are necessary to bring performance back to baseline levels. We did not study floating point (Fp) programs. We chose to execute 100 million instructions in each benchmark program. This allows the Table 2 gives a list of the benchmarks we used and the inputs to leach benchmark. 
Results and Analysis
First, we compare a simulated processor that implements RBESE to a simulated processor that does not. We always test both models using the same hardware configuration before we add spare elements to the REESE model. From this paint forward, the processor model that does not implement RBESE will be called the baseline model.
We focus on the committed instructions per cycle (IPC) for the benchmark programs that were used. Every program was tested using five different hardware variations. The purpose of the following figures is to answer two distinct questions. First, how does a microprocessor that implements REESE compare to one that does not? Second, how does adding spare elements affect the comparison? Both of these questions are tested for a variety of processor configurations in order to quantify the influence of other hardware structures on IPC.
The results shown in Figure 2 correspond to the starting configuration detailed in Table 1 RUU size = 32 and LSQ size = 16 Figure 3 shows the results of doubling the size of both the RUU and LSQ in both REESE and the baseline models. The sizes of the RUU and LSQ are a couple of the factors that can limit IPC results for both models. This could cause the models to appear to perform equally well. By increasing the RUU and LSQ sizes, we can measure the influence of these elements on IPC values. This influence can then be distinguished from the influence of REESE on IPC.
In Figure 4 , the size of the models' datapaths are doubled from 8 to 16. The larger RUU and LSQ are maintained from Figure 3 . In this figure, we want to make sure that the pipeline bandwidth is not artificially limiting the IPC of either model. 
Figure 4. IPC for 16-wide datapath
We added more memory ports to the simulated processors in figure 5 . Of course, adding memory ports would be much more expensive then adding ALUs. However, we wanted to measure the influence of memory bandwidth on IPC results. We did not include the case of 2 spare ALUs and I spare multiplier/divider. This is because the data was the same as if only 2 spare ALUs are present. Examination of the previous figures shows that a spare multiplier/divider has little effect on average IPC values.
Analysis
Several general conclusions can be drawn from these figures. The first is that an RUU-based microprocessor cannot attain 2 IPC on a regular basis. This is probably due to high-latency instructions causing pipeline stalls. When an RUU is used, a high-latency instruction (like division) can reach the head of the RUU and cause other instructions to back up behind it. These stalls are reflected in the low IPC values. An alternative to the RUU-based scheme is to have reservation stations associated with each type of functional unit. This would be like a distributed RUU that would not fill up as easily as the RUU in SimpleScalar. A separate Reorder Buffer would be needed to insure in-order instruction commit. 
Figure 5. IPC for additional memory ports
More specifically, it is easy to see that the two schemes have similar IPC values before any spare elements are added. This supports the past research that was discussed in section two. Average IPC for REESE is only 11-16% worse than the baseline without any spare elements. When spare elements are added, this difference shrinks from an average of 14.0% to an average of 8.0% over the hardware configurations shown in the previous figures.
These graphs also show that the performance of some programs is erratic and unpredictable. For example, ijpeg has a baseline IPC that is usually much higher than REESE. Vortex has a baseline IPC that is lower than REESE before spare elements are added. These unpredictable results are due to complex interactions within the instruction stream. That is why the average is so important. A wider variety of programs need to be tested before REESE can be implemented on a larger scale.
Cycle time is another factor that should not be forgotten. Since cycle time is dependent upon implementation technology, our simulation model could not address possible cycle time dilation. Cycle time should not be adversely affected, though, because the R-stream Queue acts in parallel to the processor pipeline. REESE does add result comparison and forwarding hardware, but these should not overly burden the clock cycle. Figure 6 shows a summary of the previous results. It is clear that the added memory ports significantly improved the performance of REESE. It is more feasible to implement REESE on a system that has four or more memory ports. However, adding these ports adds much more cost and complexity than adding integer ALUs.
According to these results, REESE should work better as more hardware is added to a system. To test this, we increase the size of the RUU and ran more simulations. Fig-I None RUU,LSQ 2X Ex. Q 2X MemPorts Figure 6 . Summary of results ure 7 shows the result. We ran cases where the RUU increased to 64 and even 256 entries. The LSQ always remained at half the RUU size, as in the first simulations. We adjusted the RUU because it seemed to be a bottleneck to quick execution. We also wanted to compare the results of adding functional units in addition to the large RUU. The results show that the difference between the Baseline system and REESE remains at approximately 15% when only the RUU is increased in size. However, addi-1.ional functional units shrink this difference to about 1.5%. 'When a large number of functional units are present, we %would expect that adding spare functional units would not impact performance. This expectation is proven correct by the figure. However, the figure also shows how the addition of only 2 integer ALUs can drastically improve the performance of the REESE model.
Discussion and Conclusions
REESE is an efficient implementation of time redundancy in a microprocessor pipeline. The purpose of REESE is to allow for the detection of soft errors in the processor while only adding a minimal amount of spare elements. When a small overall amount of hardware is present, results show that REESE has a 14% performance decrease when compared to the baseline. Simulations of computer systems with more hardware bring this difference down to about 12%.
This difference is significantly reduced by simply adding two spare integer ALUs to the REESE hardware configuration. This brings the difference down to 8.2% for the simpler systems that we simulated and 1.5% for the more complex systems. As our simulated system grows larger, it becomes clear that the practicality of REESE depends heavily on both the size of the register update unit and the total number of functional units. The baseline processor is not as dependent upon a large number of functional units.
LFrom our simulations, it seems that we can only approach our goal of incorporating time redundancy into a GPP pipeline while attaining zero performance degradation. However, the fact that we approach our goal as we increase the amount of hardware is encouraging. This indicates that the time penalty due to time-redundant instruction execution should decrease for future generations of microprocessors. Of course, REESE will need different implementations for different processors. But the practicality of REESE, and methods like it, will increase with time.
To be completely practical, the hardware cost of implementing REESE must be compared to the model's benefits. We can assume that the R-stream Queue will take up a larger amount of die area than other hardware additions required by REESE. Depending on its size, the R-stream Queue requires slightly more area than the RUU. If the RUU takes up 10% of the die area, then we can expect REESE to add a total of about 20% to the die area and (as stated before) 1.5% to the execution time. These are the costs for the benefit of full duplication and reexecution of the instruction stream.
Future work could explore the possibility of executing less than 100% of P-stream instructions in the R stream.
Compared to the current simulation, this method could more easily meet processor performance goals. For example, one out of every two instructions could be re-executed. This would speed up execution, but it would decrease the number of soft errors that REESE would be able to detect. Adding only two integer ALUs to the execution stage of the pipeline approaches our goal of zero performance degradation with a reasonable processor configuration. ALUs are relatively inexpensive additions to the processor. REESE adds soft error detection to GPPs without adding expensive hardware or drastically increasing program execution time. REESE is a scheme that must be seriously considered as a defense against the reliability problems of future microprocessor pipelines.
