Static and dynamic variations, which have negative impact on the reliability of microelectronic systems, increase with smaller CMOS technology. Thus, further downscaling is only profitable if the costs in terms of area, energy and delay for reliability keep within limits. Therefore, the traditional worst case design methodology will become infeasible. Future architectures have to be error resilient, i.e., the hardware architecture has to tolerate autonomously transient errors.
I. INTRODUCTION
With the continuing downscaling of CMOS technologies, static and dynamic variations, time dependent device degradation, sporadic timing errors, and radiation induced soft errors will result in unreliable components. The traditional worst case design methodology becomes infeasible due to the large area and energy overhead and the required a priori knowledge of all error sources at design time. Further technology downscaling is only profitable if the costs for reliability keep within limits. A promising approach is to tolerate errors on the physical hardware level and correct them on higher levels [1] - [4] .
In this paper, we focus on the design of multi-processor systems-on-chip (MPSoC) which can tolerate a specific amount of transient errors. Soft errors, one type of these errors, are caused by radiation. When a particle hits a semiconductor device, it can generate an electron-hole pair. This yields a false signal value for a short time, which is called single-event transient (SET). When a particle hit causes a flipped bit in a register, a single-event upset (SEU) occurs. Another type of transient errors are timing errors caused by temporary delay variations.
We present an FPGA based rapid prototyping system for an MPSoC consisting of autonomous hardware units. These units The presented work was partly done under the scope of the AIS project which is supported by the German Federal Ministry of Education and Research, funding label 01M3083.
can monitor and analyze sporadic disturbances and trigger autonomously adequate reactions. Different techniques are integrated into the MPSoC for a holistic protection of the system: a self-correcting data path and control flow checking in LEON3 [5] processor cores and a run-time configurable data protection of the AMBA [6] advanced high-performance bus (AHB).
The system is dedicated for fast architectural exploration under different scenarios for transient errors on the physical hardware level. Errors can be injected on multiple places in the hardware for evaluation. Our adaptive FPGA prototyping system allows a fast design space exploration by emulating various error protection techniques with varying failure rates on the microarchitectural level. The impact on the system behavior can be evaluated with respect to overhead in area and latency. Another alternative for such an exploration is simulation. Since errors occur on microarchitectural level, simulation has to be performed on this level which results in extreme long simulation times if the system behavior has to be monitored over a long time period. This makes simulation infeasible. A channel decoding system is used as application for demonstration. Channel decoding is an important building block in any communication system. Further, it is representative for probabilistic and iterative algorithms which have an algorithmic resilience. Many other applications also have a cognitive resilience, e.g., video or audio compression. This algorithmic or cognitive resilience, called application level resilience in the following, offers a large optimization potential for error resilient system architectures [7] - [9] . The system designer can make a trade-off between quality and robustness.
The paper is structured as follows. Related work is discussed in Section II. Sections III, IV and V focus on various techniques to detect and correct errors in data path, control path, and interconnect, respectively. The rapid prototyping platform and results from an exploration scenario are presented in Section VI.
II. RELATED WORK
Error detection and correction methods have important roots in the area of fault-tolerance. Here, the most familiar methods for error detection or correction are the duplication or triplication, respectively, of a given processor core with subsequent comparison of the results [10] . Duplication or even triplication of processing units is, however, most of the times prohibitive due to area and power consumption. Hence, these approaches are only affordable in safety-critical systems with a very high demand on reliability.
A technique called virtualization-assisted concurrent, autonomous self-test (VAST) [11] uses a concept similar to ours with autonomous, on-line failure protection. VAST tests the processor cores of a multi-core system while they are free from executing tasks. Only hard failures are detected because the on-line test is not running during normal operation.
Related work on data path protection includes arithmetic and logic units protected with residue or parity codes [12] . Ernest et al. [13] protects a CPU pipeline with razor which is a technique for timing error detection and correction. Whenever an error is detected it is corrected by inserting one cycle delay for correction. This technique does not protect the pipeline against SET and SEU errors.
Gaisler [14] developed a fault-tolerant version of the Leon processor. Fault-tolerance is provided against bit-flips (Single Event Upsets -SEU) in cache memories, register file and pipeline registers. This work does not address the increasing problem of timing and single event transients.
Related work on control flow checking can be divided into approaches using an additional hardware checker unit or a watchdog processor [15] - [20] , and approaches which are completely software-based [21] . In these approaches, the program code is first structured into basic blocks 1 .
Control flow checking using assertions (CCA) [21] denotes a software-based approach. After creating a basic block graph, a sequence of special control instructions is inserted into the program code at the beginning as well as at the end of each basic block. These additional instructions verify that only legal branch or jump destinations according to the specification, given by the basic block graph is taken. A good overview over software methods for control flow checking for security and fault tolerance is given in [22] .
To check all types of instructions, a signature (hash or a CRC value) of all instructions of a basic block can be calculated offline (at compile-time). At run-time, a hardware checker can calculate the signature of the executed instruction in a basic block. When leaving a basic block, the signatures can be compared and errors inside the basic block can be found. Signature methods can be divided into two groups, namely embedded signature monitoring (ESM) [15] - [17] and autonomous signature monitoring (ASM) [18] , [19] .
For interconnect protection, interconnect noise can be reduced [23] or general protection techniques on the circuit level to detect timing errors [13] or to mask SET errors [24] can be applied. On higher levels, spatial-and time 1 A basic block is a sequence of code which is executed successively without any jumps or branches except, possibly, at the end. The basic block can only be left at the end of a block and can only be entered at the beginning. Only the last instruction can be a jump or branch and only the first instruction can be a jump or branch destination. redundancy can be added [4] . Well known techniques are error detection codes with automatic retransmission request (ARQ) or forward error correction (FEC) codes [25] - [27] . However, the general application of these techniques implies a large overhead in area, energy and timing. In [25] it was shown that the efficiency of the error protection scheme strongly depends on the application constraints.
Bertozzi et al. [28] evaluated AMBA data bus protection schemes using Hamming codes. They conclude that using the Hamming code only for error detection and ARQ is the most effective coding scheme with respect to energy per useful bit. However, retransmission is not feasible for applications with strong latency constraints.
III. DATA PATH PROTECTION Data path error detection uses Nicolaidis shadow registers [4] which detect SET, SEU, and timing errors. An extra shadow register is added to each inter-stage pipeline register, see Figure 1 . Whenever an error is detected the errant operation has to be retried. Assuming that a SET hits the execution stage (EX), it will be detected at the EX/ME interstage pipeline registers. The operation should be retried but at that moment the execution stage input registers have already been overwritten. Therefore a straight-forward solution would be to flush the pipeline and to restart it at the errant instruction.
In order to minimize the error recovery overhead, a new customized micro-rollback [29] is presented. As the shadow register technique has an error detection latency of only one clock cycle, storing the last state in history registers (figure 1) is sufficient to perform a micro-rollback [30] . Whenever an error is detected, operations will be retried from the history registers. The recovery penalty equals two clock cycles, one cycle for error detection and one for correction. This error penalty is independent from where the error occurs.
This concept has been implemented in a Leon3 CPU pipeline. Micro-architectural extensions are added to each pipeline register as shown in Figure 1 . Whenever an error is detected, the global error signal (obtained by OR'ing all protected registers error signals) is forwarded to the control unit which is extended to control the pipeline rollback. Errors are detected and corrected with a 2-clock cycle penalty.
Processor protection with shadow registers has been implemented in ASIC [13] . However implementing shadow register based designs in FPGAs can not be done in a straight-forward manner as FPGA tools do not allow to constrain hold time, also routing two different clocks to the same clock domain is not simple/impossible with existing tools. To circumvent this problem, we decided to use the flip-flops clock enable signal to mimic the 2-clock scheme of the main/shadow register concept.
IV. CONTROL PATH PROTECTION
Control path protection can be achieved by checking the control flow of a program under execution, and in case of an error a re-execution of the erroneous instruction. A quite general definition of control flow checking may be given as follows: Control flow checking denotes the task to test whether of data has to be processed by the Turbo decoder. Therefore, the implementation of the complete simulation environment directly on the FPGA is an appropriate solution.
Turbo decoding belongs to the class of belief propagation algorithms. It is processed using an iterative algorithm with high computational complexity and high communication bandwidth. This allows exploring and verifying the error protection techniques used in the processors and the interconnect system. Figure 4 shows the mapping of the communication system (a) on the MPSoC (b). The decoder environment is implemented in the hardware unit called Turbo slave. We integrated an additive white Gaussian noise (AWGN) generator from Xilinx [33] for emulating the noisy wireless channel. Turbo decoding runs on the two Leon3 processors. Data is transmitted over the AMBA AHB for reading/writing the input/output data and reading/storing data from/into the SDRAM.
We injected various errors in the different components of the MPSoC. In data and control path, we mimic single event transient and timing errors by generating short pulses in register input signals. This is achieved by adding a multiplexer in front of register input signals. The multiplexer control signal is the injection control generated by a linear phase shift register (LFSR). An LFSR for each bit line is used to generate random bit flips or delays in the data bus signals. An LFSR length of 51 ensures the stochastical independence of the bit flips. The injected error rate (IER) per bit line can be configured at runtime. This allows an efficient design space exploration under varying IER. Three scenarios are emulated to evaluate various error protection techniques:
• In normal operation, no errors are injected. This is our reference.
• In failure operation, error injection is performed in different components.
• Autonomous error handling in processors and interconnect is activated while autonomous operation. A design space exploration with respect to different error protection scenarios is shown in Figure 5 . We used two different codes to protect the transmission of the input data. The first code is a parity bit over all six bits of the input data. In the second code, only the sign bit is doubled. In both cases, the input value is punctured if an error is detected, i.e., it is set to zero which represents a 50:50 probability for a zero and a one. The Turbo decoding algorithm tolerates injected errors up to an IER of almost 10 −4 without any protection. This demonstrates the algorithmic resilience of Turbo decoding. The decoder refuses completely to work at an IER of 10 −2 without error protection, whereas the decoding loss with both codes is about 0.6 dB. With an IER of 10 −3 , the decoding performance with protection by parity bit is nearly equal to the normal operation. Sign duplication performs only about 0.05 dB worse.
Injecting error in data or control path without error protection leads to a complete system failure or the generation of wrong outputs. Whereas, enabling the autonomous error protection results in correct outputs with a very small performance overhead. In order to correct errors, some additional clock cycles are necessary as described in the previous sections. The correction in the data path needs two additional clock cycles, whereas the number of needed clock cycles for re-execution of an erroneous control instruction depends on the currently executed instruction. On a simple program counter increment (no control flow instruction) we are able to correct the error in one additional clock cycle, whereas a correction of an erroneous return instruction needs five clock cycles. Furthermore, cache misses due to falsified branch or jump targets have also an impact on the latency.
The clock cycles per instruction (CPI) of the normal operation (no fault injection) and the autonomous operation (with error injection and correction) of the Turbo decoder running on the Leon3 processor is shown in Table III. In the case of the autonomous operation, we see that the mean CPI is higher due to additional clock cycles for correcting errors than in normal operation. Note that all measurements are taken from different runs, and the number of executed instructions vary from run to run.
The investigations show that reliability (in terms of resulting error rates and keeping latency constraints) depends on multiple conditions: application, transient error rate, where the errors occur, and protection technique. An exploration on the microarchitectural level is mandatory for analyzing the impact of transient errors in all parts of the hardware. The presented adaptive prototyping platform allows this exploration by emulating the scenarios under varying constraints and conditions and thus to exploit application level resilience in the design process. A typical emulation of the channel coding system under injected errors runs more than 20 hours with two Leon3 cores at a clock frequency of 60 MHz. In contrast, a software simulation on the microarchitectural level would take years and is thus infeasible. The presented platform is at the best of our knowledge the first rapid protoyping system which provides a holistic combination of techniques for error protection in processors and interconnect that increases the reliability of MPSoCs drastically.
VII. CONCLUSION
In this paper we presented an adaptive system for rapid prototyping and verification of an error-resilient MPSoC architecture. A unique combination of multiple error protection techniques for processor cores and interconnect which have an autonomous behavior with respect to transient errors has been implemented. The demonstration platform allows the fast architectural exploration of various error protection techniques under different failure rates for all signals in the circuit on the microarchitectural level. This is mandatory for exploiting application level resilience in the design process, because simulation including the error injection in hardware is infeasible. With LTE Turbo decoding, a relevant application for state-ofthe-art wireless communication was chosen for demonstration.
