Abstract Computer-level faults leading to data errors in computations are predicted to occur increasingly frequent in future microprocessors. This work discusses the impact of such errors on closed-loop performance in implementations of digital control systems. A method to render a control system more robust to data errors by introducing artificial signal limits and then combine them with an anti-windup scheme is presented and exemplified.
Introduction
As computers, rather than electro-mechanical systems, are increasingly used for implementing control algorithms, control systems become more vulnerable to computer level failures due to faults in semiconductor devices. Such faults are classified as: (i) transient faults which are short-duration faults that are induced by neutron and alpha particles, power supply and interconnect noise, and electrostatic discharge, (ii) intermittent faults which are re-occurring short-term faults that occur due to marginal hardware or aging effects, and (iii) permanent faults which have the same causes as the intermittent faults, but reflect irreversible physical changes. The trend with decreasing transistor and interconnect dimensions, lower power supply, and higher operating frequencies contributes to increasing the occurrence rates of transient and intermittent faults, while improvements in design and manu-'facturing have led to decreasing permanent failure rates, less than 15 FIT (Failures In Time, or failures in IO9 hours of operation) for microprocessors in 2001 [4] . This paper will therefore focus on transient faults.
A microprocessor is built from SRAM memory cells, latches, and combinational logic. Faults in the microprocessor may either result in control-flow errors, in which case the instruction execution order is erroneous, or in data errors, in which case the executing program delivers erroneous results. Many control-flow errors may be detected with watchdog timers. Modern microprocessors have built-in protection against transients faults in memory cells (cache) using e.g. error-correcting codes and parity checks, while the remaining parts are essentially unprotected. Methods to implement pro- , when the failure rate will be comparable to that of unprotected memory elements. The corresponding fault rate for the present generation of microprocessors is -1 FIT. The trend of increasing fault rates is also reported in [8] , where it is predicted that a 32 Mbit static memory implemented in a 0.1 p m process will fail on the average each 5.7 years at sea level. A result that is expected to hold also for other logic circuits, such as flip-flops, latches, registers, and combinational circuits.
Orjan Askerdal

Department of Computer Engineering
In safety-critical applications it is imperative to have error detection and recovery functionality to meet the high demands on dependability. In some areas, such as the aircraft industry, fault-tolerance is achieved by redundant hardware and highend devices. In more cost-sensitive areas, such as the automotive industry, expensive solutions may not be feasible. Assuming a microprocessor with a (constant) transient failure rate of 10 FIT that hosts e.g. a braking functionality in a car, the mean time to computer failure (possibly leading to catastrophic failure for the vehicle) is more than 11,415 years for a single vehicle. For a series of 100,000 cars, the mean time to failure in one of the cars is 42 days. If a microprocessor executes a control algorithm that is likely to tolerate a transient data error, this device may not necessarily have to be as fault-tolerant, and can thus be less expensive. Therefore it is of interest to investigate the impact of transient data errors on hardware hosting implementations of safety-critical control algorithms.
Understanding the effect of data errors on general computer functionality is an intensively researched area, e.g., [ 121. Methods for analyzing the effects of timing errors on control systems were presented in, e.g., [9, 19] . Analysis of effects on system stability of data errors caused by EM1 bursts was in-vestigated in [ 101. However, catastrophic failures in a safetycritical system may occur before the system reaches instability, e.g. if some constraint on the control error is exceeded. Recent results [5, 181 show that many data errors will have a limited effect on control performance, i.e., control systems often have an inherent resilience or inertia to data errors. This path is pursued further in this paper. It is discussed how the implementation of the control algorithms influences the tolerance to data errors resulting from computer node failures. Related work is published in [2] , where classical frequencydomain methods are used to investigate the effects of computer failures on linear feedback control systems.
The paper is organized as follows: In Section 2 a general model for the effect of data errors on a control algorithm is stated. Thereafter, controller realization and scaling is discussed in Section 3. Sections 4 and 5 show how computed signal bounds together with an anti-windup scheme may improve the recovery from data errors. This is illustrated with an example in Section 6. The results are discussed in Section 7, and summarized in Section 8.
Modeling Data Errors Caused by Computer
Let
Node Faults
be a state-space realization of a general linear two-degreesof-freedom dicrete-time controller with internal state z, command signal U,, process output y , and control signal U. The controlled process i s assumed to be a l i n e a r t i m e -i n v a r i a n t system
Y ( s ) = G(s)U(s)
( 2) with zero-order-hold sampling. A pseudo-code implementation of (1) This algorithm would be implemented and executed in a computer node like in Figure 1 . Erroneous computation results due to transient data errors would eventually be stored in U or z , and would propagate to the controlled process, and through the feedback loop. As illustrated in the figure, the internal components of a computer node are: communication controllers, memories, microprocessors, and internal communication buses. All these components may be affected by faults. As the communication controller handles the information exchange with other nodes, quantities measured by sensors may become erroneous due to faults in the communication controller. The data path of a microprocessor consists of caches, registers, buses, and functional units (i.e., ALU, multiplier, etc.). If a fault occurs in any of these parts, it will affect the ongoing calculation if the faulty part is activated, rendering the result of that calculation to be incorrect. The communication inside the computer node is, generally, performed using buses. If a fault occurs on such a bus, the data currently being transferred will be affected and the result may be a data error in any of the calculated data.
Data errors may be regarded as bit-flips in the digital representation of the affected variable or signal. Single-bit errors tend to be more common than multiple-bit errors [l]. In a Nbit fixed-point representation with M fractional bits the num-
, with a resolution of Q = 2-M. With this representation, the magnitudes of the errors will be in the same range as the control signals (assuming that the data have been properly scaled). In the IEEE floating-point standard with f fraction bits, and e exponent bits the range is
and bit flips in the most significant exponent bits may lead to very large errors. A reasonable model of data errors is additive impulse disturbances with magnitude of rectangular distribution within the number representation range. The data error disturbances may then be included in (1) as
where q,(k) and q,(k) are impulse disturbances due to data errors affecting the computation of the controller state and control signal, respectively. Since transient data errors occur sporadically, it may be assumed that an error manifests in
Controller Realizations
With a change of state variables -z(k) = Tz(k) in ( Note that q, is not transformed in (4), since it appears internally in the controller implementation. Combining the controller realization (4) with a state space realization of the sampled controlled process (2)
results in the closed-loop dynamics Inspection of Dcl reveals that the structure (7) is preserved for all k > 1. Hence, the impulse-responses will have the structures Iz,,: The results are optimal realizations in the context of roundoff errors, which have the same problem structure, but where the disturbances are close to stationary white noise processes.
solutions to the minimization of IIHqZ lip with respect to T , subject to the state scaling constraint llHzllq I y are presented for p , q E ( 2 ,~)
in [13] . For data errors it may be more natural to regard p = q = 1, i.e. peak-to-peak gain. Even more natural would it be to minimize Ilhqzllm, since the disturbances are assumed to be impulses. However, in practice it is likely more feasible to optimize the realization with respect to the all-time present round-off noise than rare and sporadic data errors. In the following it will thus be assumed that the transformation T is given, and we concentrate on the analysis of the influence of qz and vu on the closed-loop performance. Hence, we can consider the representations (3) and ( 5 ) of the controller and the process. It is straightforward to include the effects of other bounded inputs to the closed-loop system, e.g. load disturbances, in the bounds of Equations (IO). The bound (loa) may also be used for 1, state scaling, as described in [7] . If the closed-loop system is well designed with proper damping, the bounds are not expected to be very conservative, which is also confirmed in [7] . If the signal bounds are exceeded it can be concluded that an error has occurred in the system. Hence, the signal bounds may be used on-line to actively detect deviations from normal operation, similarly to the approach in e.g. [16] . Note, however, that in the cited work the bounds appear to be computed on the controller in open-loop, which result in over-estimation of the bounds by factors of magnitudes in comparison with the closed-loop bounds. The approach taken in the present work is to introduce explicit bounds in the controller, that correspond to those of (IO), and then use well-known anti-windup methods to handle these signal limitations in a graceful manner. This will give the system an inherent robustness to data errors that exceed the signal bounds.
Signal Bounds
Anti-windup
In the presence of control signal limitations, the control signal actually delivered to the controlled process will be u ( k ) = sat(v(k)), where v(k) is the linear control signal. When the output signal is saturated the feedback path is broken and the controller states are driven in open-loop, leading to deteriorated performance or even instability. If the controller has integral action this phenomenon is denoted integrator windup. To inhibit this behavior various anti-windup schemes may be applied [6] . Anti-windup should always be implemented in a controller with actuator saturation. In this work an explicit artificial limitation according to (lob) is introduced to make the system robust to data errors. In practice an actuator limitation will also be present. If the actuator saturation limit is smaller than the artificial limit (lob), then the smallest limit should be used. The gain K is chosen as to obtain the desired observer dynamics given by &r. The anti-windup scheme wilI now reduce the effect of data errors that are causing the controller output to exceed the estimated limits. Note that the observer based anti-windup operates on all controller states, while certain other schemes only operates on the integrator state. Hence, errors in any of the controller states are eliminated due to the anti-windup. Also note that the observer-based method does not require any additional states to be introduced. In the context of data errors this is important, since data errors affecting explicit anti-windup states would not be handled gracefully by the system. The closed-loop system resulting from combining (1 1) with ( 2 ) is shown in Figure 2. 
Example
As an illustration of the inherent tolerance to transient faults that may be achieved with the combination of signal bounds and anti-windup, we study the control of the simple servo process 100 G ( s ) = ___ s(s + 10)
A discrete-time two-degrees-of-freedom tracking controller is synthesized using the polynomial pole-placement design method of [3] (11) is implemented, with K chosen as to obtain dead-beat dynamics, to obtain the closed-loop system of Figure 2 . In Figure 3 the closed-loop step-response to a command-signal unit step is shown. Note that the computed state-bounds are very tight, while the control signal bound seems to be a little more conservative. In Figure 4 the closedloop response to a data error qzl ( k ) = 5 6 ( k ) , affecting the computation of the integrator state z , , is shown in the case when the artificial signal-limit and anti-windup are not applied. It can be seen how the integrator state slowly recovers, while the control error grows large. In presence of actuator limitation large data-error amplitudes even result in instability. Figure 5 shows the corresponding response with applica- leads to larger control errors in this case. This is because the data error is interpreted as a saturation by the anti-windup, which is fast enough to react immediately. Important to note is also that the control error magnitude does not increase with the data error magnitude for large data errors. (iii) Dead-beat anti-windup only on the integrator state (this is common in e.g. PI-controllers). This is normally sufficient for handling actuator limitations, but in the case of data-errors the performance is inferior compared with full-state anti-windup.
(iv) Dynamic anti-windup with a bandwidth of 2oCl. Here the response to data errors entering the control signal is better, since the anti-windup is too slow to react immediately on the error impulse. The response to data errors on the state is, however, worse than for the dead-beat design, as is clear from the figure. The recovery time is also significantly slower compared to the dead-beat design, which may be seen from time-domain response plots.
Discussion
In general it can be noted that the integrator state z1 is most sensitive to data errors, which seems very intuitive since it depends on feedback of the control error to decay, while the controller state z2 is stable with a short time constant, and decays by itself when the loop is broken by saturation. Another general observation is that the control-signal bound is computed from bounds on the command signal, and in consequence is related to the expected magnitude of control errors during normal operation. Hence, the artificial limits and the anti-windup scheme will capture data errors resulting in control errors larger than those expected during normal operation. The proposed method may also be combined with the dynamic bounds of [ 171, to increase the coverage for data errors.
Since the controller state z is bounded by (loa), it may seem natural to introduce explicit limits also for this variable.
However, simulations with saturations on the controller state indicate that the performance improvement is minor. The increased complexity resulting from additional saturations in the loop also makes the system difficult to analyze, even if it seems to perform well in presence of data errors.
In presence of stochastic signals such as measurement noise there will be a probability of false error detections. This may be handled by computing the resulting variance of the control signal. The deterministic control-signal bound (lob) is then adjusted with some measure depending on the variance. The size of the adjustment will determine the probability of false detections. Sporadic false detections will affect control performance as the anti-windup intervenes. By using slower anti-windup dynamics the noise sensitivity is decreased.
If a rate bound on the command signal is known lAu,(k)l = (u,(k) -u,(k -1)1 5 MA, then state and control-signal rate bounds may be computed in analogy with (10). An artificial rate bound on the control signal may then also be introduced in the controller, and used together with the anti-windup scheme. Note that the noise sensitivity will be worse than for the case with magnitude bounds, since the rate of the control signal will depend on the noise variance.
Summary
Data errors resulting from transient faults in the computer hardware may be modeled as impulse errors entering the control algorithm internally. The effect on closed-loop performance therefore depends on the controller realization. Previous methods to optimize controller realizations with respect to round-off errors in finite-precision numerics are applicable also in the context of transient data errors. Given a controller realization, robustness to data errors may be achieved by introducing artificial signal limitations based on /,-bounds, in combination with an anti-windup scheme. As one may expect, the integrator state of the controller is most sensitive to data errors.
