Performance optimization for energy-aware adaptive checkpointing in embedded real-time systems by Li, Zhongwen et al.





Zhongwen Li,        Hong Chen,          Shui Yu
 
Information Science and   
 
 Information Science and   
 
School of Information 
Technology College,        Technology College,        Technology, 
Xiamen University, China  Xiamen University, China  Deakin University, Australia 







Using additional store-checkpoinsts (SCPs) and 
compare-checkpoints (CCPs), we present an adaptive 
checkpointing for double modular redundancy (DMR) in 
this paper. The proposed approach can dynamically ad-
just the checkpoint intervals. We also design methods to 
calculate the optimal numbers of checkpoints, which can 
minimize the average execution time of tasks. Further, the 
adaptive checkpointing is combined with the DVS (dy-
namic voltage scaling) scheme to achieve energy reduc-
tion. Simulation results show that, compared with the 
previous methods, the proposed approach significantly 
increases the likelihood of timely task completion and 





Checkpointing is an important method for 
fault-tolerance in real-time systems in the condition of 
harsh environment. The following three types of check-
points are well known: CSCP, SCP and CCP
 [1-3]
. CCPs are 
used to compare the states of the processors without stor-
ing them, while, the processors store their states without 
comparison in SCPs. If the two operations are used to-
gether in the same checkpoint, we call it CSCP. Using 
CCP and SCP, Ziv and Bruck have shown numerically that 
the task execution time is significantly reduced 
[1,4]
. Using 
additional CCPs and SCPs, Nakagawa and Fukumoto have 
used a triple modular redundancy and double modular 
redundancy to analyze the optimal checkpoint intervals 
that can minimize a task execution time, respectively 
[5]
.  
In addition, many real-time systems are often en-
                                                 
 * This work is supported in part of Fujian natural science grant 
(A0410004), Fujian young science & technology innovation grant 
(2003J020), NCETXMU 2004 program, Program of 985 Innovation on 
Information in Xiamen Univ.(2004-2007) and Xiamen Univ. research 
grant (0630-E23011). 
ergy-constrained since system lifetime is determined to a 
large extent by the battery lifetime 
[2]
. For example, 
autonomous airborne and sea-borne systems working on 
limited battery supply, space systems working on a limited 
combination of solar and battery power supply, 
time-sensitive systems deployed in remote locations where 
a steady power supply is not available 
[3,6]
. DVS has 
emerged as a popular solution to the problem of reducing 
power consumption during system operations. The DVS 
become possible on the availability of embedded proces-
sors that can dynamically scale the frequency by adjusting 
the operation voltage 
[2,3]
. Many embedded processors 
have the ability to dynamically scale the operation voltage 
currently. Such as, the mobile processors from Intel with 
its SpeedStep 
[7]
 technology. In the realm of real-time sys-
tems, the DVS techniques focuse on minimizing energy 
consumption of the system under the condition of meeting 
the deadlines. The DVS and fault tolerance for real-time 
systems have been studied as separate problems. It is only 
recently that an attempt has been made to combine fault 
tolerance with the DVS
 [3]
.  
The combination of DVS, CSCPs (CCPs or SCPs) 
can be used to satisfy system’s DVS requirement and im-
prove the performance of real-time systems. However, 
none of the mentioned papers addressed these issues in 
terms of conjunction. Using additional SCPs and CCPs, 
we modify the methods of [3] in the double modular re-
dundancy (DMR) in this paper. Different from the existing 
methods, our approach is to tune the scheme to the spe-
cific system which it is implemented on, and use both the 
comparison and storage operations efficiently, the per-
formance of checkpoint schemes is improved.  
Some notations used in our paper are listed below: 
ts: the time to store the states of processors. 
tcp: the time to compare processors’ states. 
tr: the time to roll back the processors to a consistent 
state. 
tR : remaining execution time. 
dR : time left before the deadline.  
Authorized licensed use limited to: DEAKIN UNIVERSITY LIBRARY. Downloaded on May 12, 2009 at 00:35 from IEEE Xplore.  Restrictions apply.
 fR : upper boundary on the remaining number of 
faults that can be tolerated by the system. 
 
2  Adaptive checkpointing scheme 
 
Assume task τ  has a period T , a deadline D , a 
worst-case computation time N when there are no fault in 
the system. An upper boundary k represents the number of 
fault occurrences that have to be tolerated. C is the over-
head of a checkpoint. Faults arrive as a Poisson process 
with parameter λ , the average execution time for the task 
is minimum, if a constant checkpoint interval of 2 /C λ is 
used
 [8]
. We refer to this as the Poisson-arrival approach. If 
the Poisson-arrival scheme is used, the effective task exe-
cution time in the absence of faults must be less than the 
deadline D. Assume the fault-free execution time for a task 
is N, the worst-case execution time for up to k faults is 
minimum, if the constant checkpoint interval is set 
to kNC /  
[9]
. This is the k-fault-tolerant approach. 
In addition, we assume that task τ is divided equally 
into n intervals of length NT
n
 =   
, and at the end of each 
interval, CSCP is always placed. 
 
2.1  Additional SCPs 
 
Each CSCP interval is divided equally into m inter-









( figure 1). The SCPs are placed 
between the CSCPs, the states of two processors are 
stored at iT1 and jT (i=1,2,…, m-1). If two states do not 
get an agreement at time jT, then, we need to find the most 
recent SCP with identical states and roll back to it. As 
shown in figure 1, two processors are rolled back to 
(i-1)T1 because some errors have occurred during ((i-1)T1, 
iT1), and repeat the execution from (i-1)T1. The average 
execution time R1(m) for a CSCP interval ((j-1)T, jT) is 
given by a renewal-equation
[4, 10]
: 




/m T T= , we have 
121
1 1
1 1 1 1
( )
( ) [( ) ]( 1)......(1)
2
T
s cp s cp
T TT T T
R T T t t T t t e
T T T T
λ+= + + + + + −     
If
1 0T
+→ , then R1(T1)= +∞ . Let T1=T, we 
have 2
1 1( ) ( )
T
s cpR T T t t e
λ= + + . Thus, there exists a fi-
nite ( ]jTTjT ,)1(~1 −∈ which minimizes R1(T1). Differentiat-
ing equation (1) with respect to T1 and setting it equal to 
zero, we get
1T
 . Procedure num_SCP(T) for calculating 
m which minimize )~(1 mR is described in Figure 2.  
The adaptive checkpointing with SCPs, adapchp-SCP 
(D,E,C,k, λ ), is described in Figure 3. A check is per-
formed to see if the task has been completed in line 4, and 
line 5 checks for the deadline constraint. The length of 
SCP and CSCP interval is set in line 6 and line 7, respec-
tively. In line 9, a check is performed to see if fault is de-
tected. If there is no fault, then continue to run task, oth-
erwise, roll back to previous SCP with identical states and 
continues execution, which are described from line 12 to 
Fig. 2  Procedure for calculating the m  
Procedure num_SCP(T){ 
1.  Find 1
~
T  which minimizes R1(m); 
2.  if ( 1
~
T <T) { 
3.    m=  1~/TT ; 
4.   if (R1(m)≤ R1(m+1)) then 
5.      ;~ mm =  
6.      else ;1~ += mm  
7.  } else ;1~ =m  
8. return ;~m } 
Fig. 3  Adaptive checkpointing with SCPs 
Procedure adapchp-SCP (D, N, C, k, λ ){ 
1. Rt=N; Rd=D; Rf=k; 
2. Itv=interval(Rd, Rt, C, Rf, λ ); 
3. m=num_SCP(Itv);  mItvitv /= ; 
4. while (Rt>0) do { 
5.   if (Rt> Rd) break with task failure; 
6.   Insert SCP with interval length itv; 
7.   Insert CSCP with interval length Itv; 
8.   Update Rt, Rd; 
9.   if (no error has been detected at CSCP) 
10.    Resume execution; 
11. else{ 
12.    Rollback to the most recent SCP with identical 
states; 
13.    Rf= Rf-1; 
14.    Itv=interval(Rd, Rt, C, Rf, λ ); 
15.    m=num_SCP(Itv);  mItvitv /= ; 











































Fig. 1 Task execution with SCPs 
(j-1)T 
error SCP CSCP 
T 1T1 (i-1)T1 (m-1)T1 jT iT1 
Rollback point Error detection 
T1 
Authorized licensed use limited to: DEAKIN UNIVERSITY LIBRARY. Downloaded on May 12, 2009 at 00:35 from IEEE Xplore.  Restrictions apply.
 line 16. In line 2 and 14, we use procedure interval (Rd, Rt, 
C, Rf, λ ) 
[3]
 ( figure 4)
 
to calculate the checkpoint interval. 
 In figure 4, 
1
( , ) 2 /I C Cλ λ=  is the checkpoint 
interval of the Poisson-arrival approach, 
kNCCkNI /),,(2 = is the checkpoint interval of the 
k-fault-tolerant approach. In addition, [3] defined some 
quations:  
( , , ) ( ) /(1 / 2 )d dTh R C R C Cλ λ λ= + +  
2( , , ) 2 2 ( ) ( )
d f d f f d f
Th R R C R C R C R C R C R C= + + − + +  
)/(2),,(3 NCDNCCDNI −+=  
Line 1 of figure 4 calculates the number of faults 
Exp-fault that are expected to occur in the remaining time 
Rt. If Exp-fault is less than or equal to Rf, the 
k-fault-tolerant requirement is deemed to be more strin-
gent than the Poisson-arrival criterion. In line 3, a check is 
performed to see if Rt exceeds the threshold  
( , , )dTh R Cλ λ . If this condition is satisfied, the checkpoint 
interval is set to I3(Rt, Rd,C). In line5, a check is per-
formed to see if Rt exceeds threshold Th(Rd, Rf,C) but is 
below ( , , )dTh R Cλ λ . If this condition is satisfied, the 
checkpointing interval is set to I2(Rt, Exp-fault,C). If the 
k-fault-tolerant threshold is met, the checkpoint interval is 
set to I2(Rt, Rf,C) in line 7. Line 8-10 handle the case when 
the k-fault-tolerant requirement is deemed to be less strin-
gent than the Poisson-arrival criterion.  
 
2.2  Additional CCPs 
 
Each CSCP interval is divided equally into m inter-









. The CCPs are placed between 
CSCPs, and the states of the two processors are compared 
at iT2 and jT (i=1,2,…, m-1). If two states do not reach to 
an agreement at iT2 and jT, that means some errors have 
occurred during this interval, the two processors will be 
rolled back to (j-1)T ( Figure 5).  
The average execution time R2(m) for an interval 
((j-1)T, jT) is given by a renewal-equation: 
Therefore, the average execution time RCCP(n)=nR2(m). 




















+→ , then R2(T2)= +∞ . If T2=T, 
then 2
2 2( ) ( )
T
s cpR T T t t e
λ= + + . Therefore, there exists a 
finite ( ]jTTjT ,)1(~2 −∈ , which minimizes R2(T2). Differen-
tiating equation (2) with respect to T2 and setting it to zero, 
we can get 
2T
 . We can use the similar approach de-
scribed in figure 2 to calculatem which minimize
2 ( )R m
. 
 
3  Adaptive checkpointing with DVS 
 
With additional SCPs and CCPs, we show how adap-
tive checkpointing scheme can be combined with the DVS 
to obtain fault tolerance and power savings in real-time 
systems. In the one hand, our approach is to maximize the 
probability that the task meets its deadline in the presence 
of faults. In another hand, our approach is to reduce en-
ergy consumption through the DVS. 
Assume that task τ has a fixed quantity of computa-
tion cycles N in the fault-free condition. Because the vari-
able voltage CPUs are available, the time to execute task 
τ depends on the processor speed. We therefore charac-
terizeτ by a fixed quantity N, namely, its worst-case num-
ber of CPU cycles, needed to execute the task at the 
minimum processor speed. For the rest of this paper, we 
normalize the units of N such that the minimum processor 
speed is 1. That is, if the minimum processor speed is S 
cycles per second, then we express the number of cycles in 
units of S cycles and thus normalize the minimum proces-
sor speed to Smin=1. Of course, period T and deadline D  
are expressed in terms of the number of CPU cycles at the 
Fig. 4  Calculating checkpointing interval  
Procedure interval(Rd, Rt, C, Rf, λ ){ 
1  .exp_error= λ Rt; 
2.   if (exp_error≤ Rf) { 
3.   if (Rt>Thλ(Rd, λ ,C)) then 
4.      chk_interval=I3(Rt, Rd, C); 
5.   else if (Rt>Th(Rd, Rf, C)) then 
6.        chk_interval= I2(Rt, exp_error,C); 
7.      else chk_interval= I2(Rt, Rf ,C);} 
8.   else{ if(Rt> Thλ(Rd, λ ,C)) then 
9.      chk_interval= I3(Rt, Rd, C); 
10.      else chk_interval=I1(C, λ );} 



















( ) ( )

















R m mT mt t e

























error CCP CSCP 
Fig. 5  Task execution with lCCPs 
1T2 (i-1)T2 (m-1) T2 jT iT2 
Error detection Rollback 
T2 
Authorized licensed use limited to: DEAKIN UNIVERSITY LIBRARY. Downloaded on May 12, 2009 at 00:35 from IEEE Xplore.  Restrictions apply.
 minimum processor speed.  
To simplify the analysis and to allow for the deriva-
tion of analytical formulas, we would like to assume that a 
single processor with two speeds f1 and f2, and f1 is the 
minimum processor speed, namely, f1= Smin=1. Moreover, 
the processor can switch its speed in a negligible amount 
of time.  
Additional notations we use is below:  
Rc : the number of instructions of the task that remain 
to be executed at the time of the voltage scaling decision.  
c : the numbe of clock cycles that a single checkpoint 
takes. 
test: an estimate of the time that the task has to exe-
cute in the presence of faults and with checkpointing. The 
expected number of faults for the duration test is esttλ . 
The checkpointing cost C at frequency f is given by 
C=c/f. 
To ensure esttλ fault tolerance during task execution, 
the checkpointing interval must be set to 















We consider the voltage scaling to be feasible if 
est dt R≤ . This forms the basis of the energy-aware adap-
tive checkpointing that are described in procedure 
adapchp_dvs_SCPs and adapchp_dvs_CCPs (Figure 6 and 
Figure 7).  
 
1 1 2
Procedure adapchp_dvs_SCP( , , , , ){
1.   ;    ;    ;
2. if ( ( ,  )   )   ;  else   ;
3.  interval( ,  / ,  / ,  ,  );
4.   num_SCP( );    / ;





D N c k
R N R D R k
t R f R f f f f
Itv R R f c f R








=  0 )do{
6. if (   ) break with task failure;
7. Insert SCP with interval length ;
8. Insert CSCP with interval length ;
9. Update ,   according to speed ;













13. Roll back to the most recent SCP/CSCP with identical states;
14.   -1;
15. if ( ( ,  )   )   ;  else   ;
16.   interval( ,  / ,  / ,  ,  );





t R f R f f f f






= um_SCP( );    / ;
18. Resume execution;}
}}
Itv itv Itv m=
 




2. if(    ) ; else ;
3. interval( , / , / , ,  );
4.   num_CCP( );    / ;





D, E, C, k, 
R  = E; R  = D; R  = k;
t (R , f ) R f  = f f  = f
Itv  R  R f  c f  R







= ) > 0 )do{
6. if(   ) break with task failure;
7. Insert CCP with interval length ;
8. Insert CSCP with interval length ;
9. Update ,   according to speed ;












13. Roll back to the last CSCP;
14.   -1;
15. if( ( ,  )   )   ; else   ;
16.   interval( ,  / ,  / ,  ,  );






t R f R f f f f
Itv R R f c f R









Fig. 7  adapchp_dvs_CCPs 
 
4  Simulation results 
 
We carried out a set of simulation experiments to 
evaluate our adaptive checkpointing schemes 
adapchp_dvs_CCPs and adapchp_dvs_SCPs (referred to 
as A_D_C and A_D_S) and to compare it with the Pois-
son-arrival (referred to as Poisson), the k-fault-tolerant 
(referred to as k-f-t) checkpointing schemes and 
ADT_DVS
[3] 
(referred to as A_D). Faults are injected into 
system using a Poisson process with various values for the 
arrival rate λ . Due to the stochastic nature of the fault 
arrival process, the experiment is repeated 10,000 times 
for the same task and the results are averaged over these 
runs. We are interested here in the probability that the task 
completes on time, and the energy consumption. Energy 
consumption is measured by summing the product of the 
square of the voltage and the number of computation cy-
cles over all the segments of the task 
[3]
. As in [3], we use 
the term task utilization U to refer to the ratio N/D. In or-
der to compare with results of ADT_DVS scheme, we let 
tr=0 and f2=2f1. Moreover, let P and E represent the prob-
ability of timely completion of tasks and energy consump-
tion, respectively. 
 
4.1  Additional SCPs  
 
As mentioned previously, additional SCPs scheme fits 
systems, in which time overhead is determined mainly by 
the time to compare processor’s states. Therefore, the pa-
rameters are as following: D=10000, ts=2, tcp=20, c=22.  
First, we let the Poisson-arrival and the 
k-fault-tolerant schemes use the lower speed f1. The task 
Authorized licensed use limited to: DEAKIN UNIVERSITY LIBRARY. Downloaded on May 12, 2009 at 00:35 from IEEE Xplore.  Restrictions apply.
 utilization U in this case is N/(f1D). Our experimental re-
sults are shown in table 1. In table 1(a), for 0 .001λ >  
and 0.7<U<0.9(high fault arrival rate and relatively high 
task utilization), the experimental results show that 
adapchp-dvs-SCPs scheme clearly outperforms the 
ADT_DVS scheme. Although Poisson-arrival and the 
k-fault-tolerant schemes have lower energy consumption, 
their probability of timely completion of the task are lower 
that 0.2. In table 1(b), for 0 .001λ <  and 
0 .9 1U< ≤ (low fault arrival rate and high task utilization), 
we draw the similar conclusions described above.  
We assume that both Poisson-arrival and the 
k-fault-tolerant schemes use the higher speed f2. Then the 
task utilization U in this case is N/(f2D). Our experimental 
results are shown in table 2. We also can draw a conclu-
sion that our scheme outperforms the other three schemes. 
 
Tab. 1 The comparison between 
adapchp-dvs-SCPs and other algorithms, both 
the Poisson-arrival and the k-fault-tolerant 




















































































































































(b) 1k =  
 
 
Tab. 2 The comparison between 
adapchp-dvs-SCPs and other algorithms, both 
the Poisson-arrival and the k-fault-tolerant 





























































































































(b) 1k =  
 
  
4.2  Additional CCPs  
 
Additional CCPs scheme fits systems which overhead 
time is determined mainly by the time to store processor’ 
states. Therefore, the parameters is as following: D=10000, 
ts=20, tcp=2, c=22.  
Tab. 3 The comparison between 
adapchp-dvs-CCPs and other algorithms, both 
the Poisson-arrival and the k-fault-tolerant 














































Authorized licensed use limited to: DEAKIN UNIVERSITY LIBRARY. Downloaded on May 12, 2009 at 00:35 from IEEE Xplore.  Restrictions apply.
































































































(b) 1k =  
Tab. 4 The comparison between 
adapchp-dvs-CCPs and other algorithms, both 
the Poisson-arrival and the k-fault-tolerant 





































































































































(b) 1k =  
Our experimental results are shown in table 3 and 
table 4. Similar to section 4.1, simulation results show 
that compared to ADT_DVS scheme, the proposed 
scheme significantly increases the likelihood of timely 
task completion and reduces power consumption in the 




In this paper, we presented an adaptive checkpoint-
ing, using a DMR with two processors, and tuning the 
scheme to the specific system which it is implemented on. 
The proposed scheme is done by inserting two types of 
checkpoints (CCP and SCP) between CSCP. Separating 
the comparison and store operations enables choosing the 
optimal interval for each operation, without concerning 
about the other. We also discussed the optimal numbers of 
checkpoints that minimize the average times. Based on 
that, we combined the adaptive checkpoiting with the 
DVS schemes to achieve energy reduction. We presented 
simulation results which showed the advantages of our 
scheme. We will extend the proposed scheme to other task 




[1] Ziv A. Analysis of checkpointing schemes with task dupli-
cation, IEEE Trans. Computers, 1998,47(2):222-227 
[2] Ying Z, Crishnendu C. Task feasibility analysis and dy-
namic voltage scaling in fault-tolerant real-time embedded 
systems, Proc. Of the design, automation and test in Europe 
conference and exhibition (DATE’04) 
[3] Ying Z, Crishnendu C. Energy-Aware Adaptive Check-
pointing in Embedded Real-Time Systems, Proc. of the de-
sign, automation and test in Europe conference and exhibi-
tion (DATE’03), 2003 
[4] Ziv A, Bruck J. Performance Optimization of Checkpoint-
ing Schemes with Task Duplication  IEEE Transactions on 
Computers, 1997, 46(2):1381-1386 
[5] Sayori N, Satoshi F, Naohiro I. Optimal Checkpointing 
Intervals of Three Error Detection Schemes by a Double 
Modular Redundancy, Mathematical and Computer Model-
ling, 2003,38:1357-1363 
[6] Melhem R, Mosse D, Elnozahy E. The interplay of power 
management and fault recovery in real-time systems, IEEE 
Tran. On computers, 2004, 53(2):217-231 
[7] Intel Corp, speedstep, 
http://developer.inte.com/mobile/pentiumIII, 2003 
[8] Duda A. The effects of checkpointing on program execution 
time, Information Processing Letters, 1983 (16):221-229 
[9] Lee H, Shin H, Min S. Worst case timing requirement of 
real-time tasks with time redundancy, Processing Real-Time 
computing systems and Applications, 1999: 410-414 
[10] Osaki S. Applied stochastic system modeling, 
Springer-Verlag, 1992 
 
Authorized licensed use limited to: DEAKIN UNIVERSITY LIBRARY. Downloaded on May 12, 2009 at 00:35 from IEEE Xplore.  Restrictions apply.
