Soft Error-Aware Design Optimization of Low Power and Time-Constrained Embedded Systems by Shafik, Rishad Ahmed et al.
Soft Error-Aware Design Optimization of Low
Power and Time-Constrained Embedded Systems
Rishad A. Shaﬁk†, Bashir M. Al-Hashimi†, Krishnendu Chakrabarty‡
†School of ECS, University of Southampton, Southampton, SO17 1BJ, UK, e-mail: {ras06r, bmah}@ecs.soton.ac.uk
‡Department of ECE, Duke University, Durham, NC 27708, USA, e-mail: krish@ee.duke.edu
Abstract—In this paper, we examine the impact of application
task mapping on the reliability of MPSoC in the presence
of single-event upsets (SEUs). We propose a novel soft error-
aware design optimization using joint power minimization with
voltage scaling and reliability improvement through application
task mapping. The aim is to minimize the number of SEUs
experienced by the MPSoC for a suitably identiﬁed voltage
scaling of the system processing cores such that the power is
reduced and the speciﬁed real-time constraint is met. We evaluate
the effectiveness of the proposed optimization technique using an
MPEG-2 decoder and random task graphs. We show that for an
MPEG-2 decoder with four processing cores, our optimization
technique produces a design that experiences 38% less SEUs
than soft error-unaware design optimization for a soft error rate
of 10
−9, while consuming 9% less power and meeting a given
real-time constraint. Furthermore, we investigate the impact of
architecture allocation (varying the number of MPSoC cores)
on the power consumption and SEUs experienced. We show
that for an MPSoC with six processing cores and a given real-
time constraint, the proposed technique experiences upto 7%
less SEUs compared to soft error-unaware optimization, while
consuming only 3% more power.
I. INTRODUCTION
Dynamic Voltage Scaling (DVS) is an effective power
minimization technique often employed in hand-held devices
to extend battery life [1]. However, it has been reported that
the reduction of supply voltage causes an exponential increase
in the rate of soft errors, particularly that of SEUs, leading
to degradation of reliability [2]. This is further exacerbated
by device miniaturization and continuing technology scal-
ing [3]. As a result, reliability is emerging as a key challenge
for low power system design [4]. Several researchers have
proposed number of power-aware fault tolerance techniques,
such as hardware redundancy [4], time and information re-
dundancy [5], task re-execution or replication [6], check-
pointing or pre-emptive online scheduling [7]. Recently, fault
tolerance-based optimization of cost-constrained distributed
real-time systems has been proposed in [8]. The fault tolerance
in [8] is achieved through mapping and assignment of fault
tolerance policies to different processes. In [9], an approach to
fault-tolerant design using process re-execution and schedul-
ing of low power MPSoC applications has been proposed.
The works reported in [8, 9] do not consider the impact
of application task mapping on the reliability of MPSoC
architecture. In this work, we present the ﬁrst study (to the
best of our knowledge) of the impact of task mapping on
reliability of MPSoC applications in the presence of SEUs.
Based on this study, we propose a novel soft error-aware
design optimization to minimize the power consumption and
improve reliability in terms of minimized number of SEUs
experienced through application task mapping, while meeting
a given real-time constraint.
f1
f2
f3
f4
C
l
o
c
k
 
T
r
e
e
 
G
e
n
e
r
a
t
o
r
1 2
34
Mem.
ARM7
Processor
Cache
Mem.
ARM7
Processor
Cache
Mem.
ARM7
Processor
Cache
Mem.
ARM7
Processor
Cache
Fig. 1: MPSoC architecture with four
processing cores
Scaling, s f,M H z Vdd,V
1 200 1
2 100 0.58
3 66.7 0.44
TABLE I: Different operating f
and Vdd for different voltage
scalings of ARM7TDMI
II. PRELIMINARIES
A. Architecture Model
We consider a homogeneous MPSoC architecture, A,c o m -
posed of C identical processing cores with dedicated inter-
core communication due to its high performance [1]. Fig. 1
shows one such architecture with four processing cores. Each
processing core consists of an ARM7 processor, data and
instruction cache (8kbits and 16kbits), and private memory
(512kbits). The cache and memory sizes have been chosen to
provide high availability of data and parallelism among the
processing cores. To minimize power, clock tree generator is
used to feed different voltages and frequencies to the MPSoC
processing cores (Fig. 1). The processor dynamic power is
Pdyn = αCLfVdd
2 . (1)
The DVS technique reduces power consumption in (1) by
reducing Vdd and f. For ARM7TDMI, Vdd (in volts) is
expressed in terms of f (in MHz) as [10]
Vdd(f,s)=
 
0.1667+
4.1667× f
103 × s
 
, (2)
where s is the voltage scaling coefﬁcient. Table I shows the
different voltage scaling options used in this work. The impact
of choice of voltage scaling levels on design optimization is
discussed in Section V.
B. Application and Fault Models
We model an application as a directed, acyclic task graph
G(V,E) with N nodes. Each node ti∈V represents one
computational task within the application and each edge
dij∈E represents inter-task communication and dependency.
An application is realized on MPSoC by distributing the tasks
among the processing cores through task mapping process.
Fig. 2 shows an example task graph of MPEG-2 video
decoder using eleven tasks. The tasks’ computational and
communication costs of tasks are shown with numbers on the
nodes and edges, Fig. 2. The computational cost represents
execution time of each task and the communication costrepresents the time required to transfer data between tasks (all
costs are multiples of 5.5×106 clock cycles). All costs are
obtained from SystemC cycle-accurate simulation assuming
32-bit inter-core transfer.
Decode
Header
Sequences
Decode
Frame/Slice
Headers
Decode
Macroblock
Sequences
Run-length
Decode
Block
Inverse
Scan
Blocks
Inverse
Quantize
Blocks
Inv. DCT
by row
Inv.
DCT by
column
Motion
Compens.
Blocks
Add Blocks
Store/
Display
Frame
1 2
2
2 3
3
4 4
2
4
4
t1(10) t2(15) t3(16)
t6(39) t5(25) t4(31)
t7(63) t8(61) t9(48) t10(41)
t11(21)
Fig. 2: MPEG-2 video decoder task graph
In this work, fault injection is carried out using SEU-based
fault model employing the technique reported in [11]. The
fault injection is initiated through replacement of original
data/signal types to fault injection enabler types, which facili-
tates a centralized list of register space for fault injections. For
a given soft error rate (SER, expressed as number of SEUs per
bit per cycle), the number of SEUs to be injected is identiﬁed
and their locations are determined using Poisson distribution.
Using SystemC cycle-accurate simulation, register usage in-
cluding cache and memory registers and the number of SEUs
experienced are found.
III. IMPACT OF TASK MAPPING ON RELIABILITY
Reliability of an MPSoC application in the presence of
SEUs is related to the total number of SEUs experienced [12].
For a soft error rate (SER) of λi (SEUs per bit per clock
cycle), the total number of SEUs experienced by an MPSoC
with C processing cores is given by [12] as
Γ=
C  
i=1
RiTiλi = TM
C  
i=1
Riαiλi , (3)
where Ti is the execution time (in clock cycles), Ri is the
register usage (in bits per clock cycle) of i-th processing
core and TM is the multiprocessor execution time (in clock
cycles, ∀i:Ti=αiTM). The register usage, Ri, is a measure of
average number of registers used over the execution time by
a processing core and deﬁned as [11]
Ri =
1
Ti
Ti  
t=1
Ri,t , (4)
where Ri,t is the register usage (in bits) at t-th clock cycle of
the i-th processing cores. The Ri in (4) of a processing core
depends on the nature of processing being carried out by the
tasks mapped, data dependency and resource sharing among
them. The TM in (3) affects multiprocessor performance and
depends on the number of mapped tasks on a processing core
and the data dependency among them. As a result, when
more related tasks are mapped on a processing core, TM
increases for a given operating frequency but the register
resources related to tasks are localized reducing the overall
register usage (R=
 
iRi). On the other hand, when tasks
are distributed among processing cores to achieve higher
parallelism, TM decreases at the expense of increased R due
to higher duplication of shared register resources among tasks.
An example of this trade-off follows. In the MPEG-2 decoder
(Fig. 2), the tasks t5 and t6 share nearly 6.4kb registers,
while the tasks t6, t7 and t8 share about 8kb registers among
them. To reduce register usage, for example it is possible to
map tasks t5, t6, t7 and t8 on a processing core. However,
due to computationally intensive nature of these tasks, TM
will be high. To reduce TM, an alternative option is to map
tasks t5 and t6 on a processing core, while the tasks t7 and
t8 can be mapped on another core. However, this gives a
duplication of about 14.4kb registers (increased R) between
the processing cores. Because of this register usage (R)a n d
multiprocessor execution time (TM) trade-off, the MPSoC
experiences varying number of SEUs (Γ) for different task
mappings given by (3). The Γ also depends on the voltage
scaling of the MPSoC processing cores as it affects λi in (3).
To demonstrate the impact of application task mapping and
voltage scaling on the number of SEUs experienced (Γ), a
total of 120 task mappings were carried out using the MPEG-
2 decoder (Fig. 2) on the MPSoC architecture (Fig. 1). Fig. 3
shows the TM, R and Γ obtained through SystemC simulation
and fault injection (Section II-B) using an SER of 10−9 SEU
per bit per cycle (i.e. 1 SEU per 10ms for 1kb register bank)
as an example. Three key observations are made:
Observation 1: Fig. 3(a) shows the trade-off between multi-
processor execution time (TM, ms) and overall register usage
(R). As can be seen, when tasks are mapped to reduce R by
localization of tasks, TM increases. On the other hand, as tasks
are mapped to reduce TM, register resources shared among
tasks are duplicated, leading to increased register usage, R.
Observation 2: Fig. 3(b) shows the total number of SEUs
experienced (Γ) and multiprocessor execution time, TM (in
ms), when all the decoder processing cores are scaled by 1
(f=200MHz and Vdd=1V). It can be seen that when tasks are
distributed among processing cores to reduce TM, the decoder
experiences more higher Γ given by (3) due to higher R
(Fig. 3(a)). When tasks are localized to reduce R, the decoder
also experiences higher number of SEUs due to increased TM
(Fig. 3(a)). This results in a concave curve for Γ given by (3),
with the minimum Γ located around the middle of TM range.
Observation 3: Fig. 3(c) shows the total number of SEUs
experienced (Γ) and multiprocessor execution time (TM,i n
ms) when all the decoder processing cores are scaled by 2
(f=100MHz and Vdd=0.58V). As can be seen, Γ increases by
approximately 2.5 times due to lowered Vdd from 1Vt o0.58V
(due to Vdd and λ relationship in [2]) and TM is increased by
approximately 2 due to reduced f from 200MHz to 100MHz.
The above observations demonstrate the impact of applica-
tion task mapping and voltage scaling on the MPSoC decoder
reliability in the presence of SEUs. Hence, an interesting
design optimization problem is to identify suitable voltage
scaling of the MPSoC processing cores to minimize power
consumption and to improve reliability through application
task mapping, while meeting a real-time constraint.
IV. PROPOSED DESIGN OPTIMIZATION
In this section, we propose a novel design optimization
using joint power minimization and reliability improvement
of an application implemented on an MPSoC architecture.
Fig. 4 shows ﬂowchart of the proposed design optimization
with three major steps: power minimization, soft error-aware
application task mapping and iterative assessment. For a
given SER and real-time constraint, the design optimization5400
7400
9400
11400
13400
80 90 100 110 120
Register Usage, kbits/cyc.
MP. Execution Time, ms
(a) (c) (b)
1.5E+5
1.7E+5
1.8E+5
2.0E+5
2.1E+5
5400 7400 9400 11400 13400
Multiprocessor Execution Time, ms
SEUs Experienced
3.6E+5
3.9E+5
4.1E+5
4.4E+5
4.6E+5
4.9E+5
10800 14800 18800 22800 26800
Multiprocessor Execution Time, ms
SEUs Experienced
11
Fig. 3: (a) Trade-off between execution time and register usage, (b) SEUs experienced and MPSoC execution time, and (c) SEUs experienced and execution
time (all cores scaled by 2), for task mappings with MPEG decoder with four processing cores
is initiated by power minimization (step 1) through voltage
scaling of the MPSoC cores. This is followed by soft error-
aware application task mapping (step 2) to minimize the
number of SEUs experienced for the chosen voltage scalings
in step 1. These two steps are repeated and assessed in step
3 until a design with minimized power consumption and
minimized SEUs experienced is found, meeting the real-time
constraint. The design optimization steps are discussed next.
Power Minimization:
through voltage scaling
Application
Task Graph
Iterative Assessment:
Power & SEUs minm
meeting time-constr.
START
END
Soft Error-Aware Task Mapping:
to minimize SEUs experienced
Optimized
Design
YES NO
STEP 1
STEP 2
STEP 3
Initial Mapping
Optimized Mapping
Soft Error rate
Real-time Constraint
Stage
1
Stage
2
Fig. 4: Flowchart of the proposed design optimization
A. Power Minimization and Iterative Assessment
Power minimization of the proposed design optimization is
performed using the voltage scaling algorithm, Fig. 5(a). The
voltage scaling algorithm, nextScaling, starts with the lowest
voltage scaling on all identical cores and generates the next
set of higher voltage scalings, nextS, based on the previous
set of coefﬁcients, prevS (Fig. 5(a)). In each iteration, nextS
is updated as the prevS reduced by 1 on a processing core until
the voltage scaling reaches the nominal voltage scaling level
(s=1, lines 3-6). When the nominal level (s=1) is reached
by a core, nextS is updated by increasing voltage scaling
of the core by 1 (line 9) and reducing the voltage scaling
of the next processing core by 1 in steps (lines 7-11). The
aim is to generate non-repetitive combinations and reduce
the number of voltage scalings that need to be investigated.
For example, for an architecture with four processing cores
(Fig. 1) and three scaling options (Table I), the voltage scaling
scaling algorithm, Fig. 5(a), generates 15 unique combinations
starting with scaling coefﬁcient of 3 for all cores, Fig. 4(b),
compared to a total of 34=81 possible combinations.
With each chosen voltage scaling resulting from the scaling
algorithm (step 2, Fig. 4) is carried out to minimize the num-
ber of SEUs experienced through application task mapping.
The resulting power consumption and SEUs experienced are
assessed in step 3. Using (2), the dynamic power consumption,
P, of the MPSoC with C processing cores can be expressed
//C = no of cores, prevS = previous scaling
[nextS] = nextScaling(prevS): begin
 1:  copy prevS into nextS
 2:  for i := 1 to C
 3:    if prevS[i] > 1: begin //1 is lowest scale
 4:      nextS[i] := prevS[i]-1;
 5:      break;
 6:    end if
 7:    else
 8:      for k := i to C
 9:        nextS[k] = prevS[k]+1;
 10:     end for
 11:   end else
 12: end for
 13: return nextS;
end
Scaling Coefficients
(a) (b)
Iter.
#
Core 1,
s1
Core 2,
s2
Core 3,
s3
Core 4,
s4
1 3 3 3 3
2 3 3 3 2
3 3 3 3 1
4 3 3 2 2
5 3 3 2 1
6 3 3 1 1
7 3 2 2 2
8 3 2 2 1
9 3 2 1 1
10 3 1 1 1
11 2 2 2 2
12 2 2 2 1
13 2 2 1 1
14 2 1 1 1
15 1 1 1 1
Fig. 5: (a) Voltage scaling algorithm, (b) Voltage scaling coefﬁcients for four
processing cores
as a function of voltages scaling identiﬁed in step 1, si,a s
P = CL
C  
i=1
αifi(si)V 2
ddi(si), (5)
where fi(si) and Vddi(si) take values (Table I) depending
on the chosen scaling coefﬁcient si. The iterative assessment
(step 3) is carried out until a better design is found with
minimized power consumption (by (5)) and minimized SEUs
experienced (by (3)) within a chosen search-time.
// C: no of cores,  G: application task graph with  N tasks, t is the current task
//M: mapping of all cores,  A: MPSoC arch.,  Q: task queue,  L: temporary list
[M] = InitialSEAMapping (G, C, A):  begin
 1:   push G[0] into Q   // push the first task into Q
 2:   for i:= 1 to C-1  and  Q is not empty
 3:    t := Q.front(); M[i].map(t); delete all mapped tasks from Q
 4:     while  Ti < T Mref and  no. of unmapped tasks in G > (C-i)
 5:      L : = dependents of t //(sorted by minimum SEUs)
 6:       if L is empty  and Q is not empty
 7:        swap last two elements in Q
 8:       else if  Q is not empty
 9:         t = first element in L //task with minimum SEUs and Time
 10:       M[i].map(t); delete t from L; move tasks in L into Q and empty L
 11:     else  break  while; end if
 12:     t = Q.front();
 13:   end while
 14: end for
15: return  M;
end
Fig. 6: Initial soft error-aware mapping algorithm
B. Soft Error-Aware Application Task Mapping
The problem of application task mapping on MPSoC
cores to minimize SEUs experienced (Γ) is an NP-complete
problem [8]. We propose a soft error-aware application task
mapping in two stages (step 2, Fig. 4): the stage 1 is the
initial soft error-aware application task mapping, followed by
stage 2 of search based optimized application task mapping.
Fig. 6 shows the initial soft error-aware application task map-
ping algorithm (stage 1), InitialSEAMapping, which aims to
simplify the optimization process by reducing the number oftask movements. The InitialSEAMapping starts with mapping
the task with no predecessor in task graph (G) (line 1). The
dependents of the currently mapped task in G are then sorted
by SEUs experienced if they were to be mapped with the
current task and stored in a dependency list, L.T h et a s k
with the minimum SEUs in L is then mapped next (lines 5-
10). This is continued until the execution time of the current
core does not exceed the real-time constraint (TMref)a n dt h e
number of unmapped tasks left in task graph G is higher than
the number of remaining cores to ensure tasks are mapped
in all cores (lines 4-13). The unmapped tasks are stored in a
queue, Q (line 10), which are then mapped gradually to the
other cores using the same criteria. After all tasks are mapped,
the initial mapping (M) is returned (line 15).
After the initial soft error-aware task mapping (Ini-
tialSEAMapping, Fig. 6), the design optimization is con-
tinued further through optimized mapping (stage 2, step 2,
Fig. 4). We use a search based mapping optimization, Opti-
mizedMapping, Fig. 7, employing list scheduling for schedul-
ing tasks [8]. The OptimizedMapping starts with scheduling
List Schedule M
Mbest
is optimized
mapping
YES
(A)
(B)
(C)
(G)
Task movement in M
for neighbouring
solution
List Sheduling M
SEUs(M)<SEUs(Mbest)
and TM(M)<=T Mref
(D)
(E)
Time not over or
TM(M) > TMref or
unSchedulable(M)
(F)
YES
NO
NO
Mbest := M
Fig. 7: Flowchart of optimized mapping, OptimizedMapping
the initial task mapping, M (step A, Fig. 7). The mapping
M is then checked to see if it violates the schedulability
requirements or real-time constraints (step B). If any such
violation is found, the optimization proceeds with gener-
ating neighboring task movements to ﬁnd out a possible
next mapping solution (step C). This mapping (M)i st h e n
list scheduled, if schedulable (step D). This is followed by
comparison with the the previous best solution, Mbest.I f
M is better than Mbest in terms of lower number of SEUs
experienced and meets the given real-time constraint, it is then
updated as the new Mbest (steps E-F). The optimization steps
C-F are repeated until the speciﬁed search time is not over
(step B). Once the search time is over, Mbest is returned as
the optimized design for the chosen voltage scalings.
In OptimizedMapping, the multiprocessor execution time
(TM, in seconds) for an application task mapping is found
by the dividing the total number of execution cycles of all
mapped tasks by the effective number of cycles executed by
processing cores per second for chosen voltage scaling (step
B, E, Fig. 7), i.e.
TM =
⎡
⎣
C  
i=1
N  
j=1
 
ti
j +
N  
k=1
di
j,k
 ⎤
⎦/
 
C  
i=1
αifi(si)
 
, (6)
where ti
j is the execution time (in clock cycles) of the j-th task
mapped on i-th processing core, di
j,k is the dependency time
(in clock cycles) between j-th and k- t ht a s k( j,k =1:N)
due to selection of j- t ht a s ko ni-th processing core. The
total number of SEUs experienced (Γ) for an application task
mapping and chosen voltage scaling on MPSoC processing
cores is found in InitialSEAMapping (line 5, Fig. 6) and in
OptimizedMapping (step E, Fig. 7) through (3). The per core
execution time (Ti) and register usage (Ri) in (3) are given
in terms of mapped tasks as
∀i : Ti =
N  
j=1
 
t
i
j +
N  
k=1
d
i
j,k
 
, and (7)
∀i : Ri =
 
 
   
 
 
⎛
⎝
N  
j=1
N  
k=1
Ri
j,k
⎞
⎠
 
 
   
 
 
, (8)
where Ri
j,k is the set of registers shared between j-th and
k-th tasks for being mapped on i-th processing core (j=k
deﬁnes the local register usage of j-th task). As can be seen
in (8), Ri is given by cardinality of the register set arising
out of union of all Ri
j,k in i-th processing core. The proposed
optimization is carried out by iterative search through N tasks,
with each iteration generating maximum two task movements
out of maximum search space of (N-1) dependent tasks. This
is followed by second stage search through maximum (N-1)
tasks for minimum number of SEUs experienced. As a result,
OptimizedMapping has worst-case complexity of O(2N(N −
1)(N − 1))≈O(N3).
t1
t1(5)
t2(4) t3(4)
t6(4) t4(5) t5(6)
12
12
12
3
Core 1:
s1 = 1
Core 2:
s2 = 2
Core 3:
s3 = 2
t1 t3 t6
t2
Deadline (T Mref) = 75 ms
1
t1(5)
t2(4) t3(4)
t6(4) t4(5) t5(6)
12
12
12
3
1
Reg. Set Size, bits
r1 4096
r2 2048
r3 2048
r4 5120
r5 4096
r6 2048
r7 2048
r8 4096
r9 2048
Task Register Usage
t1 R1=[r1, r2, r3]
t
2 R
2=[r
2, r
4, r
5, r
6]
t3 R3=[r4, r5, r6]
t4 R4=[r5, r6, r7]
t
5 R
5=[r
6, r
7, r
8]
t6 R6=[r7, r8, r9]
t1 t3
t2
t1(5)
t2(4) t3(4)
t6(4) t4(5) t5(6)
12
12
12
3
1
t1 t3 t5
t2 t4
t6
t1(5)
t2(4) t3(4)
t6(4) t4(5) t5(6)
12
12
12
3
1
(a) (b)
(d) (e)
(f) (g)
(h) (i)
Q = {t 1}
L = {t 2 t3}
Q = {t 2}
L = {t 4} Q = { }
L = {}
(c)
Core 1:
s1 = 1
Core 2:
s2 = 2
Core 3:
s3 = 2
Core 1:
s1 = 1
Core 2:
s2 = 2
Core 3:
s3 = 2
Core 1:
s1 = 1
Core 2:
s2 = 2
Core 3:
s3 = 2
Core 1:
s1 = 1
Core 2:
s2 = 2
Core 3:
s3 = 2
Core 1:
s1 = 1
Core 2:
s2 = 2
Core 3:
s3 = 2
t1(5)
t2(4) t3(4)
t6(4) t4(5) t5(6)
12
12
12
3
1
t1 t3 t5
t2 t4
t6
t1 t3 t6
t2 t4
t5
Q = {t 2}
L = {t 4 t5}
Fig. 8: Example illustration of the soft error-aware application task mapping
An example illustrating the proposed soft error-aware ap-
plication task mapping algorithm is shown in Fig. 8. In
Fig. 8(a), an application task graph with six tasks is shown
(all costs are multiples of 60×104 cycles) and in Fig. 8(b)-
(c) the application registers and their distribution for different
tasks are shown. Fig. 8(d)-(g) show the incremental task
mapping using InitialSEAMapping algorithm, Fig. 6, and ﬁ-
nally, Fig. 8(h)-(i) show scheduling and task movements using
OptimizedMapping, Fig. 7. The chosen voltage scaling for the
processing cores are: s1=1, s2=2 and s3=2 and deadline is
assumed to be TMref=75ms. As can be seen, after the ﬁrst
task, t1, in the application task graph, Fig. 8(a), is mapped to
processing core 1, the InitialSEAMapping mapping algorithm
selects t3, followed by t6 from dependency list, L.T h i si sbecause task t3 gives the least number of SEUs experienced
compared to t2 and t5 shown in gray, Fig. 8(d), with the Rj,k
values from Fig. 8(c). Note that after allocating t1, t3 and t5 on
core 1, the deadline constraint cannot be satisﬁed with further
allocation of tasks and the mapping algorithm carries on with
the mapping of core 2 selecting tasks t2 and t4, which give
minimum SEUs experienced, Fig. 8(f). Finally, the unmapped
task t6 in queue (Q) is mapped to core 3, Fig. 8(g). After
InitialSEAMapping (Fig. 6) is completed, OptimizedMapping
list schedules the tasks, Fig. 8(h), found through step A, Fig. 7.
However, with the chosen voltage scalings for the architecture
processing cores, this mapping cannot satisfy the real-time
constraint of 75ms. The OptimizedMapping swaps t5 with t6
in the fourth iteration (step C, Fig. 7) and gives the minimum
number of SEUs experienced for the chosen voltage scaling,
while meeting TMref=75ms.
V. EXPERIMENTAL RESULTS
We evaluate the effectiveness of the proposed soft error-
aware design optimization using four design optimization
experiments, Table II. The experiments are carried out using
MPEG decoder implemented with the architecture, Fig. 1.
The ﬁrst three experiments, Exp:1, Exp:2 and Exp:3, are
soft error-unaware optimization with different design objec-
tives using application task mapping obtained through sim-
ulated annealing [13]. Exp:4 is the proposed design opti-
mization. In all experiments, power minimization is obtained
through iterative voltage scaling (step 1, Fig. 4) to meet
the real-time constraint of decoding a tennis video bitstream
(ftp://ftp.tek.com/tv/test/streams/Element/) of 437 frames at
29.97 frames per second (fps). The mapped tasks, the voltage
scaling and the resulting power consumption (P,m W )a r e
shown in col. 2-4, while the register usage (R, kbits/cyc),
the multiprocessor execution time (TM, clock cycles) and
the number of SEUs experienced (Γ) are given in cols. 5-7
(Table II). We impose a time-limit of 40 minutes to search
the design space for each voltage scaling of the MPSoC
processing cores. All experiments are carried out on an
Intel(R) Core(TM)2 2GHz CPU running RHEL5. The number
of SEUs experienced (col. 7, Table II) is found by fault
injection technique (Section II-B) assuming a SER of 10−9
SEUs/bit/cycle (i.e. 1 SEU per 10ms for 1kb register bank).
Exp:1 demonstrates the impact of design optimization
with minimized register usage, R. As expected, the design
produced gives the least register usage (R) when compared
to the other three experiments. The reduced R in Exp:1 is
obtained at the expense of highest multiprocessor execution
time, TM as described in Section III, which makes it harder
to scale down the voltages of the decoder cores. As a result,
Exp:1 gives a design that has higher power consumption
than the optimized design produced in Exp:4. However, the
design produced in Exp:1 experiences lower SEUs than that
in Exp:4. This is because, the proposed design optimization
in Exp:4 gives lower voltages of the decoder cores, and hence
lower power consumption compared to the design produced
in Exp:1. The design produced in Exp:2 is optimized for high
parallelism. This gives reduced multiprocessor execution time
(TM), which allows the voltages of the decoder processing
cores to be scaled down. As a result, Exp:2 gives lower
Exp. Mapped Tasks scal.
si
P,
mW
R,
kb/c.
TM
(×109)
Γ
(×105)
Exp:1 (Reg.
Usage [13])
Core 1 t1, t2, t3 2
9.53 80 1.89 3.46 Core 2 t4, t5 2
Core 3 t6, t7, t8, t9, t10 1
Core 4 t11 2
Exp:2
(Paral-
lelism [13])
Core 1 t1, t2, t3, t4, t9 3
4.04 118 1.18 5.22 Core 2 t5, t6, t7 2
Core 3 t8 3
Core 4 t10, t11 2
Exp:3 (Reg.
Usage
&Paral.[13])
Core 1 t1, t2, t3, t4, t5 2
4.15 92 1.26 4.18 Core 2 t6, t7 2
Core 3 t8, t9 3
Core 4 t10, t11 2
Exp:4
(Proposed)
Core 1 t1, t2, t3, t4, t5, t6 2
4.25 89 1.32 3.93 Core 2 t7, t8 2
Core 3 t9 3
Core 4 t10, t11 2
TABLE II: Comparison of soft error-unaware and the proposed soft error-
aware optimizations using MPEG decoder MPSoC with four cores
-30.00%
-5.00%
20.00%
45.00%
70.00%
Exp:1 Exp:2 Exp:3
Comparative SEUs, % Comparative Power Consumption, %
Fig. 9: Comparison of SEUs experienced (Γ) and power consumption (P)
between Exp:1, Exp:2, Exp:3 and Exp:4
power consumption than Exp:4. Note that this reduction in
multiprocessor execution time (TM)i nE x p : 2i sa c h i e v e da t
the expense of the highest register usage (R). Due to lower
voltage scaling of the decoder cores and higher register usage,
the design optimized for high parallelism, Exp:2, experiences
the highest number of SEUs when compared to the other three
experiments. In Exp:3, the design has been optimized for both
register usage and high parallelism. Such optimization gives
a good trade-off between multiprocessor execution time and
register usage, and minimizes the product: TM×R.H o w e v e r ,
this does not necessarily minimize of the number of SEUs
experienced given by (3). The design produced in Exp:4
employs soft error-aware task mapping (Fig. 6) and therefore
gives less number of SEUs experienced than the design
produced in Exp:3. Note that, although the voltage scaling
of the decoder cores are similar, proposed design optimization
(Exp:4) gives slightly higher power consumption compared to
the design produced in Exp:3 due to more tasks being mapped
in core 1 and core 2 of the decoder. Fig. 9 shows comparison
of power consumption (P) and SEUs experienced (Γ)b yt h e
decoder design in Exp:1, Exp:2 and Exp:3, compared to that
of Exp:4. All experiments are carried out with same voltage
scaling coefﬁcients (s1=2, s2=2, s3=3 and s4=2) for an SER
of 10−9. As can be seen, the design produced in Exp:4 reduces
the number of SEUs experienced by upto 38% compared to
the optimized design in Exp:2, while consuming 9% lower
power. When compared with the design produced in Exp:1,
the optimized design in Exp:4 reduces SEUs experienced by
28%, while consuming only 7% higher power.
To demonstrate the impact of architecture allocation on
the proposed design optimization, Table III shows the power
consumption (P) and the number of SEUs experienced (Γ)
using the design produced in Exp:4. A number of applications,
including MPEG decoder and random task graphs of 20 to
100 tasks were used. The cost and the number of dependents
in the random task graphs were generated using uniformApp. 2C o r e s 3C o r e s 4C o r e s 5C o r e s 6C o r e s
P,
mW
Γ
×105 P,
mW
Γ
×105 P,
mW
Γ
×105 P,
mW
Γ
×105 P,
mW
Γ
×105
MPEG-2 9.1 2.13 5.9 3.17 4.25 3.93 6.34 4.95 7.24 5.36
20 tasks 10.1 0.47 4.15 1.13 4.34 2.27 5.16 2.73 6.36 3.49
40 tasks 6.2 1.07 5.1 1.78 5.2 2.87 6.16 3.46 7.11 4.35
60 tasks 7.8 1.87 4.13 3.25 5.1 4.82 4.9 5.74 5.3 7.15
80 tasks 11.2 1.95 6.1 3.76 4.4 6.13 6.14 7.24 6.69 9.13
100 tasks 10.4 2.40 5.48 4.58 4.8 6.25 5.94 8.83 6.34 11.13
TABLE III: Power consumption and SEUs experienced for different
applications and different number of architecture processing cores
4
5
6
7
8
2
Cores
3
Cores
4
Cores
5
Cores
6
Cores
P
o
w
e
r
,
 
m
W
1E+5
3E+5
4E+5
6E+5
7E+5
S
E
U
s
 
E
x
p
e
r
i
e
n
c
e
d
Exp:4 (power) Exp:3 (power)
Exp:4 (SEUs) Exp:3 (SEUs)
Fig. 10: P (mW) and Γ comparison
between Exp:3 and Exp:4
4
5
6
7
234
Voltage Scaling Levels
P
o
w
e
r
 
(
m
W
)
3E+5
4E+5
5E+5
S
E
U
s
 
E
x
p
e
r
i
e
n
c
e
d
Power Consumption (mW)
SEUs Experienced
Fig. 11: P (mW) and Γ for different
scaling levels
probability distribution with computation cost between 1 and
30, communication cost between 1 to 10 (all costs as mul-
tiples of 3.5×106 clock cycles), task register usage between
1kbits to 5kbits and the number of dependents was found
by exponential distribution between 0 to N/2,w h e r eN is
the number of tasks. The deadline for random task graphs
were set to 1000×N/2ms. For these task graphs, the design
optimization is carried out with imposed time limits of 50,
65, 80, 100, 130 minutes for 20, 40, 60, 80 and 100 tasks,
respectively. The P (mW) and Γ for architecture allocations
from two to six cores are shown in cols. 2-6 (Table III). Two
observations are made. Firstly, the architecture allocation with
minimum power consumption (P) depends on the application
and real-time constraint. In the case of the MPEG decoder,
the least power consumption is found with four cores for the
given real-time constraint of decoding tennis video bitstream
at 29fps. Secondly, with increased number of architecture
cores, the number of SEUs experiencedincreases. It is because
with higher number of cores, multiprocessor execution time
(TM) reduces due to higher parallelism enabling higher power
reduction through lower voltage scaling. Also, due to reduced
TM, register usage (R) is increased (Section III). As a result,
the decoder with six processing cores experiences the highest
number of SEUs, compared to the lowest for the decoder with
2 processing cores (row 2, Table III). Similar observations for
power consumption and the number of SEUs experienced are
also observed with random task graphs. Fig. 10 shows the
power consumption (in mW) and the SEUs experienced by
the optimized designs produced in Exp:4 and Exp:3 using
the random task graph of 60 tasks. As can be seen, the
proposed optimization, Exp:4, consistently outperforms the
design produced using joint optimization of reduced R and
high parallelism, Exp:3, with upto 7% reduction of SEUs
experienced for an SER of 10−9. This reliability improvement
is achieved with only 3% higher power consumption using an
MPSoC with six cores.
To show the impact of choice of voltage scaling levels,
Fig. 11 shows the power consumption (mW) and the number
of SEUs experienced by the optimized designs produced in
Exp:4 with different voltage scaling levels. The design opti-
mizations are carried out using MPSoC with six processing
cores with random task graph of 60 tasks and employing the
following voltage scaling levels: 2 levels (with 1V−200MHz,
and 0.58V−100MHz), 3 levels (Table I) and 4 levels (in-
troducing 1.2V−236MHz in Table I). As can be seen, with
4 scaling levels the proposed design optimization (Exp:4) is
able to minimize power further by 4% with only 3% increase
in the number of SEUs experienced compared to 3 scaling
levels (Fig. 11). This is because with more scaling options,
the power minimization (step 1, Fig. 4) has higher ﬂexibility
with more combinations of voltage scaling generated by the
voltage scaling algorithm (Fig. 5(a)). With 2 scaling levels, it
is possible to reduce the number of SEUs experienced by 42%
at the cost of 28% higher power consumption compared to 3
scaling levels due to limited voltage scaling options (Fig. 11).
VI. CONCLUSIONS
We investigated the impact of application task mapping
on the reliability of multiprocessor system-on-chip (MPSoC)
(Section III). We proposed a novel soft error-aware design
optimization for low power time-constrained MPSoCs (Sec-
tion IV). Using an MPEG decoder and random task graphs,
we showed that proposedoptimization can signiﬁcantly reduce
the number of SEUs experienced compared to soft error-
unaware optimization, while power consumption is minimized
and the real-time constraint is met (Section V). Furthermore,
we investigated the impact of architecture allocation on the
power consumption and the number of SEUs experienced
using the proposed optimization technique (Section V).
ACKNOWLEDGEMENT
The authors would like to thank the EPSRC-UK for funding
this work in part under grant number EP/E035965/1.
REFERENCES
[1] B. M. Al-Hashimi, Ed., System-on-Chip: Next Generation Electronics.
IEE Press, 2006, May, Ch:17.
[2] V. Chandra and R. Aitken, “Impact of technology and voltage scaling on
the soft error susceptibility in nanoscale CMOS,” in Proc. of DFT-VLSI,
2008, pp. 114–122.
[3] F. Dabiri et al, “Reliability-aware optimization for DVS-enabled real-
time embedded systems,” in Proc. of ISQED’08, 2008, pp. 780–783.
[4] A. Maheshwari et al, “Trading off transient fault tolerance and power
consumption in deep submicron (DSM) VLSI circuits,” IEEE TVLSI,
12 (3), pp. 299–311, March, 2004.
[5] A. Ejlali et al, “Combined time and information redundancy for SEU-
tolerance in energy efﬁcient real-time systems,” IEEE TVLSI,1 4( 4 ) ,
pp. 323–335, April, 2006.
[6] G. Chen et al, “Energy-aware computation duplication for improving
reliability in embedded chip microprocessors,” in Proc. of ASPDAC,
Japan, 2006, pp. 134–139.
[7] Y. Zhang and K. Chakrabarty, “Energy-aware adaptive checkpointing in
embedded real-time systems,” in Proc. of DATE’03, 2003, pp. 10918.
[8] V. Izosimov et al, “Design optimization of time- and cost-constrained
fault-tolerant distributed embedded systems,” in Proc. of DATE’05.,
2005, pp. 864–869.
[9] P. Pop et al, “Scheduling and voltage scaling for energy/reliability trade-
offs in fault-tolerant time-triggered embedded systems,” in Proc. of the
CODES+ISSS’07.USA: 2007, pp. 233–238.
[10] J. Pouwelse et al, “Dynamic voltage scaling on a low-power micropro-
cessor,” in Proc. of 7th MobiCom, July, 2001, pp. 251–259.
[11] R. A. Shaﬁk et al, “SystemC-based minimum intrusive fault injection
technique with improved fault representation,” in Proc. of IOLTS,
Greece, July, 2008, pp. 99–104.
[12] R. A. Shaﬁk et al, “Soft error-aware voltage scaling technique for power
minimization in application-speciﬁc MPSoC,” ASP JOLPE, vol. 5, no. 2,
pp.145–156, August, 2009.
[13] H. Orsilla et al, “Automated memory-aware application distribution for
multi-processor system-on-chips,” Journal of Systems Architecture: the
EUROMICRO Journal, 53 (11), pp. 795–815, 2007.