Parallel embedded systems: where real-time and low-power meet by Karakehayov, Zdravko & Guo, Yu
Syddansk Universitet
Parallel embedded systems: where real-time and low-power meet
Karakehayov, Zdravko; Guo, Yu
Published in:
Proc. of the ISCA's 21st lnternational Conference on Parallel and Distributed Computing and Communications
Systems
Publication date:
2008
Document Version
Publisher's PDF, also known as Version of record
Link to publication
Citation for pulished version (APA):
Karakehayov, Z., & Guo, Y. (2008). Parallel embedded systems: where real-time and low-power meet. In Proc.
of the ISCA's 21st lnternational Conference on Parallel and Distributed Computing and Communications
Systems. ISCA.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
            • Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
            • You may not further distribute the material or use it for any profit-making activity or commercial gain
            • You may freely distribute the URL identifying the publication in the public portal ?
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Download date: 10. Jan. 2017
PARALLEL EMBEDDED SYSTEMS: WHERE REAL-TIME AND LOW-POWER MEET 
 
Zdravko Karakehayov 
Department Computer Systems 
Technical University of Sofia 
Sofia, Bulgaria 
E-mail: zgk@computer.org 
Yu Guo 
University of Southern Denmark 
Software Engineering Group 
Sønderborg, Denmark 
E-mail: guo@mci.sdu.dk 
 
 
 
 
Abstract 
This paper introduces a combination of models 
and proofs for optimal power management via Dynamic 
Frequency Scaling and Dynamic Voltage Scaling. The 
approach is suitable for systems on a chip or 
microcontrollers where processors run in parallel with 
embedded peripherals. We have developed a software 
tool, called CASTLE, to provide computer assistance in 
the design process of energy-aware embedded systems. 
The tool considers single processor and parallel 
architectures. An example shows an energy reduction of 
23% when the tool allocates two microcontrollers for 
parallel execution. 
1     INTRODUCTION 
An embedded system is a computer within a host 
device, when the host device, itself, is not generally 
considered to be a computer. The computers within cars, 
mobile phones and digital cameras are typical embedded 
systems. Real-time behavior is the defining characteristic 
of embedded computers [1]. The timing constrains arise 
through the two ways that computational process interact 
with the physical world: reaction to a physical 
environment and execution on a physical platform [2]. 
Using different CPUs to execute real-time tasks in parallel 
is a remarkably simple way to solve some performance 
problems. Parallel embedded systems provide an extra 
benefit: several smaller CPUs or microcontrollers can be 
cheaper than a complex and high clock rate CPU. 
Embedded computers are widely used, with billions sold 
every year, and the manufacturer’s primary concern is to 
reduce price. Cost effective designs usually allocate 
microcontrollers. Microcontrollers incorporate a CPU, 
memory and peripherals on a single chip. Along with the 
low price, the built-in mechanism for hardware-software 
interaction makes them ideal components for embedded 
systems. Also, power consumption is an important design 
metric for battery-powered embedded computers. Again, 
mapping the system’s functionality into a set of parallel 
running processors may be more power efficient than 
using a single CPU at a higher clock rate. This approach 
may help to improve the timing and the energy budget 
simultaneously. 
Ganging together several CPUs or 
microcontrollers makes the system fault tolerant. Most 
error treatment strategies are based on task replication [3]. 
Parallel embedded systems are capable of executing 
active or semi-active replication. Semi-active replication 
is similar to active replication except that decisions 
common to all replicas are taken by one processor, while 
in active replication they are taken by a consensus 
protocol. 
Sophisticated embedded systems deal with a 
large number of I/O variables. Given the timing, power 
and price constrains, a parallel system built from several 
microcontrollers would easily meet the I/O requirements 
as well [4]. In many cases the design starts from an 
already existing system running a certain application and 
the design effort is to implement new functionality on the 
top of this system [5]. A parallel hardware platform 
provides better opportunities for incremental design due 
to reasonable processing and I/O capacity. 
The processor’s speed is an attribute designers 
can use to balance between conflicting goals such as real-
time and low power. The number of CPUs running in 
parallel influences the behavior interval of the clock rates 
required to meet deadlines, save energy and allow 
incremental design. Once the architecture has been 
accepted, the power management is implemented via 
Dynamic Frequency Scaling (DFS) or Dynamic Voltage 
Scaling (DVS). 
2     RELATED WORK 
Different aspects of the power management are 
addressed by research. Linear energy models are used in 
[6]. A concept called critical power slope is introduced to 
explain why for a benchmark program is energy efficient 
to run only at the highest frequency on one hardware 
platform and at the lowest frequency on another hardware 
platform. Energy minimization of distributed embedded 
systems through DFS and DVS are addressed in [7]. A 
multitask real-time computational model is in the center 
of the discussion. Speed modulation in energy-aware real-
time systems is investigated in [8]. I/O oriented 
computational model and a set of discrete clock rates 
underlie this research. A dynamic computational model is 
introduced in [9, 10]. Compiler-assisted techniques are 
combined with a power-aware operating system to reduce 
energy. Power management points specify when to 
change the CPU speed. In related research we applied a 
single task computational model for optimal power 
management via DFS [11, 12]. 
In this paper we first discuss DFS for the 
multitask computational model and then show how to 
map the system’s functionality onto a parallel architecture 
in an energy-efficient manner. 
3     COMPUTATIONAL MODEL 
Assume that the system's functionality is partitioned 
into tasks. A set of tasks,  
( ){ }T T ,T ,..T1 2 n T=          (1) 
is mapped to a particular processor. Each task is 
characterized by its workload measured in number of 
clock cycles, N. All tasks have a common deadline, 
DLT .The deadline is the time when all computation must 
finish. Fig.1 shows the timing for two tasks. The period, 
PT , is the interval between two consecutive executions. 
When the CPU completes the last task it enters a power-
saving mode. 
     1N   2N  
 
     ACT1T   ACT2T   PST  
  
       DLT  
 
         PT  
 
Fig. 1 Real-time computational model 
 
 
4     HARDWARE PLATFORM 
Since the target architecture is based on a 
microcontroller the CPU runs in parallel with a variable 
number of embedded peripherals. Fig. 2 shows a 
hardware platform, model Common Clock (CC). The 
architecture consists of an oscillator (OSC), 
divider/multiplier (D/M), CPU, divider for peripherals 
(DP) and peripherals (P). 
We assume hardware provides the following 
mechanisms for power management: 
• The CPU and embedded peripherals can be 
individually enabled and disabled. When the CPU is 
running, the system is in an active mode. If the CPU is 
switched off, the system is in a power saving mode. 
• The clock rate can be scaled by division/multiplication 
of the oscillator frequency. 
• A timer counts the clock cycles to keep track of the 
work load processed. This is important in case of 
preemption. 
Hardware does not necessarily provide a 
mechanism for dynamic voltage scaling. In case of 
voltage-scalable systems, additional energy savings can 
be achieved. 
Fig. 3 shows another model for the hardware 
platform – Separate Clock (SC). Under this architecture, 
scaling of the clock rate does not affect the peripherals 
speed. Consequently, the SC architecture allows a 
straightforward implementation of DFS. 
OSC      D/M  CPU 
 
         DP   P 
 
Fig.2 Hardware platform, model CC 
 
 
OSC               D/M   CPU 
 
          DP   P 
 
Fig.3 Hardware platform, model SC 
 
Fig. 4 shows a parallel hardware platform. In order 
to facilitate replication of tasks some input is shared [4]. 
Replicated functionality introduces fault tolerance and 
declines communication. 
 
 OSC     D/M     CPU 
 
        DP      P 
 
 OSC    D/M     CPU 
 
            DP      P 
 
Fig.4 A parallel hardware platform with shared input 
 
5     ENERGY MODEL 
Assume that the supple current DDI  scales 
linearly with the clock frequency for both active ( ACTDD,I ) 
and a power saving mode (
PSDD,I ). Fig. 5 shows the 
empirical equations for the two options and outlines the 
power management as a two-steps process. First, the 
power control is confined to DFS. Second, the power 
management assumes both DFS and DVS. When the 
clock frequency, f, is changed, the supply voltage, DDV , is 
adjusted accordingly. This in turn, leads to new values for 
ACTk  and PSk . 
DVS 
 
  DFS 
  ACTACTACTDD, nfkI +=  
  PSPSPSDD, nfkI +=  
 
  DDACTACT Vpk =   srfVDD +=  
  DDPSPS Vpk =  
 
Fig. 5 Energy model 
 
Fig. 6 classifies three defining relations between 
the power consumption in active mode and power saving 
mode. The clock frequency PSf  is applied during the 
power saving period. We suppose, a peripheral demands 
this clock rate for a proper operation. 
 
    ACT 
    PS 
Relation Notation 
DDI  
 
      PSf       f 
 
 
PSPSPSACT nfkn +>  
 
 
A > (S) 
DDI  
  
      PSf       f 
 
 
PSPSPSACT nfkn +=  
 
 
A = (S) 
DDI  
  
      PSf       f 
 
 
PSPSPSACT nfkn +<  
 
 
A < (S) 
 
Fig. 6 Three relations between the power consumption in 
active mode and power saving mode 
 
 
6     DYNAMIC FREQUENCY SCALING 
Based on the models from previous sections we 
analyze how to change the clock rate from task to task in 
order to minimize the energy for the period of execution. 
Theorem 1. Let the computational model includes 
n(T) tasks. Each task has a workload of iN clocks. 
Referring to Fig. 1, the common deadline is TDL  and the 
period of execution PT . The energy model is defined as 
ACTACTACTDD, nfkI +=  and PSPSPSDD, nfkI +=  on the closed 
interval [ ]MAXMIN f,f . Suppose that the clock frequency is 
switched to PSf  for the power saving period. The energy 
per period has the smallest value for  








−







= ∑∑
≥<
<
(S)A,T
iMAXDL
(S)A,T
i(S)A
ii
N)f/1(T/Nf  (2) 
when the energy model is A < (S) and for MAX(S)A ff =≥  
when the energy model is A = (S) or A > (S). 
Proof. The energy per period 
)f/N)(Tnf(kV
f/)Nnf(kVP(t)dtE
n(T)
1i
iiPPSPSPSDD
n(T)
1i
iiACTiiACTiDD
T
0
P
P
∑
∑∫
=
=
−++
+==
 (3) 
The first partial derivative of PE  with respect to if  
2
iACTiPSPSPSiDD
'
fP f/)nnfk(NV)(E i −+=  (4) 
If A = (S) or A > (S), 0)(E 'fP i <  and MAX(S)A ff =≥  is 
selected. In case of A < (S), 0)(E 'fP i >  and the clock 
frequency must be as low as possible. Since MAXf  is used 
for the period  
MAX
(S)A,T
i f/N
i
∑
≥
         (5) 
The lowest possible frequency 








−







= ∑∑
≥<
<
(S)A,T
iMAXDL
(S)A,T
i(S)A
ii
N)f/1(T/Nf   (6) 
 
Related simulations for Theorem 1 can be seen in 
Fig. 7, Fig. 8 and Fig. 9. Two tasks have the following 
workload: 20000N1 =  and 30000N2 = . All 
computation must finish within ms60TDL = , 
ms010TP =  and MHz40fMAX = . The energy model is 
characterized by 3ACT1 1010n
−×= , 9ACT1 1092.0k −×= , 
3
ACT2 1012n
−×= , 9ACT2 1095.0k −×= , 3PS 103n −×=  and 
9
PS 106.0k −×= . The only difference in the three 
simulations is the clock frequency in the power-saving 
mode, PSf . 
J,EP  
0
10
20
30
40
0
10
20
30
40
3.5
4
4.5
5
x 10-3
 
2f , MHz      1f , MHz 
 
Fig. 7 Energy per period, model A<(S), MHz20fPS =  
 
   A > (S)      A < (S) 
mJ,EP  
0
10
20
0  510
1520
3.25
3.3
3.35
3.4
3.45
3.5
3.55
3.6
3.65
 
2f , MHz      1f , MHz 
 
Fig. 8 Energy per period, MHz13fPS =  
 
 
Parallel architectures may prove power efficient 
if the clock rates approach the low-frequency area where 
the savings would justify the overhead. 
mJ,EP  
0  5
10 15
20
510
1520
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
3.6
 
2f , MHz    1f , MHz   
 
Fig. 9 Energy per period, model A>(S), MHz10fPS =  
 
7     CASTLE SOFTWARE TOOL 
We have developed a tool called CASTLE, 
Crystal Annealing Software Tool for Low Energy, to 
provide computer assistance in the design process of 
energy-aware embedded systems. 
7.1    Parallel execution 
CASTLE evaluates the opportunities for parallel 
execution of a set of tasks. The tool utilizes a DFG as a 
design entry. Fig. 10 shows an example DFG 
specification. Table 1 depicts the workload for each task 
and the corresponding energy models. Also, task T4 
requires MHz12fPS =  and T3 - MHz5.1fPS = . The 
CPU’s highest speed is 60 MHz and the supply voltage 
can vary in the range 1.8 – 3.6 V. The following equation 
describes the required supply voltage for a certain clock 
frequency. 
7982.1f1003.30V 9DD +×= −      (7) 
The power-saving mode is characterized by 
9
PS 101667.0p
−×=  and 3PS 103n
−×= . All computation 
must finish within 1 ms and the period of execution is 1.2 
ms. Fig. 11 compares six energy levels calculated by 
CASTLE. First, the energy is calculated for MAXf  and 
MHz56fMIN = . Version #3 is based on Theorem 1, 
MHz222.55f (S)A =< . Version #4 utilizes DVS. In an 
attempt to save energy, CASTLE splits the DFG into two 
subtrees for parallel execution. Tasks T1, T2, T3, T4 and 
T5 are mapped to CPU1. Tasks T1, T6, T7 and T8 are 
mapped to CPU2. The resulting energy is 181 µJ which 
gives a reduction of 23% compared to version #4. The 
DFG partitioning for three CPUs is version #6. Since 
version #6 shows an increase of energy, the process is 
terminated. 
 
 
T1 ▲ 
 
T2 ▼      T6 ▼ 
 
 
  T3 ▼    T4 ▼   T7 ▼   T8 ▼ 
 
 
   T5 ▲   MAXf  ▲  
          (S)Af <  ▼ 
Fig. 10 A DFG specification 
 
            Table 1 
 N 9
ACT 10,p
−
 
3
ACT 10,n
−
 
T1 2000 0.25 11.0 
T2 3000 0.29 8.2 
T3 7000 0.27 12.0 
T4 10000 0.31 7.5 
T5 8000 0.24 11.5 
T6 12000 0.32 9.5 
T7 8000 0.35 6.8 
T8 6000 0.31 7.2 
 
 
                 
 
    258 µJ   260 µJ  257 µJ  239 µJ 
             181 µJ 192 µJ 
             CPU1   CPU1.1 
 MAXf     MINf  MAXf   MAXf    CPU2   CPU1.2 
      (S)Af <   (S)Af <        CPU2 
            DVS   DVS      DVS 
 
 #1   #2    #3  #4     #5  #6 
 
Fig. 11 CASTLE outlines the design space 
 
7.2   2O  execution 
With the assumption that two different clock 
rates are employed, CASTLE is capable of specifying Out 
of Order, 2O , execution of tasks. The goal is to minimize 
the overhead associated with clock rate control. Again, 
the tool utilizes a DFG as a design entry. Tasks are 
marked to run at one of the two clock frequencies. The 
tool generates a sequence of tasks with a minimal number 
of clock rate transitions. Fig. 12 compares the number of 
transitions for three examples processed by CASTLE. 
Each example is based on a 1023 tasks symmetrical 
binary tree. The tool calculated the number of clock rate 
transitions for Breadth-first search, BFS, Depth-first 
search, DFS, and 2O  execution. Simulation results 
indicate ample reserve to decline the clock rate control 
overhead. 
BFS                DFS           2O  
500 
400 
300 
200 
100 
 
     #1           #2              #3 
Fig. 12 Number of clock rate transitions 
 
8     CONCLUSION 
This paper presents a methodology for optimal 
power management of real-time embedded systems. The 
power consumption is controlled via DFS and DVS. The 
combination of a real-time computational model, an 
energy model and a model of the hardware platform is 
central for the method. CASTLE software tool processes 
DFGs to select an optimal number of CPUs for the 
application. Finally, the tool can be used to decrease the 
overhead associated with power management via out of 
order execution of tasks. 
9     REFERENCES 
[1] W. Wolf, “The good news and the bad news”, IEEE 
Computer, pp. 104-105, November 2007. 
[2] T. A. Henzinger and J. Sifakis, “The discipline of 
embedded systems design”, IEEE Computer, pp. 32-
40, October 2007. 
[3] P. Chevochot and I. Puaut, “Scheduling fault-tolerant 
distributed hard real-time tasks independently of the 
replication strategies”, Proc. 6th Int. Conference on 
Real-Time Computing Systems and applications, 
Hong-Kong, China, Dec. 1999. 
[4] Z. Karakehayov and E. Saramov, "A fuzzy 
geography approach to hardware-software co-design 
of distributed embedded systems", IEEE International 
Workshop on Embedded Fault-Tolerant Systems, 
Dallas, USA, 1996. 
[5] P. Pop, P. Eles and Z. Peng, Analysis and Synthesis 
of Distributed Real-Time Embedded Systems, 
Kluwer, 2004. 
[6] A. Miyoshi, C. Lefurgy, E. V. Hensbergen, R. 
Rajamony and R. Rajkumar, "Critical power slope: 
Understanding the runtime effects of frequency 
scaling", Proceedings ICS’02, New York, 2002. 
[7] M. T. Schmitz, B. M. Al-Hashimi and P. Eles, 
System-Level Design Techniques for Energy-
Efficient Embedded Systems, Kluwer, 2004. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
[8] E. Bini, G. Buttazzo and G. Lipari, “Speed 
modulation in energy-aware real-time systems”, 
Proceedings of the 17th Euromicro Conference on 
Real-Time Systems, pp. 3-10, 2005. 
[9] R. Melhem, N. A. Ghazaleh, H. Aydin and D. Mosse, 
"Power management points in power-aware real-time 
systems", in Power Aware Computing, edited by R. 
Graybill and R. Melhem, Kluwer, 2002. 
[10] N. A. Ghazaleh, D. Mosse, B. Childers and R. 
Melhem, "Toward the placement of power 
management points in real-time applications", in 
Compilers and operating systems for low power, 
edited by L. Benini, M. Kandemir and J. Ramanujam, 
Kluwer, 2003. 
[11] Z. Karakehayov, "Low-power design for Smart Dust 
networks", in Handbook of Sensor Networks: 
Compact Wireless and Wired Sensing Systems, 
edited by Mohammad Ilyas and Imad Mahgoub, CRC 
Press LLC, pp. 37-1 - 37-12, 2005. 
[12] Z. Karakehayov, "Dynamic clock scaling for energy-
aware embedded systems", Proceedings of the IEEE 
Fourth International Workshop on Intelligent Data 
Acquisition and Advanced Computing Systems, 
Dortmund, Germany, 6-8 September, pp. 96-99, 
2007. 
 
