Models and complexity results for performance and energy optimization of concurrent streaming applications by Benoit, Anne et al.
Models and complexity results for performance and
energy optimization of concurrent streaming applications
Anne Benoit, Paul Renaud-Goud, Yves Robert
To cite this version:
Anne Benoit, Paul Renaud-Goud, Yves Robert. Models and complexity results for performance
and energy optimization of concurrent streaming applications. [Research Report] RR-7589,
INRIA. 2011, pp.35. <inria-00583123>
HAL Id: inria-00583123
https://hal.inria.fr/inria-00583123
Submitted on 4 Apr 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
appor t  

de  r ech er ch e
IS
S
N
0
2
4
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--
7
5
8
9
--
F
R
+
E
N
G
Distributed and High Performance Computing
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
Models and complexity results for performance and
energy optimization of concurrent streaming
applications
Anne Benoit — Paul Renaud-Goud — Yves Robert
N° 7589
April 2011

Centre de recherche INRIA Grenoble – Rhône-Alpes
655, avenue de l’Europe, 38334 Montbonnot Saint Ismier
Téléphone : +33 4 76 61 52 00 — Télécopie +33 4 76 61 52 52
Models and complexity results for performance
and energy optimization of concurrent
streaming applications
Anne Benoit , Paul Renaud-Goud , Yves Robert
Theme : Distributed and High Performance Computing
E´quipe-Projet GRAAL
Rapport de recherche n° 7589 — April 2011 — 35 pages
Abstract: In this report, we study the problem of finding optimal mappings
for several independent but concurrent workflow applications, in order to op-
timize performance-related criteria together with energy consumption. Each
application consists in a linear chain graph with several stages, and processes
successive data sets in pipeline mode, from the first to the last stage. The prob-
lem is to decide which processors to enroll, at which speed (or mode) to use
them, and which stages they should execute. There is a clear trade-off to reach,
since running faster and/or more processors leads to better performance, but
energy consumption is then very high. Energy savings can be achieved at the
price of a lower performance, by reducing processor speeds or enrolling fewer
resources. We study the problem complexity on different target execution plat-
forms, ranking from fully homogeneous platforms to fully heterogeneous ones.
We consider three mapping strategies: (i) one-to-one mappings, where a pro-
cessor is assigned a single stage; (ii) interval mappings, where a processor may
process an interval of consecutive stages of the same application; and (iii) gen-
eral mappings, which are fully arbitrary, i.e., a processor may process stages
of several distinct applications. Finally, we compare two different models for
the computation of the latency, which is the time elapsed between the begin-
ning and the end of the execution of a given data set: with the Path model,
it is computed as the length of the path taken by this data set, while with the
Wavefront model, each data set progresses concurrently within a period. For
all platform types, all mapping strategies and both latency models, we establish
the complexity of several multi-criteria optimization problems, whose objective
functions combine period, latency and energy criteria. In particular, we exhibit
instances where the problem is NP-hard with concurrent applications, while it
can be solved in polynomial time for a single application, and instances whose
problem complexity depends upon the latency model.
Key-words: mapping, concurrent streaming applications, heterogeneous plat-
forms, complexity results, resource sharing, energy, latency, period.
Mode`les et re´sultats de complexite´ pour
l’optimisation de l’e´nergie et des performances
d’applications concurrentes pipeline´es
Re´sume´ : Dans ce rapport, nous e´tudions le proble`me de trouver des place-
ments pour plusieurs applications pipeline´es concurrentes, dans le but de min-
imiser a` la fois la performance des applications et la consommation d’e´nergie.
Nous conside´rons plusieurs plates-formes d’exe´cution dont le degre´ d’he´te´roge´ne´ite´
varie, ainsi que plusieurs strate´gies de placement. Finalement, nous intro-
duisons deux mode`les d’exe´cution pour calculer la latence. Pour chacune des
variantes du proble`me, nous e´tablissons la complexite´ de plusieurs proble`mes
d’optimisation multi-crite`res.
Mots-cle´s : applications pipeline´es, plates-formes he´te´roge`nes, re´sultats de
complexite´, partage de ressources, e´nergie, latence, pe´riode.
Performance and energy optimization 3
Contents
1 Introduction 4
2 Related work 6
3 Motivating example 7
3.1 Interval mappings . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 General mappings . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Framework 10
4.1 Applicative framework . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Target platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Mapping strategies and scheduling . . . . . . . . . . . . . . . . . 11
4.4 Performance optimization criteria . . . . . . . . . . . . . . . . . . 11
4.4.1 Without processor sharing . . . . . . . . . . . . . . . . . . 11
4.4.2 With resource sharing . . . . . . . . . . . . . . . . . . . . 12
4.5 Energy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Complexity results with the Path model 14
5.1 Period minimization . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1.1 One-to-one mappings . . . . . . . . . . . . . . . . . . . . 15
5.1.2 Interval mappings . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Latency minimization . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.1 One-to-one mappings . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Interval mappings . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Period/latency minimization . . . . . . . . . . . . . . . . . . . . 21
5.4 Period/energy minimization . . . . . . . . . . . . . . . . . . . . . 23
5.5 Period/latency/energy minimization . . . . . . . . . . . . . . . . 24
5.6 Summary of complexity results for the Path model . . . . . . . . 28
6 Complexity results with the Wavefront model 28
6.1 Period minimization . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Period/latency minimization . . . . . . . . . . . . . . . . . . . . 29
6.3 Period/latency/energy minimization . . . . . . . . . . . . . . . . 30
7 Conclusion 31
RR n° 7589
Performance and energy optimization 4
1 Introduction
In this report, we aim at optimizing the parallel execution of several pipelined
applications that execute concurrently on a given platform. We focus in this
work on pipelined applications whose structure is a linear chain of tasks. Such
applications are ubiquitous in streaming environments, as for instance video and
audio encoding and decoding, DSP applications, image processing, and so on [13,
16, 26, 30, 31]. Furthermore, the regularity of these applications render them
amenable to a high-level parallel programming approach based on algorithmic
skeletons [12,23]. Skeletons ease the task of the application developer and make
it easy to tailor his/her specific problem to a target platform.
In a linear pipelined application, a series of data sets enters the input stage
and progresses from stage to stage until the final result is computed. Each stage
corresponds to a distinct task and has its own communication and computation
requirements: it reads an input from the previous stage, processes the data and
outputs a result to the next stage. Each data set is input to the first stage, and
final results are output from the last stage. The pipeline operates in synchronous
mode: after a transient behavior due to the initialization delay, a new data set
is completed every period. Mapping such applications onto parallel platforms
is a challenging problem, that becomes even more difficult when platforms are
heterogeneous (nowadays a standard assumption). Another level of difficulty
is added when considering several independent applications which are executed
concurrently on the platform and that compete for available resources.
The objective is to minimize the energy consumption of the whole plat-
form, while satisfying given performance-related bounds on the period and la-
tency of each application. This problem has recently become a critical prob-
lem, both for economic and environmental reasons [22]. The Green500 list
(www.green500.org) provides rankings of the most energy-efficient supercom-
puters in the world, therefore raising awareness about power consumption. The
multi-criteria approach targets a trade-off between the users and the platform
manager. The formers have specific requirements for their applications, while
the latter has crucial economical and environmental constraints. Indeed, the
energy saving problem is becoming increasingly important, not only because
of the sole cost of energy, but also because of the cost of cooling systems and
related infrastructures. To help reduce energy costs, modern computing cen-
ters provide multi-modal processors: each processor has a discrete number of
predefined speeds (or modes), which correspond to different voltages that the
processor can be subjected to. This approach is the Dynamic Voltage Scaling
(DVS) technique; the slowest modes correspond to a lower voltage, and hence a
slower execution speed. The power consumption is the sum of a static part (the
cost for a processor to be turned on: power can be saved by shutting processors
down – ON/OFF technique) and a dynamic part. This dynamic part is a strictly
convex function of the processor speed, so that the execution of a given amount
of work costs more energy if a processor runs in a higher mode [17,19]. On the
one side, faster modes (i.e., higher voltages) allow for fulfilling the performance
criteria, and on the other side, they lead to a higher energy consumption.
The main performance-oriented criteria for pipelined applications are period
and latency [6,7,24,25,28–30]. The period of an application is the inverse of the
throughput, i.e., it corresponds to the time interval between the arrival of two
consecutive data sets. The period is dictated by the critical resource: it is equal
RR n° 7589
Performance and energy optimization 5
to the longest cycle time of a processor. For instance under a strict one-port
communication model with no overlap of communications and computations,
it is the sum of the time to perform all incoming communications, the time to
perform all outgoing communications, and the total computation time. With
overlap, we simply replace the sum of these three terms by their maximum. In
some cases, the period is fixed by the applicative setting, and we must ensure
that data sets are processed fast enough so that there is no accumulation of data
sets in the pipeline. The latency of an application is the time elapsed between
the beginning and the end of the execution of a given data set, hence it measures
the response time of the system to process the data set entirely. For streaming
applications, there are several approaches to compute the latency. The most
accurate model is the Path model, in which the latency is computed as the
length of the path taken by any data set. With the Wavefront model, we
rather consider that each data set progresses concurrently within a period, and
the latency is then a multiple of the period, as suggested by Hary and O¨zgu¨ner
in [16].
The two performance criteria alone already are antagonistic. The smallest
latency is obtained when no communication occurs, i.e., when the same (fastest)
processor executes all the stages of an application. However, such a mapping
may well exceed the bound on the period, since the same processor must process
an entire application. Moreover, when several applications run concurrently, the
scheduler must decide which resources to select and assign to each application,
so that all users receive a fair share of the platform.
Adding energy consumption as a third criterion renders everything even
more complex. Obviously, energy is minimized by enrolling a single processor
for all applications, namely the one with the smallest speed available; but such
a mapping would most certainly exceed period and latency bounds.
Our goal is to execute all applications efficiently while minimizing the energy
consumed. Unfortunately, the goals of low power consumption and efficient
scheduling are contradictory. Indeed, period and/or latency can be minimized
by using more energy to speed up processors, while energy can be minimized
by reducing processor speeds, hence performance-related objectives. How to
deal with these contradictory objective functions? In traditional approaches,
one would form a linear combination of the different objectives and treat the
result as the new objective to be optimized. But is it natural for the user to
maximize the quantity 0.7P + 0.3E, where P is the period and E the energy?
Since criteria are very different in nature, it does not make much sense for
a user to make a linear combination of them. Thus we advocate the use of
multi-criteria mappings with thresholds. Now, each criteria combination can
be handled in a natural and meaningful way: one single criterion is optimized,
under the condition that a threshold is enforced for all other criteria. This leads
to two interesting questions. If we fix energy, we get the laptop problem, which
asks “What is the best schedule achievable using a particular energy budget,
before battery becomes critically low?” Fixing schedule quality gives the server
problem, which asks “What is the least energy required to achieve a desired level
of performance?”
The optimization problem can then be stated as follows: given a set of appli-
cations and a computational platform, which stage to assign to which processor?
We consider three different mapping strategies: one-to-one mappings, for which
each application stage is allocated to a distinct processor; interval mappings,
RR n° 7589
Performance and energy optimization 6
where each participating processor is assigned an interval of consecutive stages
of the same application; and general mappings which are fully arbitrary. These
mapping strategies have been widely used in the literature when mapping one
single application (see [6,24,25]), and we extend them naturally to map of several
concurrent applications.
We target three different platform types: fully homogeneous platforms have
identical processors and interconnection links; communication homogeneous plat-
forms have identical links but different-speed processors, thus introducing a first
degree of heterogeneity; and finally, fully heterogeneous platforms, with different-
speed processors and different capacity links, constitute the most difficult prob-
lem instance.
The report is organized as follows. We first review related work in Sec-
tion 2. Then, we illustrate and motivate the problem with a simple example
in Section 3. The framework is described in Section 4. The next two sections
constitute the heart of the report: we assess the complexity of all problem in-
stances. In Section 5, we establish the complexity of mapping problems with the
Path latency model, while we investigate the complexity with the Wavefront
latency model in Section 6. We conclude in Section 7.
2 Related work
The problem of mapping a single linear chain application onto parallel platforms
in order to minimize latency and/or period has already been widely studied, in
particular on homogeneous platforms (see the pioneering papers [24] and [25])
and later for heterogeneous platforms (see [6, 7]), considering the Path latency
model. These results focus on the mapping of one single application, while we
add the complexity of satisfying several users who each have different require-
ments for their applications. We were able to extend polynomial time algorithms
to this multi-application setting, and to exhibit cases in which the problem
becomes NP-hard because of this additional difficulty. Of course, problem in-
stances which were already NP-hard with a single application remain difficult
with several concurrent applications.
Moreover, we consider a new and important objective function, namely en-
ergy minimization, and this is the first study (to the best of our knowledge)
which combines performance-related objectives with energy in the context of
pipelined applications. As expected, combining all three criteria (period, latency
and energy) leads to even more difficult optimization problems: the problem is
NP-hard even with a single application on a fully homogeneous platform (for
interval mappings with the Path latency model).
In order to adjust energy consumption, we use the Dynamic Voltage Scaling
(DVS) technique. DVS has been extensively studied in several papers, for map-
ping onto a single-core processor, a multi-core processor, or a set of processors.
Slack reclamation techniques are used for frame-based hard real-time em-
bedded system in [32]: a set of independent tasks, provided with their WCEC
(Worst Case Execution Cycle) and sharing a common deadline, has to be mapped
onto a processor. If a task needs less cycles than its WCEC, the dynamically
obtained slack allows the processor to run at a lower frequency and therefore to
spare energy. This work is extended in [33], where the energy model includes
time and energy penalties when the processor frequency is changing. Those
RR n° 7589
Performance and energy optimization 7
transition overheads are also taken into account in [2], but tasks are interdepen-
dent.
Then [11] maps applications which consists of a program modeled with a
sequential part and another part which can be parallel, onto a multi-core pro-
cessor. Bunde [9] focuses on the problem of oﬄine scheduling unit time tasks
with release dates, while minimizing the makespan or the total flow time on one
processor. He extends this work from one processor to multi-processors.
Authors in [10] study the problem of scheduling real-time tasks on two
heterogeneous processors. They provide a FPTAS to derive a solution very
close to the optimal energy consumption with a reasonable complexity, while
in [15], the authors design heuristics to map a set of dependent tasks with dead-
lines onto a set of homogeneous processors, with the possibility of changing a
processor speed during the execution of a task. [18] proposes a greedy algorithm
based on affinity to assign frame-based real-time tasks, and then they re-assign
them in pseudo-polynomial time when any processing speed can be assigned for
a processor. In [21], leakage energy is the focus for mapping applications repre-
sented as DAGs. In [27], the authors are interested about scheduling task graphs
with data dependencies while minimizing the energy consumption of both the
processors and the inter-processor communication devices, while assuming the
communication times are negligible compared to the computation times.
All these problems are quite different from ours, since we focus on pipelined
applications of infinite duration, thus considering power instead of total energy
consumption. Due to the streaming nature of the applications, we do not allow
for changing the processor speed during execution.
3 Motivating example
In this example, we have two applications and three processors, as shown on
Figure 1. The first stage of App1 computes 3 operations, and then sends a data
of size 3 to the second stage; the second stage first receives a data of size 3, then
computes 2 operations, and finally sends a data of size 1, and so on. If both
stages are assigned to the same processor, there is no communication cost to
pay; otherwise this cost depends on the communication volume (3 between the
first and the second stage in this case), and on the link bandwidth between the
corresponding processor pair. All communication link bandwidths are set to 1.
For the computational platform, each processor has two execution modes. For
instance, P1 can process 3 operations per time unit in its first mode, and 6 in
its second one, against 6 or 8 for P2, and 1 or 6 for P3.
We compute the global period as follows: T = max(T1, T2), where Ti is the
period of the ith application (i = 1, 2). The global latency is defined in a similar
way, as the maximum of the latency achieved by all applications. The energy
Figure 1: Example with two applications and three multi-modal processors.
RR n° 7589
Performance and energy optimization 8
consumption of a processor is equal to the square of its speed, which is quite a
realistic assumption (see Section 4.5 for more details on the model for energy
consumption). Note that when the energy is not a criterion to minimize, all
processors can run in their higher modes (as fast as possible), because this can
only improve the performance-related criteria (period and latency). In this case,
either a processor is used at its fastest speed, or it is turned off.
3.1 Interval mappings
First we restrict to interval mappings, where a processor can be assigned only
a set of consecutive stages of a single application.
In order to minimize the period without energy constraints, we map the
whole first application onto processor P3, the first half of the second application
onto processor P2, and the rest onto processor P1. The period is then:
max
(
3+2+1
6 ,max
(
2+6
8 ,
1
1 ,
4+2
6
))
= 1 . (1)
Equation (1) reads as follows: we compute the cycle-time of each processor
as the maximum time spent for incoming communications, computations, and
outgoing communications, thus considering a model in which communications
and computations are overlapped. We then take the maximum of these quanti-
ties to derive the period. There is only one communication to pay in the second
application since it is split between two processors. Note that the cycle-time of
each processor is exactly 1 and there is no idle time on computation, thus it is
not possible to achieve a better period: this mapping is optimal for the period
minimization problem.
The minimum latency is obtained by removing all communications and using
the fastest processors. A mapping that returns the optimal latency (in the
absence of other criteria) is for instance the one which maps the first application
on P1 and the second application on P2, thus achieving a global latency of:
max
(
3+2+1
6 ,
2+6+4+2
8
)
= 1.75 . (2)
In Equation (2), we simply compute the longest execution path for each
application following the Path latency model. The period of each application
is, in this case, equal to its latency, and theWavefrontmodel returns the same
latency (one single period for the execution of a data set). The bottleneck is
the second application, and we cannot achieve a better latency since we pay no
communication and use the fastest processor for this application. This latency
is thus optimal.
The minimum energy is obtained when we use fewer processors, each running
in its slowest mode. Since we assume that a processor cannot be assigned stages
of two different applications, two processors are required in the example. For
instance, we can map the first application on P1 running in its lowest mode and
the second application on P3 running in its lowest mode too, thus achieving an
energy of 32 + 12 = 10. This is the minimum energy consumption required to
run both applications. We observe that the period is then:
max
(
3+2+1
3 ,
2+6+4+2
1
)
= 14 .
As expected, running at a slower mode to save energy leads to poorer per-
formances. Trade-offs must be found when considering several antagonistic op-
timization criteria.
RR n° 7589
Performance and energy optimization 9
For instance, if we try to minimize the energy consumption under the con-
straint that the period is not greater than 2, we can use the first mode of each
processor. Then the first application is mapped onto P1, the first three stages
of the second application are mapped onto P2 and its last stage is mapped onto
P3. The global period is 2, and the consumed energy is 3
2 +62 +12 = 46. This
may be quite a reasonable compromise between energy and period: indeed, with
the mapping minimizing the period (period of 1), the energy consumption was
62 + 82 + 62 = 136. With this mapping, the latency model impacts the result.
With the Path model, we compute the longest path followed by a data set, it is
max
(
3+2+1
3 ,
2+6+4
6 +
1
1 +
2
1
)
= 5, while with the Wavefront model, it takes
three periods for a data set to be computed by the second application, leading
to a latency of 3× 2 = 6.
3.2 General mappings
With general mappings, it is possible to assign any subset of stages to the
processors. For instance, we consider the mapping in which the first stage of
application one, and the second and third stages of application two, are all
mapped onto the second processor, running at speed 6. The other stages are
mapped onto the first processor, running at speed 3.
The energy consumption is then 32 + 62 = 45. For the period, we take the
maximum between the periods of both processors, accounting both for compu-
tation and communication costs:
max
(
1+1
1 ,
2+1+2+2
3 ,
1
1 ,
3+6+4
6 ,
1+1
1
)
= 73 .
Note that there are two communications from P2 to P1: one which cor-
responds to the communication in the first application between the first and
the second stage, and one in the second application between the third and the
fourth stage. For the computation of the latency with the Path model, it is
necessary to decide in which order these communications occur. If we start with
the communication in the first application, the latency is computed as follows:
max
(
3
6 +
1
1 +
2+1
3 ,
2
3 +
1
1 +
6+4
6 +
1
1 +
1
1 +
2
3
)
= 6 .
There is one time unit of idle time in the computation of the latency of the
second application, which corresponds to the communication from P2 to P1 in
the first application. The latency can be reduced to 5 if we change the order of
communications. Actually, for general mappings, even if the mapping is fixed,
it is NP-hard to decide in which order communications should be executed in
order to minimize the latency with the Path model [1].
Because of this observation, we consider the Wavefront model when deal-
ing with general mappings. This model was introduced by Hary and O¨zgu¨ner
in [16], and it is widely used in real-time systems. Note that the Wavefront
model assumes a full overlap of communications and computations. In the ex-
ample, the latency is still dictated by the second application: this application
needs 5 periods to execute a whole data set. The Wavefront latency is there-
fore 5× 73 ≈ 11.66.
RR n° 7589
Performance and energy optimization 10
4 Framework
We start with a formal description of the applicative framework (Section 4.1) and
the target execution platform (Section 4.2). Next in Section 4.3, we introduce
and motivate the mapping strategies. We are then ready to formally describe
the performance objective criteria (period and latency) in Section 4.4, and then
to finally discuss the energy model in Section 4.5.
4.1 Applicative framework
We consider A independent application workflows (A ≥ 1) to be executed con-
currently; each application operates on a collection of data sets that are executed
in a pipelined fashion. For 1 ≤ a ≤ A, let na be the number of stages of ap-
plication a, and N =
∑A
a=1 na be the total number of stages. For 1 ≤ k ≤ na,
wka is the computation requirement of S
k
a , the k
th stage of application a. For
1 ≤ k < na, δ
k
a is the size of the output data of S
k
a .
4.2 Target platform
The target platform is composed of p processors, which are fully interconnected;
there is a bidirectional link Pu ↔ Pv between any processor pair Pu and Pv, of
bandwidth bu,v.
We use a linear cost model for communications; it takes X/bu,v time units to
send (resp. receive) a message of size X to (resp. from) Pv. With the mapping
rules that we enforce (see Section 4.3 below), it turns out that a processor
never has to perform two concurrent ingoing nor outgoing communications: at
any time-step, a processor is involved in at most one send, one computation
and one receive. However, these three operations can either be parallel (as in
the example of Section 3) or serialized. With parallel operations, we have the
overlap model that corresponds to multi-threaded communication libraries such
as MPICH2 [20]. With sequential operations, we have the no-overlap model
that is well-suited to single-threaded programs.
Processors are multi-modal: every processor Pu is associated with a set of
speeds Su = {su,1, . . . , su,mu}. During the mapping process, we need to choose
one speed in Su for each processor Pu that is enrolled, and this speed is fixed
during the whole execution.
Then we classify particular cases which are important, both from a theoret-
ical and practical perspective. Fully homogeneous platforms, also called speed
homogeneous, have identical processors (all processors have a common speed
set: Su = S) and homogeneous communication devices (bu,v = b for all link
bandwidths). They represent typical parallel machines. Communication ho-
mogeneous platforms, also called speed heterogeneous, are still interconnected
with homogeneous communication devices, but they may have processors with
different speed sets (Su 6= Sv). They correspond to networks of workstations
with plain TCP/IP interconnects or other LANs. Fully heterogeneous platforms
are the most general, fully heterogeneous architectures. Hierarchical platforms
made up with several clusters interconnected by slower backbone links can be
modeled this way.
RR n° 7589
Performance and energy optimization 11
4.3 Mapping strategies and scheduling
We consider three mapping strategies. One-to-one mappings obey the simplest
rule: each application stage is allocated to a distinct processor. While easier
to optimize and implement, this rule may be unduly restrictive, and is likely to
pay high communication costs. Obviously, it also requires that p ≥ N , thereby
limiting its applicability to larger platforms (or fewer and smaller applications).
A natural extension is to search for interval mappings, where each participating
processor is assigned an interval of consecutive stages. Intuitively, assigning sev-
eral consecutive stages to the same processors will increase their computational
load, but may well dramatically decrease communication requirements. Interval
mappings have been widely used in the literature, see [6, 24, 25, 30, 31] among
others. We point out that both one-to-one and interval mappings forbid any
processor sharing, or re-use, across applications. These mappings are relevant
in practice, for instance if we envision a computer center where applications,
or jobs, cannot share resources because of security rules or of batch-assignment
procedures. The goal of the platform manager is to secure an efficient (albeit
concurrent) execution for each application (performance-related criteria) while
minimizing the energy consumption of the whole platform.
We also introduce general mappings that allow any processor to execute any
number of stages, consecutive or not, taken from one or several applications.
Such mappings are likely to lead to a better resource utilization throughout the
platform.
4.4 Performance optimization criteria
We are now ready to formally define the period and the latency of the appli-
cations. We start with one-to-one and interval mappings with no processor
sharing, and then we discuss the impact of processor sharing on the metrics.
4.4.1 Without processor sharing
For one-to-one and interval mappings, since there is no processor sharing, we
can focus on a single application.
Formally, an interval mapping is a partition of the set of stages S1 to Sn into
m intervals Ij = [dj , ej ] such that dj ≤ ej for 1 ≤ j ≤ m, d1 = 1, dj+1 = ej + 1
for 1 ≤ j ≤ m−1 and em = n. Then, the function al : [1, n] 7→ [1, p] associates a
processor number to each stage number. In a one-to-one mapping, this function
is a one-to-one assignment. In an interval mapping, for 1 ≤ j ≤ m, the whole
interval Ij is mapped onto the same processor Pal(dj), i.e., for dj ≤ i ≤ ej
, al(i) = al(dj). Also, two intervals (from the same application or from two
different applications) cannot be mapped onto the same processor, i.e., for 1 ≤
j, j′ ≤ m, j 6= j′, al(dj) 6= al(dj′).
The period of this single application is expressed in the overlap model as:
T (overlap) = max
j∈{1,...,m}
(
max
(
δdj−1
bal(dj−1),al(dj)
,
∑ej
i=dj
wi
sal(dj)
,
δej
bal(dj),al(ej+1)
))
, (3)
with δd1−1 = δem = 0 for the boundaries.
RR n° 7589
Performance and energy optimization 12
The maximum in the previous expression is replaced by a sum when con-
sidering the no-overlap model, since all operations are serialized. The period is
then:
T (no−overlap) = max
j∈{1,...,m}
(
δdj−1
bal(dj−1),al(dj)
+
∑ej
i=dj
wi
sal(dj)
+
δej
bal(dj),al(ej+1)
)
. (4)
The latency is the time to process a single data entirely, so it is identical in
both communication models, and computed with the Path model:
L =
m∑
j=1

 ej∑
i=dj
wi
sal(dj)
+
δej
bal(dj),al(ej+1)

 , (5)
with δem = 0 for the boundary.
Again, the simplicity of Equations (3), (4) and (5) is a very useful property of
interval mappings, and greatly simplifies the solution of multi-criteria problems.
These are the period and latency of one single application, and we need to
define a global period and latency function to be optimized. The simplest ap-
proach is to minimize X = maxa∈{1,...,A}(Xa), where Xa is the period or latency
of application a, for a ∈ {1, . . . , A}, as in the example of Section 3. However, the
concurrent applications can be of completely different nature and/or economic
value, so that their periods or latencies are not always comparable. Therefore
we aim at minimizing
X = max
a∈{1,...,A}
Wa ×Xa, (6)
whereWa > 0 is a weight associated to each application and Xa is the period
or latency of application a, for a ∈ {1, . . . , A}. Wa can be 1 (we retrieve a
simple maximum) or a priority ratio (fixed by the platform manager and/or paid
by the user). We can also let Wa = 1/X
∗
a , where X
∗
a is the objective function
computed when the application is executed alone on the platform; in this case
Wa×Xa represents the slowdown factor of application a, and X corresponds to
the maximum stretch [3].
4.4.2 With resource sharing
If we keep the classical latency definition (Path model) and consider general
mappings, it leads to intricate scheduling problems for period/latency bi-criteria
problems. Basically, even when the mapping is given, scheduling the execution
is a problem of combinatorial nature (it is NP-complete, see [1]). With general
mappings, a processor typically has several incoming and/or outgoing commu-
nications, and it is difficult to orchestrate these operations so as to minimize
conflicting objectives such as period and latency. This holds true both for the
overlap and no-overlap models.
Therefore, when considering resource sharing, we focus in this report on the
problem in which bounds on period and latency are fixed by the application
designer, and we relax the definition of the latency using the approach of Hary
and O¨zgu¨ner [16], that we call the Wavefront model. Instead of computing
the longest path, we approximate the latency L as L = (2m − 1)P , where
P is the period, i.e., the rate at which data sets enter the system, and m is
RR n° 7589
Performance and energy optimization 13
the number of intervals of consecutive stages mapped onto a same processor
in the mapping. A processor change occurs each time when a stage and its
successor are not mapped onto the same processor, i.e., m − 1 times. The
intuition is that the whole application is executed synchronously, and each data
set progresses concurrently within a period. With m successive computations
and m− 1 processor changes (i.e., communications), each data set traverses the
platform within 2m− 1 periods.
The mapping is an allocation function, which associates a processor number
to each stage number, as well as a speed at which each processor is running.
For general mappings with processor reuse, we must carefully decide how the
speed of each processor is shared among all stages it is assigned to. Similarly,
a communication link or processor network card may be involved in several
communications, which implies to sharing bandwidths and card capacities, too.
Hence the question is the following: given the mapping, and a threshold period
Pa and latency La for each application a ∈ {1, . . . , A}, is it possible to determine
which fraction of computing and communicating resources to assign to each
operation so that all period and latency thresholds are met?
Since we consider the Wavefront latency model, one period is accounted
for each computation of an interval of stages and for each inter-processor commu-
nication. We observe that given the mapping, we know ma, the number of inter-
vals (ma−1 processor changes), for each application a. We can thus check imme-
diately whether the bounds on the latency are respected, i.e., (2ma−1)Pa ≤ La
for a ∈ {1, . . . , A}.
Now for the periods, the key idea is to distribute platform resources parsi-
moniously, and to allocate only the needed CPU fraction to each computation,
and the needed bandwidth fraction to each communication, so that the period
constraint is fulfilled. The mapping is valid if neither processor speeds, nor
link bandwidths, nor network card capacities are exceeded. First, we merge
consecutive stages [Sia, . . . ,S
j
a] of application a mapped onto a same processor
as one single coalesced stage Sˆka , with computing cost wˆ
k
a =
∑j
k′=i w
k′
a , and
output communication cost δˆka = δ
j
a. The transformed application now has ex-
actly ma stages. In the following, stage Sˆka corresponds to the k-th stage of the
transformed application a, for 1 ≤ k ≤ ma.
As for computations, consider a processor Pu and an application a. We
define Kua such that k ∈ K
u
a if and only if Sˆ
k
a is processed by processor Pu; K
u
a
is the set of stages of (transformed) application a processed by Pu. Then, for
all a and u, and for each k ∈ Kua , we allocate the speed fraction s
k
a,u = wˆ
k
a/Pa
for Pu to execute Sˆka .
Similarly for communications, we define Ku,va such that k ∈ K
u,v
a if and
only if Sˆka is processed by Pu and
ˆSk+1a is processed by Pv, i.e., there is a
communication to pay between Pu and Pv. Note that u 6= v, otherwise stages
Sˆka and
ˆSk+1a would have been merged as a single stage. Formally, k ∈ Ku,va ⇔
k ∈ Kua and k+1 ∈ K
v
a. Then we allocate the bandwidth fraction b
k
a,u,v = δˆ
k
a/Pa
to the communication.
The period of each application can be respected if and only if all the following
inequalities are satisfied. There might be some spare speed and bandwidth if
RR n° 7589
Performance and energy optimization 14
these are strict inequalities, and resources are fully utilized in case of equalities:
• ∀1 ≤ u ≤ p,
∑A
a=1
∑
k∈Kua
ska,u ≤ su,
∑p
v=1
∑A
a=1
∑
k∈Ku,va
bka,u,v ≤ B
out
u ,∑p
v=1
∑A
a=1
∑
k∈Kv,ua
bka,v,u ≤ B
in
u ;
• ∀1 ≤ u, v ≤ p, u 6= v,
∑A
a=1
(∑
k∈Ku,va
bka,u,v +
∑
k∈Kv,ua
bka,v,u
)
≤ bu,v.
Note that we can consider mappings without reuse with this latency model.
In this case, if we transform each application a as explained above, the allocation
function of stages Sˆka (for 1 ≤ a ≤ A and 1 ≤ k ≤ ma) is a one-to-one function:
each coalesced stage is allocated onto a distinct processor. It becomes then
much easier to check the validity of the mapping, since each processor is only
handling one single stage, receiving input data from one single other processor,
and sending output data to one single other processor.
4.5 Energy model
The energy consumption of the platform is defined as the sum of the energy
E(u, ℓ) consumed by each processor Pu enrolled in the mapping in mode ℓ.
We assume that E(u, ℓ) consists of a static part and of a dynamic part. The
static part Estat(u) is the static cost for a processor to be in service, and does
not depend on the speed su,ℓ at which the processor is running. However,
the static energy is consumed only in mode ℓ 6= 0 (otherwise, the processor is
inactive, and not enrolled in the mapping). On the contrary, the dynamic part
Edyn(u, ℓ) is of the form Edyn(u, ℓ) = s
α
u,ℓ, where α > 1 is an arbitrary rational
number. It is sometimes assumed that α = 2 [19], as we did in the example of
Section 3, but all our results hold for any value of α. Finally, for ℓ 6= 0, we have
E(u, ℓ) = Estat(u) + Edyn(u, ℓ), while E(u, 0) = 0.
The energy E(u, ℓ) is an energy consumed per time unit, so we could also
speak of dissipated power. Note that it is mandatory to minimize energy con-
sumption per time unit, because the execution of streaming applications with
arbitrarily many data sets may last for an unbounded amount of time. Hence we
always consider a combination of energy and period objective criteria, because
the latency by its own takes only one single data set into account, and does not
reflect a pipelined execution.
5 Complexity results with the Path model
In this section, we consider the Path model for the computation of the latency,
and therefore we restrict the study to one-to-one and interval mappings with
no resource sharing. General mappings with resource sharing are investigated
in Section 6.
In the following, proc-hom denotes identical speed processors while proc-
het represents heterogeneous processors; com-hom means identical communi-
cation links, while they differ for com-het. We also report results for the case
special-app, which corresponds to applications whose stages are all identical
(all wka are equal), and no communication cost is paid (all δ
k
a are equal to 0).
We start with the mono-criterion problems of period or latency minimization
in Sections 5.1 and 5.2. In these cases, we do not consider energy minimization
issues, and therefore we can systematically run processors at their highest speed,
RR n° 7589
Performance and energy optimization 15
and thus use classical results established in a context with no energy. Then we
investigate the following multi-criteria problems: period/latency (Section 5.3),
period/energy (Section 5.4) and period/latency/energy (Section 5.5). We dis-
card the latency/energy combination since, as discussed above, the energy model
holds only in combination with the period criterion.
When dealing with multiple criteria, our approach is to minimize one of
them, given a threshold on the others. Actually, fixing the period or the latency
means fixing a threshold on the period or latency of each application, thus pro-
viding a table of period or latency values. Equivalently, we minimize the value
of Equation (6) with suitable coefficients. For the energy, only a bound on the
global energy consumption is required. Note that all results apply to both the
overlap and no-overlap models, and to all objective functions introduced in Sec-
tion 4.4: more precisely, polynomial problems remain polynomial for arbitrary
weights Wa in Equation (6), while NP-complete problems are already difficult
with Wa = 1. All complexity results are summarized in Section 5.6.
5.1 Period minimization
We show that a greedy assignment solves the problem of finding a one-to-one
mapping on communication homogeneous platforms, but the problem turns NP-
complete with heterogeneous links between the processors. For interval map-
pings, we use an existing algorithm which finds the minimum period in a single
application to build a new polynomial time algorithm that minimizes the global
period of many applications on fully homogeneous platforms, giving the right
number of processors to each application. The problem is NP-complete with
heterogeneous processors, even for the case special-app.
5.1.1 One-to-one mappings
Theorem 1 On communication homogeneous platforms, a one-to-one mapping
that minimizes the period can be determined in polynomial time.
Proof. The following proof is an adaptation of the algorithm described in [6],
which finds the minimum period under the same hypothesis but for a single
application. The main idea remains the same, since on communication homo-
geneous platforms the application that the stage belongs to does not matter for
a one-to-one mapping.
The optimal period belongs to the set:
T =
{
Wa ×max
(
δk−1a
b ,
wka
su
,
δka
b
)}
1≤a≤A,1≤k≤na,1≤u≤p
,
because it is equal to the product of Wa by the cycle-time of some processor Pu,
running in its fastest mode su, and executing one of the N stages, S
k
a . First we
compute the set T and we sort its elements into an array TA. Then, we perform
a binary search on the array TA to find the optimal period, testing at each step
whether the current element T is a feasible value. To do so, we use the greedy
assignment procedure of Algorithm 1. Initially, the current element T is the
median of TA. If the greedy assignment procedure returns “failure”, we increase
the period by jumping to the median of the elements of TA which are larger
RR n° 7589
Performance and energy optimization 16
Algorithm 1 Greedy-Assignment(T)
Work with fastest N processors, numbered P1 to PN , where s1 ≤ s2 ≤ · · · ≤
sN
Mark all stages S1 to SN as free
for u = 1 to N do
Pick up any free stage Ska such that:
Wa ×max(
δk−1a
ba
,
wka
su
,
δka
ba
) ≤ T
Assign Ska to Pu
Mark Ska as already assigned
if no stage found then
return “failure”
end if
end for
return “success”
than T , and if it returns “success”, we jump to the median of the elements of
TA which are smaller than T . The algorithm terminates in ⌈log T ⌉ iterations.
Note that |T | ≤ N × p (N stages and p processors), hence the total compu-
tation time is O((N × p+ costGA) log(N × p)), where costGA is the cost of the
greedy assignment procedure.
We now describe the greedy assignment algorithm for a prescribed value T
of the achievable period. Recall that there are N stages to map onto p ≥ N
processors in a one-to-one fashion. Also, we target communication homogeneous
platforms with different-speed processors (su 6= sv), with different-capacity links
between the applications, but with links of same capacities within an application.
First we retain only the fastest N processors, which we rename P1, P2, . . . , PN
such that s1 ≤ s2 ≤ · · · ≤ sN . Then we consider the processors in the order P1
to PN , i.e., from the slowest to the fastest, and greedily assign them any free
(not already assigned) task that they can process within the period.
The proof that the greedy procedure returns a solution if and only if there
exists a solution of period T is done by a simple exchange argument. Indeed,
consider a valid one-to-one assignment of period T , denoted A, and assume
that it has assigned stage Sk1a1 to P1. Note first that the greedy procedure will
indeed find a stage to assign to P1 and cannot fail, since S
k1
a1 can be chosen.
If the choice of the greedy procedure is actually Sk1a1 , we proceed by induction
with P2. If the greedy procedure has selected another stage S
k2
a2 for P1, we find
which processor, say Pu, has been assigned this stage in the valid assignment A.
Then we exchange the assignments of P1 and Pu in A. As Pu is faster than P1,
which could process Sk1a1 in time in the assignment A, Pu can process S
k1
a1 in
time too.
As Sk2a2 has been mapped on P1 by the greedy procedure, P1 can process S
k2
a2
in time. So the exchange is valid, we can consider the new assignment which
is valid and which did the same assignment on P1 than the greedy procedure.
The proof proceeds by induction with P2 as before.
The complexity of the greedy assignment procedure is costGA = O(N
2),
because of the two loops over processors and stages. Altogether, since N ≤ p,
the cost of Algorithm 1 can be neglected, and the complexity of the whole
RR n° 7589
Performance and energy optimization 17
algorithm is O((N × p) log(N × p)), which is indeed polynomial in the problem
size.
In addition, note that this algorithm works with the no-overlap communica-
tion model, by replacing Wa ×max(
δk−1a
ba
,
wka
su
,
δka
ba
) ≤ T by Wa × (
δk−1a
ba
+
wka
su
+
δka
ba
) ≤ T .
Theorem 2 On fully heterogeneous platforms, the problem of finding a one-to-
one mapping that minimizes the period is NP-complete.
Proof. As the problem was already NP-complete with one single applica-
tion [6], it remains NP-complete with concurrent applications.
5.1.2 Interval mappings
Theorem 3 On fully homogeneous platforms, an interval mapping that mini-
mizes the period can be determined in polynomial time.
Proof. A polynomial algorithm has already been found to exhibit the minimal
period with one application, under a communication model without overlap [6],
and it can easily be extended to the overlap model, so the following proof is
valid for both models. We exhibit an algorithm (see Algorithm 2) which finds
an optimal interval mapping for concurrent applications, thanks to the previous
polynomial algorithm for a single application, and we show its validity.
Algorithm 2
Assign all stages of each application to one processor
Compute the period of all applications
for a = (p−A) to p do
Find an application a′ such that Wa′ × Ta′ is maximum
Add one processor to this application
Compute the new period Ta′ of this application
end for
First, here are some notations:
• (kua,i) is a A-tuple which represents the processor distribution among the
applications at step i of Algorithm 2.
• (koa,i) is an optimal processor distribution with i processors.
• Ta(n) is the period of the application numbered a, where n is the number
of processors the application a is assigned to.
• T (d) = maxa∈{1,...,A}Wa × Ta(da), where d is a A-tuple.
Let us prove now the optimality of Algorithm 2.
• (kua,A) is the best distribution with A processors, because it is the only
one.
• Let us assume that (kua,i) is optimal with i processors. We want (k
u
a,i+1)
to be an optimal distribution with i+ 1 processors.
– Either: ∃a, koa,i+1 < k
u
a,i
In this case, by construction,
∃i′ < i, T ((kua,i′)) =Wa × Ta(k
u
a,i′) =Wa × Ta(k
o
a,i+1)
RR n° 7589
Performance and energy optimization 18
Now, because every Ta and x 7→Wa×x are non-decreasing, T ((k
u
a,i+1)) ≤
T ((kua,i)) ≤ T ((k
u
a,i′)), and by definitionWa×Ta(k
o
a,i+1) ≤ T ((k
o
a,i+1)).
Finally, T ((kua,i+1)) ≤ T ((k
o
a,i+1)).
– Or: ∃!a, koa,i+1 = k
u
a,i + 1
∗ either: kua,i+1 = k
u
a,i + 1 and we are done,
∗ or: ∃a′ 6= a, kua′,i+1 = k
u
a′,i + 1
In this case, by construction,
T ((kua,i)) = fa′(Ta′(k
u
a′,i)) = fa′(Ta′(k
o
a′,i+1)) because k
u
a′,i =
koa′,i+1. Thus T ((k
u
a,i)) ≤ T ((k
o
a,i+1)). Finally, T ((k
u
a,i+1)) ≤
T ((koa,i+1)).
Overall we have shown that (kua,i+1) was as good as (k
o
a,i+1).
• By induction, the algorithm finds an optimal solution to map A applica-
tions onto p processors.
The complexity of computing the period of application a with q ≤ p pro-
cessors, keeping the intermediate result with q − 1 processors, is bounded by
O((na)
3q) [6]. Let nmax = maxa∈{1,...,A} na. Since we perform at most p steps
in the algorithm, and q ≤ p, the complexity of Algorithm 2 is bounded by
O(n3maxp
2), which is indeed polynomial in the problem size.
Theorem 4 On communication homogeneous platforms, the problem of finding
an interval mapping that minimizes the period is NP-complete.
Proof. As the problem was already NP-complete with one single applica-
tion [6], it remains NP-complete with concurrent applications.
The case special-app is more interesting, because a polynomial algorithm
exists to find an interval mapping which minimizes the period of one single
application [7]; however, the problem becomes NP-complete with several appli-
cations.
Theorem 5 With several applications, heterogeneous processors, homogeneous
pipelines without communication, the problem of finding an interval mapping
which minimizes respectively maxa∈{1,...,A} Ta, maxa∈{1,...,A}Wa × Ta, or
maxa∈{1,...,A} Ta/T
∗
a , are NP-complete (in the strong sense).
Proof. First we focus on the first problem, i.e., minimizing maxa∈{1,...,A} Ta.
We consider the associated decision problem: given a period T, is there
a mapping of period less than T? The problem is obviously in NP: given a
period and a mapping, it is easy to check in polynomial time that it is valid by
computing its period.
To establish the completeness, we use a reduction from 3-partition [14].
We consider an instance I1 of 3-partition: given an integer B and 3m positive
integers a1, a2, . . . , a3m such that for all i ∈ {1, . . . , 3m}, B/4 < ai < B/2 and
with
∑m
i=1 ai = mB, does there exist a partition I1, . . . , Im of {1, . . . , 3m} such
that for all j ∈ {1, . . . ,m}, |Ij | = 3 and
∑
i∈Ij
ai = B?
As 3-partition is NP-complete in the strong sense, we can encode the 3m
numbers in unary, and assume that the size of I1 is O(mB).
We build an instance I2 of our problem with m identical applications such
that each application is composed of B stages, with w = 1, and p = 3m proces-
sors with speeds aj for each j ∈ {1, . . . , 3m}. We ask whether it is possible
to realize a period of 1. Clearly, the size of I2 is polynomial in the size of I1
RR n° 7589
Performance and energy optimization 19
(coded in unary). We now show that instance I1 has a solution if and only if
instance I2 does.
Suppose first that I1 has a solution. Let Ij = {a
′
1,j , a
′
2,j , a
′
3,j}, for j ∈
{1, . . . ,m}. For each j ∈ {1, . . . ,m}, we assign the a′1,j first consecutive stages
of the application j to the processor of speed a′1,j , the a
′
2,j next stages to the
processor of speed a′2,j , and the a
′
3,j remaining stages to the processor of speed
a′3,j . As the period of every processor is clearly equal to 1, the period is 1.
Suppose now that I2 has a solution. As the sum of all computation times
is equal to the sum of all processor speeds, and a processor cannot be assigned
stages of two different applications, for each application, the sum of its compu-
tation times is equal to the sum of the speed of processors which are assigned
a stage of this application. Now, for all i ∈ {1, . . . , 3m}, B/4 < ai < B/2, so
there are exactly three processors involved in the processing of each applica-
tion. We can derive easily a solution to I1 (set Ij corresponding to processors
of application j).
As there is no communication, this proof is valid for both communication
models.
For the second problem, we follow the previous proof, but we assume now
that, for each a ∈ {1, . . . , A}, for k ∈ {1, . . . ,m}, wka = 1/Wa. Then we scale
each application: each wka is multiplied by Wa so that the new period T
′
a of the
application a will be WaTa. We are now in the case of the previous proof.
Finally, for the third problem, we build the same instance as the one of
the first proof. As the pipeline applications are all similar, the period of those
applications when they are alone on the platform are all the same. We finally
just have to minimize maxa∈{1,...,A} Ta.
5.2 Latency minimization
We show that finding a one-to-one mapping which minimizes the latency is
NP-complete as soon as the processors do not have the same speed thanks
to a reduction from 3-partition. However we write a greedy algorithm that
finds the optimal interval mapping on communication homogeneous platforms.
The problem is still NP-complete on fully heterogeneous platforms for interval
mappings.
Note that latency expression does not depend on the communication model,
thus the results of this section are valid for the overlap and no-overlap models.
5.2.1 One-to-one mappings
Theorem 6 The problem of finding the one-to-one mapping which minimizes
the latency on fully homogeneous platforms is polynomial.
Proof. As all mappings are equivalent, the theorem is true.
The case special-app is more interesting, because a polynomial algorithm
exists to find a one-to-one mapping which minimizes the latency of one sin-
gle application [8]; however, the problem becomes NP-complete with several
concurrent applications.
RR n° 7589
Performance and energy optimization 20
Theorem 7 With several applications, heterogeneous processors, homogeneous
pipelines without communication, the problem of finding the optimal one-to-one
mapping which minimizes respectively maxa∈{1,...,A} La, maxa∈{1,...,A}Wa×La,
or maxa∈{1,...,A} La/L
∗
a, are NP-complete (in the strong sense).
Proof. First we focus on the first problem, i.e., minimizing maxa∈{1,...,A} La.
We consider the associated decision problem: given a latency L, is there
a mapping of latency less than L? The problem is obviously in NP: given a
latency and a mapping, it is easy to check in polynomial time that it is valid by
computing its latency.
To establish the completeness, we use a reduction from 3-partition. We
consider an instance I1 of 3-partition: given an integer B and 3m positive
integers a1, a2, . . . , a3m such that for all i ∈ {1, . . . , 3m}, B/4 < ai < B/2 and
with
∑m
i=1 ai = mB, does there exists a partition I1, . . . , Im of {1, . . . , 3m} such
that for all j ∈ {1, . . . ,m}, |Ij | = 3 and
∑
i∈Ij
ai = B?
We build an instance I2 of our problem with m identical applications, each
one composed of 3 stages with w = 1, and p = 3m processors with speeds 1/aj
for j ∈ {1, . . . , 3m}. We ask whether it is possible to realize a global latency
of B. Clearly, the size of I2 is polynomial in the size of I1. We now show that
instance I1 has a solution if and only if instance I2 does.
Suppose first that I1 has a solution. Let, for each j ∈ {1, . . . ,m}, Ij =
{a′1,j , a
′
2,j , a
′
3,j}. For each j ∈ {1, . . . ,m}, for i ∈ {1, 2, 3} we assign the i
th stage
of the application j to the processor whose speed is equal to 1/a′i,j . The global
latency is clearly B.
Suppose now that I2 has a solution. There exists a partition I1, . . . , Im of
{1, . . . , 3m} such that for all j ∈ {1, . . . ,m}, |Ij | = 3 and
∑
i∈Ij
ai ≤ B. Since∑m
i=1 ai = mB, we have, ∀j ∈ {1, . . . ,m},
∑
i∈Ij
ai = B. We conclude that I1
has a solution.
For the second problem, the proof is the same as the previous one, but we
have now w1a = w
2
a = w
3
a = 1/Wa.
For the third problem, the proof is similar to the first one, but we ask
now whether it is possible to realize a global latency of K × B, where K is
the sum of the three biggest ai. All applications have indeed the same la-
tency when they are alone on the platform, and this latency is K. Instead
of minimizing maxa∈{1,...,A}
La
L∗a
, we minimize maxa∈{1,...,A}
La
K so we minimize
maxa∈{1,...,A} La.
5.2.2 Interval mappings
Theorem 8 On communication homogeneous platforms, the optimal interval
mapping which minimizes the latency can be determined in polynomial time.
Proof. First, note that with a single application, the optimal mapping is
obtained by mapping the whole application onto one processor. Indeed, if two
distinct processors were enrolled in the computation, mapping the entire applica-
tion onto the fastest processor would reduce the computation time and remove
the communication cost. Therefore, with several concurrent applications, we
keep the A fastest processors and map the applications onto those processors in
RR n° 7589
Performance and energy optimization 21
a one-to-one fashion. The greedy procedure written for the period minimization
problem with one-to-one mapping can be reused.
The optimal latency belongs to the set:
L =
{
Wa ×
(
δ0a
b +
∑na
k=1 w
k
a
su
+
δnaa
b
)}
1≤a≤A, 1≤u≤p
.
Since |L| = Ap, the complexity of the algorithm is O((Ap + A2) log(Ap)),
and it can be simplified in O(Ap log(Ap)).
Theorem 9 On fully heterogeneous platforms, the problem of finding an opti-
mal interval mapping, that minimizes the latency, is NP-complete.
Proof. As the problem of finding the interval mapping, which minimizes the
latency on fully heterogeneous platforms, was already NP-complete with one sin-
gle application [8], it remains NP-complete with several concurrent applications.
5.3 Period/latency minimization
In this section again, we are not concerned with energy minimization issues,
so, similarly to results of Sections 5.1 and 5.2, all processors can be run sys-
tematically at their highest speed. Therefore, on fully homogeneous platforms,
all one-to-one mappings are identical, and it is straightforward to minimize the
latency for a given period, or the converse.
However, for interval mappings, we must decide where to split applications
into intervals, and we provide a dynamic programming algorithm which solves
both variants of the problem with a single application. When considering mul-
tiple applications, we need to run the dynamic programming algorithm once
per application with the corresponding period (resp. latency) threshold, and
the minimum latency (resp. period) that can then be achieved is the maximum
over all applications.
Theorem 10 With one application, on fully homogeneous platforms, the opti-
mal interval mapping which minimizes the latency for a bounded period, or the
period for a bounded latency, can be determined in polynomial time.
Proof. We denote by n the number of stages, s the speed of every processor
and b their bandwidth.
We exhibit a dynamic programming algorithm which computes the optimal
mapping that minimizes the latency for a given period. We compute recursively
the values of (L, T )(i, q), which are the optimal latency and period that can
be achieved by any interval-based mapping of stages S1 to Si using exactly q
processors. The recurrence relation can be expressed as:
(L, T )(i, q) = min
1≤j<i


(
L(j, q − 1) +
∑i
k=j+1 w
k
s
+ δ
i
b
,
max
(
T (j, q − 1),max( δ
j
b
,
∑i
k=j+1 w
k
s
, δ
i
b
)
))

 .
This relation holds for all i > 1 and q > 1. The function ”min” keeps the
brace such that the period is not greater than the given period and the latency
is minimum. If such a brace does not exist, it returns (+∞,+∞).
The initialization relations are:
RR n° 7589
Performance and energy optimization 22
• If there is only one processor, we map the whole interval onto this proces-
sor. For each i ∈ {1, . . . , n}:
(L, T )(i, 1) =
(
δ0
b +
∑i
k=1 w
k
s +
δi
b ,max
(
δ0
b ,
∑i
k=1 w
k
s ,
δi
b
))
• If q > 1 (too many processors for one stage):
(L, T )(1, q) = (+∞,+∞)
Finally we aim at computing:
min
q∈{1,...,p}
(L, T )(n, q) .
This dynamic programming algorithm solves the problem of finding a map-
ping, which minimizes the latency for a given period, with a complexity in
O(n2p).
For the converse problem of finding a mapping which minimizes the period
for a given latency, we use a binary search. The minimum period belongs to the
set:
T =
{∑j
k=i w
k
s
}
1≤i≤j≤n
⋃{δi
b
}
0≤i≤n
Moreover, if a mapping realizes a period T and a latency L, then it realizes
a period T2 > T and a latency L2 = L. We conclude that the algorithm which
minimizes the latency for a given period Tlim will find a bigger latency than
the one which minimizes the latency for a given period T 2lim > Tlim. We can
thus minimize the period for a given latency thanks to a binary search on the
period and some calls to the previous algorithm, which minimizes the latency
for a given period.
Since |T | = n(n+1)2 +n, the complexity of this problem isO((n
2+n2p) log(n)),
i.e. O(n2p log(n)).
The proof of this theorem under the no-overlap communication model is very
similar: all we have to do is to replace max( δ
j
b ,
∑i
k=j+1 w
k
s ,
δi
b ) by
δj
b +
∑i
k=j+1 w
k
s +
δi
b in the recurrence relation, max
(
δ0
b ,
∑i
k=1 w
k
s ,
δi
b
)
by δ
0
b +
∑i
k=1 w
k
s +
δi
b in the
first initialization relation, and the previous T by
T =
{
δi−1
b
+
∑j
k=i w
k
s
+
δj
b
}
1≤i≤j≤n
.
Theorem 11 With several applications, on fully homogeneous platforms, the
optimal interval mapping which minimizes the latency L = maxa∈{1,...,A}Wa ×
La for a bounded period by application, or the period T = maxa∈{1,...,A}Wa×Ta
for a bounded latency by application can be determined in polynomial time.
Proof. For several applications, we can reuse the structure of Algorithm 2,
but instead of computing the period, we compute both period and latency,
thanks to one of the previous algorithms for one single application (dynamic
programming algorithm if we minimize the global latency for given periods, and
binary search combined with dynamic programming algorithm if we minimize
RR n° 7589
Performance and energy optimization 23
the global period for given latencies, see proof of Theorem 10). While there
are some processors which are not yet allocated, we add one processor to any
application which maximizes the criterion we want to minimize (if the bound
on the other criterion is exceeded, the first criterion is set to +∞, according to
the single application algorithm).
Since there is a total of p calls to the single-application algorithms, and
a total of N application stages, the complexity is in O((Np)2 log(N)) for the
period minimization with a bounded latency, and in O((Np)2) for the latency
minimization with a bounded period.
When moving to a platform with heterogeneous processors, even if the ap-
plication is homogeneous with no communication (case special-app), the prob-
lem of finding a one-to-one or interval mapping that solves the bi-criteria pe-
riod/latency problem is NP-complete. This result is a direct consequence of the
NP-completeness of the mono-criterion cases, see Sections 5.1 and 5.2.
Theorem 12 With heterogeneous processors and homogeneous pipelines, with-
out communication, the problem of finding an interval or one-to-one mapping,
that solves the bi-criteria period/latency problem, is NP-complete.
Proof. The problem of minimizing the latency with a one-to-one mapping is
NP-complete, so finding a one-to-one mapping that minimizes the latency for a
given array of period is NP-complete too.
The problem of minimizing the period with an interval mapping is NP-
complete, so finding an interval mapping that minimizes the period for a fixed
latency by application is NP-complete too.
5.4 Period/energy minimization
We first provide results for one-to-one mappings, and then discuss interval map-
pings. For fully heterogeneous platforms, the problem is NP-hard because the
period minimization problem already is NP-hard on such platforms. The inter-
esting result is the following:
Theorem 13 On communication homogeneous platforms, a one-to-one map-
ping which minimizes the energy consumption while enforcing a given period for
each application can be determined in polynomial time.
Proof. We build a bipartite graph G = (U, V,E), and prove that the problem
amounts to finding a minimum weighted matching in this graph. U is the
processor set, and V the stage set. For each processor and each stage, the
weight of the edge between the two vertices is set to +∞ if the processor cannot
execute the stage within the period, and else it is the energy consumed by the
processor when it is running in the smallest mode allowing to execute the stage
within the period. Finding a minimum weighted matching gives us the minimum
power consumption, in polynomial time O
(
(N + p)
3
2
)
.
For interval mappings, first note that the problem becomes NP-complete as
soon as we consider different speed processors, because of the NP-completeness
of the period minimization problem for such platforms. Thus we focus on fully
homogeneous platforms.
RR n° 7589
Performance and energy optimization 24
Theorem 14 On fully homogeneous platforms, an interval mapping which min-
imizes the energy consumption while enforcing a given period for each application
can be determined in polynomial time.
Proof. We first exhibit a dynamic programming algorithm that returns the
optimal energy consumption for a single application, when using exactly k pro-
cessors to compute the application. This algorithm is fixing the processor speeds
so as to minimize the energy. Then, the multiple application case can be solved
using another dynamic programming algorithm, which decides how many pro-
cessors should be allocated to each application.
For a single application a ∈ {1, . . . , A}, and a processor number q ∈ {1, . . . , p},
we compute Eqa, the minimum energy consumed for the application a using at
most q processors. We recursively compute the value E(i, j, k), which is the
optimal energy consumption that can be achieved by any interval-based map-
ping of stages Sia to S
j
a using exactly k processors. The goal is to determine
Eqa = mink∈{1,...,q}E(1, na, k). The recurrence relation can be expressed as:
E(i, j, k) = min
i≤ℓ≤j−1
(E(i, ℓ, k − 1) + E(ℓ+ 1, j, 1))
with the initialization:
• E(i, i, r) = +∞ if r > 1
• Defining
Fji =
{
Edyn(sℓ) + Estat
∣∣ max( δi−1b , ∑jk=i wksℓ , δjb
)
≤ T
}
1≤ℓ≤m
we have:
E(i, j, 1) =
{
minFji if F
j
i 6= ∅
+∞ otherwise
Here, m is the number of speed modes, and T is the period bound for the ap-
plication a. The complexity of this dynamic programming algorithm is bounded
by O(n2a(p+m)).
Note that for the no-overlap model, we simply replace max
(
δi−1
b ,
∑j
k=i w
k
sℓ
, δ
j
b
)
by δ
i−1
b +
∑j
k=i w
k
sℓ
+ δ
j
b in the definition of F
j
i . Note also that E
k
a = +∞ if the
algorithm fails to match the period T .
For several applications, let E(a, k) the minimum energy consumed by k
processors on the first applications 1, . . . , a, so we are looking for E(A, p). This
energy can be computed recursively, thanks to the recurrence relation:
∀k ∈ {1, . . . , p}, ∀a ∈ {2, . . . , A}, E(a, k) = min
q∈{0,...,k−1}
(Eqa + E(a− 1, k − q))
and the initialization: ∀k ∈ {1, . . . , p}, E(1, k) = Ek1 .
The overall complexity is O(AN3p2).
5.5 Period/latency/energy minimization
When mixing the three criteria, the problem becomes NP-hard even for fully
homogeneous platforms, no communication, and a single application. The com-
binatorial nature of the problem comes from the fact that even if processors are
identical, they are multi-modal and each of them may run at a different speed.
RR n° 7589
Performance and energy optimization 25
Theorem 15 On fully homogeneous platforms, with a single application and
without any communication cost, finding a one-to-one mapping that solves the
tri-criteria problem is NP-hard.
Proof. We consider the associated decision problem: given a period T, a
latency L and an energy E, does there exist a one-to-one mapping of period less
than T, latency less than L and energy less than E?
The problem is obviously in NP: given a period, a latency, an energy and a
mapping, it is easy to check in polynomial time that the mapping is valid.
To establish the completeness, we use a reduction from 2-partition [14].
We consider an instance I1 of 2-partition: given n strictly positive integers
a1, a2, . . . , an, does there exists a subset I of {1, . . . , n} such that
∑
i∈I ai =∑
i/∈I ai? Let S =
∑n
i=1 ai. Let K = α × S + 2, where α is the exponent used
in the computation of the energy (see Section 4.5).
We build an instance I2 of our problem with n identical processors, each
with m = 2n+ 1 modes such that:
∀i ∈ {1, . . . , n}
{
s2i−1 = K
i
s2i = K
i + aiX
Ki(α−1)
and a pipelined application composed of n stages, with computation costs wi =
Ki(α+1).
Intuitively, the idea is to choose K such that (i) stage weights are far enough
from one another; and (ii) there is a gap between (s2i−1, s2i) and (s2j−1, s2j).
Then the mapping will use exactly one component of every pair (s2i−1, s2i).
We claim that for each j ∈ {2, . . . , n}, we have Kjα >
∑j−1
i=1 K
iα +
α
(
S
2 −
1
2
)
and Kjα+1 >
∑j
i=1K
iα +
(
K1−α × aj−1 + 1−
S
2
)
.
To prove the claim, let j ∈ {2, . . . , n}. On the one side,∑j−1
i=1 K
iα + α
(
S
2 −
1
2
)
<
∑j−1
i=1 K
iα + αS
< (j − 1)K(j−1)α +K
< jK(j−1)α < Kjα .
On the other side,∑j
i=1K
iα +
(
K1−α × aj−1 + 1−
S
2
)
<
∑j
i=1K
iα +K1−α ×K
< jKjα +K2−α < (j + 1)Kjα < Kjα+1 .
We deduce that for each j ∈ {2, . . . , n} and each 0 < X < 1, Kjα >∑j−1
i=1 K
iα+αX
(
S
2 −
1
2
)
and Kjα+1 >
∑j
i=1K
iα+X
(
K1−α × aj−1 + 1−
S
2
)
.
For all i ∈ {1, . . . , n}, if we choose speed s2i instead of speed s2i−1, the
additional energy is:
sα2i − s
α
2i−1 = (K
i +
aiX
Ki(α−1)
)α −Kiα
= Kiα(1 + α
aiX
Kiα
+ o(X))−Kiα
= αaiX + f
E
i (X) ,
where fEi (X) =
x→0
o(X).
RR n° 7589
Performance and energy optimization 26
In the same way, for each i ∈ {1, . . . , n}, the difference in latency when
using speed s2i instead of speed s2i−1 to execute stage Si is:
wi
s2i−1
−
wi
s2i
=
Ki(α+1)
Ki
−
Ki(α+1)
Ki + aiX
Ki(α−1)
=
Ki(α+1)
Ki
−
Ki(α+1)
Ki
(
1−
aiX
Kiα
+ o(X)
)
= aiX − f
L
i (X) ,
where fLi (X) =
x→0
o(X).
For all i ∈ {2, . . . , n}, the time to execute Si at speed s2i−2 is:
wi
s2i−2
=
Ki(α+1)
Ki−1 + ai−1X
K(i−1)(α−1)
=
Ki(α+1)
Ki−1
(
1−
ai−1X
K(i−1)α
+ o(X)
)
= Kiα+1 −K1−α × ai−1X + f
Li(X)
So we choose X < 1 small enough, so that for each i ∈ {1, . . . , n},{
|fEi (X)| < X ×
α
2n
|fLi (X)| < X ×
1
2n
and for all i ∈ {2, . . . , n}, |fLi(X)| < X × 12 .
We are now ready to choose the latency, energy and period bounds. Let E∗
and L∗ be the energy and latency obtained when Si is executed at speed s2i−1 for
all i ∈ {1, . . . , n}, E∗ =
∑n
i=1 s
α
2i−1 =
∑n
i=1K
iα and L∗ =
∑n
i=1
wi
s2i−1
= E∗.
We ask whether it is possible to achieve an energy Eo = E∗ + αX(S/2 + 1/2),
a latency Lo = L∗ −X(S/2− 1/2) and a period T o = Lo.
Clearly, the size of I2 is polynomial in the size of I1. We show that I1 has
a solution if and only if I2 does.
Assume first that I1 has a solution. For each i ∈ I, stage Si is executed at
speed s2i, and for each i ∈ {1, . . . , n} \ I, stage Si is executed at speed s2i−1.
The mapping consumes an energy E and has a latency L, where:
E = E∗ +
∑
i∈I
(
sα2i − s
α
2i−1
)
= E∗ +
∑
i∈I
(
αaiX + f
E
i (X)
)
≤ E∗ +
∑
i∈I
(
αaiX +
αX
2n
)
≤ E∗ + αX
(
S
2 +
1
2
)
≤ Eo
L = L∗ −
∑
i∈I
(
wi
s2i−1
− wis2i
)
= L∗ −
∑
i∈I
(
aiX − f
L
i (X)
)
≤ L∗ −
∑
i∈I
(
aiX −
X
2n
)
≤ L∗ −X
(
S
2 −
1
2
)
≤ Lo
Because T o = Lo, and because we fulfill the latency constraint, we fulfill the
period constraint too. We conclude that I2 has a solution.
RR n° 7589
Performance and energy optimization 27
Suppose now that I2 has a solution. We first show that for each i ∈ {1, . . . , n},
stage Si is executed at speed either s2i−1 or s2i. Let (Pj) be the property: for
each i ∈ {j, . . . , n}, there is a single processor running at speed s2i−1 or s2i,
and this processor is assigned stage Si. We first prove that (Pn) is true. On
the one hand, if two processors were running at speed s2n−1 or s2n, they would
consume an energy
E ≥ 2sα2n−1 > K
nα +
∑n−1
i=1 K
iα + αX
(
S
2 +
1
2
)
> Eo.
On the other hand, if no processor was running at speed s2n−1 or s2n, the
latency would verify
L ≥ wns2n−2 ≥ K
nα+1 −K1−α × an−1X + f
Li(X)
>
∑n
i=1K
iα +X
(
K1−α × an−1 + 1−
S
2
)
−K1−α × an−1X + f
Li(X)
>
∑n
i=1K
iα −X
(
S
2 −
1
2
)
+
(
X
2 + f
Li(X)
)
> Lo .
We conclude that (Pn) is true. We now proceed by induction. If for some
j ∈ {3, . . . , n}, (Pj) is true, then we show that (Pj−1) is true in a quite similar
way. In the end, (P2) is true (and the processor that is assigned stage S1 is
running either at speed s1, or at speed s2). Let I the subset of {1, . . . , n} such
that the processor that is assigned the stage Si is running at speed s2i. Then for
each i ∈ {1, . . . , n}\I, the processor that is assigned stage Si is running at speed
s2i−1. The consumed energy is E = E
∗ +
∑
i∈I
(
αaiX + f
E
i (X)
)
, and E ≤ Eo,
hence
∑
i∈I ai ≤
S
2 +
(
1
2 −
∑
i∈I f
E
i (X)
αX
)
. Therefore
∑
i∈I ai <
S
2 +
(
1
2 +
1
2
)
.
Since the ai are integers, we conclude that
∑
i∈I ai ≤
S
2 .
The achieved latency is L = L∗ −
∑
i∈I
(
aiX − f
L
i (X)
)
, and L ≤ Lo, hence∑
i∈I ai ≥
S
2 −
(
1
2 −
∑
i∈I f
L
i (X)
X
)
. Since
∑
i∈I f
L
i (X)
X ≤
1
2 , we get
∑
i∈I ai ≥
S
2 .
Finally,
∑
i∈I ai =
S
2 and I1 has a solution, which concludes the proof.
Theorem 16 On fully homogeneous platforms, with a single application and
without any communication cost, finding an interval mapping that solves the
tri-criteria problem is NP-hard.
Proof. We only give the sketch of the completeness proof, which reuses the
proof of Theorem 15. To construct the instance I2, we insert big stages between
the previous stages. We add a big speed to the processor modes, adjusted to
allow the execution of exactly one big stage during the period. More formally, we
build a pipeline composed of 2n − 1 stages, such that ∀i ∈ {1, . . . , n}, w2i−1 =
Ki(α+1) and ∀i ∈ {1, . . . , n − 1}, w2i = K
(n+1)(α+1) We use 2n − 1 identical
processors, that can run 2n + 1 modes, such that ∀i ∈ {1, . . . , n}, s2i−1 = K
i
and s2i = K
i + aiXKiα . We also let s2n+1 = K
n+1.
We search for an interval mapping, whose energy does not exceed Eo =
(n−1)K(n+1)α+E∗+αX(S/2+1/2), whose latency does not exceed Lo = (n−
1)K(n+1)α+L∗−X(S/2−1/2), and whose period does not exceed T o = K(n+1)α.
If the instance I1 of 2-partition has a solution, we proceed like in the previous
proof, and map every big stage onto a processor that is running in its highest
mode. All constraints are fulfilled.
If the instance I2 has a solution, we have to run processors that are assigned a
big stage in their highest mode. Moreover, these processors cannot be assigned
RR n° 7589
Performance and energy optimization 28
proc-hom proc-het
com-hom special-app com-hom com-het
Per - one-to-one polynomial (binary search) NP-c.
Per - interval poly (dyn. prog. + greedy) NP-complete(*) NP-complete
Lat - one-to-one polynomial NP-complete(*) NP-c.
Lat - interval polynomial (binary search) NP-c.
Per/Lat - both polynomial NP-complete
Per/En - one-to-one polynomial (minimum matching) NP-c.
Per/En - interval poly (dyn. prog.) NP-complete
Per/Lat/En - both NP-complete
Table 1: Complexity results with the Path latency model.
other stages. All we have to do next is to find a one-to-one mapping of the
unassigned stages, with the additional constraint that we cannot run the
remaining processors in their highest modes without exceeding the energy
bound. We then conclude as in the proof of Theorem 15.
We conclude this section with some remarks on uni-modal processors. If we
restrict to processors with a single execution mode, the problem becomes poly-
nomial on fully homogeneous platforms, while it remains NP-hard otherwise
(because of the NP-completeness of the period/latency problem which is also
established with uni-modal processors). For one-to-one mappings, all mappings
are equivalent on fully homogeneous platforms, but the algorithm is more so-
phisticated for interval mappings. We first write an algorithm which partitions
the stages of a single application into intervals, for each of the three variants
of the tri-criteria optimization problem, and then we use this algorithm for the
multiple application problem. Details can be found in [4].
5.6 Summary of complexity results for the Path model
Table 1 summarizes all complexity results with the Path latency model, for
one-to-one and interval mappings without resource sharing.
For the mono-criterion problems, most NP-completeness proofs come from
the single application problem which already was NP-hard, see [6, 8] for the
proofs. The two special entries denoted with (*) are problem instances which
could be solved in polynomial time for a single application, but becomes NP-
hard with several ones. Remaining entries correspond to polynomial algorithms
that were already existing for a single application and that have been extended
for several ones.
For the bi-criteria problems, we provide new polynomial algorithms to min-
imize one of the criterion, given a bound on the other one. NP-completeness
results are obtained from the mono-criterion complexity results.
Finally, the tri-criteria problem turns out to be NP-hard even for fully ho-
mogeneous platforms, no communication and a single application.
6 Complexity results with the Wavefront model
In the previous section, we have performed an exhaustive complexity study con-
sidering the Path latency model, and hence restricting to mapping rules with-
out resource sharing (one-to-one or interval mappings). We have provided new
RR n° 7589
Performance and energy optimization 29
polynomial algorithms for multiple applications and results of NP-completeness.
However, when considering resource sharing and general mappings, we use the
Wavefront latency model, as explained in the framework (see Section 4.4).
In this section, we investigate the impact of this model on the complexity re-
sults. Since the latency definition is now closely related to the period definition,
we consider only latency in combination with period. For the period/latency
combination, we minimize the latency for a fixed period. For the tri-criteria
problem, both period and latency are fixed, and we minimize the energy crite-
rion.
Also, we do not restrict the study to one-to-one and interval mappings,
but also discuss general mappings. It turns out that the period minimization
problem is NP-hard for such mappings, even for fully homogeneous platforms, no
communication and a single application. Therefore, all multi-criteria problems
with general mappings are NP-hard.
All results are summarized in Table 2.
6.1 Period minimization
All complexity results for period minimization were already established in Sec-
tion 5.1, except for general mappings. It turns out that the problem is NP-hard
for general mappings, even for fully homogeneous platforms, no communication
and a single application.
Theorem 17 On fully homogeneous platforms with no communication, the prob-
lem of finding a general mapping that minimizes the period of a single application
is NP-complete.
Proof. The reduction is straightforward, with a reduction from 2-partition [14]:
the application consists of n stages and there are two identical processors. Stages
must be partitioned in two sets of equal computational weight, which amounts
to 2-partition the stages.
As a corollary, all multi-criteria problems are NP-hard for general mappings,
since they all involve the period criterion (because of the energy and latency
definitions).
6.2 Period/latency minimization
With heterogeneous processors and interval mappings, we already know that
the period minimization problem is NP-hard, and therefore it remains NP-hard
when combining it with the latency criterion. However, the result does not hold
any longer for one-to-one mappings, while the bi-criteria problem was NP-hard
proc-hom proc-het
com-hom special-app com-hom com-het
Per/* - general NP-complete
Per/Lat - one-to-one polynomial NP-complete
Per/Lat - interval polynomial NP-complete
Per/Lat/En - one-to-one polynomial NP-complete
Per/Lat/En - interval poly (dyn. prog.) NP-complete
Table 2: Complexity results with the Wavefront latency model.
RR n° 7589
Performance and energy optimization 30
with the Path latency model. Actually, with homogeneous communications,
the latency of an application with n stages is always (2n − 1) × T , where T
is the period of the application, and therefore the latency is minimized when
the period is minimized. The bi-criteria problem amounts in this case to the
period minimization problem, which is polynomial (binary search algorithm, see
Section 5.1).
For homogeneous platforms, we propose below a polynomial algorithm for
the period/latency/energy combination on homogeneous platforms and interval
mappings. This algorithm can be used to solve the easier bi-criteria problem
with no energy criterion.
6.3 Period/latency/energy minimization
As motivated earlier, we focus on the tri-criteria problem of minimizing energy
under constraints on period ans latency. It turns out that this problem becomes
polynomial for interval mappings without resource sharing on fully homogeneous
platforms, while it was NP-complete with the classical definition of latency (see
Theorem 16).
For one-to-one mappings, the problem is polynomial for com-hom platforms
with different speed processors. Indeed, similarly to the period/latency problem,
minimizing the latency is equivalent to minimizing the period for such mappings
because of the Wavefront latency model and the one-to-one mapping.
Theorem 18 With the Wavefront latency model, the tri-criteria problem is
polynomial on fully homogeneous platforms for interval mappings without reuse.
Proof. The optimal solution for interval mappings relies on an intricate nesting
of two dynamic programming algorithms. The first one solves the problem with
one single application: it recursively computes the optimal energy consumption
that can be achieved by mapping one stage interval to exactly q processors.
Then another dynamic programming algorithm finds the minimum energy con-
sumption with several applications, recursively trying all possible distributions
of processors to applications, and using the first algorithm to compute the op-
timal energy consumption for each application, given the number of processors
allocated to this application.
For the single application problem, Let n be the number of stages of this
application, Pgiv be the given period, and Lgiv be the given latency. First of all,
note that the latency is given by L = (2m − 1) × Pgiv, where m is the number
of intervals. Therefore, we can compute a priori the maximum possible number
of intervals in the mapping. Let mmax be this number; note that it cannot
exceed n, the total number of stages, nor p, the number of processors: mmax =
min(n, p, ⌊(
Lgiv
Pgiv
+ 1)/2⌋). If we use more intervals, the bound on the latency
will be exceeded. Otherwise, we just have to check if the period constraint is
fulfilled.
We exhibit a dynamic programming algorithm that returns the optimal en-
ergy consumption. We compute recursively the value E(i, j, q), which is the
optimal energy consumption that can be achieved by any interval-based map-
ping of stages Si to Sj using exactly q processors. The goal is to determine
minm∈{1,...,mmax}E(1, n,m). The recurrence relation can be expressed as:
E(i, j, q) = min
i≤ℓ≤j−1
(E(i, ℓ, q − 1) + E(ℓ+ 1, j, 1)) ,
RR n° 7589
Performance and energy optimization 31
with the initializations:
• E(i, i, q) = +∞ if q > 1 (we cannot run one stage with many processors);
• E(i, j, 1) =
{
minF i,j if F i,j 6= ∅
+∞ otherwise,
where F i,j =
{ Edyn(s) + Estat |
max
(
δi−1
b ,
∑j
k=i w
k
s ,
δj
b
)
≤ Pgiv
}
s∈S
Since the platform is homogeneous, we denote by Estat the static energy of
all processors, and by Edyn(s) the dynamic energy consumed at speed s (s ∈ S).
Then, the recurrence is easy to justify: to compute E(i, j, q), we create an
interval from stages Sℓ+1 to Sj that is assigned to one single processor, and we
use the q− 1 remaining processors to process stages Si to Sℓ. The initialization
states that one single stage cannot be run on exactly more than one processor,
and it returns the energy consumed by the processor in charge of interval [i, j]
so that the bound on the period is satisfied.
With many applications, for a ∈ {1, . . . , A} and q ∈ {0, . . . , p}, let Eqa the
minimum energy consumed by q processors on the application a, computed by
one the previous dynamic programming algorithms. If the period constraint
cannot be fulfilled, or if the latency constraint cannot be fulfilled (q > kmaxa ),
we set Eqa = +∞.
We recursively compute the value E(a, q), which is the minimum energy
consumed by exactly q processors on applications 1, . . . , a. The goal is thus to
compute min1≤q≤pE(A, q). The recurrence relation can be expressed as:
E(a, q) = min
1≤r≤q−1
(E(a− 1, q − r) + Era) ,
with the initialization:
E(1, q) = Eq1 , ∀1 ≤ q ≤ p.
Indeed, when there is only one application left, the result is known from the
previous dynamic programming algorithm. For several applications, we try to
assign r processors to application a, and find the value of r which returns the
lowest energy consumption.
7 Conclusion
In this report, we have studied the problem of mapping concurrent applications
onto computational platforms according to three criteria: period, latency and
energy. We restricted the study to the class of applications which have a pipeline
structure, and we established the complexity of the problems for different vari-
ants of mapping strategies (one-to-one, interval and general mappings), and
different types of platforms (ranking from fully homogeneous to fully hetero-
geneous).
First we focused on one-to-one and interval mappings with no resource shar-
ing. We considered performance criteria, namely period or latency minimiza-
tion. From this study of mono-criterion problems, one striking result is the
impact of having multiple concurrent applications on the problem complexity.
RR n° 7589
Performance and energy optimization 32
Indeed, when several applications are in competition for resources, the period
minimization problem turns out NP-hard for interval mappings with hetero-
geneous processors, homogeneous pipelines and without communication, while
a polynomial algorithm had been found to solve the same problem with a single
application. The same phenomenon happens for latency minimization with one-
to-one mappings. For other period or latency minimization problems, either we
were able to extend polynomial algorithms for the single application case, or the
problem remained NP-complete. Considering bi-criteria problems, we were able
to derive nice sophisticated multi-criteria polynomial algorithms, through the
construction of bipartite graphs or the use of dynamic programming. Trade-offs
were found to allow for an efficient albeit energy-aware execution. Finally, the
most challenging tri-criteria problem period/latency/energy turned out to be
NP-hard even with a single application on a fully homogeneous platform and
no communication cost.
In order to handle processor sharing, we explained why it was mandatory
to use a simpler model for the latency, and we discussed the use of the Wave-
front latency model. Thanks to a combination of two dynamic programming
algorithms, we showed that finding an optimal interval mapping without reuse
on fully homogeneous platforms can be done in polynomial time, while the
same problem was shown to be NP-complete with the classical definition of
latency. However, finding an optimal general mapping on any platform type,
or finding any optimal interval mapping on speed-heterogeneous platforms, are
NP-complete problems.
We believe that this exhaustive complexity analysis provides a solid theo-
retical foundation for the study of multi-criteria mappings of several concurrent
applications, in particular when combining performance and energy optimiza-
tion criteria.
On the practical side, we designed several heuristics in [5], as well as an
integer linear program to compute the optimal solution (either interval-based or
general) in possibly exponential time, for the Wavefront latency model. The
comparison of heuristics with and without processor sharing does confirm that
sharing is most useful when: (i) the modes are not close to each other; and (ii)
the static energy is high.
As future work, we envision to add replication to the mapping rules: a stage
could be mapped onto several processors, each in charge of different data sets, in
order to improve the period. This problem, partially investigated in [7], would
become even more challenging in a framework accounting for energy issues.
Also, it would be interesting to include the consumption induced by memory,
disks, fans, and other devices, in the energy model. Finally, we would like to
consider different application settings, as for instance applications that share
some data paths. In this case, we expect the impact of resource sharing to be
even more important, since mapping two such applications on the same resource
may further reduce their period and latency.
Acknowledgment
A. Benoit and Y. Robert are with the Institut Universitaire de France. This
work was supported in part by the ANR StochaGrid and RESCUE projects.
RR n° 7589
Performance and energy optimization 33
References
[1] K. Agrawal, A. Benoit, L. Magnan, and Y. Robert. Scheduling algo-
rithms for workflow optimization. In Proceedings of the International
Parallel and Distributed Processing Symposium (IPDPS). IEEE CS Press,
May 2010. Also available as research report RR-LIP-2009-22 at http:
//graal.ens-lyon.fr/~abenoit/.
[2] A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. M. Al-Hashimi. Overhead-
conscious voltage selection for dynamic and leakage energy reduction of
time-constrained systems. In Proceedings of the conference on Design, Au-
tomation and Test in Europe (DATE), page 10518, Washington, DC, USA,
2004. IEEE Computer Society Press.
[3] M. A. Bender, S. Chakrabarti, and S. Muthukrishnan. Flow and stretch
metrics for scheduling continuous job streams. In Proceedings of SODA’98,
1998.
[4] A. Benoit, P. Renaud-Goud, and Y. Robert. Performance and energy op-
timization of concurrent pipelined applications. Research Report RR-LIP-
2009-27, LIP, ENS Lyon, France, September 2009. Available at http:
//graal.ens-lyon.fr/~abenoit. Short version appears in IPDPS’2010.
[5] A. Benoit, P. Renaud-Goud, and Y. Robert. Sharing resources for per-
formance and energy optimization of concurrent streaming applications.
Research Report 2010-05, LIP, ENS Lyon, France, February 2010. Avail-
able at http://graal.ens-lyon.fr/~abenoit/. Short version appears in
SBAC-PAD’2010.
[6] A. Benoit and Y. Robert. Mapping pipeline skeletons onto hetero-
geneous platforms. Journal of Parallel and Distributed Computing (JPDC),
68(6):790–808, 2008.
[7] A. Benoit and Y. Robert. Complexity results for throughput and latency
optimization of replicated and data-parallel workflows. Algorithmica, 2009.
Available online at http://dx.doi.org/10.1007/s00453-008-9229-4.
[8] A. Benoit, Y. Robert, and E. Thierry. On the complexity of mapping
linear chain applications onto heterogeneous platforms. Parallel Processing
Letters (PPL), 19(3):383–397, 2009.
[9] D. P. Bunde. Power-aware scheduling for makespan and flow. In Proceedings
of the ACM Symposium on Parallelism in Algorithms and Architectures
(SPAA), pages 190–196, New York, NY, USA, 2006. ACM Press.
[10] J.-J. Chen and L. Thiele. Energy-efficient task partition for periodic real-
time tasks on platforms with dual processing elements. In Proceedings of
International Conference on Parallel and Distributed Systems (ICPADS),
pages 161–168, Washington, DC, USA, 2008. IEEE Computer Society
Press.
[11] S. Cho and R. G. Melhem. On the interplay of parallelization, program
performance, and energy consumption. IEEE Transactions on Parallel and
Distributed Systems (TPDS), 21:342–353, 2010.
RR n° 7589
Performance and energy optimization 34
[12] M. Cole. Bringing Skeletons out of the Closet: A Pragmatic Manifesto for
Skeletal Parallel Programming. Parallel Computing, 30(3):389–406, 2004.
[13] DataCutter. DataCutter Project: Middleware for Filtering Large Archival
Scientific Datasets in a Grid Environment. http://www.cs.umd.edu/
projects/hpsl/ResearchAreas/DataCutter.htm.
[14] M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to
the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY,
USA, 1990.
[15] F. Gruian and K. Kuchcinski. Lenes: task scheduling for low-energy sys-
tems using variable supply voltage processors. In Proceedings of the Asia
South Pacific Design Automation Conference (ASPDAC), pages 449–455,
New York, NY, USA, 2001. ACM.
[16] S. L. Hary and F. O¨zgu¨ner. Precedence-constrained task allocation onto
point-to-point networks for pipelined execution. IEEE Transactions on
Parallel and Distributed Systems (TPDS), 10(8):838–851, 1999.
[17] Y. Hotta, M. Sato, H. Kimura, S. Matsuoka, T. Boku, and D. Takahashi.
Profile-based optimization of power performance by using dynamic voltage
scaling on a pc cluster. In Proceedings of the International Parallel and
Distributed Processing Symposium (IPDPS), page 340, Los Alamitos, CA,
USA, 2006. IEEE Computer Society Press.
[18] T.-Y. Huang, Y.-C. Tsai, and E. T.-H. Chu. A near-optimal solution for
the heterogeneous multi-processor single-level voltage setup problem. In
Proceedings of the International Parallel and Distributed Processing Sym-
posium (IPDPS), page 57, Los Alamitos, CA, USA, 2007. IEEE Computer
Society Press.
[19] T. Ishihara and H. Yasuura. Voltage scheduling problem for dynamically
variable voltage processors. In Proceedings of International Symposium on
Low Power Electronics and Design (ISLPED), pages 197–202. ACM Press,
1998.
[20] N. T. Karonis, B. Toonen, and I. Foster. MPICH-G2: A grid-enabled
implementation of the message passing interface. Journal of Parallel and
Distributed Computing (JPDC), 63(5):551–563, 2003.
[21] P. Langen and B. Juurlink. Leakage-aware multiprocessor scheduling. Jour-
nal of Signal Processing Systems, 57(1):73–88, 2009.
[22] M. P. Mills. The internet begins with coal. Environment and Climate News,
page ., 1999.
[23] F. Rabhi and S. Gorlatch. Patterns and Skeletons for Parallel and Dis-
tributed Computing. Springer Verlag, 2002.
[24] J. Subhlok and G. Vondran. Optimal mapping of sequences of data parallel
tasks. In Principles and Practice of Parallel Programming (PPoPP), 1995.
RR n° 7589
Performance and energy optimization 35
[25] J. Subhlok and G. Vondran. Optimal latency-throughput tradeoffs for data
parallel pipelines. In Proceedings of the ACM Symposium on Parallelism
in Algorithms and Architectures (SPAA), 1996.
[26] K. Taura and A. A. Chien. A heuristic algorithm for mapping commu-
nicating tasks on heterogeneous resources. In Heterogeneous Computing
Workshop, pages 102–115. IEEE Computer Society Press, 2000.
[27] G. Varatkar and R. Marculescu. Communication-aware task scheduling
and voltage selection for total systems energy minimization. In Proceedings
of the IEEE/ACM International Conference on Computer-Aided Design
(ICCAD), page 510, Washington, DC, USA, 2003. IEEE Computer Society
Press.
[28] N. Vydyanathan, U. Catalyurek, T. Kurc, P. Sadayappan, and J. Saltz.
A duplication based algorithm for optimizing latency under throughput
constraints for streaming workflows. In Proceedings of International Con-
ference on Parallel Processing (ICPP), pages 254–261, Washington, DC,
USA, 2008. IEEE Computer Society.
[29] N. Vydyanathan, U. Catalyurek, T. Kurc, P. Saddayappan, and J. Saltz.
Toward optimizing latency under throughput constraints for application
workflows on clusters. In Euro-Par’07, LNCS 4641, pages 173–183. Springer
Verlag, 2007.
[30] Q. Wu, J. Gao, M. Zhu, N. Rao, J. Huang, and S. Iyengar. On optimal
resource utilization for distributed remote visualization. IEEE Transactions
on Computers (TC), 57(1):55–68, 2008.
[31] Q. Wu and Y. Gu. Supporting distributed application workflows in hetero-
geneous computing environments. In Proceedings of International Confer-
ence on Parallel and Distributed Systems (ICPADS), pages 3–10, Washing-
ton, DC, USA, 2008. IEEE Computer Society Press.
[32] R. Xu, D. Mosse´, and R. Melhem. Minimizing expected energy in real-time
embedded systems. In Proceedings of the ACM Int. Conf. on Embedded
Software (EMSOFT), pages 251–254, 2005.
[33] R. Xu, D. Mosse´, and R. Melhem. Minimizing expected energy consump-
tion in real-time systems through dynamic voltage scaling. ACM Trans.
Comput. Syst., 25(4):9, 2007.
RR n° 7589
Centre de recherche INRIA Grenoble – Rhône-Alpes
655, avenue de l’Europe - 38334 Montbonnot Saint-Ismier (France)
Centre de recherche INRIA Bordeaux – Sud Ouest : Domaine Universitaire - 351, cours de la Libération - 33405 Talence Cedex
Centre de recherche INRIA Lille – Nord Europe : Parc Scientifique de la Haute Borne - 40, avenue Halley - 59650 Villeneuve d’Ascq
Centre de recherche INRIA Nancy – Grand Est : LORIA, Technopôle de Nancy-Brabois - Campus scientifique
615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex
Centre de recherche INRIA Paris – Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex
Centre de recherche INRIA Rennes – Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex
Centre de recherche INRIA Saclay – Île-de-France : Parc Orsay Université - ZAC des Vignes : 4, rue Jacques Monod - 91893 Orsay Cedex
Centre de recherche INRIA Sophia Antipolis – Méditerranée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex
Éditeur
INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)
http://www.inria.fr
ISSN 0249-6399
