In this paper, we study the problem of finding optimal mappings for several independent but concurrent workflow applications, in order to optimize performance-related criteria together with energy consumption. Each application consists of a linear chain graph with several stages, and processes successive data sets in pipeline mode, from the first to the last stage. The problem is to decide which processors to enroll, at which speed (or mode) to use them, and which stages they should execute. There is a clear trade-off to reach, since running faster and/or more processors leads to better performance, but energy consumption is then very high. Energy savings can be achieved at the price of a lower performance, by reducing processor speeds or enrolling fewer resources. We study the problem complexity on different target execution platforms, ranking from fully homogeneous platforms to fully heterogeneous ones. We consider three mapping strategies: (i) one-to-one mappings, where a processor is assigned a single stage; (ii) interval mappings, where a processor may process an interval of consecutive stages of the same application; and (iii) general mappings, which are fully arbitrary, i.e. a processor may process stages of several distinct applications. Finally, we compare two different models for the computation of the latency, which is the time elapsed between the beginning and the end of the execution of a given data set: with the PATH model, it is computed as the length of the path taken by this data set, while with the WAVEFRONT model, each data set progresses concurrently within a period. For all platform types, all mapping strategies and both latency models, we establish the complexity of several multi-criteria optimization problems, whose objective functions combine period, latency and energy criteria. In particular, we exhibit instances where the problem is NP-hard with concurrent applications, while it can be solved in polynomial time for a single application, and instances whose problem complexity depends upon the latency model.
Introduction
In this paper, we aim at optimizing the parallel execution of several pipelined applications that execute concurrently on a given platform. We focus in this work on pipelined applications whose structure is a linear chain of tasks. Such applications are ubiquitous in streaming environments, as for instance video and audio encoding and decoding, DSP applications, image processing, and so on 1 (Hary and Ö zgüner 1999; Taura and Chien 2000; Wu and Gu 2008) . Furthermore, the regularity of these applications render them amenable to a high-level parallel programming approach based on algorithmic skeletons (Cole 2004; Rabhi and Gorlatch 2002) . Skeletons ease the task of the application developer and make it easy to tailor their specific problem to a target platform.
In a linear pipelined application, a series of data sets enters the input stage and progresses from stage to stage until the final result is computed. Each stage corresponds to a distinct task and has its own communication and computation requirements: it reads an input from the previous stage, processes the data and outputs a result to the next stage. Each data set is input to the first stage, and final results are output from the last stage. The pipeline operates in synchronous mode: after a transient behavior due to the initialization delay, a new data set is completed every period. Mapping such applications onto parallel platforms is a challenging problem, that becomes even more difficult when platforms are heterogeneous (nowadays a standard assumption). Another level of difficulty is added when considering several independent applications which are executed concurrently on the platform and that compete for available resources.
The objective is to minimize the energy consumption of the whole platform, while satisfying given performancerelated bounds on the period and latency of each application. This problem has recently become a critical problem, both for economic and environmental reasons (Mills 1999) . The Green500 list (see http://www.green500.org) provides rankings of the most energy-efficient supercomputers in the world, therefore raising awareness about power consumption. The multi-criteria approach targets a trade-off between the users and the platform manager. The formers have specific requirements for their applications, while the latter has crucial economical and environmental constraints. Indeed, the energy saving problem is becoming increasingly important, not only because of the sole cost of energy, but also because of the cost of cooling systems and related infrastructures. To help reduce energy costs, modern computing centers provide multi-modal processors: each processor has a discrete number of predefined speeds (or modes), which correspond to different voltages that the processor can be subjected to. This approach is the Dynamic Voltage Scaling (DVS) technique; the slowest modes correspond to a lower voltage, and hence a slower execution speed. The power consumption is the sum of a static part (the cost for a processor to be turned on: power can be saved by shutting processors down, the ON/OFF technique) and a dynamic part. This dynamic part is a strictly convex function of the processor speed, so that the execution of a given amount of work costs more energy if a processor runs in a higher mode (Ishihara and Yasuura 1998; Hotta et al. 2006 ). On the one side, faster modes (i.e. higher voltages) allow for fulfilling the performance criteria, and on the other side, they lead to a higher energy consumption.
The main performance-oriented criteria for pipelined applications are period and latency Vondran 1995, 1996; Vydyanathan et ak, 2007; Benoit and Robert 2008; Vydyanathan et al. 2008; Benoit and Robert 2009 ). The period of an application is the inverse of the throughput, i.e. it corresponds to the time interval between the arrival of two consecutive data sets. The period is dictated by the critical resource: it is equal to the longest cycle time of a processor. For instance, under a strict oneport communication model with no overlap of communications and computations, it is the sum of the time to perform all incoming communications, the time to perform all outgoing communications, and the total computation time. With overlap, we simply replace the sum of these three terms by their maximum. In some cases, the period is fixed by the applicative setting, and we must ensure that data sets are processed fast enough so that there is no accumulation of data sets in the pipeline. The latency of an application is the time elapsed between the beginning and the end of the execution of a given data set, hence it measures the response time of the system to process the data set entirely. For streaming applications, there are several approaches to compute the latency. The most accurate model is the PATH model, in which the latency is computed as the length of the path taken by any data set. With the WAVEFRONT model, we rather consider that each data set progresses concurrently within a period, and the latency is then a multiple of the period, as suggested by Hary and Ö zgüner (1999) .
The two performance criteria alone already are antagonistic. The smallest latency is obtained when no communication occurs, i.e. when the same (fastest) processor executes all of the stages of an application. However, such a mapping may well exceed the bound on the period, since the same processor must process an entire application. Moreover, when several applications run concurrently, the scheduler must decide which resources to select and assign to each application, so that all users receive a fair share of the platform.
Adding energy consumption as a third criterion renders everything even more complex. Obviously, energy is minimized by enrolling a single processor for all applications, namely the one with the smallest speed available; but such a mapping would most certainly exceed period and latency bounds.
Our goal is to execute all applications efficiently while minimizing the energy consumed. Unfortunately, the goals of low power consumption and efficient scheduling are contradictory. Indeed, period and/or latency can be minimized by using more energy to speed up processors, while energy can be minimized by reducing processor speeds, hence performance-related objectives. How to deal with these contradictory objective functions? In traditional approaches, one would form a linear combination of the different objectives and treat the result as the new objective to be optimized. However, is it natural for the user to maximize the quantity 0:7P þ 0:3E, where P is the period and E the energy? Since criteria are very different in nature, it does not make much sense for a user to make a linear combination of them. Thus, we advocate the use of multi-criteria mappings with thresholds. Now, each criteria combination can be handled in a natural and meaningful way: one single criterion is optimized, under the condition that a threshold is enforced for all other criteria. This leads to two interesting questions. If we fix energy, we obtain the laptop problem, which asks 'What is the best schedule achievable using a particular energy budget, before battery becomes critically low?'. Fixing schedule quality gives the server problem, which asks 'What is the least energy required to achieve a desired level of performance?'.
The optimization problem can then be stated as follows: given a set of applications and a computational platform, which stage to assign to which processor? We consider three different mapping strategies: one-to-one mappings, for which each application stage is allocated to a distinct processor; interval mappings, where each participating processor is assigned an interval of consecutive stages of the same application; and general mappings which are fully arbitrary. These mapping strategies have been widely used in the literature when mapping one single application Vondran 1995, 1996; Benoit and Robert 2008) and we extend them naturally to the mapping of several concurrent applications.
We target three different platform types: fully homogeneous platforms have identical processors and interconnection links; communication homogeneous platforms have identical links but different-speed processors, thus introducing a first degree of heterogeneity; and, finally, fully heterogeneous platforms, with different-speed processors and different capacity links, constitute the most difficult problem instance.
The paper is organized as follows. We first review related work in Section 2. Then, we illustrate and motivate the problem with a simple example in Section 3. The framework is described in Section 4. The next two sections constitute the heart of the paper: we assess the complexity of all problem instances. In Section 5, we establish the complexity of mapping problems with the PATH latency model, while we investigate the complexity with the WAVE-FRONT latency model in Section 6. We provide conclusions in Section 7.
Related work
The problem of mapping a single linear chain application onto parallel platforms in order to minimize latency and/ or period has already been widely studied, in particular on homogeneous platforms (see the pioneering papers of Vondran (1995, 1996) ) and later for heterogeneous platforms Robert 2008, 2009 ), considering the PATH latency model. These results focus on the mapping of one single application, while we add the complexity of satisfying several users who each have different requirements for their applications. We were able to extend polynomial time algorithms to this multi-application setting, and to exhibit cases in which the problem becomes NP-hard because of this additional difficulty. Of course, problem instances which were already NP-hard with a single application remain difficult with several concurrent applications.
Moreover, we consider a new and important objective function, namely energy minimization, and this is the first study (to the best of our knowledge) which combines performance-related objectives with energy in the context of pipelined applications. As expected, combining all three criteria (period, latency and energy) leads to even more difficult optimization problems: the problem is NP-hard even with a single application on a fully homogeneous platform (for interval mappings with the PATH latency model).
In order to adjust energy consumption, we use the DVS technique. DVS has been extensively studied in several papers, for mapping onto a single-core processor, a multicore processor, or a set of processors.
Slack reclamation techniques are used for frame-based hard real-time embedded systems in Xu et al. (2005) : a set of independent tasks, provided with their WCEC (worstcase execution cycle) and sharing a common deadline, has to be mapped onto a processor. If a task needs less cycles than its WCEC, the dynamically obtained slack allows the processor to run at a lower frequency and therefore to spare energy. This work is extended in Xu et al. (2007) , where the energy model includes time and energy penalties when the processor frequency is changing. Those transition overheads are also taken into account in Andrei et al. (2004) , but tasks are interdependent.
Then Cho and Melhem (2010) mapped applications which consist of a program modeled with a sequential part and another part which can be parallel, onto a multi-core processor. Bunde (2006) focuses on the problem of offline scheduling unit time tasks with release dates, while minimizing the makespan or the total flow time on one processor. He extends this work from one processor to multiprocessors. Chen and Thiele (2008) studied the problem of scheduling real-time tasks on two heterogeneous processors. They provide a fully polynomial time approximation scheme (FPTAS) to derive a solution very close to the optimal energy consumption with a reasonable complexity, while Gruian and Kuchcinski (2001) designed heuristics to map a set of dependent tasks with deadlines onto a set of homogeneous processors, with the possibility of changing a processor speed during the execution of a task. Huang et al. (2007) proposed a greedy algorithm based on affinity to assign frame-based real-time tasks, and then they reassign them in pseudo-polynomial time when any processing speed can be assigned for a processor. Langen and Juurlink (2009) focused on leakage energy for mapping applications represented as directed acyclic graphs (DAGs). Varatkar and Marculescu (2003) were interested in scheduling task graphs with data dependencies while minimizing the energy consumption of both the processors and the interprocessor communication devices, while assuming that the communication times are negligible compared with the computation times.
All of these problems are quite different from ours, since we focus on pipelined applications of infinite duration, thus considering power instead of total energy consumption. Owing to the streaming nature of the applications, we do not allow for changing the processor speed during execution.
Motivating example
In this example, we have two applications and three processors, as shown in Figure 1 . The first stage of App 1 computes 3 operations, and then sends a data of size 3 to the second stage; the second stage first receives a data of size 3, then computes 2 operations, and finally sends a data of size 1, and so on. If both stages are assigned to the same processor, there is no communication cost to pay; otherwise, this cost depends on the communication volume (3 between the first and the second stage in this case), and on the link bandwidth between the corresponding processor pair. All communication link bandwidths are set to 1. For the computational platform, each processor has two execution modes. For instance, P 1 can process 3 operations per time unit in its first mode, and 6 in its second one, against 6 or 8 for P 2 , and 1 or 6 for P 3 .
We compute the global period as follows:
The global latency is defined in a similar way, as the maximum of the latency achieved by all applications. The energy consumption of a processor is equal to the square of its speed, which is quite a realistic assumption (see Section 4.5 for more details on the model for energy consumption). Note that when the energy is not a criterion to minimize, all processors can run in their higher modes (as fast as possible), because this can only improve the performancerelated criteria (period and latency). In this case, either a processor is used at its fastest speed, or it is turned off.
Interval mappings
First we restrict to interval mappings, where a processor can be assigned only a set of consecutive stages of a single application.
In order to minimize the period without energy constraints, we map the whole first application onto processor P 3 , the first half of the second application onto processor P 2 , and the rest onto processor P 1 . The period is then
Equation (1) reads as follows: we compute the cycle time of each processor as the maximum time spent for incoming communications, computations, and outgoing communications, thus considering a model in which communications and computations are overlapped. We then take the maximum of these quantities to derive the period. There is only one communication to pay in the second application since it is split between two processors. Note that the cycle time of each processor is exactly 1 and there is no idle time on computation, thus it is not possible to achieve a better period: this mapping is optimal for the period minimization problem.
The minimum latency is obtained by removing all communications and using the fastest processors. A mapping that returns the optimal latency (in the absence of other criteria) is for instance the one which maps the first application on P 1 and the second application on P 2 , thus achieving a global latency of
In Equation (2), we simply compute the longest execution path for each application following the PATH latency model. The period of each application is, in this case, equal to its latency, and the WAVEFRONT model returns the same latency (one single period for the execution of a data set). The bottleneck is the second application, and we cannot achieve a better latency since we use no communication and use the fastest processor for this application. This latency is thus optimal.
The minimum energy is obtained when we use fewer processors, each running in its slowest mode. Since we assume that a processor cannot be assigned stages of two different applications, two processors are required in the example. For instance, we can map the first application on P 1 running in its lowest mode and the second application on P 3 running in its lowest mode too, thus achieving an energy of 3 2 þ 1 2 ¼ 10. This is the minimum energy consumption required to run both applications. We observe that the period is then
As expected, running at a slower mode to save energy leads to poorer performances. Trade-offs must be found when considering several antagonistic optimization criteria.
For instance, if we try to minimize the energy consumption under the constraint that the period is not greater than 2, we can use the first mode of each processor. Then the first application is mapped onto P 1 , the first three stages of the second application are mapped onto P 2 and its last stage is mapped onto P 3 . The global period is 2, and the consumed energy is 3 2 þ 6 2 þ 1 2 ¼ 46. This may be quite a reasonable compromise between energy and period: indeed, with the mapping minimizing the period (period of 1), the energy consumption was 6 2 þ 8 2 þ 6 2 ¼ 136. With this mapping, the latency model impacts the result. With the PATH model, we compute the longest path followed by a data set, it is while with the WAVEFRONT model, it takes three periods for a data set to be computed by the second application, leading to a latency of 3 Â 2 ¼ 6.
General mappings
With general mappings, it is possible to assign any subset of stages to the processors. For instance, we consider the mapping in which the first stage of application one, and the second and third stages of application two, are all mapped onto the second processor, running at speed 6. The other stages are mapped onto the first processor, running at speed 3. The energy consumption is then 3 2 þ 6 2 ¼ 45. For the period, we take the maximum between the periods of both processors, accounting both for computation and communication costs:
Note that there are two communications from P 2 to P 1 : one which corresponds to the communication in the first application between the first and the second stage, and one in the second application between the third and the fourth stage. For the computation of the latency with the PATH model, it is necessary to decide in which order these communications occur. If we start with the communication in the first application, the latency is computed as follows:
There is one time unit of idle time in the computation of the latency of the second application, which corresponds to the communication from P 2 to P 1 in the first application. The latency can be reduced to 5 if we change the order of communications. Actually, for general mappings, even if the mapping is fixed, it is NP-hard to decide in which order communications should be executed in order to minimize the latency with the PATH model (Agrawal et al. 2010) .
Owing to this observation, we consider the WAVEFRONT model when dealing with general mappings. This model was introduced by Hary and Ö zgüner (1999), and it is widely used in real-time systems. Note that the WAVEFRONT model assumes a full overlap of communications and computations. In the example, the latency is still dictated by the second application: this application needs 5 periods to execute a whole data set. The WAVEFRONT latency is therefore 5 Â 7 3 % 11:66.
Framework
We start with a formal description of the applicative framework (Section 4.1) and the target execution platform (Section 4.2). Next in Section 4.3, we introduce and motivate the mapping strategies. We are then ready to formally describe the performance objective criteria (period and latency) in Section 4.4, and then to finally discuss the energy model in Section 4.5.
Applicative framework
We consider A independent application workflows (A ! 1) to be executed concurrently; each application operates on a collection of data sets that are executed in a pipelined fashion. For 1 a A, let n a be the number of stages of application a, and N ¼ P A a¼1 n a be the total number of stages. For 1 k n a , w k a is the computation requirement of S k a , the kth stage of application a. For 1 k <n a , d k a is the size of the output data of S k a .
Target platform
The target platform is composed of p processors, which are fully interconnected; there is a bidirectional link P u $ P v between any processor pair P u and P v , of bandwidth b u;v . We use a linear cost model for communications; it takes X =b u;v time units to send (respectively receive) a message of size X to (respectively from) P v . With the mapping rules that we enforce (see Section 4.3), it turns out that a processor never has to perform two concurrent ingoing nor outgoing communications: at any time step, a processor is involved in at most one send, one computation and one receive. However, these three operations can either be parallel (as in the example of Section 3) or serialized. With parallel operations, we have the overlap model that corresponds to multi-threaded communication libraries such as MPICH2 (Karonis et al. 2003) . With sequential operations, we have the no-overlap model that is well-suited to singlethreaded programs.
Processors are multi-modal: every processor P u is associated with a set of speeds S u ¼ fs u;1 ; . . . ; s u;m u g. During the mapping process, we need to choose one speed in S u for each processor P u that is enrolled, and this speed is fixed during the whole execution.
Then we classify particular cases which are important, both from a theoretical and practical perspective. Fully homogeneous platforms, also called speed homogeneous, have identical processors (all processors have a common speed set: S u ¼ S) and homogeneous communication devices (b u,v ¼ b for all link bandwidths). They represent typical parallel machines. Communication homogeneous platforms, also called speed heterogeneous, are still interconnected with homogeneous communication devices, but they may have processors with different speed sets (S u 6 ¼ S v ). They correspond to networks of workstations with plain TCP/IP interconnects or other local area networks (LANs). Fully heterogeneous platforms are the most general, fully heterogeneous architectures. Hierarchical platforms made up with several clusters interconnected by slower backbone links can be modeled this way.
Mapping strategies and scheduling
We consider three mapping strategies. One-to-one mappings obey the simplest rule: each application stage is allocated to a distinct processor. While easier to optimize and implement, this rule may be unduly restrictive, and is likely to pay high communication costs. Obviously, it also requires that p ! N , thereby limiting its applicability to larger platforms (or fewer and smaller applications). A natural extension is to search for interval mappings, where each participating processor is assigned an interval of consecutive stages. Intuitively, assigning several consecutive stages to the same processors will increase their computational load, but may well dramatically decrease communication requirements. Interval mappings have been widely used in the literature Vondran 1995, 1996; Benoit and Robert 2008; Wu and Gu 2008) . We point out that both one-to-one and interval mappings forbid any processor sharing, or re-use, across applications. These mappings are relevant in practice, for instance if we envision a computer center where applications, or jobs, cannot share resources because of security rules or of batch-assignment procedures. The goal of the platform manager is to secure an efficient (albeit concurrent) execution for each application (performance-related criteria) while minimizing the energy consumption of the whole platform.
We also introduce general mappings that allow any processor to execute any number of stages, consecutive or not, taken from one or several applications. Such mappings are likely to lead to a better resource utilization throughout the platform.
Performance optimization criteria
We are now ready to formally define the period and the latency of the applications. We start with one-to-one and interval mappings with no processor sharing, and then we discuss the impact of processor sharing on the metrics.
Without processor sharing.
For one-to-one and interval mappings, since there is no processor sharing, we can focus on a single application.
Formally, an interval mapping is a partition of the set of stages S 1 to S n into m intervals I j ¼ ½d j ; e j such that d j e j for 1 j m, d 1 ¼ 1, d jþ1 ¼ e j þ 1 for 1 j m À 1 and e m ¼ n. Then, the function al : ½1; n7 !½1; p associates a processor number to each stage number. In a one-to-one mapping, this function is a one-to-one assignment. In an interval mapping, for 1 j m, the whole interval I j is mapped onto the same processor P alðd j Þ , i.e. for d j i e j , alðiÞ ¼ alðd j Þ. Also, two intervals (from the same application or from two different applications) cannot be mapped onto the same processor, i.e. for 1 j; j 0 m, j 6 ¼ j 0 , alðd j Þ 6 ¼ alðd j 0 Þ.
The period of this single application is expressed in the overlap model as 
The maximum in the previous expression is replaced by a sum when considering the no-overlap model, since all operations are serialized. The period is then
The latency is the time to process a single data entirely, so it is identical in both communication models, and computed with the PATH model:
with d e m ¼ 0 for the boundary. Again, the simplicity of Equations (3), (4) and (5) is a very useful property of interval mappings, and greatly simplifies the solution of multi-criteria problems.
These are the period and latency of one single application, and we need to define a global period and latency function to be optimized. The simplest approach is to minimize X ¼ max a2f1;...;Ag ðX a Þ, where X a is the period or latency of application a, for a 2 f1; . . . ; Ag, as in the example of Section 3. However, the concurrent applications can be of completely different nature and/or economic value, so that their periods or latencies are not always comparable. Therefore, we aim at minimizing
where W a > 0 is a weight associated with each application and X a is the period or latency of application a, for a 2 f1; . . . ; Ag. W a can be 1 (we retrieve a simple maximum) or a priority ratio (fixed by the platform manager and/or paid by the user). We can also let W a ¼ 1=X Ã a , where X Ã a is the objective function computed when the application is executed alone on the platform; in this case W a Â X a represents the slowdown factor of application a, and X corresponds to the maximum stretch (Bender et al. 1998 ).
With resource sharing.
If we keep the classical latency definition (PATH model) and consider general mappings, it leads to intricate scheduling problems for period/latency bi-criteria problems. Basically, even when the mapping is given, scheduling the execution is a problem of combinatorial nature (it is NP-complete, see Agrawal et al. (2010) ). With general mappings, a processor typically has several incoming and/or outgoing communications, and it is difficult to orchestrate these operations so as to minimize conflicting objectives such as period and latency. This holds true both for the overlap and no-overlap models.
Therefore, when considering resource sharing, we focus in this paper on the problem in which bounds on period and latency are fixed by the application designer, and we relax the definition of the latency using the approach of Hary and Ö zgüner (1999) , that we call the WAVEFRONT model. Instead of computing the longest path, we approximate the latency L as L ¼ ð2m À 1ÞP, where P is the period, i.e. the rate at which data sets enter the system, and m is the number of intervals of consecutive stages mapped onto a same processor in the mapping. A processor change occurs each time when a stage and its successor are not mapped onto the same processor, i.e. m À 1 times. The intuition is that the whole application is executed synchronously, and each data set progresses concurrently within a period. With m successive computations and m -1 processor changes (i.e. communications), each data set traverses the platform within 2m -1 periods.
The mapping is an allocation function, which associates a processor number to each stage number, as well as a speed at which each processor is running. For general mappings with processor re-use, we must carefully decide how the speed of each processor is shared among all stages it is assigned to. Similarly, a communication link or processor network card may be involved in several communications, which implies sharing bandwidths and card capacities too. Hence, the question is as follows: given the mapping, and a threshold period P a and latency L a for each application a 2 f1; . . . ; Ag, is it possible to determine which fraction of computing and communicating resources to assign to each operation so that all period and latency thresholds are met?
Since we consider the WAVEFRONT latency model, one period is accounted for each computation of an interval of stages and for each inter-processor communication. We observe that given the mapping, we know m a , the number of intervals (m a -1 processor changes), for each application a. We can thus check immediately whether the bounds on the latency are respected, i.e. ð2m a À 1ÞP a L a for a 2 f1; . . . ; Ag.
Now for the periods, the key idea is to distribute platform resources parsimoniously, and to allocate only the needed CPU fraction to each computation, and the needed bandwidth fraction to each communication, so that the period constraint is fulfilled. The mapping is valid if neither processor speeds, nor link bandwidths, nor network card capacities are exceeded. First, we merge consecutive stages ½S i a ; . . . ; S j a of application a mapped onto a same processor as one single coalesced stageŜ k a , with computing cost w k a ¼ P j k 0 ¼i w k 0 a , and output communication costd k a ¼ d j a . The transformed application now has exactly m a stages. In the following, stageŜ k a corresponds to the kth stage of the transformed application a, for 1 k m a .
As for computations, consider a processor P u and an application a. We define K u a such that k 2 K u a if and only ifŜ k a is processed by processor P u ; K u a is the set of stages of (transformed) application a processed by P u . Then, for all a and u, and for each k 2 K u a , we allocate the speed fraction s k a;u ¼ŵ k a =P a for P u to executeŜ k a . Similarly for communications, we define K u;v a such that k 2 K u;v a if and only ifŜ k a is processed by P u andŜ kþ1 a is processed by P v , i.e. there is a communication cost between P u and P v . Note that u 6 ¼ v, otherwise stagesŜ k a andŜ kþ1 a would have been merged as a single stage. Formally, k 2 K u;v a , k 2 K u a and k þ 1 2 K v a . Then we allocate the bandwidth fraction b k a;u;v ¼d k a =P a to the communication.
The period of each application can be respected if and only if all of the following inequalities are satisfied. There might be some spare speed and bandwidth if these are strict inequalities, and resources are fully utilized in the case of equalities:
that we can consider mappings without re-use with this latency model. In this case, if we transform each application a as explained above, the allocation function of stageŝ S k a (for 1 a A and 1 k m a ) is a one-to-one function: each coalesced stage is allocated onto a distinct processor. It then becomes much easier to check the validity of the mapping, since each processor is only handling one single stage, receiving input data from one single other processor, and sending output data to one single other processor.
Energy model
The energy consumption of the platform is defined as the sum of the energy Eðu; 'Þ consumed by each processor P u enrolled in the mapping in mode '. We assume that Eðu; 'Þ consists of a static part and of a dynamic part. The static part E stat ðuÞ is the static cost for a processor to be in service, and does not depend on the speed s u;' at which the processor is running. However, the static energy is consumed only in mode ' 6 ¼ 0 (otherwise, the processor is inactive, and not enrolled in the mapping). In contrast, the dynamic part E dyn ðu; 'Þ is of the form E dyn ðu; 'Þ ¼ s a u;' , where a > 1 is an arbitrary rational number. It is sometimes assumed that a ¼ 2 (Ishihara and Yasuura 1998), as we did in the example of Section 3, but all our results hold for any value of a. Finally, for ' 6 ¼ 0, we have Eðu; 'Þ ¼ E stat ðuÞ þ E dyn ðu; 'Þ, while Eðu; 0Þ ¼ 0.
The energy Eðu; 'Þ is an energy consumed per time unit, so we could also speak of dissipated power. Note that it is mandatory to minimize energy consumption per time unit, because the execution of streaming applications with arbitrarily many data sets may last for an unbounded amount of time. Hence, we always consider a combination of energy and period objective criteria, because the latency by its own takes only one single data set into account, and does not reflect a pipelined execution.
Complexity results with the Path model
In this section, we consider the Path model for the computation of the latency, and therefore we restrict the study to one-to-one and interval mappings with no resource sharing. General mappings with resource sharing are investigated in Section 6. Note that all proofs are available in the companion research report (Benoit et al. 2011 ), due to restrictions on space.
In the following, proc-hom denotes identical speed processors while proc-het represents heterogeneous processors; com-hom means identical communication links, while they differ for com-het. We also report results for the case special-app, which corresponds to applications whose stages are all identical (all w k a are equal), and no communication cost is paid (all d k a are equal to zero). We start with the mono-criterion problems of period or latency minimization in Sections 5.1 and 5.2. In these cases, we do not consider energy minimization issues, and therefore we can systematically run processors at their highest speed, and thus use classical results established in a context with no energy. Then we investigate the following multi-criteria problems: period/latency (Section 5.3), period/energy (Section 5.4) and period/latency/energy (Section 5.5). We discard the latency/energy combination since, as discussed above, the energy model holds only in combination with the period criterion.
When dealing with multiple criteria, our approach is to minimize one of them, given a threshold on the others. Actually, fixing the period or the latency means fixing a threshold on the period or latency of each application, thus providing a table of period or latency values. Equivalently, we minimize the value of Equation (6) with suitable coefficients. For the energy, only a bound on the global energy consumption is required. Note that all of results apply to both the overlap and no-overlap models, and to all objective functions introduced in Section 4.4: more precisely, polynomial problems remain polynomial for arbitrary weights W a in Equation (6), while NP-complete problems are already difficult with W a ¼ 1. All complexity results are summarized in Section 5.6.
Period minimization
We show that a greedy assignment solves the problem of finding a one-to-one mapping on communication homogeneous platforms, but the problem turns NP-complete with heterogeneous links between the processors. Theorem 1. On communication homogeneous platforms, a one-to-one mapping that minimizes the period can be determined in polynomial time.
Theorem 2. On fully heterogeneous platforms, the problem of finding a one-to-one mapping that minimizes the period is NP-complete.
For interval mappings, we use an existing algorithm which finds the minimum period in a single application to build a new polynomial time algorithm that minimizes the global period of many applications on fully homogeneous platforms, giving the right number of processors to each application. The problem is NP-complete with heterogeneous processors. Theorem 3. On fully homogeneous platforms, an interval mapping that minimizes the period can be determined in polynomial time.
Theorem 4. On communication homogeneous platforms, the problem of finding an interval mapping that minimizes the period is NP-complete.
The case special-app is more interesting, because a polynomial algorithm exists to find an interval mapping which minimizes the period of one single application (Benoit and Robert 2009) ; however, the problem becomes NP-complete with several applications.
Theorem 5. With several applications, heterogeneous processors, homogeneous pipelines without communication, the problems of finding an interval mapping which minimizes respectively max a2f1;...;Ag T a , max a2f1;...;Ag W a Â T a , or max a2f1;...;Ag T a =T Ã a , are NP-complete (in the strong sense).
Latency minimization
First we point out that the latency expression does not depend on the communication model, and therefore the results of this section are valid for both the overlap and no-overlap models. We show that finding a one-to-one mapping which minimizes the latency is NP-complete as soon as the processors do not have the same speed thanks to a reduction from 3-PARTITION.
Theorem 6. The problem of finding the one-to-one mapping which minimizes the latency on fully homogeneous platforms is polynomial.
The case special-app is more interesting, because a polynomial algorithm exists to find a one-to-one mapping which minimizes the latency of one single application (Benoit et al. 2009b ); however, the problem becomes NPcomplete with several concurrent applications. Theorem 7. With several applications, heterogeneous processors, homogeneous pipelines without communication, the problems of finding the optimal one-to-one mapping which minimizes, respectively, max a2f1;...;Ag L a , max a2f1;...;Ag W a Â L a , or max a2f1;...;Ag L a =L Ã a , are NPcomplete (in the strong sense).
However, we write a greedy algorithm that finds the optimal interval mapping on communication homogeneous platforms. The problem is still NP-complete on fully heterogeneous platforms for interval mappings.
Theorem 8. On communication homogeneous platforms, the optimal interval mapping which minimizes the latency can be determined in polynomial time. Theorem 9. On fully heterogeneous platforms, the problem of finding an optimal interval mapping, that minimizes the latency, is NP-complete.
Period/latency minimization
In this section again, we are not concerned with energy minimization issues, so, similarly to the results of Sections 5.1 and 5.2, all processors can be run systematically at their highest speed. Therefore, on fully homogeneous platforms, all one-to-one mappings are identical, and it is straightforward to minimize the latency for a given period, or the converse.
However, for interval mappings, we must decide where to split applications into intervals, and we provide a dynamic programming algorithm which solves both variants of the problem with a single application. When considering multiple applications, we need to run the dynamic programming algorithm once per application with the corresponding period (respectively latency) threshold, and the minimum latency (respectively period) that can then be achieved is the maximum over all applications.
Theorem 10. With one application, on fully homogeneous platforms, the optimal interval mapping which minimizes the latency for a bounded period, or the period for a bounded latency, can be determined in polynomial time.
Theorem 11. With several applications, on fully homogeneous platforms, the optimal interval mapping which minimizes the latency L ¼ max a2f1;...;Ag W a Â L a for a bounded period by application, or the period T ¼ max a2f1;...;Ag W a Â T a for a bounded latency by application can be determined in polynomial time.
When moving to a platform with heterogeneous processors, even if the application is homogeneous with no communication (case special-app), the problem of finding a one-to-one or interval mapping that solves the bi-criteria period/latency problem is NP-complete. This result is a direct consequence of the NP-completeness of the monocriterion cases, see Sections 5.1 and 5.2. Theorem 12. With heterogeneous processors and homogeneous pipelines, without communication, the problem of finding an interval or one-to-one mapping, that solves the bi-criteria period/latency problem, is NP-complete.
Period/energy minimization
We first provide results for one-to-one mappings, and then discuss interval mappings. For fully heterogeneous platforms, the problem is NP-hard because the period minimization problem already is NP-hard on such platforms. The interesting result is the following. Theorem 13. On communication homogeneous platforms, a one-to-one mapping which minimizes the energy consumption while enforcing a given period for each application can be determined in polynomial time.
For interval mappings, first note that the problem becomes NP-complete as soon as we consider different speed processors, because of the NP-completeness of the period minimization problem for such platforms. Thus, we focus on fully homogeneous platforms.
Theorem 14 On fully homogeneous platforms, an interval mapping which minimizes the energy consumption while enforcing a given period for each application can be determined in polynomial time.
Period/latency/energy minimization
When mixing the three criteria, the problem becomes NP-hard even for fully homogeneous platforms, no communication, and a single application. The combinatorial nature of the problem comes from the fact that even if processors are identical, they are multi-modal and each of them may run at a different speed. Theorem 15. On fully homogeneous platforms, with a single application and without any communication cost, finding a one-to-one mapping that solves the tri-criteria problem is NP-hard. Theorem 16. On fully homogeneous platforms, with a single application and without any communication cost, finding an interval mapping that solves the tri-criteria problem is NP-hard.
We conclude this section with some remarks on unimodal processors. If we restrict to processors with a single execution mode, the problem becomes polynomial on fully homogeneous platforms, while it remains NP-hard otherwise (because of the NP-completeness of the period/ latency problem which is also established with uni-modal processors). For one-to-one mappings, all mappings are equivalent on fully homogeneous platforms, but the algorithm is more sophisticated for interval mappings. We first write an algorithm which partitions the stages of a single application into intervals, for each of the three variants of the tri-criteria optimization problem, and then we use this algorithm for the multiple application problem. Details can be found in Benoit et al. (2009a) . Table 1 summarizes all complexity results with the PATH latency model, for one-to-one and interval mappings without resource sharing.
Summary of complexity results for the Path model
For the mono-criterion problems, most NP-completeness proofs come from the single application problem which already was NP-hard, see Benoit and Robert (2008) and Benoit et al. (2009b) for the proofs. The two special entries denoted with * are problem instances which could be solved in polynomial time for a single application, but becomes NP-hard with several applications. Remaining entries correspond to polynomial algorithms that were already existing for a single application and that have been extended for several applications.
For the bi-criteria problems, we provide new polynomial algorithms to minimize one of the criteria, given a bound on the other one. NP-completeness results are obtained from the mono-criterion complexity results.
Finally, the tri-criteria problem turns out to be NP-hard even for fully homogeneous platforms, no communication and a single application.
Complexity results with the Wavefront model
In the previous section, we have performed an exhaustive complexity study considering the PATH latency model, and hence restricting to mapping rules without resource sharing (one-to-one or interval mappings). We have provided new polynomial algorithms for multiple applications and results of NP-completeness. However, when considering resource sharing and general mappings, we use the WAVEFRONT latency model, as explained in the framework (see Section 4.4).
In this section, we investigate the impact of this model on the complexity results. Since the latency definition is now closely related to the period definition, we consider only latency in combination with period. For the period/ latency combination, we minimize the latency for a fixed period. For the tri-criteria problem, both period and latency are fixed, and we minimize the energy criterion. Also, we do not restrict the study to one-to-one and interval mappings, but also discuss general mappings. It turns out that the period minimization problem is NP-hard for such mappings, even for fully homogeneous platforms, no communication and a single application. Therefore, all multi-criteria problems with general mappings are NP-hard. All results are summarized in Table 2 . Note that all proofs are available in the companion research report (Benoit et al. 2011 ), owing to restrictions on space.
Period minimization
All complexity results for period minimization were already established in Section 5.1, except for general mappings. It turns out that the problem is NP-hard for general mappings, even for fully homogeneous platforms, no communication and a single application.
Theorem 17. On fully homogeneous platforms with no communication, the problem of finding a general mapping that minimizes the period of a single application is NPcomplete.
As a corollary, all multi-criteria problems are NP-hard for general mappings, since they all involve the period criterion (because of the energy and latency definitions).
Period/latency minimization
With heterogeneous processors and interval mappings, we already know that the period minimization problem is NP-hard, and therefore it remains NP-hard when combining it with the latency criterion. However, the result does not hold any longer for one-to-one mappings, while the bicriteria problem was NP-hard with the PATH latency model. Actually, with homogeneous communications, the latency of an application with n stages is always ð2n À 1Þ Â T , where T is the period of the application, and therefore the latency is minimized when the period is minimized. The bi-criteria problem amounts in this case to the period minimization problem, which is polynomial (binary search algorithm, see Section 5.1).
For homogeneous platforms, in the following we propose a polynomial algorithm for the period/latency/energy combination on homogeneous platforms and interval mappings. This algorithm can be used to solve the easier bi-criteria problem with no energy criterion.
Period/latency/energy minimization
As motivated earlier, we focus on the tri-criteria problem of minimizing energy under constraints on period and latency. It turns out that this problem becomes polynomial for interval mappings without resource sharing on fully homogeneous platforms, while it was NP-complete with the classical definition of latency (see Theorem 16).
For one-to-one mappings, the problem is polynomial for com-hom platforms with different speed processors. Indeed, similarly to the period/latency problem, minimizing the latency is equivalent to minimizing the period for 
Conclusion
In this paper, we have studied the problem of mapping concurrent applications onto computational platforms according to three criteria: period, latency and energy. We restricted the study to the class of applications which have a pipeline structure, and we established the complexity of the problems for different variants of mapping strategies (one-to-one, interval and general mappings), and different types of platforms (ranking from fully homogeneous to fully heterogeneous).
First we focused on one-to-one and interval mappings with no resource sharing. We considered performance criteria, namely period or latency minimization. From this study of mono-criterion problems, one striking result is the impact of having multiple concurrent applications on the problem complexity. Indeed, when several applications are in competition for resources, the period minimization problem turns out to be NP-hard for interval mappings with heterogeneous processors, homogeneous pipelines and without communication, while a polynomial algorithm had been found to solve the same problem with a single application. The same phenomenon happens for latency minimization with one-to-one mappings. For other period or latency minimization problems, either we were able to extend polynomial algorithms for the single application case, or the problem remained NP-complete. Considering bi-criteria problems, we were able to derive nice sophisticated multi-criteria polynomial algorithms, through the construction of bipartite graphs or the use of dynamic programming. Trade-offs were found to allow for an efficient albeit energy-aware execution. Finally, the most challenging tri-criteria problem period/latency/ energy turned out to be NP-hard even with a single application on a fully homogeneous platform and no communication cost.
In order to handle processor sharing, we have explained why it was mandatory to use a simpler model for the latency, and we have discussed the use of the WAVEFRONT model. Thanks to a combination of two dynamic programming algorithms, we have shown that finding an optimal interval mapping without re-use on fully homogeneous platforms can be done in polynomial time, while the same problem was shown to be NP-complete with the classical definition of latency. However, finding an optimal general mapping on any platform type, or finding any optimal interval mapping on speed-heterogeneous platforms, are NPcomplete problems.
We believe that this exhaustive complexity analysis provides a solid theoretical foundation for the study of multi-criteria mappings of several concurrent applications, in particular when combining performance and energy optimization criteria.
On the practical side, we designed several heuristics in , as well as an integer linear program to compute the optimal solution (either interval-based or general) in possibly exponential time, for the WAVEFRONT latency model. The comparison of heuristics with and without processor sharing does confirm that sharing is most useful when: (i) modes are not close to each other; and (ii) static energy is high.
As future work, we envision to add replication to the mapping rules: a stage could be mapped onto several processors, each in charge of different data sets, in order to improve the period. This problem, partially investigated in Benoit and Robert (2009) , would become even more challenging in a framework accounting for energy issues. Also, it would be interesting to include the consumption induced by memory, disks, fans, and other devices, in the energy model. Finally, we would like to consider different application settings, for instance applications that share some data paths. In this case, we expect the impact of resource sharing to be even more important, since mapping two such applications on the same resource may further reduce their period and latency. 
Funding
This work was supported in part by the ANR StochaGrid and RESCUE projects. 
