Executing multiple applications on a single MPSoC brings the major challenge of satisfying multiple quality requirements regarding real-time, energy, and so on. Hybrid application mapping denotes the combination of design-time analysis with run-time application mapping. In this article, we present such a methodology, which comprises a design space exploration coupled with a formal performance analysis. This results in several resource reservation configurations, optimized for multiple objectives, with verified real-time guarantees for each individual application. The Pareto-optimal configurations are handed over to run-time management, which searches for a suitable mapping according to this information. To provide any real-time guarantees, the performance analysis needs to be composable and the influence of the applications on each other has to be bounded. We achieve this either by spatial or a novel temporal isolation for tasks and by exploiting composable networks-on-chip (NoCs). With the proposed temporal isolation, tasks of different applications can be mapped to the same resource, while, with spatial isolation, one computing resource can be exclusively used by only one application. The experiments reveal that the success rate in finding feasible application mappings can be increased by the proposed temporal isolation by up to 30% and energy consumption can be reduced compared to spatial isolation. deepakg@seas.upenn.edu; M. Glaß, (Current address) Institute of Embedded Systems/Real-Time Systems, Ulm University, Albert-Einstein-Allee 11, 89081 Ulm, Germany; email: michael.glass@uni-ulm.de. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. 
INTRODUCTION
Modern multiprocessor system-on-chips (MPSoCs) contain an increasing number of heterogeneous resources, i.e., processing elements (PEs), distributed memories, and parallel communication interconnects. This advances the admittance of more and more functionality into a single chip, which is becoming a prerequisite for implementing modern mobile and multimedia devices, as well as near-future automotive and avionics multi-/many-core systems with varying mixes of concurrently running real-time applications. These application mixes are not always known a priori: At design time, applications may stem from different developer teams and/or are added at different points of time to the already running system. Also, the number of possible application mixes is exponential to the number of applications. This renders the analysis of all application mixes practically infeasible, even if all applications would be known. In this context, runt-ime management (RM) has the purpose of partitioning the system resources and mapping applications onto these partitions dynamically in such a way that certain objectives such as energy consumption are optimized. For this task, run-time management (RM) has to be able to anticipate the impact of the different mapping options on (a) the individual application objectives and (b) the overall system A common approach in related work [29, 39, 45] is to try to reduce any side effects by assigning PEs exclusively to applications, so that only tasks of the same application may share the same PEs. Yet this way of creating spatial isolation is realized for PEs only. So, these approaches do not consider that the on-chip communication infrastructure is typically shared to realize flexible memory accesses and data transmissions. In fact, Ref. [34] recently showed that approaches that neglect communication are too optimistic. This basically means that applications pass admission control and are executed although their deadlines could actually be violated. As a solution, Refs [34, 36] propose a hybrid application mapping approach for MPSoCs with a composable network-onchip (NoC) architecture to also bound the interferences in communication. Isolation of tasks of different applications is still obtained via spatial isolation by exclusive assignment of tasks to PEs, which, however, may result in poor PE utilization rates.
As a remedy, we present a novel hybrid mapping approach that is temporal isolation of applications on both computational and communication resources. In particular, we propose (a) novel composable scheduling and performance analysis techniques, and (b) a constraint-based run-time mapping approach supported by design-time analysis, which enable to bound the interference effects between applications even if they share the same resources. This has the major contribution that system resources can be utilized much better and much more efficiently even under real-time constraints.
We illustrate this by means of a motivational example according to Figure 2 . We assume a given heterogeneous 2 × 2 NoC target architecture with PEs being either of resource type r 1 or r 2 . An example application, see Figure 3 (a), is specified by a task graph with four tasks t i and four messages m i . Based on this specification, design space exploration (DSE) is performed (e.g., Refs [2, 22, 24, 34] ) for generating and evaluating different mappings of tasks to resources. By employing static performance analysis, the worst-case end-to-end latencies can be determined for each of these mappings, and mappings that could violate deadlines are rejected. The result of the DSE is a set of Pareto-optimal OPs that represents a tradeoff between several objectives. Now, as symmetric architectures may have a huge number of concrete mappings with the same number of PEs used in the mapping, each OP does not describe a concrete mapping of tasks to resources and messages to the NoC, but a constraint graph instead, which describes (a) which tasks are clustered together and (b) mapped onto what resource type to achieve the quality numbers analyzed. For example, for OP1 in Figure 2 , t 0 and t 2 should be mapped together onto a PE with resource type r 1 (denoted Fig. 2 . Schematic overview of spatially and temporally isolated mappings: After DSE, the resulting Paretooptimal operating points (OPs) are stored along with their quality numbers of resources used (#PE), energy consumption (E), and their worst-case end-to-end latency (L). When the application is released at run-time, it needs to be mapped to the system where another application is already admitted and currently executed (gray clusters C a and C b ). Via spatial isolation, only OP1 can be feasibly mapped (a), while with our proposed temporal isolation, also OP2 can be mapped (b). This choice results in a lower energy consumption while still meeting the application's deadline by construction. Unused PE u 0 could be power gated to save more energy. Fig. 3 . Representation of (a) an example application by a task graph and (b) an example system architecture including four PEs with two different resource types (colored white and gray) and a 2 × 2 NoC. One possible application mapping of the task graph onto the architecture is shown in (c), also illustrating the two paths in the task graph, which are relevant for the calculation of the worst-case end-to-end latency by a solid blue and a dashed red line. C 4 ) and t 1 and t 3 together onto any available PE of resource type r 1 (denoted C 5 ). For OP2 tasks, t 1 and t 3 need to be mapped together onto a PE of the type r 2 (denoted C 3 ), while tasks t 0 and t 2 should be mapped onto two different PEs of type r 1 (denoted C 1 and C 2 ). Overall, OP2 uses more resources (two r 1 , one r 2 ) than operating point OP1 (two r 1 , zero r 2 ). In this example, task mappings according to operating point OP2 can be executed more efficiently and, thus, have a lower energy consumption due to the higher degree of parallelism. Please, thereby, note that a constraint graph stands for a family of concrete and symmetrically identical mappings. The advantage of this separation of static quality analysis and run-time search for a suitable mapping is to reduce the complexity of run-time mapping to a largest possible extent.
This information is now used by the RM prior to starting each application at run-time. In the example illustrated in Figure 2 , tasks belonging to another application (C a and C b ) are already mapped to some resources. This means that the RM needs to determine a feasible mapping just for the new application. Figure 2 (a) illustrates an RM strategy based on spatial isolation. Here, the already occupied resources cannot be used for mapping the new application. Thus, there does not exist a feasible mapping for operating point OP2, as there is no unoccupied instance of resource type r 2 for mapping the tasks represented by C 3 . RM, therefore, has to test the operating point with next lower energy consumption, i.e., operating point OP1, which then can be feasibly mapped as illustrated in the figure. In contrast, the proposed approach is now able to share PEs under certain conditions as introduced in this article. As a consequence, operating point OP2 can be mapped according to Figure 2 (b), resulting in a lower energy consumption, where even the unoccupied PE u 0 could be powered down.
As illustrated, the advantage of allowing temporal is to obtain a higher utilization of the system resources while satisfying predictability requirements on execution time. This not only has the direct consequence that a higher number of applications can be executed concurrently, but it is also possible to execute them on fewer PEs than when they are reserved exclusively for an application. Unused PEs can be power gated, which may additionally reduce energy consumption. Moreover, in emerging many-core systems, temporary or even permanent unavailability of hardware resources is expected to be experienced more often because of hardware faults (manufacturing variability and aging) or temperature/power management (cf. Ref. [15] ). In this context, the proposed mapping approach enhances robustness as it is possible to react to unavailability of PEs by re-mapping affected applications onto the remaining PEs, which can be shared with other applications. As Pourmohseni et al. show in Ref. [26] , the re-mapping of an application during run-time can be even performed with a bounded timing overhead. They achieve this by adding a post DSE analysis, which determines efficient migration options and routes. By reserving additional NoC resources the applications can be migrated during run-time without violating the deadline. This includes the time to suspend the current execution of the application as well as the time to transfer the tasks over the NoC to the new PEs. The allocation overhead for 95% of the investigated cases was less than 10%. The methods proposed by Pourmohseni et al. can be used additionally to the work at hand. Specifically, the contributions of this article are the following: -This article presents the hybrid application mapping methodology, first introduced in Ref. [34] , in more detail and with more examples. This approach combines the strengths of design time, e.g., analysis and compute-intensive optimization, with the flexibility of runtime decision making to cope with dynamism. While related work often neglects or simplifies the NoC communication, with this hybrid application mapping (HAM) methodology, timing guarantees for state-of-the-art packet-switched NoC architectures can be given. -We enhance this methodology by including the concept of temporal isolation. This, opposed to Ref. [34] , enables the sharing of PEs among different applications while still preserving realtime requirements. In consequence, this increases the utilization of the system and enables possibilities for energy saving. -We evaluate execution times of the RM through a simulation of embedded hardware, which is not considered in Ref. [34] . For bounding the execution times, we propose to use the backtracking algorithm with timeout mechanism and outline the implications in the conducted experiment.
The remainder of the article is outlined as follows: In Section 2, we give an overview of related work. We formalize the used model of applications and system architecture in Section 3. Section 4 describes our formal design-time analysis while Section 5 details the design-time optimizations. In 89:6 A. Weichslgartner et al.
contrast, Section 6 deals with run-time mapping. In Section 7, we evaluate our approach through several experiments and conclude our work in Section 8.
RELATED WORK
According to Ref. [32] , application mapping approaches for embedded multi-/many-cores can be classified as design-time mapping, (on-the-fly) run-time mapping, and hybrid (design-time analysis and then run-time use) mapping. In the following, we give a brief overview of the existing mapping approaches:
Design-time mapping approaches require a global view of the system for which application mapping is then optimized. While these approaches enable application execution with high predictability, support of varying sets of executed applications and/or unpredictable dynamic workload scenarios is not in their focus. In general, there are not any strict requirements on the execution-time of design-time approaches and they can utilize well-known optimization techniques such as integer linear programming (ILP) [6] , evolutionary algorithm (EA) [7] , simulated annealing (SA) [25] , or divide-and-conquer [19] .
Run-time mapping approaches use scalable run-time heuristics to determine application mapping whenever the workload scenario of the system is dynamically changing. However, they do neglect or cannot guarantee the predictable execution of applications with (typically hard/soft) realtime requirements. In contrast to design-time mapping, the execution time and available power for determining a mapping is limited. In consequence, simple and fast heuristics such as simple nearest neighbor algorithms have been proposed here (e.g., Refs [5, 38] ). The objectives for runtime optimization are typically soft real time (e.g., Ref. [3] ), energy (e.g., Refs [8, 16] ), or average speedup (e.g., Ref. [20] ). In Ref. [16] , an iterative online application mapping methodology for heterogeneous NoC architectures is proposed. After an initial greedy task to resource assignment, the mapping is optimized, and, afterward, it is checked if all quality-of-service (QoS) are met. If not, the mapping is marked as infeasible and feedback to the previous steps to remap the application is given. In contrast to this work, we propose to pre-define already mapping classes, which define the implementation, i.e., task variant for a certain resource type, at design time.
Hybrid application mapping (HAM) attempts to combine the strengths of design-time and runtime mapping. Here, scenario-based (e.g., Ref. [33] ) and multi-mode (e.g., Ref. [40] ) embedded system design tries to optimize the mappings for different workload scenarios or execution modes at design time and then just applies them at run-time. Yet, considering all possible combinations of applications in different scenarios, of course, would result in a lot of mappings that need to be stored, as the number of combinations increases exponentially with the number of applications. To reduce this number of mappings, the authors in Ref. [27] propose to save only an "representative subset of scenarios for each cluster." For each application, two operating points (throughput-optimized and throughput under a certain energy budget) are stored after DSE. The RM then tries to detect a scenario at run-time and to customize and optimize the mapping accordingly. In contrast to this approach, we exploit the concept of composability to explore several mappings per application, which can be embedded at run-time with guaranteed upper bounds for end-to-end-latency and without the need of scenarios and any run-time optimization.
In Ref. [31] , a hybrid mapping methodology that determines energy and throughput optimized application mappings is proposed. Pareto-optimal mappings with iteratively increased hop distances between the tasks are generated at design time. At run-time, a heuristic selects a mapping based on the number of used processor tiles while only considering the maximal number of hops for the respective operating point. This approach is only viable when using a communication infrastructure, which provides dedicated point-to-point connections between all pairs of computational resources. This has the major advantage that the usage of such end-to-end connections results in fixed communication latencies between computational resources and, thus, supports the verification of real-time guarantees. However, implementing dedicated connections between all pairs of computational resources is not practicable and scalable in many-core systems with tens or even hundreds of PEs.
In Refs [23, 44] , HAM approaches where a design-time DSE generates operating points that are mapped onto a bus-based MPSoCs during run-time by a light-weight multiple-choice knapsack problem (MMKP) solver are presented. Another approach for bus-based MPSoCs, which solves the MMKP heuristically during run-time by using Pareto-Algebra, is presented in Ref. [30] . Reference [18] proposes to explore Pareto-optimal schedules for data-flow modeled applications while a greedy run-time manager performs allocation and binding. As communication infrastructure they assume NoC "point-to-point connections with fixed latency between tiles" and the real-time properties are assured by spatial isolation, i.e., exclusive tile usage by one application.
In fact, sophisticated NoC architectures multiplex multiple communication flows over shared resources, i.e., links [9] . They perform packet-switched routing by partitioning each communication into packets, which are then routed over shared links. While this enhances scalability, it makes it harder to give any guarantees regarding the communication latency as this requires a communication infrastructure with QoS guarantees. In order to give any QoS guarantee, each flow can only get a limited time budget of a multiplexing interval. There are different strategies to assign such budgets, e.g., priority-based [4] , global time division multiple access (TDMA) [12] , or weighted round robin [14] .
PRELIMINARIES
In the following, we introduce the required formal notations and models for applications as well as the MPSoC system architecture.
Application Model
In this work, we concentrate on periodic real-time applications (e.g., image/signal processing, control loops, streaming and multimedia applications, etc.). Such applications typically can be represented by acyclic, directed, bipartite task graphs. Figure 3 (a) illustrates an example. A task graph is denoted by G A (V , E). The vertices V = T ∪ M are composed of the set of tasks t ∈ T , representing sequential code segment, and the set of messages m ∈ M, representing data exchanged between pairs of tasks. Consequently, tasks in T are connected through directed edges in E with messages in M and vice versa, i.e.,
Applications represented by task graphs shall be executed periodically once admitted with period P and have to meet a certain deadline δ . Furthermore, we assume that the period is at least as long as the deadline. Every message has a maximum data size size (m) (i.e., payload), so that together, with the period, a minimum bandwidth requirement bw (m) can be calculated. Each task is assumed to represent a sequentially executed code segment of an application; a worst-case execution time (WCET) W (t, u) can be determined through WCET analysis 1 for each task t ∈ T on resource u. The determination of task WCETs itself is not in the focus of this work but can be derived by WCET analysis tools like aiT [11] or Chronos [21] . However, to prevent cache interferences in private PE caches when mapping different tasks to the same PE, cache partitioning, private scratchpads, or flushing caches after each scheduling interval (cf. Ref. [42] ) may be considered.
System Architecture
The system architectures G ar ch (U , L) targeted by our approach are many-core systems, which consist of a set of heterogeneous PEs u ∈ U . The resource type r ∈ R of a PE u is specified by the function type : U → R. A NoC is used as communication infrastructure where routers are connected with each other and to the PEs via links l ∈ L to form a 2-dimensional mesh topology, as exemplified in Figure 3 (b). Each link has a capacity cap(l ), which is proportional to the link width and the frequency 1 τ . Moreover, we concentrate on packet-based routing, where messages are partitioned into flits, which are transmitted one after the other over the infrastructure. Transmission happens from the sending to the receiving PE along a route of consecutive routers. The distance between two PEs u 1 and u 2 is determined by a hop count function hops (u 1 , u 2 ), i.e., the number of routers along the route.
APPLICATION MAPPING AND STATIC PERFORMANCE ANALYSIS
The worst-case end-to-end latency of an application depends on its mapping onto the available computing and communication resources. Using the proposed model, this is formulated as a mapping of the application graph G A (V , E) onto the architecture graph G ar ch (U , L) obtained by binding each task and routing each message:
represents the routing of each message m with sender t 1 and receiver t 2 over a set of connected links L ⊆ L that establish a path between PE β (t 1 ) with PE β (t 2 ). Figure 3 (c).
An example mapping of the introduced task graph G A (V , E) from Figure 3(a) is shown in
For a mapping to be feasible, it must be guaranteed that the end-to-end latency for executing an application does not exceed its deadline δ . The worst-case end-to-end latency of the application depends on the critical path of the mapped task graph. For determining the critical path, we calculate the end-to-end latency for each path of G A (V , E) by summing up the worst-case execution latencies T L of all tasks in the path and the worst-case communication latencies CL of all messages in the path. The worst-case end-to-end latency of a path path for a given binding β and routing ρ may then be calculated according to
(
The worst-case end-to-end latency is then the latency of the path with the highest worst-case end-to-end latency (i.e., the critical path):
Figure 3(c) presents an example where G A (V , E) basically includes two paths from the source task t 0 to the sink task t 3 . One path is (t 0 , m 0 , t 1 , m 2 , t 3 ), and the other path is (t 0 , m 1 , t 2 , m 3 , t 3 ). In the given mapping, t 0 and t 1 are mapped together on one PE so that m 0 does not have to be routed over the NoC but can be established by local memory. Note that the resulting delay for doing so has to be already included in the WCET analysis.
When permitting to execute the application on resources that are potentially shared with other applications, they may interfere and affect each other's timing behavior. For being able to bound this interference, and thus being able to calculate T L and CL without knowing whether and how other applications share resources, composability is required. In the following, we describe techniques for composable communication scheduling and composable task scheduling and their respective worst-case analysis as used in this work. Both techniques are based on the idea of reserving periodically available time slots for data transmission and task scheduling, respectively. The interesting aspect is that the worst-case execution and communication latencies obtained here can be composable even during run-time mapping of new tasks into the system if just certain mapping constraints are satisfied. This will be explained in detail in Section 6.
Composable Communication Scheduling
In order to provide the desired composability, the NoC architecture has to fulfill certain criteria and has to show a predictable timing behavior. One NoC architecture that adheres to these requirements is proposed in Ref. [14] . This architecture uses wormhole switching and the concept of virtual channels (VCs) to ensure a high throughput and low latencies. Further, guaranteed service (GS) connections supporting QoS can be set up and physical links are arbitrated in a weighted round robin fashion for transmitting the flits of the different messages routed over it. A number of SL max time slots (one slot for transmitting one flit) is periodically available for the overall transmission out of which a budget of SL(m) ≤ SL max time slots can be reserved for the transmission of a message m. Note that, in contrast to a global synchronous TDMA like presented in Ref. [12] , only the number and not the position of the allocated time slot is fixed. This increases the utilization while still allowing to compute upper bounds for throughput and worst-case latency.
The worst-case communication latency CL(m, ρ (m)) for transmitting message m ∈ M depends on the number of flits f lits(m), the length of the route ρ (m), and the number of reserved time slots SL(m), and can be calculated as follows [14] :
In Equation (3a), Δ Rf denotes the delay for routing one flit in one router with the frequency f . Once the routing decision has been made in one router, one flit per clock cycle (f −1 ) can be transmitted. Figure 4 illustrates the best case and the worst case for communication latencies with examples. The best case corresponds to the case without any interference. The message can utilize the whole scheduling interval SL max and the transmission delay only depends on the message size flits, the hop distance hops, and the router delay Δ Rf (see first summand in Equation (3a)). The second summand in Equation (3b) gives the maximal delay possible by interference with other messages. This interference can happen in f lits (m) SL(m) − 1 arbitration intervals and depends on the number of hops.
Composable Task Scheduling
Composability at PE level is achieved by temporally isolating the execution of tasks on it. Therefore, the processing time on a PE is partitioned into service intervals with fixed time duration. Within a service interval, tasks are scheduled exclusively. We consider service intervals of equal length SI on each PE type type (u) 2 . Transition between the scheduling of two tasks, i.e., task switching, takes place after each service interval SI . This incurs an operating system (OS) scheduling overhead after each interval denoted by SI os . Service intervals are made available to the tasks in the PE's waiting queue in a round robin fashion. Each task is assigned with a fixed priority that determines the order in which intervals are allocated to tasks by the scheduler. − 1) arbitration windows, the flits can be delayed by (SL max − SL(m 1 )) · τ . Note that the position of the time slot can vary in each hop, while the number of time slots is assumed fixed per message. Transmission can also use more time slots than actually reserved given there are unused time slots available. However, it is always guaranteed that at least the reserved time slots are available within the period. This scheduling strategy is illustrated in Figure 5 for two tasks t 1 and t 2 in the ready queue of a PE. The priority of task t 1 is higher than the priority of task t 2 , i.e., pr (t 1 ) < pr (t 2 ) (the lower the value, the higher is the priority). So, task t 1 is assigned the first service interval. Allocation then proceeds by means of round robin scheduling. With the above scheduling mechanism, we develop a performance analysis method next to derive the worst-case execution latency of a task.
As initially stated, it is our goal to achieve a high utilization of the given many-core system despite having to isolate applications from each other in order to satisfy real-time constraints. The worst-case execution latency of a task basically consists of two parts: First, the worst-case execution time of the task without interference T L exec (t, β (t )). The proposed analysis also considers an upper bound on the number of tasks that could share the same PE, denoted by K max . Therefore, the second part is the worst-case interference T L inter (t, β (t )) from other tasks that could possibly be mapped and scheduled on the same PE. Thus, the total worst-case execution latency (T L(t, β (t )) of a task is given by
As each task is executed in service intervals and considered to finish at the end of an interval, the value of T L exec (t, β (t )) is not necessarily equal to the WCET W (t, β (t )). 3 6 . Example of the two cases of Equation (7). The priorities of the tasks are annotated in circles.
given by
The above expression is obtained from the fact that each task has to complete W (t, β (t ))/SI service intervals to finish its execution. Moreover, each task execution incurs the OS scheduling overhead SI os every time there is a switch into its service interval from the service interval of the previously scheduled task (cf. Figure 5) . The worst-case interference from other tasks consists of two components: the worst-case interference before T L b inter (t, β (t )) and after T L a inter (t, β (t )) the first service interval of t, which is given by
Recall K max being the maximum overall number of tasks allowed to be mapped onto a PE, and let pred (t ) be the predecessor of task t in the currently analyzed path of the task graph. Then, worst-case interference before the first service interval is formulated as follows:
otherwise.
If task pred (t ) is mapped on the same PE as task t, data is exchanged locally and the number of time intervals with length (SI + SI os ) that t has to wait is pr (t ) − pr (pred (t )), as exemplified in Figure 6 (a). On the other hand, if pred (t ) is mapped onto another PE, then the maximum interference is due to the service intervals of the possible number of other tasks (K max − 1) on the PE (see Figure 6 (b)). This is because in the worst case, the message from pred (t ) would have to wait until the service intervals of all other tasks finish. Worst-case interference after the first service interval is given by
where tl inter = (SI + SI os ) × (K max − 1) is the maximal total interference from all the remaining possible tasks between two consecutive service intervals of task t. The first part of the equation gives the number of service intervals of task t between which interference could happen (analogous to Equation (5)). The worst-case execution latency of task t can then be calculated by inserting Equations (5)- (8) into Equation (4).
89:12
A. Weichslgartner et al. Fig. 7 . Flowchart of DSE using EA, including the iterative process of exploration and evaluation.
DESIGN SPACE EXPLORATION
Due to our composability assumptions and using the performance analysis techniques presented in Section 4.2, a DSE for finding Pareto-optimal mappings is applied to each application individually. Here, multiple mapping candidates are generated per application with verified real-time properties and optimized objectives. The gain of this separation is that the complexity of analyzing a single application is dramatically reduced over the exploration of a complete system with various application mixes.
To efficiently explore various mappings in our DSE, we apply an approach that combines an EA with a Pseudo-Boolean solver [22] . The EA constitutes an iterative optimization process: In the exploration phase, a set of new applications mappings is generated by applying genetic operators; and in the evaluation phase, this set is evaluated by using analytical models (e.g., for timing the one presented in Section 4). Both phases are iteratively carried out to obtain a set of better and better solutions over time. In each iteration, the best so far explored, non-dominated mappings are updated and stored in an archive and returned once the DSE terminates (see Figure 7) . Again, to enable the individual exploration of classes of optimal application mappings by means of a formal analysis, the concept of composability is essential. Composability ensures that the addition of a new application in the mix only has a bounded effect on the performance values obtained for each application that was analyzed completely in isolation without considering the execution behavior of any other application as this would fail due to complexity reasons.
Generation of Feasible Application Mappings
We apply the composable scheduling techniques presented in the last section. This means that an application mapping during DSE is generated by (a) determining a binding β (t ) of each task t ∈ T and (b) determining a routing ρ (m) of each message m ∈ M. We consider deterministic xy-routing for the messages in the NoC. Routing of each message does, therefore, not have to be explored explicitly, as proposed in Ref. [13] , as it is implicit by the binding of the message's m sending and receiving tasks. In addition, also a priority pr (t ) has to be assigned to each task for scheduling tasks on the same PE, and SL(m) has to be generated for the transmission of each message.
In our approach, unique priorities for each task mapped to the same PE are assigned in the exploration phase. In the evaluation phase, it is checked if the assignment is feasible. Through a depth-first search, we identify if a task is a predecessor of another task on the same PE and change the priorities if required.
Also, SL(m) has to be explored per message m. To satisfy the minimal bandwidth requirements of the message, SL(m) has to be at least worst-case end-to-end latency L(β, ρ) may be reduced. Therefore, the exploration interval of SL(m) is defined as follows:
Only feasible application mappings are returned in the end. More formally, a mapping is feasible if the following conditions hold:
-First, the worst-case end-to-end latency has to stay within the deadline:
-Second, no PE is overutilized. Meaning that the load induced by all tasks mapped onto the same PE stays below 100%:
-Finally, no communication link is overutilized. This means that SL(m) of all messages that are sent over the same route (same source PE and target PE) do not exceed the overall available budget of time slots SL max . Let M ρ = {m|m, m ∈ M : ρ (m) = ρ (m )} be the set of messages that are sent over the same route. This constraint is then formulated as follows:
An example of such an infeasible mapping due to overutilization of a link is illustrated in Figure 8 (a).
Optimization Objectives and Evaluation
Our DSE considers multiple objectives related to non-functional properties. As modern embedded systems have strict energy budgets, it is essential to minimize the energy consumption of application mappings. Therefore, we include energy consumption minimization as one objective in the DSE (Objective I). This maximal energy consumption E OV that is going to be minimized may be the sum of the energy consumed by the PEs E P E and energy, which is used to route the message over the NoC E N oC :
The maximal energy consumed in the PE is the product of the WCET of the task on the mapped PE and the maximal power consumption power (r ) for the given resource type, which is derived by the function type (u). The energy consumed by the communication infrastructure for a message m is directly proportional to the number of hops and used links. We derive E N oC from the NoC energy model in Refs [17, 43] :
In Equation (14), E Sbit is the energy consumed per bit inside the router, E Lbit is the energy consumed on a link, and size (m) is the size of the message in bits. Contrary to conventional exploration, the outcome of the DSE will not be used to encode a concrete task and communication assignment to be selected by the RM but rather a class of mappings. More details are elaborated in Section 6. In order to find mappings that allow a greater run-time flexibility, we, therefore, also include objectives that quantify the resource overhead and flexibility of an application mapping as follows:
The overall number of messages routed over the NoC should be minimized (Objective II). The reason is that, if two communicating tasks are mapped to the same PE, they can exchange their data through local memory and, hence, ρ = ∅. This does not burden the NoC infrastructure. Consequently, congestion on NoC links is reduced, making it more likely to map this operating point at run-time.
Another two objectives are the maximization of the average and the minimal hop distances (Objective III and IV). Again, here the idea is to increase flexibility by giving preference to routings that are more likely to be feasibly routed during run-time: the longer the routes are allowed to be, the less mapping restrictions exist.
As the targeted architecture is heterogeneous, different PE types may be selected for the execution of the tasks. Only minimizing the overall number of allocated PEs without differentiating between their resource types, will result in the generation of suboptimal mappings, e.g., by always using the same PE type such as a powerful core, which can execute many tasks within the application's period. However, if now during run-time all instances of this PE type are occupied, no more operating points could be embedded in the system. To thwart this, we minimize the number of allocated PEs per resource type to generate diverse operating points (Objective V).
Our DSE therefore performs a multi-objective optimization, with an overall of five objectives. This results not in a single optimal, but in multiple Pareto-optimal application mappings that tradeoff between the different objectives. Such a Pareto front is illustrated in Figure 9 for two objectives.
RUN-TIME CONSTRAINT SOLVING
The Pareto-optimal mappings generated by the DSE are handed over to the RM. Yet each DSE mapping corresponds to a fixed assignment of tasks to concrete resources in the architecture. However, in architectures with a multitude of equal resources, numerous equivalent mappings may exist. Therefore, we transform the application mapping (provided by β and ρ) into a constraint graph G C (V C , E C ) as exemplified in Figure 9 right. This graph represents a full class of symmetrical feasible mappings within the NoC which are all equivalent to the application mapping that was actually determined and analyzed during DSE. Consequently, all analyzed properties-particularly real-time properties-also apply for these symmetrical mappings. 
Constraint Graphs
As illustrated in Figure 9 , the vertices V C = T C ∪ M C of a constraint graph are composed of task clusters belonging to the set T C and message clusters belonging to the set M C .
Each task cluster C ∈ T C represents a set of tasks that are mapped to the same PE, so that ∀t, t ∈ C : β DS E (t ) = β DS E (t ).
Each task cluster is annotated with type CG (C) ∈ R, specifying the PE type onto which the tasks are mapped, and furthermore, with load load (C) induced by the tasks on this PE:
Also, the scheduling information is annotated to the task cluster, i.e., the maximum number K max (C) of tasks allowed on the PE for scheduling and the priorities pr (t ), ∀t ∈ C of all its tasks. Each message cluster B ∈ M C represents a set of all messages which are routed along the same path in the NoC between two such task clusters, so that ∀m, m ∈ B : ρ (m) = ρ (m ). Each message cluster is annotated also with the routing information, i.e., the accumulated SL(B) = m ∈B SL(m) and the hop distance hop(B) = hops (ρ (m)) between the sending and the receiving task clusters of messages m ∈ B.
Serializing Operating Points
To hand over the set of operating points to the RM the data has to be serialized. This includes the constraint graph as well as the values for the explored objectives. The memory requirement for these tuples can be calculated as follows:
Where size CG is the memory requirement of the serialized constraint graph, n ob j is the number optimized objectives and size ob j is the memory requirement of one objective value. For serializing the constraint graph the graph needs to be traversed and all task clusters, all message clusters, and all edges have to be serialized.
89:16
A. Weichslgartner et al. 
Run-Time Mapping of Constraint Graphs
The main task of the RM is to select a suitable operating point of the application that should be executed and do the actual run-time application mapping. In principal, the RM can select any operating point out of all found points which fulfills the application's requirement, e.g., performance. 4 To fulfill system requirements, e.g., utilization, the RM may also re-map an already mapped application to another operating point. The step of run-time application mapping itself is to find a concrete application mapping based on the notation of a constraint graph G C (V C , E C ) and the architecture G ar ch (U , L) by (a) binding each task cluster to a PE, i.e. β CG :T C →U , and (b) routing each message cluster over a route of consecutive links, i.e., ρ CG : M C → 2 L . 5 Instead of mapping the task graph G A (V , E) onto the architecture, mapping the constraint graph G C (V C , E C ) has a lot of advantages: As tasks are clustered to a task cluster and messages to a message cluster, it is evident that |T C | ≤ |T | and |M C | ≤ |M |. In consequence, the size of the graph that needs to be mapped during run-time is smaller than the original size of the task graph. Second, the constraint graph also is a very compact representation of possibly multiple symmetrical run-time mappings. This basic idea is illustrated in Figure 10 , where one constraint graph can be feasibly mapped in multiple ways while guaranteeing the analyzed quality bounds. Third, time-consuming analysis is performed at design time. The analyzed properties apply for a mapped constraint graph due to the composability of our approach. A feasible mapping of a constraint graph has to satisfy the following constraints: First, the routings of all message clusters B ∈ M C have to fulfill constraints C.1 and C.2: C.1 Routing ρ CG (B) has to provide a connected route of links between β CG (C 1 ) and β CG (C 2 ), i.e., the target PEs of its sending and receiving task clusters are β CG (C 1 ) and β CG (C 2 ), respectively, with (C 1 , B), (B, C 2 ) ∈ E C . The hop count of this route must not exceed the given maximal hop count associated with the message cluster:
C.2 Let M C denote the set of all already routed message clusters in the system. The accumulated SL(B) of the messages routed over each link l ∈ ρ CG (B) must not exceed the maximal number of time slots SL max : Figure 8(b) gives an example where this constraint is violated resulting in an infeasible run-time mapping.
Second, the bindings of all task clusters C ∈ T C have to fulfill constraints C.3-C.5:
The resource type of the target PE has to be the same as is required for the task cluster:
C.4 Let T C denote the set of task clusters that are already bound. The load induced by all task clusters which are mapped on a target PE β CG (C) together with the load of the new task cluster C must not exceed 100%:
C. 5 The overall number of tasks bound on a target PE must not exceed the maximal numbers K max allowed for feasibly scheduling any task cluster on the PE according to its performance analysis results:
In case of a spatial isolation, only Constraint C.3 and the absence of other tasks on β CG (C) would be sufficient to guarantee the worst-case latency (see Equation (4)) as only tasks of one task cluster would be mapped together onto the same PE. However, when applying temporal isolation, all constraints need to hold. Figure 8 (c) exemplifies a feasible run-time mapping which fulfills all mentioned constraints. If all constraints are fulfilled but the priority ranges of the tasks in C and in C overlap, the priorities of C are shifted after mapping to keep them unique on the PE. An example of this priority assignment and Constraint C.5 can be found in Figure 11 .
Backtracking Algorithm
To find a mapping which satisfies all the five constraints given a constraint graph and to solve the corresponding constraint satisfaction problem, 6 we propose a backtracking algorithm as shown in Algorithm 1 which is an extension of the algorithm presented in Ref. [34] . This algorithm is executed for each application that should be started on the system. This algorithm starts with A = ∅ and then searches recursively for a valid variable assignment for A. As the backtracking 89:18 A. Weichslgartner et al. Fig. 11 . Example of a binding of a task cluster C = {t 2 , t 3 } to u. The maximal number of tasks allowed on a PE for scheduling C is K max (C) = 4. The tasks from task clusters C = {t 0 } and C = {t 1 }, C , C ∈ T C are already present at u and support a maximum task number of K max (C ) = K max (C ) = 5. After mapping C, no further tasks can be mapped onto u due to Constraint C.5. The priorities (annotated in circles) of the tasks in C are updated to 3 and 5 in order to keep the priorities on u unique.
algorithm would search exhaustively through all possible variable assignments, a timeout can be chosen to determine the maximal run-time of the algorithm. This condition is checked in line 6, and returns an empty set if the maximal time has elapsed since the initial start of the backtracking algorithm for one operating point. In line 9, the next task cluster to map is selected, and in line 10 the domain D C containing all target PEs which fulfill C.1 and C.3 is created. In lines 11 to 17, the remaining constraints are checked when trying to map C to the selected PE u. We use xy-routing to obtain routes L B deterministically for all message clusters sent or received by C and which communication partners are already mapped. Success ratio of mapping operating points obtained for the E3S benchmarks to a 5×5 NoC for different utilization classes. Success ratios are given for resource management based on resource availability and resource management using a constraint solver are compared [34] .
EXPERIMENTS
We use task graphs from the Embedded System Synthesis Benchmarks Suite (E3S) [10] for our experiments. These applications stem from various embedded domains like automotive (18 tasks), telecommunication (14 tasks), consumer (11 tasks), and networking (7 tasks). The values for energy consumption, WCET of a task, and bandwidth requirements of messages reflect a realistic scenario of current embedded MPSoCs. We derived the energy consumption of each task on a certain PE from the E3S benchmark and the communication energy consumption by a model proposed by [17, 43] with a link length of 2mm (resulting in E Lbit = 0.0936nJ) and E Sbit = 0.98nJ (see Section 5.2). Furthermore, we selected a heterogeneous 6×6 NoC-based architecture,consisting of three different processor types from Ref. [10] , including an IBM Power PC and variants of AMD K6.
Considering Communication Constraints
In a first experiment, we evaluate the influence of the communication constraints, i.e., C.1-C.2, on finding feasible mappings. As exemplified in Figure 8 , checking only the availability of the needed processing resources, e.g. as proposed in Refs [29, 39, 45] or assuming only point-to-point connections [31] , is not sufficient for a feasible mapping in a packet-switched NoC architecture. Indeed, it only satisfies C.3 and C.4 and neglects the other constraints. To visualize this, we tried, in 6,000 test cases, to map operating points from the above mentioned E3S benchmark applications to a preoccupied system using Algorithm 1 without a timeout. As a result, Figure 12 shows the gap between only considering the resource availability (blue curve) and the actual feasibility considering the communication constraints C.1 and C.2 tested by the introduced constraint solver (red curve). The utilization classes on the x-axis denote the percentage of utilized computing resources before testing to add the new application. For example, zero represents a completely empty system and the utilization class 10 includes systems where 1% to 10% of the PEs are utilized by previously mapped applications. The gray area between the two curves highlights the optimism introduced by a run-time system, which only relies on computing resource availability as in Ref. [31] . In case of a 40% utilization class, 39% of applications could be mapped to the system by only considering resource availability, while only for 13% guarantees for holding their deadlines could be given. All remaining ones miss deadlines because of communication latencies or are actually not mapped because of congested communication resources. Overall, this underlines the importance of considering communication and routing constraints when it comes to methodologies for application mappings on composable NoC-based MPSoCs with predictable execution times.
Temporal Isolation versus Spatial Isolation
By applying the EA-based DSE illustrated in Figure 7 , we generated and evaluated an overall of 200,000 mappings per application, resulting from a population size of 200 and 1,000 iterations. For each of these mappings, we conducted the performance analysis proposed in Section 4. This was done with the number of additional tasks set to K max −|C | = 4, SI = 50μs, and SI os = 10μs. As outlined in Section 5.2, the optimization criteria were minimizing (I) the energy consumption of each mapping, (II) the number of routed messages, (V) the number of allocated PEs per resource type r ∈ R. Further criteria were maximizing (III) the average and (IV) the minimal hop distance in order to generate more flexible mappings (the bigger the hop count of a message cluster, the less stringent becomes Constraint C.1). Out of these 200,000 mappings, all Pareto-optimal solutions that do not violate the application deadline are stored as operating points together with the created constraint graphs and the values of the evaluated objectives (less than 100 points per application).
We then implemented an RM for mapping different run-time mixes of the benchmark applications, where the applications are mapped iteratively. The operating points of each application were sorted in increasing order of energy consumption values (the objective of main interest in our experiments). In this order, a run-time embedder, following a first-fit scheme, searches the first operating point whose constraint graph can be feasibly mapped to the system. For comparison, we implemented two embedder variants based on Algorithm 1: (a) variant ti performs the proposed mapping with temporal isolation and (b) variant spi with spatial isolation (see Ref. [34] ). 7 These embedders try to map one constraint graph of each application (from the first fitting OP) to the architecture. Here, the mapping of the applications is incremental, i.e., first, a constraint graph from the first application is mapped, then a constraint graph from the second application, and so on. This simulates the arrival of different applications at different points in time during run-time that constitute an application mix, which was unknown at design time. In principle, the proposed run-time mapping would also support the remapping of OPs and removing of mapped applications, but this is not considered in the following experiments.
We evaluated how many applications out of an application mix we can map successfully to our system (referred to as success rate in the following) for both variants. For three different application mixes, experiments were repeatedly performed, but PEs were successively made unavailable for mapping any tasks so that the overall PE availability ranged from 100% down to 40% (which also captures scenarios with, e.g., faulty or powered down PEs). We generated 100 different sequences in which PEs are randomly made unavailable, starting from 100% availability of PEs down to 40%, and used the average values per number of available PEs as the result.
The result of such a set of experiments is depicted in Figure 13 (a) for application mix 1 consisting of one telecom application and two networking applications. Application mix 2 (see Figure 13(b) ) is composed of one telecom, three automotive, and one consumer application, while application mix 3 (see Figure 13 (c)) consists of two automotive, two consumer, and two networking applications. In the graphs, the x-axis represents the percentage of initially available PEs while the y-axis corresponds to the ratio of successful mappings. The main trend observed is that with decreasing PE availability, the success rate declines much faster when using spatial isolation. In the case of application mix 1, the success rate of spi drops to 65%, while it still remains at 95% using the proposed ti in case of an availability of 60% of the PEs. The experiments with application mix 2 show a similar behavior. Even more drastically, in the experiments with application mix 3, all applications could be mapped with our proposed approach in the case where all PEs are available, whereas using spi, one application in the mix could not even be mapped at all. Fig. 13 . Evaluation of the average success rate of run-time mapping of pre-explored Pareto-optimal operating points belonging to different application mixes for spatial isolation (spi) and temporal isolation (ti) depending on the percentage of initially available PEs. The average success rate refers to the number of applications that could be successfully mapped in an overall of 100 experiments, providing a good measure for the system utilization.
In our test cases, the obtained energy consumptions of ti mappings were always equal to or better than those using spi mappings for a PE availability of 100% for application mixes 1 and 2. In application mix 1, ti and spi reached the same results. In application mix 2, ti mapped operating points with an energy consumption of 351mJ, whereas spi mapping resulted in 477mJ per execution. Being able to obtain run-time application mappings, which are better with respect to the objective (energy), is a direct consequence of being able to better utilize the available resources. For all other rates of PE availability and also for 100% PE availability in application mix 3, a comparison is not meaningful as spi is not able to map as many applications as ti.
Execution Time
Constraint solving is the central concept for making use of the offline explored operating points at run-time. However, this implies an additional overhead for determining a feasible mapping based on the provided constraint graphs. In this experiment, we evaluate the execution times of the runtime backtracking mapping algorithm (Algorithm 1) performed by a central RM. Here, we applied Ref. [28] to simulate the execution of the RM according to Ref. [41] on a 32-bit embedded processor with a clock frequency of 300MHz. Overall, feasible mappings for 500 constraint graphs on an 8 × 8 NoC architecture were searched via the backtracking mapping algorithm. Figure 14 shows the cumulative distribution function (CDF) of the execution times (in ms) measured for executing the run-time backtracking algorithm. The CDF describes the maximal execution time needed by the percentage of runs. Values are separated for the cases of (a) successful (i.e., at least one feasible mapping exists) and (b) failed constraint solving (no feasible mapping exists). Note that constraint solving is a complex task (in the worst-case, Algorithm 1 has exponential run-time) and took up to 305ms (denoted in the Figure 14 by a vertical line) for successful and 947,878ms for failed mappings. The vast majority of the applications can be mapped much faster, e.g., 97% of the successful test cases took at most 10ms. In the case of failed mappings, execution times were much higher. Only 78% of test cases took below 305ms, and 19% took seconds or even minutes (see Figure 14(b) ).
Note that this time only elapses before a newly arriving real-time application is started. While we are dealing with applications that-once mapped-are periodically executed for a long time, mapping times in the range of a few seconds might be tolerable. However, in order to bound the execution time of the run-time mapping and supporting domains where mapping time matters, we propose the usage of a timeout mechanism (see Algorithm 1): We stop the algorithm after the expiration of the timeout interval and classify the currently tested mapping as infeasible. The timeout value needs to be appropriately chosen to fulfill the turn-around time requirements of the application being mapped. Particularly, as a too low value may increase the number of false negatives (i.e., feasible mappings that are classified as infeasible). However, for our experiments, even with a timeout value as low as 10ms, we would only reject feasible mappings (i.e., classify false negatives) in 3% of the cases. As we provide multiple operating points per application, a mapping according to another constraint graph may then be obtained, e.g., by using run-time management algorithms such as Ref. [41] . However, the investigation of sophisticated RM strategies is out of the scope of this article.
Nevertheless, to handle larger systems, the execution times of this algorithm may not be acceptable anymore. Therefore, we will conduct further research on the run-time constraint satisfaction problem (CSP) solving. This may include a hierarchical decomposition of the architecture where the backtracking algorithm searches in a sub-architecture first, distributed CSP solving, or dedicated hardware support [35] . With using isolated regions per applications, also, fast heuristics solving a 2D packing problem can be used [37] . This 2D packing problem can also be solved as an Boolean satisfiability problem (SAT) considering all applications present in the system. This can be used to realize a re-mapping if an application cannot be mapped due to fragmentation. However, this whole concept is based on spatial isolation only and makes temporal isolation infeasible, thus, decreasing the utilization. Pourmohseni et al. [26] propose a concept to also enable switching between operating points represented by constraint graphs . This is achieved by an additional post DSE analysis to determine efficient re-mapping options of OPs with a minimal transition overhead. During run-time, the re-mapping can be performed with bounded latency.
CONCLUSIONS
In this article, we proposed a technique to increase the utilization of many-core systems using hybrid application mapping combined with a static performance analysis considering bounds on temporal interference on tasks. More specifically, the design-time analysis for applications with real-time constraints was performed, considering, for the first time in a hybrid application mapping approach, temporal isolation of concurrent tasks with bounds on task interference. Via DSE of mappings, a set of Pareto-optimal operating points with composable performance values is obtained. The subsequent operating point mapping at run-time is achieved by solving a constraint satisfaction problem. It has been shown that this hybrid approach allows to provide predictable application mappings within high system utilization and reduced number of PEs that are needed to execute various application mixes while satisfying real-time requirements. Another major advantage of our approach over previous work is the reduction of the exploitative search for feasible mappings to design time and leave only the remaining freedom in finding a concrete mapping to the RM. This was possible through the concept of a constraint graph characterizing feasible mappings.
Yet, as detailed in this article, constraint solving during run-time is still a complex task. Therefore, we further investigate to make RM more efficient. The proposed timeout is a first solution to bound the mapping time. A method to also bound the time for switching between OPs is presented in Ref. [26] . To investigate the influence of dynamic re-mapping on the utilization of the system remains as future work.
