Abstract: Nowadays, owing to unpredictable changes of the environment and workload variation, optimally running multiple applications in terms of quality, performance and power consumption on embedded multi-core platforms is a huge challenge. A lightweight run-time manager, linked with an automated design-time exploration and incorporated in the host processor of the platform, is required to dynamically and efficiently configure the applications according to the available platform resources (e.g. processing elements, memories, communication bandwidth), for minimising the cost (e.g. power consumption), while satisfying the constraints (e.g. deadlines). This study presents a flow linking a design-time design space explorer, coupled with platform simulators at two abstraction levels, with a fast and lightweight priority-based heuristic integrated in the run-time manager to select near-optimal application configurations. To illustrate its feasibility and the very low complexity of the run-time selection, the proposed flow is used to manage the processors and clock frequencies of a multiplestream MPEG4 encoder chip dedicated to automotive cognitive safety applications.
Introduction
The future of embedded computing is shifting to multi-core designs to boost performance due to the unacceptable power consumption and operating temperature increase of fast single-core CPUs. This introduces new big challenges. The first challenge is the support for a variety of applications: mobile communications, networking, automotive and avionic applications, multimedia in the automobile and Internet interfaced with many embedded control systems. These applications may run concurrently, start and stop at any time. Each application may have multiple configurations, with different constraints imposed by the external world or the user (deadlines and quality requirements, such as audio and video quality, output accuracy), different usages of various types of platform resources (processing elements (PEs), memories and communication bandwidth) and different costs (performance, power consumption). The second challenge is the platform heterogeneity, happening between platforms and within a platform. Even for similar platforms, process variation endows them with different performance characteristics. Furthermore, the resources required for each application may vary over time, as applications are launched or complete, and due to hardware adaptation to physical constraints (power, temperature, battery life and aging). Hence, it is untenable to ask software vendors to adapt or optimise their applications for each platform. The third challenge is time to market, which makes software development productivity of paramount importance.
To address the previously mentioned challenges and to allow exploration, integration and efficient collaboration of all complementary techniques, [1] presents a generic and structured framework for run-time resource management (RRM) of embedded multi-core platforms which fulfils the following features. First, this RRM supports a holistic view of the resources. This is needed for global resource allocation decisions, arbitrating between all applications, and minimising the total costs. Second, this RRM transparently optimises the resource usage and the application mapping on the platform. This is needed to facilitate the application development and mapping from diverse application domains. Third, this RRM dynamically adapts to changing context. This is needed to enable the best usage of resources and to achieve a high efficiency under changing environment and requirements. To that end, dynamic resource allocation and dynamic reconfiguration of applications must be supported. Also, quality requirements and resources must be scaled dynamically (e.g. by adjusting the processor clock frequency or by switching off some functions) in order to control the power consumption and the heat dissipation of the platform. Finally, this RRM allows different strategies (e.g. for resource allocation and task scheduling), since a single strategy cannot be expected to fit all application domains.
Since such an RRM is intended for embedded platforms, only a lightweight implementation is acceptable. To alleviate the run-time decision making and to avoid conservative worstcase assumptions, we already motivated the need of an RRM consisting of both following phases [2] . First, a design-time exploration per application derives a multi-dimensional Pareto set of optimal configurations. Each configuration is characterised by a code version together with an optimal combination of constraints, used resources and costs. The different code versions refer to different parallelisations of the application into parallel tasks and data transfers to shared and local memories. Second, a low-complexity run-time manager, which is incorporated in the user space on top of the basic services of the platform OS and acting as an exception handler. Whenever the environment is changing (e.g. when a new application/use case starts, or when the user requirements change), the run-time manager reacts as follows. First, it selects a configuration from the Pareto set of each active application, according to the available resources, in order to minimise the costs, while satisfying the constraints. Second, it reconfigures and maps the application on the platform, that is, it assigns the platform resources, it adapts the platform parameters, it loads the application tasks and it issues the application executions according to the newly selected configurations.
With growing complexity of applications and embedded devices, deriving Pareto sets of application configurations is not trivial. In [3] , a practical approach is presented for using multiple platform simulators at different abstraction levels under a single-design space exploration (DSE) tool. This approach can derive accurate Pareto sets much more quickly than full-space explorations linked to a cycleaccurate platform simulator.
The new contributions of our work presented in this paper are as follows: First, the flow, illustrated in Fig. 1 , links a design-time design space explorer, coupled with platform simulators at two abstraction levels, with a run-time manager. Second, a fast and lightweight heuristic to select near-optimal configurations for the active applications is integrated in the run-time manager. This heuristic, taking the application priorities into account, is extended from [4] , where it is shown to find solutions close (within 0 -0.4%) to the ones obtained by the fastest state-of-the-art heuristics, in just a fraction of the execution time (more than 97.5% gain on a StrongARM processor). In this paper, with no solution quality loss, the complexity of our priority-based heuristic is also further reduced through stepwise filtering and sorting.
Finally, to illustrate the feasibility of our flow and the very low complexity of the run-time configuration selection, experiments are performed to manage the processors and clock frequencies of a multiple-stream MPEG4 encoder chip dedicated to automotive cognitive safety applications. The critical section of the flow is the one regularly executed at run time. For our demonstrator, experiments show that each execution of the run-time section runs in less than 0.5 ms on the previously mentioned StrongARM processor.
The remainder of the paper is organised as follows. Section 2 overviews the state-of-the-art on RRM techniques for embedded multi-core platforms. Section 3 introduces all assumptions and terminologies used in this paper. Section 4 formulates the configuration selection problem solved by our priority-based heuristic. Section 5 presents the flow linking the design space explorer with our run-time selection heuristic. Section 6 presents our demonstrator. Finally, Section 7 reports the experiments and results obtained for each step of our flow applied to the proposed demonstrator. Conclusions are drawn in Section 8.
Related works
In the context of RRM, traditional approaches can be roughly classified into either pure design-time approaches or pure runtime approaches. Nevertheless, they suffer from the following drawbacks. First, some of them are applicable only for singleprocessor platforms [5] or for homogeneous multi-processor platforms [6] , but not for heterogeneous multi-processor platforms. Second, none of the existing approaches proposes a complete framework. Some of them are based only on task scheduling, that is, on task ordering and assignment. A good overview of available design-time algorithms can be found in [7] . Some others are based only on slowing or shutting down the platform resources [8] and on dynamic voltage and frequency scaling (DVFS) [9 -12] . Third, the objective of majority of these approaches is performance optimisation [13 -17] , and not power consumption optimisation. Finally, design-time approaches involve slow heuristics [12, 18, 19] using integer linear programming algorithms and cannot be used at run time. Also, to reach a lightweight implementation, run-time approaches hide the specification of the internal application tasks, and they do not fully exploit the task mapping choices of the target platform. Hence these approaches are sub-optimal.
Hence neither the existing pure design-time approaches nor the existing pure run-time approaches are efficient to solve this complex RRM problem. To alleviate the run-time decision making and to avoid worst-case assumptions, new research directions are ongoing and propose a mixed design-time and run-time approach.
The task concurrency management methodology [20 -22] , explores the energy-performance trade-off at the system level. To reach an efficient usage of the platform resources, this methodology models the application at a finer granularity than traditional task graphs. It identifies the sub-tasks of the application that can run in parallel on a heterogeneous multi-processor platform. It also includes data access and memory management at the task level [2, 23] .
Scenario-based approaches [24, 25] are based on the concept of application scenarios identified at design time as follows. First, a profiling-based analysis of various run-time situations of the application is performed. Then, these runtime situations are clustered into a few dominant application scenarios and a backup scenario. At run time, the actual scenario is detected with a simple detector, and the Fig. 1 Our flow linking RRM with design space explorer application is executed with the configuration decided for that scenario.
The task scheduling techniques [20, 21] schedule one task at a time, making use of the entire platform for each task. This leads to an inefficient usage of the platform. The techniques [4, 26] allow parallel execution of the tasks, while sharing the platform resources. But they assume that all tasks start at the same time. This assumption can lead to idle platform processors until the next RRM call. These techniques are extended in [27] , which allows overlapped sharing of the platform resources. This latter technique considers the start time and the periodic information present in applications such as frame processing in video decoding or packet processing in wireless applications.
RRM for multi-core embedded platforms [28 -30] also exploits the number of available cores, in addition to voltage and frequency. In [28, 29] , different parallelised versions of a single application are used to trade the available platform resources with the performance and the power consumption. In [30] , the platform communication network is a shared bus and the bus traffic conditions are taken into account.
The addition of the parallelism to the set of platform parameters significantly increases the design space of operating modes. Innovative and efficient techniques for RRM are needed to extend the traditional approaches for power consumption optimisation. Recent studies in this field address the problem by modelling it as a multidimension multiple-choice knapsack problem (MMKP) and solving it through dedicated heuristics [4, 26] . Another approach [31] proposes a run-time management technique for task-level parallelism in order to optimise the performance under a power consumption budget.
Advanced technologies such as sub-45 nm CMOS and threedimensional integration are known for the increased number of reliability failure mechanisms. Nevertheless, classical reliability-aware approaches are no longer viable, since they propose ad hoc failure or worst-case solutions, which incur a significant cost penalty. In [32] , the state-of-the-art in reliability management techniques is summarised, and a new proactive energy management approach is proposed, which handles both temperature and lifetime at run time.
Another issue in the context of RRM is dynamic reconfiguration. Applications are becoming more complex and multiple use cases need to be supported. Moving from one application mapping to another may be needed due to user interactions or changes in the platform resource availability when new applications are activated. Adapting the mapping of an active application is called dynamic reconfiguration or task migration. The key challenge is of course to maintain the realtime behaviour and the data integrity of the overall set of active applications. On the one hand, dynamic reconfiguration is a powerful mechanism to improve the platform utilisation, avoiding some idle computing resources, while others are overloaded. On the other hand, important issues are not only task-state representation and message consistency during reconfiguration, but also reconfiguration time that limits the performance of the overall system [33] [34] [35] [36] and [37] propose efficient run-time support for reconfiguration while improving the platform utilisation and preserving all timing constraints. The authors of [38] propose a process-level migration mechanism, which allows transferring much less state information and hence yielding less migration time. In [39] the authors propose a dynamic resource (re)allocation mechanism for large-scale high-performance computing platforms. In [1] the author points out that activation of new applications or requirement changes in already active applications may give rise to reselection of configurations. One main issue is also the implementation switching that must be seamless. To that end, switching points are introduced in the applications where it is checked whether a switching is requested.
Our paper focuses on the following RRM issue: the flow ( Fig. 1) , linking a design-time design space explorer, coupled with platform simulators at two abstraction levels, with a fast and lightweight heuristic, executed at run time whenever the environment is changing, to select nearoptimal configurations for the active applications. This flow works as follows:
For each application, a multi-dimension Pareto set of optimal configurations is identified at design time, by exploring parallelisations, measuring the platform resource usage (PEs, memories, communication bandwidth) and estimating power consumption and performance. This is automated through a design space explorer coupled with platform simulators.
The design space explorers used in our flow can be either the commercially available modeFRONTIER [40] or the open-source Multicube Explorer [41] . As emphasised in Section 5.1, the number of possible application configurations can be huge. Hence these exploration tools are needed within the embedded system design cycle to automatically generate these configurations.
Platform simulators are used to evaluate the application configurations generated by the design space explorer. They are available at two abstraction levels. As pointed in [42] , platform simulations can be done at many abstraction levels: for example, functional-level simulation, timed simulation, cycle-accurate simulation. But the key problem is the following one: the more accurate the simulator, the more time it takes to perform a simulation. Hence from the DSE point of view, which needs a large number of simulations, there is an important trade-off between the result accuracy and the simulation time. In our approach, we use two simulators as follows. First, an extensive DSE is performed to derive a large set of application configurations, using HLSim. This fast high-level timed and functional simulator uses timing back-annotations from the accurate commercial simulator TLMsim to generate both execution time and power consumption. Power consumption estimations are calculated using platform mapping on TSMC 90 nm process technology using standard cell libraries. Details can be found in [43] , where a strong correspondence between the HLSim estimations and the actual measurements is reported. Then, only the Pareto set of configurations are evaluated again and verified using TLMsim [44] . TLMsim is a SystemC-based cycle-accurate transaction-level model (TLM) [45] , built using the CoWare virtual platform prototyping tools. This simulator consumes more time, but it reports more accurate estimations. For instance, TLMsim takes around 4 h to simulate the MPEG4 encoder encoding ten frames in our demonstrator, whereas HLSim takes 90 s. Recently, simulations based on a cycleaccurate TLM have gained importance due to standardisation efforts.
A fast and lightweight priority-based heuristic to select near-optimal configurations for the active applications is integrated in the run-time manager conforming to the framework proposed in [1] and implemented in C. This heuristic is extended from [4] . This latter finds solutions close (within 0 -0.4%) to the ones obtained by the fastest state-of-the-art heuristics, in just a fraction of the execution time (more than 97.5% gain on a StrongARM processor). It can run in less than 1 ms for multi-processor problem sizes. This heuristic is extended as follows: † The heuristic cannot always guarantee to find a feasible solution, that is, to select a set of configurations, one per active application, within the available platform resources. In that case, the deadline of the soft real-time application with the lowest priority is relaxed and the heuristic is executed once again. † The size of the Pareto set, used as input to the heuristic, can still be high. Since this size is a critical parameter in the heuristic complexity, stepwise filtering and sorting are integrated prior to the heuristic. With no solution quality loss, this allows further alleviating the run-time configuration selection and running it in less than 0.5 ms on the previously mentioned StrongARM processor, as illustrated on our experiments.
Assumptions and terminologies
In this section, the assumptions and terminologies used in the next sections are listed, in order to formally introduce the configuration selection problem and to present our flow.
Platform assumptions: The target platform is heterogeneous and consists of multiple intellectual property cores (e.g. ASIC, FPGA, multi-CPUs). One of the CPUs is intended to run (but not only) the RRM. m platform resource types are assumed: for example, memory sizes, number of PEs per PE type, communication bandwidth. For each platform resource type k, 0 ≤ k , m, the available amount is r k max . The possible clock frequencies per PE type are also considered and it is assumed that they can be dynamically scaled independently from each other. The set of possible clock frequencies is denoted as F.
RRM assumptions: This paper targets the application deadlines as being the constraints to be satisfied, and the total power consumption of the platform as being the cost to be minimised.
Application assumptions: p applications are simultaneously active on the platform. Both number and type of active applications can change at run time. Each active application receives an identifier a [ A ¼ {a 1 , . . . , a p }. The deadline of an application a is denoted as t a max . This deadline can change at run time by the external world or the user. The priority of an application a is denoted as v a . As described in [1] , an application consists of communicating jobs, whose implementations can take several forms (fixed logic, configurable logic, software) and offer different characteristics. Binaries of these parallelised versions (resulting from previous HW/SW partitioning or parallelisation tool such as [43] ) of the application among the PEs of the platform are available. The maximum number of such parallelisations is s. Also, for robustness reasons, for easy failure detection and recovery, and to simplify the platform-level power management, it is assumed that one job is mapped exactly on one PE. For each application a, the Pareto set of available configurations c a is C a , and its size is N a ≤ N max , where N max is the maximum number of configurations in any C a . As already mentioned in Section 2, the application configurations are generated by the design space explorer in our flow, and the power consumption and execution time estimations are derived by the platform simulators. The meta-data characterising each application configuration are structured and stored in a database of the RRM to enable fast exploration during run-time decisions. The implementation of this database is out of scope of this paper.
Configuration selection problem definition
As defined in [4] , the run-time manager has to select exactly one configuration from each active set C a , according to the available platform resources, in order to minimise the total power consumption of the platform, while satisfying the application deadlines. This problem (Fig. 2) can be mathematically formulated as follows. Given p active applications a 1 , . . . , a p with deadlines t a1 max , . . . , t ap max , and sets of available configurations C a1 , . . . , C ap , identify a set of configurations This selection is a combinatorial problem, called MMKP. According to [46] , it belongs to the NP-hard class with respect to p, N max and r k max , 0 ≤ k , m. The MMKP heuristic proposed in [4] to solve this problem has an overall worstcase complexity of O(m + 2pN max + pN max log( pN max )). The reported experiments reflect its very low complexity compared to the fastest state-of-the-art heuristics. It is also the only one to run in less than 1 ms for problems with p ≤ 30, N max ≤ 10 and m ≤ 10 on a StrongARM processor running at 206 MHz. In our flow, N max is the size of the Pareto sets derived from the design space explorer. It can be still high and become a critical parameter in the heuristic complexity. Hence stepwise filtering and sorting are integrated prior to the MMKP heuristic to further alleviate the run-time selection (Sections 5.2 and 5.3). 
active application, needs more than the available platform resources. In case of soft real-time applications, for which the deadline can be missed with the lowest possible penalty, their priority is taken into account in our selection strategy in order to relax the deadlines and to reach a solution (Section 5.3.3).
Observe that this configuration selection is executed at run time whenever the environment is changing (e.g. when a new application/use case starts, or when the user requirements change). Different strategies are possible. The first one is to select a configuration for each active application at once, but some downsides may be expected. Indeed, revisiting already active applications whenever a new application is activated can also lead to a large number of reconfigurations for only a minimal power gain. In case enough platform resources are still available to accommodate new applications, the second strategy is not to revisit the already active applications, and select a configuration only for the new applications. In spite of a performance gain, this strategy is not less optimal. The third strategy is to revisit a limited, carefully chosen set of already active applications, together with the new ones. As mentioned in [1] , the strategy choice should depend on the considered application domain and be selected by the user. The reconfiguration cost (time and power) has to be taken into account in both application deadlines and total power consumption. This is part of our future work.
Observe also that the configuration selection relies on estimations of individual applications. A proactive management technique, based on [32] , must complement the configuration selection in the run-time manager, as follows: based on real performance and power consumption, in case of deviations from the estimated ones, this technique proactively refines the configuration selection decision. The integration of this proactive technique is part of our future work too.
Flow linking design space explorer and run-time selection
Our flow (Fig. 1) links a design-time design space explorer, coupled with platform simulators at two abstraction levels, with a fast and lightweight run-time heuristic, executed whenever the environment is changing, to select nearoptimal configurations for the active applications. Observe that the design space explorer and the platform simulators are external tools. Only the run-time selection heuristic is integrated in the RRM, which is implemented in C, and is loaded in the user space of the host processor of the platform. On the StrongARM processor of our target platform (Section 6.1), its binary size is 54 kbytes. The data structure size for one application configuration, whose characterisation is given in Section 3, is 292 bytes.
The remainder of this section describes the different steps (used heuristics and tools, and complexity analysis) of our flow and highlights the extensions brought to the MMKP heuristic [4] . Fig. 3 performs a designtime exploration of configurations for each application a, and it derives a Pareto set, according to the available platform resources, the corresponding power consumption of the platform and the execution time of the application. This algorithm works as follows: † Lines 1 and 3: The algorithm iterates over the applications and the available parallelised versions of the applications. † Line 4: For any parallelised version a par , the algorithm explores the clock frequency combinations f (one clock frequency per used PE) and derives the set of Pareto configurations C a new according to the power consumption p and the execution time t estimated by the platform simulator. The size of the design space to be explored rapidly increases with the number of possible clock frequencies. Therefore an exhaustive exploration for each a par is not possible, and the exploration is performed by one of the multi-objective optimisation strategies developed either in modeFRONTIER or in Multicube Explorer. Currently, for a par using ≤3 PEs, an exhaustive exploration is performed. Otherwise, the exploration is performed by the NSGA-II multi-objective genetic algorithm [47] . For each a par using r PEs, the genetic algorithm runs with a population size |F| × r for ngen generations, resulting for the worst case in ngen × |F| × r simulations. During the exploration, p and t † Line 7: At the end of the exploration, to alleviate the runtime selection heuristic (Section 5.3), the set of configurations C a is further filtered to only keep the Pareto configurations according to the number of used PEs per PE type too.
Design-time
Pareto set generation 5.1.1 Heuristic description:
Complexity analysis:
The processing of one iteration over one application parallelisation (line 4 in Fig. 3 ) is completely independent of the processing over the other parallelisations. Thus, to save exploration time, all iterations can be performed in parallel. Since every iteration consists of a multi-objective optimisation (with the required platform simulations), the worst-case execution time for processing one application can then be approximated as the worst-case execution time of the multi-objective optimisation for one parallelisation, which is O(ngen |F| r).
Design-time pre-processing

Description:
To improve the performance of the runtime decision making (Section 5.3), and to reduce the effort needed to select the near-optimal application configurations, the following steps are also performed at design time in our flow.
Step 1 looks in each set C a for the configuration with the highest power consumption p. Let c The new step 4 sorts the configurations of each set C a in ascending order according first to their resource price r price and then to their value n. First, this step allows reducing step 2 of the run-time selection to a linear scan of each active set C a . Second, this step makes the selection of the initial solution immediate in step 4 of the run-time selection.
Complexity analysis:
The complexity analysis per application is as follows. The complexity of both steps 1 and 2 is O(N max ), where N max is the maximum number of configurations in any C a . The complexity of step 3 is
Step 4 is a sorting. From [48] , its worstcase complexity is O(N max log(N max )).
Run-time selection of c a
Heuristic description:
The optimisation strategy used in the run-time manager is a fast MMKP heuristic for MP-SoC run-time management, adapted from [4] . The problem solved by this heuristic is defined in Section 4, where the sets of configurations C a1 , . . . , C ap are derived from the design-time exploration described in Section 5.1. The different steps of this heuristic are described below. The new step 1 removes in each active set C a any configuration c a whose execution time exceeds the application deadline, that is, such that c a [t] . t a max .
The minimisation problem defined in Section 4 becomes the following maximisation problem. Given p active applications a 1 , . . . , a p and set of available configurations C a1 , . . . , C ap , identify a set of configurations S ¼ {c a1 [ C a1 , . . . , c ap [ C a1 } in order to maximise the total value
according to the resource constraints
The new step 2 further filters each active set C a updated from step 1 as follows. All remaining configurations in C a satisfy the application deadlines. Some of them are Pareto optimal due to their low execution time, at the expense of using more resources or consuming more power than other configurations in C a . For our minimisation problem, these configurations are undesirable and are removed from C a too. Concretely, the goal of step 2 is, for each active set C a and for each available parallelised a par , to identify the combination of clock frequencies f that minimises the corresponding power consumption p, while satisfying the application deadline. Hence the resulting maximum size of C a is s.
Owing to the sorting of C a performed by step 2 during the design-time pre-processing, this step only consists of a linear scan of C a .
Step 3 sorts all configurations of all active sets C a indifferently. These configurations are sorted in descending order according to their coefficient angular (c a [n]/c a [r price ]) in the twodimensional space (value against resource price). This implies that a high priority is given to configurations with high value and low resource price. The sorted set of all configurations, resulting from step 3, is denoted as C_sorted.
Step 4 is a greedy algorithm solving the knapsack problem (Fig. 4) . The following terminologies are used. A solution is a set of configurations c a , one per active application a. A feasible solution is a solution that satisfies the resource constraints (2) . An optimal solution is a feasible solution with a maximum total value (1). cur_sol (resp. saved_sol) denotes the current knapsack solution (resp. saved feasible solution), including the configurations c_cur a (resp. c_saved a ). The Boolean variable feasible(cur_sol) indicates whether cur_sol is feasible or not. cur_val (resp. saved_val) denotes the total value of cur_sol(resp. saved_sol). The initial solution includes the configuration from each set C a first with the lowest resource price and then with the lowest value. Starting with the lowest resource price configurations helps the greedy algorithm to quickly reach a feasible solution when one exists. Starting with the lowest value configurations allows the greedy algorithm to perform better in finding a solution close to an optimal one. The procedure ExchangeConfiguration(c a ) updates cur_sol by exchanging c_cur a against c a and by updating both total resource it uses and total value. However, this exchange is ignored if cur_sol, initially feasible, would become infeasible.
Complexity analysis:
The complexity analysis of each step is as follows. Both step 1 and step 2 can be combined. Their global complexity is O( pN max ), where p is the number of active applications and N a ≤ N max is the number of Pareto-optimal configurations in C a .
Step 3 is again a sorting. Its worst-case complexity [48] is O( pslog( ps)).
Step 4 consists of iterating over the whole C_sorted and has a linear complexity O( ps). Hence the runtime selection heuristic has an overall worst-case complexity of only O(2pN max + ps log( ps)), due to stepwise filtering and sorting performed before the greedy algorithm solving the knapsack problem. Observe that this run-time selection is executed at run time whenever the environment is changing and that a low worst-case complexity is mandatory for this flow section.
Deadline relaxation:
Whenever no feasible solution is found by the heuristic of Section 5.3.1, the deadline of the active application with the lowest priority is relaxed, and the heuristic is performed again on the updated set of active applications. The deadline relaxation is performed as follows:
1. Let a denotes the active application with the lowest priority. 2. Let k be the critical resource in finding a feasible solution. 3. Look in C a for the configurations with a minimum r k . 4. Among these configurations, detect the minimum execution time t. 5. Relax the deadline t a max with this the minimum execution time t.
Interface between design-time exploration and RRM
The interface between the design-time exploration and the RRM is implemented with XML, similarly to all interfaces between the design space explorer and platform simulators in our flow. For each application, the design-time exploration exports a Pareto set of optimal configurations, according to the used platform resources, the corresponding power consumption of the platform and the execution time of the application. Each configuration is also characterised by a combination of clock frequencies, one for each needed processor.
Whenever an application is loaded on the platform, the RRM parses the XML file describing all its configurations exported by the design-time exploration. This parsing is executed once and does not impact the complexity of the run-time configuration selection of Section 5.3.
Demonstrator
This section presents our demonstrator, that is, the target platform, the application and the use case scenario.
Target platform
The embedded multi-core platform (Fig. 5a) , being of interest for our experiments, consists of one StrongARM host processor, running at 206 MHz, where the RRM is loaded, seven coarse-grain architecture for dynamically reconfigurable embedded system (ADRES) processors [49] , where the active applications are mapped, one shared L2 memory, being as SDRAM, and a 32-bit wide full crossbar switch used as interconnection bus. Each ADRES processor has also a local L1 memory. The ADRES processor is a power-efficient flexible architecture template that combines a very long instruction word (VLIW) digital signal processor (DSP) with a coarse-grain reconfigurable array. The VLIW DSP efficiently executes control-flow code by exploiting instruction-level parallelism. The array, containing many functional units, accelerates data-flow loops by exploiting loop-level parallelism.
In this paper, it is assumed that three MPEG4 encoders are simultaneously activated and that the RRM is responsible for their mapping.
One important way to reduce the power consumption of the platform is to combine DVFS with dynamic power management. DVFS allows processors to dynamically change their supply voltage and clock frequency. In our target platform, it is assumed that the clock frequency of each ADRES processor can be dynamically scaled, independently from each other. Allowed clock frequencies are 20, 60, 100, 140, 180 and 220 MHz. 
Target application
The application is an industry standard hybrid video encoder, being the MPEG4 encoder [50, 51] . For this application, the quality-of-service requirements are specified through the frame rate and the frame resolution.
Using the MPSoC parallelisation assist (MPA) tool [43] , several parallelisations of the application can be generated at design time for the target platform. As input, MPA takes sequential code and parallelisation specifications (i.e. functions and loops to be parallelised) of the application. As output, it generates a parallelised application code. For the MPEG4 encoder, parallelisations with 1 -7 parallel threads have been generated.
For each MPEG4 encoder parallelisation, the MP-SoC platform simulator HLSim also provided at design time rapid feedback about the expected performance, memory space and communication bandwidth usage of the parallelisation, based on platform parameters and profiled data in the sequential implementation on the ADRES processor.
In our demonstrator, three MPEG4 encoders are simultaneously activated on the ADRES processors of the target platform with the same 4CIF frame resolution, but with different frame rates.
Use-case scenario
As use-case scenario, an enhanced automotive cognitive safety system (ACSS) is considered. An ACSS is an on-board vehicle digital system that analyses, anticipates and acts in response to ever-changing conditions to help keep passengers safer. For example, it exploits a wide range of sensors (cameras and radars) for actuating emergency measures such as forward collision warning, automatic pre-crash emergency brake, lane departure warning and guidance, lane change assistance and blind spot assistance.
The vehicle is assumed to have three cameras associated with the left, centre and right mirrors, respectively (Fig. 5b) . These cameras are connected to the host processor of our ADRES-based platform, where three MPEG4 encoders are simultaneously running to encode the three video streams. Then, these encoded streams are sent to an off-chip central safety unit (CSU), which reduces the needed bandwidth of the on-chip buses.
Moreover, the driver of the vehicle can watch the actual content of the video streams coming from the CSU on the displays of the dashboard. He can also independently select which video stream he wants to watch and set the frame rates.
Requirements:
In this use-case scenario, two main requirements are considered. First, to operate correctly and to achieve safety-critical operations, it is assumed that the CSU should meet a frame rate of at least 15 frames per second (FPS) for each video stream. Second, whenever the driver activates one or more video streams on the dashboard, to allow him to achieve a reasonable quality, the CSU regularly communicates to the platform a new set of frame rates and priorities. This is done whenever (i) another vehicle is detected in the proximity by the CSU; (ii) the vehicle speed exceeds the following thresholds: every 10 km/h for lateral cameras and every 20 km/h for the central camera.
On the one hand, the frame rates are then set at run time by the CSU as follows: † For lateral cameras: 15 FPS when the speed is under 10 km/h, 25 FPS when the speed is over 110 km/h, a frame rate being linearly interpolated for intermediate thresholds, and 30 FPS when another vehicle is detected in the proximity. † For the centre camera: 15 FPS when the speed is under 20 km/h, 20 FPS when the speed is over 120 km/h, a frame rate being linearly interpolated for intermediate thresholds. On the other hand, the priorities are also set at run time by the CSU as follows. Whenever there is no vehicle in the proximity, a high priority is given to the left and centre cameras. When a vehicle is detected in the proximity, a high priority is given to the centre camera and to the lateral camera close to the approaching vehicle.
Simulation:
The vehicle speed and the arrival times of approaching vehicles have been simulated according to several reasonable driving patterns and conditions. For simplicity, the results associated with a single pattern in an urban environment are outlined in Fig. 6 . This figure illustrates the vehicle speed and the required video stream frame rates for the three cameras: (i) the three video streams are activated at the beginning of the simulation; (ii) the frame rate of the video stream on the right camera is only influenced by the vehicle speed (Fig. 6d) ; (iii) the frame rate of the video stream on the left camera is also influenced by two vehicles, which are consecutively approaching (Fig. 6b) and (iv) at around 50 s, the driver deactivates the centre camera, and the frame rate of the corresponding video stream is reduced to 15 FPS (Fig. 6c) .
Applying our flow to the demonstrator
This section reports the experiments and results obtained for each step of our flow applied to the demonstrator. The goal of these experiments is to show the very low complexity of the run-time selection step. As mentioned in Section 5.3, the complexity is independent from the number m of resource constraints. Hence in the following, only the processor usage is considered. As depicted in Fig. 1 , our flow links the Multicube Explorer exploring the MPEG4 encoder configurations at design time with a lightweight RRM running on the StrongARM host processor of the platform.
Design-time exploration of MPEG4 encoder configurations
In our demonstrator, any MPEG4 encoder configuration is characterised by (i) the parallelised code, generated by MPA; (ii) the number r of needed ADRES processors (also called cores in the remainder of the paper); (iii) the combination of clock frequencies for each of the needed ADRES processors; (iv) the platform power consumption associated with the configuration and the execution time to run the parallelised code on the platform, estimated by the platform simulator.
As mentioned in Section 6.1, the allowed clock frequencies for an ADRES processor are 20, 60, 100, 140, 180 and 220 MHz. Hence for any parallelisation of the MPEG4 encoder, depending on the number of needed ADRES processors, the number of clock frequency combinations varies from 6 to 6 7 ¼ 279 936. Hence an exhaustive exploration of all MPEG4 encoder configurations is not possible. Instead, the exploration is mainly performed by the NSGA-II multi-objective genetic algorithm [47] of the Multicube Explorer. As explained in Section 5.1.1 † For r ≤ 3, an exhaustive exploration is performed. Hence for each such r, 6 r operating points are generated. † For r ≥ 4, the exploration is performed by the NSGA-II multi-objective genetic algorithm [47] . For each such r, the genetic algorithm runs with a population size |F| × r for 50 generations, resulting for the worst case in 50 × |F| × r simulations. The exact number of simulations for each such r is shown in Table 1 , together with the size of the corresponding complete design space (i.e. 6 r ). Among the 2905 operating points generated from the Multicube Explorer, only 153 explored configurations of the MPEG4 encoder are kept after the first filtering of the designtime exploration. They are represented in Fig. 7a . These configurations characterised by the same r are Pareto optimal according to the average execution time of the MPEG4 encoder and the corresponding average power consumption of the platform. The Pareto set of configurations for the MPEG4 encoder (in total, 41), after the second filtering of the designtime exploration, is depicted in Fig. 7b . These configurations are Pareto optimal according to the number of used ADRES processors, the average execution time of the MPEG4 encoder and the corresponding average power consumption of the platform. This filtering reducing from 2905 to 41 configurations significantly alleviates the complexity of the run-time selection (Section 5.3.2), where the number of kept configurations is critical. Fig. 7c illustrates a first filtering that can be performed at design time. Indeed, a frame rate of at least 15 FPS (Section 6.3.1) is required to allow the CSU in the vehicle to operate correctly and to achieve safety-critical operations. Such a frame rate implies a maximum average execution time of 0.067 s per frame. This is possible with already one ADRES processor per MPEG4 encoder (Fig. 7c) . Since three MPEG4 encoders must simultaneously run on the platform, three ADRES processors on the platform are at 
Run-time selection of MPEG4 encoder configurations
Whenever the CSU sets a new set of frame rates and priorities for the three video streams, the RRM is called to globally select one configuration for each MPEG4 encoder, according to the seven supplied ADRES processors and the allowed clock frequencies, in order to minimise the platform power consumption, while satisfying the required frame rates. Fig. 8 presents the results of each RRM call during the simulation pattern described in Section 6.3.2. It shows the number of needed ADRES processors and the average clock frequency among the used processors, as adopted by the RRM, for the selected configurations of the three MPEG4 encoders. Fig. 9a profiles the average power consumption of the platform during the simulation pattern of Section 6.3.2, whereas Fig. 9b profiles the difference between the required frame rates and the achieved ones.
From Figs. 8 and 9, it can be observed that 1. When the required frame rate is 15 FPS, that is, when the vehicle speed is low, or when no vehicle is approaching, or when all cameras are deactivated by the driver, that is, at the start, around 40 s, and around 70 s in the simulation pattern † the RRM selects for the left MPEG4 encoder the configuration using one ADRES processor with high frequency; † the RRM selects for both centre and right MPEG4 encoders the configuration using three ADRES processors with low frequency. This selection yields an actual frame rate of 16 FPS, that is, one more FPS than the required one (Fig. 9b) .
2. When a vehicle is approaching, that is, around 35 and 45 s † the RRM increases the number of ADRES processors for the left MPEG4 encoder, by stealing two ADRES processors from the centre MPEG4 encoder; † the RRM also increases the clock frequency, yielding two peaks in the platform power consumption (Fig. 9a) .
Before the driver deactivates the centre camera, the required frame rates of all cameras cannot be achieved (Fig. 9b) . Hence, the priority-based mechanism of the RRM comes into play. The frame rate of the low-priority right MPEG4 encoder is relaxed, up to five FPS. This relaxation allows the RRM to steal ADRES processors from this right MPEG4 encoder.
RRM overhead
The speed of the run-time selection heuristic must be within acceptable boundaries. As a reference of this run-time boundary, the time required to start a new application using the Linux OS is in the order of magnitude of 1 -10 ms [52] depending on the platform. For our experiments, the run-time selection is simulated on the cycle-accurate SimIt-ARM simulator [53] for the StrongARM microarchitecture running at 206 MHz. The performance of the run-time selection is reported in Fig. 10 for the simulation pattern of Section 6.3.2, from which we can derive the following observations. Whenever the greedy algorithm finds a feasible solution, the run-time selection heuristic is performed between 0.146 and 0.218 ms. Whenever the greedy algorithm does not find any feasible solution, the frame rate needs to be relaxed, and the greedy algorithm is performed more than once. This leads to a run-time performance between 0.316 and 0.388 ms. Hence overall, any execution of the run-time selection heuristic, including the frame rate relaxation when needed, takes less than 0.5 ms. This overhead can be considered negligible.
Conclusion
This paper presented a flow linking a design-time design space explorer, coupled with platform simulators, with a fast and lightweight priority-based heuristic to select nearoptimal configurations for the active applications. As illustrated by our experiments, the number of possible application configurations can be huge. Hence a designspace explorer is needed within the embedded system design cycle. Coupling the explorer with platform simulators at different abstraction levels allows to report quickly and accurately performance and power consumption estimations of the platform running given application configurations. Our priority-based run-time selection, relying on the identified configurations, is based on a very fast optimisation heuristic to globally select configurations of active applications, while possibly relaxing the constraints of soft real-time applications. Future work will focus on integration of further low-complexity and complementary techniques in the run-time manager, such as, task mapping, run-time quality monitoring, dynamic reconfiguration and proactive energy management. It will also consider the reconfiguration cost into our run-time selection heuristic.
Acknowledgments
This work has been supported by the European research project MULTICUBE under project number FP7-216693, and by the Swiss National Science Foundation under the grant 'Design Space Exploration of Run-Time Resource Management for Multi-Processor Systems-on-Chip'.
