Abstract-Modern multicore processors for the embedded market are often heterogeneous in nature. One feature often available are multiple sleep states with varying transition cost for entering and leaving said sleep states. This research effort explores the energy efficient task-mapping on such a heterogeneous multicore platform to reduce overall energy consumption of the system. This is performed in the context of a partitioned scheduling approach and a realistic power model, which improves over some of the simplifying assumptions often made in the stateof-the-art. The developed heuristic consists of two phases, in the first phase, tasks are allocated to minimise their active energy consumption, while the second phase trades off a higher active energy consumption for an increased ability to exploit savings through more efficient sleep states. Extensive simulations demonstrate the effectiveness of the approach.
I. INTRODUCTION
For embedded real-time (RT) systems it is imperative that timing constraints posed by the environment are met. In this general context a number of trends can be identified. Firstly, Moore's law is no longer sustained by increasing clock frequencies, but rather by addition of extra cores in multiprocessors. This is driven for example, by the performance per watt ratio, as higher clock ratios demand also higher supply voltages. Besides symmetric multicore processors, homogeneous and heterogeneous multicores gain in popularity. The move beyond symmetric multicores is driven by both using cores geared to perform specific tasks well and cheap. A second trend is an increased interest in multi-criticality devices, where part of the system is critical and other parts are executed in a best effort manner. Finally, there is the move towards increased use of embedded devices with limited energy supply. These might be, for example, solar powered devices in the field or handheld rechargeable devices. In these kind of devices effective management of the limited resource (energy) is another constraint in the system requirements.
The real-time community has recognised these trends and provided solutions to these challenges. However, in most cases, the power and energy models used make many simplifying assumptions, which limit the applicability of the presented solutions. Common assumptions are, on one side homogeneous multicore processors with a constant speed factor between the different cores; on the other side, the energy consumption of different applications are only a function of execution time rather than other task characteristics (e.g. number of cache misses). The latter has been shown to be widely off the mark [1] . Finally, the use of multiple available sleep states is rare.
The fact that task characteristics, like the cache miss pattern have an influence on the energy consumption beyond the mere change of execution time, means that analytical solutions are bound to be suboptimal for most specific cases. As such, the way forward is an effective heuristic to be used for energy management. Within this work, we assume that the system has such non-linear dependencies on execution time and energy consumption and several sleep states. In order to guarantee the temporal isolation requirement, we work with a partitioned scheduling approach. The underlying approach per CPU, ERTH [2] , allows reconfiguration at run-time and thus enables limited migration, however, in this work we focus on the task partitioning and mapping problem. In the allocation stage the approach considers average-case energy consumption as objective function, considering real-time constraints based on worst-case execution and minimum inter-arrival time.
The proposed approach is divided into two phases. Firstly, the novel algorithm performs assignments with an objective to reduce the active/dynamic energy consumption of the system by allocating tasks to their favourite processors. A processor is considered favourite for a task where its active energy consumption is minimal when compared to all other processor types. In the second phase, it trades off the higher active energy consumption of tasks to enhance the processor's ability to use more efficient sleep states. The sleep states allow the processor to reduce the static power consumption of the system in idle intervals. The second phase is motivated by the fact that the static power consumption has become non negligible portion of the overall energy consumption of the system.
Traditional task assignment algorithms aim to reduce the active power consumption of the system by assigning the tasks to their favourite processor, while ignoring the static power consumption. The management of the static power consumption of the processor depends on the properties of the tasks such as their respective minimum inter-arrival times and worst-case execution times. For instance, assume the task assignment is such that it generates large amount of idle intervals in combination with a short period task. The processor may not be able to exploit it to use deeper sleep states due to a combination of the larger transition overhead of those and a short period task.
The paper is organised as follows. Section II discusses the related work followed by the system model in Section III. Section IV presents the two phase approach to do the task assignment followed by the experimental setup and results given in Section V. We conclude and provide future directions in Section VI.
II. RELATED WORK
Energy efficient scheduling for the homogeneous multiprocessors has been widely explored in RT systems in the last decade. For instance, Kandhalue et al. [3] recently presented a Single-clock domain multi-processor Frequency Assignment Algorithm (SFAA) for periodic, implicit deadline tasks under fixed priority (Rate-Monotonic) scheduling. It exploits the task period relationships to determine energy efficient frequency assignment. Chen et al. [4] provided a comprehensive survey of such techniques. In contrast, the state-of-the-art in poweraware heterogeneous multiprocessors is limited.
Yu and Prasanna [5] proposed the static allocation of the tasks in a RT system for the heterogeneous processing units under Dynamic Voltage Scaling (DVS). They formulated the problem as an Integer Linear Programming (ILP) and provided a linearisation heuristics. A pseudo polynomial time greedy algorithm [6] is proposed by Huang et al. for the frame-based RT task model and heterogeneous systems. Furthermore, a greedy heuristics is provided to migrate the tasks from the overloaded processor to reduce energy consumption. Luo and Jha addressed the task model with precedence constraints and proposed the list-scheduling strategy [7] for the heterogeneous distributed systems. Chen and Thiele [8] considered a case of 2-type heterogeneous processors and proposed a polynomial time approximation scheme based on the ratio of task execution times on the different processor types. The synthesis problem for heterogeneous platform is addressed by Hsu et al. [9] for the RT task model. They proposed an approximation algorithm based on a rounding technique by applying a parametric relaxation on an ILP to minimise the processor cost under the given timing and energy cost. Hung et al. [10] considered a heterogeneous platform with 2 processing elements, one with DVS enabled core and second without DVS capability, with an objective to reduce the overall energy consumption and maximise the energy saving in migration from DVS enabled core to non-DVS core. While DVS has its advantages, the state-of-the-art [5] - [10] ignores the static power consumption. We focus on the shut-down mechanism in this paper that effectively exploits the idle intervals in the schedule to reduce the static power consumption of the system that has become a considerable factor of the overall power consumption of the modern embedded systems.
Yang et al. [11] proposed an approximation algorithm based on dynamic programming and provides polynomialtime solution when the number of processor types is a small constant. However, in the general case when the restriction over the number of processor types is relaxed, this scheme has exponential time/space complexity. They also assume static power consumption of the system as a constant factor. The work of Chen et al. [12] presented a task assignment algorithm for periodic real-time tasks on heterogeneous platforms. The problem is formulated as an ILP problem. They relax some of the assumptions to adopt it into linear programming (LP) and solve it through extreme point theory [13] . The tasks assigned fractionally in the previous steps are reassigned through known heuristics such first-fit, best-fit, worst-fit or last-fit. They ( [11] , [12] ) assume the static power consumption of the system is a constant factor and it cannot be reduced due to the significant overhead of the sleep transitions. This assumption does not hold for modern processors which contains several sleep states to reduce the static power consumption of the system. Moreover, the static power consumption has become a considerable part of the overall energy consumption. Therefore, the effect of the task allocation on the power consumption in the sleep states should be considered to avoid suboptimal assignments.
Our proposed algorithm is based on the realistic power model. It considers the effect of task properties on both active and static power consumption of an assigned processor. In the context of heterogeneous multicores, the state-of-the-art assumes only dynamic power consumption, either ignores static power consumption or considers it a constant factor while doing task allocation on such platforms.
III. SYSTEM MODEL A. Platform
We assume a partitioned multicore architecture, with M different types of heterogeneous processors/cores. Each processor type has a unique characteristic of power consumption and execution capability when compared to others. We consider only a single processing unit of each processor type π m , for the separation of concerns and ease of notation. Each processor type π m has a utilisation of U m .
B. Task Model
We assume sporadic task-model with independent tasks τ
, where C all i is a vector of worstcase execution times of τ i on M different processor types. D i is the deadline and T i is the minimum-inter arrival time. Each independent task will release a sequence of unlimited jobs j m i,k = r i,k ,ĉ i,k , d i,k , where r i,k ,ĉ i,k and d i,k are the absolute release time, actual execution time and absolute deadline respectively. Jobs of the same task are allowed to vary their execution between τ i 's best-case execution time (BCET) and the worst-case execution time (WCET).
The Enhanced Race-To-Halt (ERTH) algorithm [2] is used on each processor, which is a leakage aware energy management approach for dynamic priority systems. It allows multiple sleep states per processor and utilises spare capacity available online to save total energy consumption of the system. ERTH is based on the Rate-Based Earliest Deadline first (RBED) framework [14] , which provides temporal isolation via an enforced budget associated with each task. This temporal isolation allows for mixed criticality workloads. Though RBED supports many application classes (such as Hard RT, Soft RT and Best Effort (BE) tasks), we focus in our discussion on BE and Hard RT tasks without loss of generality.
C. Power Model
The power model used in state-of-the-art assumes two different parts: dynamic (active) power and static (leakage) power. Dynamic power consumption varies with the frequency of the processor, while static power consumption is considered as a constant factor. Consequently, such power model assumes the energy consumption of an application on a processor is only a function of its execution time. However, in real terms, energy consumption on a certain processor depends also on the set of instructions it has to execute to perform the desired functionality. Different instructions use different parts of CPU, and hence may result in a different energy consumption. Therefore, two applications with identical execution time may consume different energy depending on the characteristics of the instructions used, and the number of cache misses involved. Secondly, the static power consumption of the system cannot be regarded as a constant factor. If the energy saving mechanism is based on sleep states then the static power consumption of the system depends on the energy characteristics of the used sleep states. We employ this more refined power model where energy consumption of a system is not constant per unit time, rather depends on the behaviour of the application, the sleep-states characteristics of the processor and the use of sleep states by the scheduling algorithm.
We assume only a single speed per core (i.e. no DVS), as DVS would add another dimension for optimisation and is hence avoided due to separate concerns. The power consumption of the processor type π m in active mode and idle mode are P The sleeps states parameters can be used to derive its breakeven-time BET m n using any known techniques [2] . Note that the BET m n for practical consideration is atleast 2 × tr m n . The average energy consumption of all tasks on all processor types is determined offline using any known techniques (for instance, energy measurement technique based on performance monitoring counter [15] ). Nevertheless, one can also use our approach with the naïve power model that assumes in active mode, the energy consumption of the processor is constant per unit time or consider worst case energy consumption as optimisation target. The preference of the task to any processor is set with respect to its ascending order of energy consumption. The most favourite processor type for a task is the one where its energy consumption is minimal. Similarly, a processor type is least preferred where the energy consumption of a task is maximal. We assume the static power consumption of the system is not constant. It can be reduced by using efficient low power sleep states in the idle intervals.
D. Problem Statement
We consider M-type Heterogeneous platform with per core several sleep states assuming their energy/time overhead in a setting of partitioned scheduling and map a given task-set onto this platform such that the overall energy consumption (active + sleep) of the system is minimised.
IV. ALLOCATION HEURISTICS
In order to tackle active and static power consumption, a two phase algorithm is proposed to perform the task assignment for the given M-type heterogeneous platform. The first phase of the algorithm optimises the assignment such that it reduces the active energy consumption of the system. The second phase trades tasks active energy consumption to enhance the ability of the processors to use efficient sleep states to reduce static power consumption of the system. We will use the terms processor type, core type and core interchangeably.
A. First Phase of Allocation
We propose two different assignment algorithms to reduce the dynamic power consumption of the system.
1) Least Loss Energy Density Algorithm (LLED):
This algorithm attempts to allocate tasks to their favourite core to optimise the individual task energy consumption of the system. However, not all tasks may be allocated to their respective favourite core type due to the limited capacity on each core. In such a scenario, where more than one task are competing for their favourite core type, we need to rank the tasks among each other on same core type.
We defined the energy density ED
The energy density of a task gives its average energy consumption per unit time on the respective core type. This value does not provide any global perspective on how the power consumption of the system changes when a certain task is not allocated to its preferred core type. The global perspective can be achieved through a metric termed as density difference (DD). The density difference can be determined by subtracting the energy density of a task on the current core type from the next higher energy density value of the same task on another core. It can be computed with the following expression DD
It defines how much extra energy will be consumed, if the task is allocated to the next higher energy consumption core instead of its current preferred core type. To get the ranking of the tasks on the given core, we sort all the tasks on this core in descending order with respect to their DD values. The tasks from the top of the list i.e. tasks with higher DD values are allocated first. The intuition behind such a mechanism is to reduce the losses by allocating the tasks with higher energy density difference first. The process can be started from any core type. A task allocated to a core is not considered for an allocation on any other core where it consumes more energy than its currently allocated core. The same procedure is repeated for all cores. In the worst-case scenario, the process is iterated over each core at most times.
The pseudo-code of Least Loss Energy Density algorithm (LLED) is given in Algorithm 1. Initially, we compute the energy density ED m i of every task on all core types (line 2). Using energy density values, the DD values of all tasks are estimated on each core and stored in a matrix called M T (line 3-6, 10). (Note: M T q w value in a matrix M T corresponds to the DD value of τ w on a core type π q ). To obtain the DD value of the task τ w on its least preferred core type ( max x=1,···,M ED x w ), its energy density value on the least preferred core type is subtracted from 0 (line 8) to obtain a negative value. Afterwards, the algorithm iterates through the processors in any order (for example, we used processors indices to order them). Starting from the first core type π q , all tasks on π q have their entries in M T q sorted in descending order with respect to their M T . In other words, τ x is not considered for allocation on other core types where it consumes more or equal energy compared to this core type π q . If the task τ x was previously allocated to these higher energy consuming core types, it is deallocated on such cores (line 20). Once the allocation for τ x is completed on π q , LLED attempts to allocate the next task in the sorted list. If any of the task in the order cannot be allocated to π q , the algorithm moves to the next core type instead of checking the next tasks in the order. This action is performed to avoid allocation of any unfavourable task to the current core type, which may have a chance of allocation in the next iteration. The same procedure is repeated for the next core type and so on. On completion of the first iteration, the algorithm starts again from the first processor type. These iterations are repeated unless all the tasks are allocated to exactly one core type. In worst-case, the algorithm has to check each task in each core type for times. Lines 13−26 in Algorithm 1 corresponds to these steps. Therefore, complexity of this algorithm is O( 2 × M ). The working of the algorithm is demonstrated with an example. a) Example: We consider a set of 4 tasks and 3 core types. The tasks specifications are given in Figure 1 end for 12: end for 13: for all Tasks do 14: for q = 1 to M do {/* For all processors types */} 15: Sort all tasks having entry in M T q , w.r.t M T q w values in descending order 16: for all τw ∈ τ on core type q in descending order of M q w values do
Assign τw to π q 19:
20: Figure 1 (c). τ 4 can be allocated to π 1 , therefore, its entry that consumes more energy compared to this core type is deleted in π 3 type. τ 2 cannot be allocated, therefore we move to π 2 and sort the task-set according. In core type π 2 , τ 1 and τ 4 can be allocated. τ 1 's entry in π 3 and τ 4 's entries on π 1 &π 3 will be deleted due to higher energy consumption. Similarly, after appropriate sorting of tasks with respect to their DD values on π 3 , τ 2 and τ 3 can be allocated to π 3 . Therefore, τ 2 's entry in π 2 and τ 3 's entry in π 2 , π 1 are deleted. This completes our first iteration and status of the tasks after first iteration are shown in Figure 1(c) . Similarly, we perform the second iteration. On π 1 , the τ 4 's entry is deleted, so it is not considered for allocation and the system attempts to allocate the next task in the order (i.e. τ 2 ). The rest of the process is similar to the first iteration. The end result of 2 nd iteration is shown in Figure 1(d) . We do not need any further iterations as all the tasks are assigned. The worst-case number of iterations is equal to a task-set size.
2) MaxMin Algorithm (MM): Another simple heuristic MaxMin labelled as MM can be used to assign tasks in M-type heterogeneous platform to reduce the active power consumption is given in Algorithm 2. Assume, ED min i is the energy density of task τ i on its most favourite core type, while ED max i corresponds to its energy density on the least preferred core type. This heuristic for each task computes the difference of ED 
Sort cores with respect to the energy consumption of τi in ascending order 8: for all Processors j = 1 to M do
Assign τi to π j 11:
12:
end if
14:
end for 15: end for this difference (line 5). The MM algorithm picks a task from the top of the list and assigns to its favourite core type. If the favourite core cannot accommodate this task, an allocation attempt is made on the next core type in its ascending order of energy consumption (line [8] [9] [10] [11] [12] [13] [14] . If the task is assigned to a core type, the utilisation of the corresponding core type is incremented accordingly. The MaxMin algorithm is simple and has a complexity of O( × M ).
B. Second Phase of Optimisation
While, the first phase of allocation is derived with an objective to optimise an individual task's active energy consumption in the system, it ignores its effect on the mechanism to reduce the static power consumption. For instance, a core may have less active energy consumption but some small group of tasks allocated to it may prevent it from using a more efficient deeper sleep state in the idle intervals of the schedule to reduce the static power consumption of the system. In this second phase of optimisation, our algorithm analyses the properties of the allocated tasks to a core in this broader context and considers its effect on the core's ability to use more efficient sleep states by trading off higher active energy consumption of a task for energy savings in sleep states.
As mentioned previously, we assume ERTH per core. The ERTH scheduler is based on a race-to-halt strategy and reduces static power consumption with a shut-down mechanism. It determines the maximum time interval offline for which the processor may be enforced in a sleep state without causing any task to miss its deadline under worst-case assumptions. This maximum time interval of a sleep state is termed as maximum-feasible-sleep-threshold th m and it can be determined using the demand bound function (DBF) [2] . Assuming synchronous release of all tasks allocated to a core π m , . The intuition behind the second phase is to collate tasks on a core with similar properties such that it can use a more efficient sleep state. As we are using a heterogeneous platform, each core has sleep states with different characteristics. A task(s) restricting a more efficient sleep state on one core may not effect the sleep state on the other core and hence can be considered for migration. However, the algorithm must ensure that such migration reduces the overall average energy consumption.
We propose the heuristics given in Algorithm 3 to do such a trade-off. Tasks assigned in the first phase are sorted in each core with respect to their difference between T i and C We refer to these sets as groups of tasks corresponding to different sleep states. These groups of tasks are ordered from the least efficient to the most efficient sleep states. Thus, if we remove the top most group of tasks, a core can achieve the next better sleep state. This complete process is repeated for all cores and finally we have different groups of tasks on each core corresponding to its different sleep states. This step is given in line 4 (Algorithm 3).
All cores compete to gain the next more efficient sleep state to save energy by getting rid of their tasks in the top most group that enforces the less efficient sleep state. However, the Group tasks per core such that next better sleep state can be achieved 5: Order core by gains when removing group 6: Feasible = TRUE 7: for all Processor Types M do 8: for all Tasks in a top group do 9: Compute the local cost of migration on energy consumption of this task for all other cores 10: Sort other cores by decreasing order of cost 11: for all Cores except the core of the currently assigned task do 12: if Feasible on core then 13: Assign to a core 14 end for 29: until Previous Assignment == Current Assignment algorithm will first consider the core which would result in the most system energy gain. To identify this core, each core will remove all the tasks associated to the first group (that cause less efficient sleep state). Let G 
The algorithm attempts to assign it to a core type with the least migration cost provided it is schedulable on that core. This process is repeated ∀τ j ∈ G If it is less than previous expected T E consumption, we iterate over the algorithm again unless the energy consumption of the previous iteration is greater than this iteration.
The maximum number of groups (of tasks) in each processor is equal to its number of sleep states and we migrate the complete group to another core. The complexity of each iteration is O( M ). Theoretically, the complexity of the entire algorithm is combinatorial, as a migrant task from one core type can be reassigned to it in another iteration, but for all practical reasons it converges very quickly. The algorithm avoids already computed assignments with a constraint that new assignment should reduce the energy consumption. The actual computation time and the number of migrations are discussed in Section V-B(2). 
V. EVALUATION A. Experimental Setup
In order to evaluate the effectiveness of our algorithms, we have extended the SPARTS (Simulator for Power Aware and Real-Time Systems) [16] and implemented our algorithms for the experiments. SPARTS is used with the parameters defined in Table I . The underlined values are the default values if not specified in the description of an individual experiment. Heterogeneous multicore platforms are used for a wide variety of complex applications, therefore, the task-set size is varied from small number of coarse grained 100 tasks to fine grained large tasks-set sizes of 500 tasks. The share distributions ξ divide the task-set size and overall effective system utilisation between RT and BE tasks. Moreover, the utilisation allocated to each task type is randomly distributed among the tasks of the same class. The minimum inter-arrival time of RT and BE tasks is randomly chosen within a range of [30ms; 50ms] and [50ms; 200ms] respectively. SPARTS selects one of the core type and reference it as a default core type π D . The task-set is initially generated for π
The average system capacity U a of the given platform is computed through the average speed-up-factor η m . The speedup-factor defines a ratio of the clock cycle of a core π m with reference to π D . Suppose speed-up-factor of a core type π m is η m , then the average capacity of the system will be U a = 1/η 1 + 1/η 2 + · · · + 1/η m . However, the effective utilisation U of the task-set in the experiments is controlled through a helper variable ζ, and U = U a × ζ. The range of ζ is (0; 1]. In our experiments, ζ is varied from 0.5 to 0.9 with a step size of 0.05. Individual utilisation of τ i on each π m is a random number within a range of
, where β is a characteristic factor that models the fact that different tasks will respond differently in terms of execution time when moved from one core to another.
Beyond those initial settings, a two level approach is used for generating a wide variety of different tasks and their subsequently varying jobs on all cores. Tasks are further annotated with a limit on the sporadic delay ∆ . The second level varies the behaviour of individual jobs of a task. The interested reader is referred to [16] for details. Each set points of parameters is evaluated with 100 different task sets.
The hardware parameters of heterogeneous platform used in our experiments are shown in Table II. The power model for the default core in our experiments is modelled after the FreeScale PowerQUICC III Integrated Communication Processor M P C8536 [17] . The FreeScalePowerQUICC III core specifications are given in Table II under m = 5. The 
B. Results
The parameters described previously remain the same, except where explicitly specified. In the state-of-the-art there is no such algorithm proposed that has a power model such that this work could be compared with it. Moreover, fundamental assumptions made in the state-of-the-art restrict their extension to the more realistic power model proposed in this paper. Therefore, we have implemented a worst-fit decreasing (W F D) and first-fit (F F ) algorithm as a base line to compared against our algorithms. It has been shown by Aydin and Yang [18] that W F D performs better when compared to other conventional bin packing algorithms for homogeneous platforms. In our experiments, we observed that W F D performs worst in heterogeneous platforms. It was able to schedule few tasks-set at higher utilisations making it hard to compare against our algorithms. Therefore, we use only the F F algorithm for the comparison. The experiments of W F D are omitted in this paper but these results are available in a technical report [19] for the interested readers. Moreover, the F F algorithm allocates the tasks sorted with respect to their D i or T i following the order from the slowest core type to the fastest core type. The results under labels LLED-SP and M M -SP represent the second phase applied on the allocation of LLED and M M respectively. We have created 2 different scenarios. In the first scenario, we have modelled the system with very efficient sleep states having low transition overhead (time and energy). The second scenario models the system, with substantially less efficient sleep states. All results are normalised to the corresponding values of the F F algorithm.
1) First Scenario:
In this scenario, as the overheads of the sleep state is low, therefore, different cores can still achieve the most efficient sleep state even at high utilisation. This scenario does not leave much room for the second phase to save any additional energy when compared to LLED. Nevertheless, M M -SP saves in some cases energy over M M but it is fairly minimal. Therefore, for this scenario, we compare the energy consumption of M M -SP and LLED-SP .
Firstly, the performance of LLED-SP and M M -SP is analysed for different number of core types. Figure 3 shows the normalised energy consumption of the system with 4 core types. The figure for 2 cores looks similar to Figure 3 but does not provide as high energy gains over F F due to the limited scope for optimisation. Similarly, in the second case of 4 core types, initially, the difference of LLED-SP and M M -SP increases but then starts to shrink towards the higher utilisations. This behaviour is obvious as LLED-SP and M M -SP have more chance at low utilisation to allocate task to their favourite core. However, towards high utilisations, this flexibility decreases along with their difference. In bestcase, LLED-SP consumes 10% less energy when compared to F F , while M M -SP saves energy slightly under 10%. We evaluate the effect of variation in the characteristic factor β on the normalised total energy consumption of the system. β controls the variation of task dynamic power consumption from the average dynamic power consumption of the core. Figure 4 demonstrates that the energy consumption of both approaches decreases with an increase in the range of β. The developed power model on average favours the slow core. However, this factor (β) can change this behaviour. With β = 10%, small portion of tasks are more favourable to the fast cores. Hence, the F F algorithm that fills the slowest core first does a few task allocation to their unfavourable cores. Consequently, the gains of LLED-SP and M M -SP are less at β = 10%. However, as the β range increases, the tasks probability to favour a fast core becomes higher. Therefore, LLED-SP and M M -SP give better allocations for higher values of β. Similar to the previous observation, the difference of M M -SP and LLED-SP is higher at low utilisation and decreases with an increase in the system utilisation. Figure 5 demonstrates the effect of task-set size variation on the given allocation mechanism. In general a large task-set size increases the probability of the tasks to be allocated to their unfavourable core with F F . Therefore, energy consumption of the LLED-SP and M M -SP algorithms decreases with an increase in the task-set size. However, this saving reduces with an increase in the effective utilisation. In the beginning LLED-SP with the different task-set sizes do the same allocation but with an increase in effective system utilisation, the difference in allocation also increases. The same observations hold for the M M -SP as well. For small task-set size of 100, F F also performs well at low utilisation. However, this effect deteriorates with an increase in the effective system utilisation.
Processor types given in Table II have approximately similar ratio of P x a /P y a ≈ ζ y /ζ x . We have generated a case where this ratio is not the same and tasks always favour the same core i.e. P x a /P y a = ζ y /ζ x . This case allows us to evaluate a system, where all the tasks are competing for the best core types. For this experiment, we modified the heterogeneous platform given in Table II Figure 6 presents the results for the asimilar platform. The energy consumption of LLED-SP and M M -SP is low at low utilisation and gradually increases towards high utilisation. All the algorithms attempt to allocate tasks in order from the slowest core to the fastest core. LLED-SP can rank tasks in an efficient way and saves more energy. Similarly, M M -SP also performs better when compared to F F as it also does some ranking of the tasks but F F does not prioritise the tasks to account for global energy benefits.
2) Scenario 2:
In this scenario, we have modelled a system, in which the core types have large overheads of sleep transitions (time/energy). To generate such model, we have scaled the transition delays of all the sleep states by a factor of 12 and determined their BET accordingly. We have observed a very interesting result, which shows, it is not necessary that tasks assigned to their favourite core will always reduce the overall system energy consumption of the system. In this scenario, the overall energy consumption depends mostly on the characteristics of the core and it depends less on those of the tasks. This fact will be evident in the following experiments, in which we are comparing LLED, M M , LLED-SP and M M -SP . The base line is still the corresponding energy consumption of F F . Furthermore, the range of ζ is increased to [0.4; 0.9] with a step size of 0.05 for this scenario. Figure 7 shows the normalised total energy consumption of system for 4 core types. At low utilisation, though LLED and M M had a chance to allocate tasks to their favourite core but globally it is not energy efficient. The reason is that these algorithms are not accounting the effect of their allocation on the core sleep states. The F F algorithm which is also sleep state agnostic allocation mechanism surprisingly performs well compared to LLED and M M . It allocates the core from the slowest one and allows fast core to have empty space to use their efficient sleep state. However, our LLED-SP and M M -SP algorithms compare well to F F at low utilisations and compensate for the wrong allocation done by LLED and M M respectively. It is interesting to see that for low utilisations LLED-SP and M M -SP achieve substantial gains. For high utilisations, LLED and M M energy consumption reduces when compared to F F . Hence, a combination of initial first phase allocation (LLED or M M ) with the second phase is a good choice for most of the system utilisations, except for some corner cases (at a utilisation of 9.9 in Figure 7 ). In the detailed analysis of utilisations between 7.2 and 9, we have observed that F F loses the efficient sleep states earlier than LLED-SP or M M -SP . Hence, the energy consumption of LLED-SP and M M -SP is dropped at U = 8.1 when compared to F F . Figure 7 shows that the performance of the LLED algorithm is always dominant over the M M algorithm, and similarly, the performance of LLED-SP over M M -SP .
The variation in the characteristics factor β is demonstrated in Figure 8 and Figure 9 . Similar to the results in scenario 1 (Figure 4) , the performance of LLED-SP and M M -SP given in Figure 8 increases with an increase in the value of β and the similar trend is followed by LLED and M M in Figure 9 . Figure 8 also shows that LLED-SP always dominates M M -SP and the same is true in Figure 9 for LLED and M M . The effect of variation in the task-set size is presented in Figure 10 and Figure 11 . Unlike to Figure 5 , in this scenario the task-set size does not make any difference on the performance of all the algorithms. To evaluate the platform, where all the tasks prefer similar core type, the same setup of Figure 6 is adopted. The results of this experiment are shown in Figure 12 . All the algorithms follow the same race to allocate tasks to the slowest core. Furthermore, LLED performance dominated over M M , and towards high utilisations, it even consumes less energy compared to M M -SP . Overall, LLED-SP performs better for all utilisations. Figure 13 and Figure 14 present the execution times and the number of tasks migrations between different core types of the second phase of allocation respectively. To generate the results in Figure 13 , we used a server with 8 Intel Xenon 1.60GHz processors and a memory size of 8GB. The allocation process of the second phase is very fast even for a large task-set size of 500. Figure 14 shows that the number of migrations (also execution time) decrease with an increase in effective utilisation as the tasks have less freedom to manoeuvre due to high utilisations. Less loaded systems (U = 7.2) allow cores to use their more efficient sleep anyway. Therefore, U = 7.2 has fewer number of migrations (executions time) when compared to U = 8.1.
VI. CONCLUSION
Heterogeneous multicore platforms are becoming popular in industry. This trend demands an advancement in RT scheduling theory. We have explored the problem of task assignment with an objective to reduce the average-case energy consumption of the system, while satisfying RT constraints. This research effort demonstrates the importance of a realistic power model and its effect on the overall energy consumption. In the future, we have an intention to relax the assumption of single processing unit of each processor type and extend this work to allow the job migration to further reduce the energy consumption of the system.
