The process of thread-to-core mapping and setting DVFS 45 levels play a crucial role in exploiting the system properties 46 such that applications can meet their, often diverse, demands 47 on performance and energy consumption [3] . In general, for 48 each application, the management process first finds a thread-49 to-core mapping, and then core DVFS level by inspecting 50 the workload profile while satisfying the performance re-51 quirement. This problem becomes much more complex when 52 dynamically mapping concurrently executing applications due 53 to contention for resources, and when the mapping is coupled 54 with DVFS, i.e., energy-efficient allocation of processing cores 55 and selection of DVFS settings [5] , [6] .
56
The reported approaches for solving this problem fall 57 into three categories: 1) offline, 2) online, and 3) hybrid 58 approaches. Several offline approaches have been proposed 59 targeting different application domains and hardware architec-60 tures [7] , [8] . These typically use computationally intensive 61 search methods to find the optimal or near-optimal mapping 62 for the applications that may run on the system. Conversely, 63 online approaches [4] , [9] - [11] must not be computationally 64 intensive, as they are required to make efficient application 65 mapping/DVFS decisions at runtime. Therefore, these tech-66 niques generally use heuristics to find a suitable platform 67 configuration. Design time approaches usually find solutions of 68 higher quality compared to online techniques, due to extensive 69 design space exploration of the underlying hardware and 70 applications. To address the drawbacks of pure offline and 71 online approaches, various hybrid approaches [8] , [12] - [17] 72 using offline analysis to make runtime decisions based on the 73 current state of the system are proposed.
74
However, a review of the prior arts (see section VI) shows 75 that the existing approaches, targeting heterogeneous multi-76 cores, have the following shortcomings. They use heavy 77 application-dependent profile data and thus are not efficient 78 in managing dynamic workloads when unknown applications 79 with different performance constraints are executing concur-80 rently. For example, the number of different frequency and 81 core configurations for the Odroid-XU3 platform [1] (four 82 big and four LITTLE cores that can operate at 13 and 19 83 different frequencies, respectively) is 4080 ((4×13×4×19) + 84 (4×13) + (4×19)). Most importantly, all these approaches do 85 not perform adaptations (changing the mappings and/or DVFS 86 settings) at an application arrival/completion, and performance 87 variations. To this end, this paper presents AdaMD, an adap-88 tive mapping approach coupled with DVFS for performance-89 constrained multi-threaded applications, executing on hetero-90 geneous multi-cores. AdaMD selects an resource combination 91 (number of cores and their type) that meets the application's 92 performance requirement while minimising energy consump-93 tion. This is achieved by employing performance prediction execution, its resources can be allocated to App2 and App3, 140 which may help them execute faster (and hence put them into 141 a low-power mode sooner), as shown in Fig. 1 (b) . This may 142 result in increased performance and lower energy consump-143 tion, because power is dissipated for a shorter duration.
144
For case ii), as reported by previous work [5] , [20] , ap- interference from other applications due to shared resources 150 such as Last Level Cache, Memory, etc. All the factors 151 above culminate into variation in an application's workload, 152 subsequently leading to variation in application performance. 153 Therefore, the application's performance has to be moni-154 tored periodically, and appropriate action (changing the DVFS 155 setting or remapping) taken to avoid/minimize performance 156 violations. Fig. 1 (c) demonstrates such a case, where more 157 resources are allocated to App2 to mitigate the performance 158 degradation experienced during runtime. If there are no free 159 cores available, as in our case, the cores are taken from the 160 over-performing App3.
161
For case iii), considering the processing capabilities of the 162 underlying hardware, the user may launch a new application 163 while other applications are running. If all the processing 164 cores have been allocated to the already running applications, 165 the runtime management software should check if there are 166 possibilities to re-adjust the current mapping and allocate 167 resources to the newly arrived application without violating 168 performance constraints. This is shown in Fig. 1 (d) , where 169 App4 is added to the system while App1, App2, and App3 170 are executing. The resources of over-performing applications 171 App1 and App3 are allocated to App4 while keeping the same 172 number of cores for App2.
173
As discussed before, existing approaches do not consider 174 the above execution scenarios (case i, ii and iii) for adaptation 175 and moreover, they also depend on extensive offline charac-176 terisation and/or instrumentation of the chosen applications. 177 As experimentally demonstrated in Section V, adaptation at 178 application arrival and completion, and workload/performance 179 variations would lead to better utilisation of the system re-180 sources, and higher energy savings and performance.
181

III. PROBLEM FORMULATION
182
Earlier studies have shown that the thread-to-core mapping 183 problem alone is NP-complete [3] . Therefore, combining it 184 with DVFS would increase the complexity of mapping prob-185 lem due to the huge design space, thereby making the runtime 186 management significantly inefficient. Similarly, if the number 187 of cores or heterogeneity or frequency levels increases, the 188 design space becomes too large for solving at runtime and 189 even for offline analysis [5] . To address this, as per literature,
190
we consider thread-to-core mapping and DVFS separately to heterogeneous multi-core platform supporting DVFS.
196
Find an initial thread-to-core mapping for each application 252 and Performance Monitor (details are given in Section IV-B). 253 The performance monitoring unit (PMU) of the processor is 254 initialized to monitor the above parameters through the routine 255 PMU_initialize() (line 1, Algorithm 1). Note that all the 256 parameters are collected only when an application(s) arrives 257 into the system, which are used by the Performance Predictor. 258 When an application arrives, the Initial Mapper adds it to the 259 application queue and allocates a free core to the application 260 to start application execution (lines 3-9, Algorithm 1). As 261 application execution begins with the serial section, the initial 262 mapper tends to allocate a big core to the application. How-263 ever, if an application's serial section is memory-intensive, 264 measured by MRPI, the application is migrated to a LITTLE 265 core as it results in a greater power efficiency [21] (line 10, 266 Algorithm 1). Data collection starts in the region of interest 267 (ROI) (indicating the parallel code in the application) as that 268 is when actual computation starts and the benefit of allocating 269 more than one processing core can be seen [18] . This is 270 accomplished by notifying the Runtime Data Collector through 271 the ROI_starts() routine when the ROI of an application 272 starts, which is identified by the hook parsec roi begin() 273 [18] (lines 12-15, Algorithm 1). If an application does not 274 support such hooks, handshaking mechanism can be used that 275 informs runtime manager when threads are spawned (e.g., 276 call to pthread create()). This can be implemented using the 277 existing inter-process communication methods (e.g., shared 278 memory variables, message queues, etc.).
279
The runtime data for ROI region is collected every 50 ms 280 for the first 500 ms and their average values are fed into the 281 performance predictor. 2) Performance Predictor: To allocate resources in a 283 heterogeneous multi-core system to meet the performance 284 requirements of an application, it is essential to know how 285 the application performs on various types of cores [21] . This 286 can be achieved either by executing the application on all 287 3.75 ms to move from a big cluster to a LITTLE cluster.
295
This overhead grows with the number of cores and types.
296
Considering the runtime overheads and scalability, this is 297 not an efficient approach. However, this approach would not 298 need offline analysis as everything is measured at runtime.
299
On the other hand, a performance prediction-based approach which can easily be adopted to a new platform/architecture.
307
Performance models: Application performance is usually measured in terms of IPS or IPC, and the relative improvement in the performance is referred to as speedup. We define speedup η as
where, IP C CoreT ype1 , IP C CoreT ype2 are the IPC of the ap-308 plication achieved on core type-1 and core type-2, respectively.
309
The performance model estimates the speedup, which is used 
314
To build the performance models, three steps are followed.
315
The first step is identifying the parameters/metrics that cap- for ∀i ∈ Apps do 8:
Allocate a free core 'l' to 'i' and execute; 10: Measure MRPI and move onto an appropriate core (j);
11:
/*Data collection for performance model*/ ;
12:
Wait until ROI begins;
13:
pmcs.push back(f);
16: η = speedup_estimate(pmcs,j);
Compute possible resource combinations and resource combination with minimum energy t h (Eq. (4), (5) 
Increase the resources of app i ∈ list by y; Distribute freed resources of 'p' to under-performing apps; 38: Allocate remaining resources to apps equally by sorting them based on η; (X) that relates X to y, such that the expected value (E y,X ) 361 of some specified error function ψ(y, f (X)) is minimized.
In general, boosting approximatesf (X) by an additive expan- shown below:
Here, the base learner functions h(X; β) are simple functions 3L  2L+1B  1L+2B  0L+3B  4L  3L+1B  2L+2B  1L+3B  0L+4B  4L+1B  3L+2B  2L+3B  1L+4B   3 energy consumption (line 17, Algorithm 1). Let R be the set of 388 possible resource combinations on a platform, and P erf App i 389 is the performance constraint for an application App i , then the 390 performance meeting thread-to-core mappings (T map i ) can be 391 defined as follows:
392
Here, perf(r) defines the performance of an application when 393 executed on the resource combination r. For simplicity, let us 394 take our chosen platform, the Odroid-XU3, with two types of 395 cores: big (B) and LITTLE (L); N b and N l are set of big and 396 LITTLE cores, respectively. Then, perf(r) is computed as:
where, η = IP C b /IP C l , performance on the big and 398 LITTLE core is denoted by IP C b and IP C l , respectively. 399 Furthermore, n b ∈ N b , n l ∈ N l and r = n l ∪ n b . IP C o 400 is the performance overhead incurred when an application is 401 mapped onto cores that do not share a cache. For instance, 402 the big and LITTLE clusters in the Odroid-XU3 do not 403 share caches, which results in an inter-cluster communication 404 overhead when the threads of an application run on both 405 the big and LITTLE clusters. As shown in Equation 5 , for 406 our chosen platform with eight cores, near linear speedup is 407 expected with increase in number of cores [29] . Even if there 408 is an error in estimation, this would anyway be compensated 409 by performance monitor (Section IV-B2). 
4)
Resource Selector: The job of resource selector is to 411 minimize the energy consumption by selecting a resource com-412 bination with minimum energy from the performance meeting 413 thread-to-core mappings T map i = {3L, 4L, 1L + 1B, ...}, 414 where L and B refers to big and LITTLE cores, respectively. 415 This can be achieved by selecting a thread-to-core mapping 416 t h ∈ T map i that has the highest performance per watt (PPW) 417 (line 17, Algorithm 1).
where, P P W (t) is computed as the ratio between IPC 419 achieved for the resource combination 't ∈ T map i ' and its 420 power consumption. This requires measuring the power con-421 sumption using on-chip power sensors or employing a power 422 model when a platform does not have power sensors [30] . 423 However, the power model would also require the collection 424 of various PMCs data at regular intervals of time, and its PMCs between the minimum number of big cores to the minimum 432 number of LITTLE cores (C r ) is higher/close to the speedup.
433
As big core can execute η times faster than LITTLE core, is done by first creating a sorted list of active applications in 481 descending order of their speedup. Then, application i at the 482 top of the list is selected, and its allocated cores are increased 483 by one. This process is repeated for remaining applications in 484 the list until no free cores are left (lines 22-27, Algorithm 1). 485 Note that applications with η < 1 in the list are given only 486 LITTLE cores as they do not benefit from big cores in terms 487 of energy efficiency.
488
The 497 If an application is under-performing, it then computes the 498 amount of performance loss (the difference between achieved 499 performance and given performance constraint), and then 500 estimates the required resources using Eq. 5 to compensate 501 it. If any resources are remaining after allocating the freed 502 resources to under-performing applications, these resources are 503 distributed among the applications as described in the previous 504 paragraph. As discussed in Section IV-B2 and IV-B3, appli-505 cation performance/workload adaptation is also performed to 506 avoid performance violations as application may experience 507 contention from other applications or workload may change 508 over the time. This may occur at any time during application 509 execution. Therefore, to increase the resource utilisation, free 510 cores are distributed to active applications first. Furthermore, 511 when a new application arrives into the system, the resource 512 reallocator tries to identify and allocate the resources as per 513 t h (Eq. 6). This is done by checking if there are enough free 514 resources available in the platform to satisfy the application 515 requirements. In case free resources are not available for 516 meeting performance constraints, the extra cores of over-517 performing applications are used. After doing this, if the ap-518 plication requirements are still not met, application execution 519 is continued using the available resources until any running 520 application completes and releases allocated resources. 2) Performance Monitor: Applications usually exhibit 522 varying workload profiles (e.g., compute-intensive to memory-523 intensive and vice versa) during execution. When multiple 524 applications are executing simultaneously, the workload profile 525 of each application gets affected due to contention on shared 526 resources [20] . As a result of this, application performance will 527 vary over time, and may lead to the violation of performance 528 constraints. To address this, each application's performance 529 is periodically monitored to detect and compensate when 530 performance constraint is violated (line 30-34, Algorithm 531 1). An application performance is measured by collecting 532 PMCs corresponding to instructions retired and CPU cycles 533 on all the cores that the application is currently running 534 on. When an application's performance constraint is violated, 535 either the operating frequency is increased, or more cores are 536 allocated. Raising the operating frequency is given priority 537 over assigning more cores as the latter incurs a migration 538 Measure MRPI and utilisation of each core i ∈ j;
18:
Compute the minimum MRPI (mrpia) and utilisation (utila); 
547
This allocation is done by computing the performance loss 548 and corresponding required cores using Eq. 5. As already 549 explained in Section IV-B1, for applications with η < 1, 550 LITTLE cores are preferred over big cores. 
V. EXPERIMENTAL RESULTS
590
This section presents the details of the experimental setup, 591 covering the platform, benchmark applications and reported 592 approaches considered for the comparison. Furthermore, an 593 evaluation of the performance prediction models and benefits 594 of the AdaMD approach over the previous approaches are 595 discussed, including associated overheads. This has four ARM Cortex-A15 (big) cores, four ARM Cortex-600 A7 (LITTLE) cores. The platform supports per-cluster DVFS, 601 and all cores within a cluster can only run at the same DVFS 602 level. The big cores have a range of frequencies between 0.2 603 GHz and 2.0 GHz with a 0.1 GHz step, whereas the LITTLE 604 cores can vary their frequencies from 0.2 GHz to 1.4 GHz in 605 steps of 0.1 GHz. The device firmware automatically adjusts 606 the voltage for a selected frequency. The platform also contains 607 four real-time current sensors that facilitate measurement of 608 power consumption of each CPU cluster, GPU and memory. 609 We used Ubuntu OS with kernel version 3.10.96. Energy 610 consumption is computed as the product of average power con-611 sumption (dynamic and static) and application execution time. 612 This includes both the core and memory energy consumption 613 of all the software components, including our implementation, 614 OS, applications and other background processes.
596
A. Experimental Setup
615
Implementation: The proposed AdaMD approach is imple-616 mented as a user space application by using the Perfmon2 617 [34] and cpufrequtils framework. Perfmon2 en-618 ables the user space access to the performance moni-619 toring unit (PMU), and cpufrequtils helps in set-620 ting/getting the operating frequencies. Standard Linux API 621 (sched_setaffinity(2)) is used to control the CPU 622 affinity of processes, i.e., to bind the applications to specific 623 cores. The thread-to-core mapping algorithm operates at a 624 coarser granularity (500 ms) considering its higher migration 625 overhead. As the workload of application changes randomly, 626 to capitalize on these changes for energy savings, the DVFS 627 governor is operated at a finer granularity of 100 ms. and performance for individual applications to decide on 677 an energy-efficient mapping when multiple applications 678 are run concurrently. Furthermore, it also applies work-679 load classification-based DVFS periodically to minimize 680 the power consumption.
681
B. Evaluation of Performance Predictor
682
The performance prediction model estimates the perfor-683 mance of the big core given the performance of a LITTLE 684 core (P bl ) and vice versa (P lb ). The number of base learners 685 (decision stumps) M in Eq. 3 impacts the model accuracy 686 and runtime overhead. We tested our model over 148 distinct 687 samples to evaluate the model accuracy in IPC estimation and 688 the corresponding box plot of percentage error distribution 689 for P bl and P lb are given in Figures 4a and 4b respectively. 690 As shown, the error range gets narrower with the number 691 of decision stumps, as it would help in better predicting 692 the speedup. Furthermore, increasing the number of decision 693 stumps also reduces the outliers, shown as cross in Figures 4a 694  and 4b , improving model stability. However, choosing more 695 decision stumps could increase the runtime overhead, and 696 sometimes accuracy of the prediction may not be improved 697 after reaching a certain number of decision stumps. There-698 fore, to balance this, we built additive regression models for 699 different numbers of decision stumps. It can be seen from 700 Fig. 4a and 4b that the the improvement in model accuracy 701 is negligible after 900 and 1100 decision stumps for P bl and 702 P lb respectively. Therefore, we have chosen these numbers 703 for our models P bl (mean absolute percentage error (MAPE) 704 = 1.57%; maximum error (ME) = 8.1%) and P lb (MAPE = 705 3.45%; ME = 8.5%). The maximum error of P bl and P lb is 706 about 7.9% and 5% lower compared to the previous model 707 [17] , respectively. The prediction accuracy of P lb is 1.88% 708 worse than P lb and requires 200 extra decision stumps. This is 709 because the LITTLE cores support accessing only four PMCs 710 simultaneously, compared to six PMCs supported by big cores. 711 abbreviated as DA Apps in Fig. 6 , are from those mentioned 742 in Section V-A. the Odroid-XU3 (4L+4b) and extrapolating for the considered 768 application execution scenarios. We used linear extrapolation 769 that takes runtime overheads associated with each application 770 as it varies depending upon workload characteristics (e.g., 771 frequent workload variations may incur DVFS transition la-772 tencies/thread migration overheads). As can be seen, AdaMD 773 is able to adapt to increased design space and achieve energy 774 savings. The increase in energy savings is mainly due to 775 proposed DVFS which exploits the synchronisation overheads 776 and workload variations to lower power consumption of more 777 number of active cores.
C. Comparison of Energy Consumption
778
D. Performance
779
The proposed approach outperforms all reported approaches 780 in meeting application performance constraints, as shown in 781 Fig. 9 . We evaluated the percentage of performance constraint 782 misses for all the application scenarios presented in 
where,
T DV F S = T pmc vf + T metrics + T wp + T classif y + T vf s (9) where, T pmc m , T pm , T t h , T rar , T pmc vf , T metrics , T wp ,
815
T classif y , and T vf s represent time taken for PMC data col- operates at a finer granularity of 100 ms compared to the 826 mapping time interval of 500 ms.
827
We observed an average runtime overhead of 600 µs and 828 1.4 ms for A to D when executed at 2 GHz and 1 GHz 829 on a big core of Odroid-XU3, respectively. The DVFS part 830 of step E incurs 320 µs and others parts take up to 15 µs 831 when the overhead is measured at the maximum frequency 832 (2 GHz). The DVFS algorithm operates at a granularity of 833 100 ms, so the overhead is less than 0.5%. Performance and 834 Resource manager part of E is invoked for every 500 ms. The 835 overhead associated with this part depends on the number of 836 times the application misses its performance constraint and 837 thread migrations across the cores. Here, we observed an 838 overhead between 0.15% to 0.75%. Our results show that the 839 total runtime overhead is very minimal and moreover, they 840 have been included when computing energy consumption and 841 performance.
842
VI. RELATED WORK
843
To achieve energy savings and/or to meet performance 844 constraints in multi-core platforms, various approaches for 845 DVFS and/or task mapping have been proposed [3] - [17] , [20] , 846 [31], [36]- [41] . These works perform offline, online or hybrid 847 (offline & online) optimization for resource management.
848
Approaches based on offline optimization utilize extensive 849 design space exploration of the underlying hardware and 850 target application(s). The techniques proposed in [7] , [40] 851 are used for DVFS and/or task mapping. However, they 852 consider execution of a single application at a time, and thus 853 are not suitable for the concurrent execution of applications. 854 The approach presented in [40] generates multiple mappings 855 for each application offering a tradeoff between resource 856 requirements and throughput, while Quan and Pimentel [8] 857 proposed scenario-based online mapping approaches targeting 858 homogeneous multi-core platforms in which mappings derived 859 from design-time DSE are stored for runtime mapping deci-860 sions. Evidently, these techniques consume more time, and 861 cannot cope with dynamic application behavior, especially 862 when multiple applications are run concurrently.
863
To adapt to dynamic application workloads, pure online 864 optimization based approaches, performing all processing at 865 runtime, have also been investigated [4] , [9] - [11] . In [4] , an 866 online reinforcement learning based adaptive DVFS approach 867 targeting frame-based applications is presented to improve 868 energy efficiency. In [9] , an online spatial mapping technique 869 to map streaming applications onto a multi-core system is 870 discussed. Brião et al. [10] present dynamic task allocation 871 strategies based on bin-packing algorithms for soft real-time 872 applications. An online task allocator using the adaptive task 873 allocation algorithm and clustering approach for minimizing 874 the communication load is described in [11] . All of these 875 approaches perform well for unknown applications to be exe-876 cuted at runtime, but lead to inefficient results as optimization 877 decisions need to be taken quickly without offline analysis 878 results [3] . 
VII. CONCLUSIONS
931
The increasing demand for performance and energy effi-932 ciency has forced mobile systems to employ heterogeneous 933 multiprocessor system-on-chips. These systems offer a diverse 934 set of core and frequency configurations to runtime manage-935 ment systems for online tuning. This paper has presented an 936 adaptive thread-to-core mapping and DVFS technique, called 937 AdaMD, for choosing a configuration for each performance-938 constrained application that minimises energy consumption. 939 By using runtime information while applications are executing 940 and eliminating the need for application-dependent offline 941 results, AdaMD is capable of managing even unknown appli-942 cations efficiently. Proposed algorithm first selects a resource 943 combination (number of cores and their type) that meets the 944 application performance requirement using an accurate perfor-945 mance prediction model and resource enumerator/selector. It 946 then monitors application performance, workload and its status 947 (finished or newly arrived) for tuning voltage-frequency set-948 tings and adjusting thread-to-core mappings. Our experiments 949 show an improvement of up to 28% in energy consumption 950 compared to the most promising existing approaches. The 951 proposed approach also outperforms previous approaches in 952 meeting application performance constraints. Our future work 953 includes validation with more number of cores and types 954 having different ISA (e.g., CPU, GPU, etc.) to show the 955 scalability and adaptability of the approach.
956
ACKNOWLEDGEMENT
957
This work was supported in parts by the EPSRC 958 Grant EP/L000563/1 and the PRiME Programme Grant 959 EP/K034448/1 (www.prime-project.org). Experimental data 960 used in this paper can be found at https://doi.org/10.5258/ 961 SOTON/D1041. 
