Abstract. This paper introduces a reinforcement-learning based resource allocation framework for dynamic placement of threads of parallel applications to Non-Uniform Memory Access (NUMA) many-core systems. We propose a two-level learning-based decision making process, where at the first level each thread independently decides on which group of cores (NUMA node) it will execute, and on the second level it decides to which particular core from the group it will be pinned. Additionally, a novel performance-based learning dynamics is introduced to handle measurement noise and rapid variations in the performance of the threads. Experiments on a 24-core system show the improvement of up to 16% in the execution time of parallel applications under our framework, compared to the Linux operating system scheduler.
Introduction
Resource allocation has become an indispensable part of the design of any engineering system that consumes resources, such as electricity power in home energy management [1] , access bandwidth and battery life in wireless communications [8] and computing bandwidth under different settings [2, 3, 5] . When resource allocation is performed online and the number, arrival and departure times of the tasks are not known a priori (as in the case of CPU bandwidth allocation), the role of a resource manager (RM) is to guarantee an efficient operation of all tasks by appropriately distributing resources. which requires the formulation of a centralized optimization problem (e.g., mixed-integer linear programming formulations [2] ), which further requires information about the specifics of each task (i.e., application details). Such information may not be available to neither the RM nor the task itself. Given the difficulties involved in the formulation of centralized optimization problems in resource allocation, not to mention their computational complexity, feedback from the running tasks in the form of performance measurements may provide valuable information for the establishment of efficient allocations. Such (feedback-based) techniques have recently been considered in several scientific domains [4, 5] .
This paper proposes a distributed learning scheme specifically tailored for addressing the problem of dynamically assigning/pinning threads of a parallelized application to the available processing units. The proposed scheme is flexible enough to incorporate alternative optimization criteria. In particular, we demonstrate its utility in maximizing the average processing speed of the overall application, which under certain conditions also imply shorter completion time. The proposed scheme also reduces computational complexity usually encountered in centralized optimization problems, while it provides an adaptive response to the variability of the provided resources. This paper extends prior work of the authors [6, 12] . Compared to that work, this paper makes the following research contributions:
(C1) We propose a novel two-level scheduling process that is more appropriate for Non-Uniform Memory Access (NUMA) architectures. At the higher level, the scheduler decides on which NUMA node each thread should be assigned, while at the lower level the scheduler decides on which CPU core (within that NUMA node) to execute the thread. (C2) We propose a novel learning dynamics motivated by aspiration learning for making decisions at the higher level. (C3) We demonstrate the efficiency of the proposed approach on an application borrowed from the evolutionary computing domain.
The paper is organized as follows. Section 2 discusses related work. Section 3 describes the problem formulation and contributions of the paper. Section 4 presents the main features of the proposed Dynamic Scheduler. Section 5 presents experiments of the proposed resource manager in a many-core Linux platform and comparison tests with the operating system's response. Finally, Section 6 presents concluding remarks and future work.
Related work
Prior work has demonstrated the importance of thread-to-core bindings in the overall performance of a parallelized application. For example, [9] describes a tool that checks the performance of each of the available thread-to-core bindings and searches for an optimal placement. Unfortunately, the exhaustive-search type of optimization that is implemented may prohibit runtime implementation. Borguedis et al. [4] combine the problem of thread scheduling with scheduling hints related to thread-memory affinity issues. A similar scheduling policy is also implemented by [11] .
The resource allocation problem that we are considering can be seen as centralized optimization problem, where n threads need to be mapped to m processing elements in the optimal way. To reduce this potentially very large search space (m n possible different allocations), distributed or game-theoretic optimizations have been attempted in the past for related problems, including cooperative game formulation for allocating bandwidth in grid computing [13] , the non-cooperative game formulation in the problem of medium access protocols in communications [14] or for allocating resources in cloud computing [15] . These approaches significantly reduce the computation complexity and also allow for the development of online selection rules where tasks/agents make decisions often using current observations of their own performance.
Contrary to this line of research, our recent work [6, 12] proposed a dynamic (learning-based) scheme for optimally allocating threads of a parallelized application into a set of available CPU cores. It requires a minimum information exchange, that is only the performance measurements collected from each running thread, and it has linear complexity with the number of running threads. Furthermore, it is flexible enough to accommodate alternative optimization criteria depending on the available performance counters (e.g., average processing speed). It was shown both analytically and through experiments under the Linux operating system, that the proposed methodology learns a locally-optimal allocation, which under certain conditions also corresponds to the global optimum. However, one potential drawback was the fact that no special consideration was taken upon the possible non-uniform memory access (NUMA) architectures, as it did not distinguish between moving a thread to a "local" (within the same NUMA node) and "remote" (from the different NUMA node) core.
Problem Formulation and Objective
Let a parallel application comprise n threads, I = {1, 2, ..., n}, that need to be pinned to a set J of available CPU cores, J = {1, 2, ..., m}. We denote the assignment of a thread i to the set of available CPU cores by α i ∈ A i ≡ J , i.e., α i denotes the CPU core to which the thread is assigned to. Let also α = {α i , i ∈ I} denote the assignment profile.
The Resource Manager (RM) is responsible for the assignment of threads to CPU cores. It periodically checks the performance of a thread and makes decisions about its placement in the next scheduling iteration, so that a (userspecified) objective is maximized. For the remainder of the paper, we will assume that: a) The internal properties and details of the threads are not known to the RM. Instead, the RM may only have access to measurements related to their performance (e.g., their processing speed); b) Threads may not be idled or postponed. Instead, the goal of the RM is to assign the currently available resources to the currently running threads.
Static optimization and issues Let v i = v i (α, w) denote the processing speed of thread i which depends on both the assignment profile α, as well as exogenous parameters aggregated within w that summarizes, for example, the impact of other applications running on the same platform (disturbances). The centralized objective that we consider then takes the following form:
In this paper, the centralized objective corresponds to the average processing speed of the running threads, i.e.,
Any solution to the optimization problem (1) corresponds to an efficient assignment. However, there are two practical issues when posing an optimization problem in this form: a) the function v i (α, w) is unknown and it may only be evaluated through measurements of the processing speed, denotedṽ i ; and, b) the exogenous disturbances w = (w 1 , ..., w m ) are unknown and may vary with time, thus the optimal assignment may not be fixed with time.
Measurement-or learning-based optimization We wish to address a static objective of the form (1) through a measurement-or learning-based optimization approach. According to such approach, the RM reacts to measurements of the objective function f (α, w), periodically collected at time instances k = 1, 2, ... and denoted byf (k). For example, in the case of objective (2), the measured objective takes on the formf (k) . = n i=1ṽ i (k)/n. Given these measurements and the current assignment α(k) of resources, the RM selects the next assignment of resources α(k + 1) so that the measured objective approaches the true optimum of the unknown performance function f (α, w). In other words, the RM employs an update rule of the form:
according to which prior pairs of measurements and assignments for each thread i are mapped into a new assignment α i (k + 1) that will be employed during the next evaluation interval.
Multi-level decision-making and actuation Recent work by the authors [6, 12] has demonstrated the potential of learning-based optimization of the CPU affinities. However, when an application runs on a Non-Uniform Memory Access (NUMA) machine, additional information can be exploited to enhance scheduling of a parallelized application. To this end, a multi-level decision-making and actuation process is considered. We extend the PaRLSched dynamic scheduler presented in [6, 12] by introducing two nested decision processes depicted in Figure 1 . At the higher level, the performance of a thread is evaluated with respect to its own prior history of performances, and decisions are taken with Fig. 1 . Schematic of a multi-layer dynamic resource allocation framework.
respect to its NUMA placement (possibly involving memory affinities). At the lower level, the performance of a thread is evaluated with respect to its own prior history of performances, and decisions are taken with respect to its CPU placement (within the selected NUMA node).
Dynamic Scheduler
In this section, we provide a detailed description of the new features of the updated PaRLSched Dynamic Scheduler.
Advancement of architecture According to the new architecture, the user may define the resources to be optimized as well as the corresponding methods used for establishing predictions and for computing optimal allocations. In particular, the initialization of the scheduler accepts the following parameters.
In the above example, we have defined two distinct resources to be optimized, namely "NUMA_BANDWIDTH", which refers to the placement of threads to specific NUMA nodes, and "NUMA_MEMORY", which refers to the placement/binding of the thread's memory pages into specific NUMA nodes. For each one of the resources to be optimized, there might be alternative optimization criteria, which summarize our objectives for guiding the placements. The selection of the optimization criteria is open-ended and directly depends upon the available performance metrics. In parallel to the selection of the optimized resources, we need to also define the corresponding "methods" for establishing predictions (which are used under the estimate() part of the Dynamic Scheduler, Figure 1 ). Similarly, we may also define the corresponding "methods" for the computation of the next placements (which are used under the optimize() part of the Dynamic Scheduler, Figure 1 ). For example, we may use the Reinforcement-Learning (RL) selection criterion (cf., [6] ) for both estimation and optimization, or the Aspiration-Learning (AL) criterion for optimization (described in a forthcoming section).
Hierarchical structure The Dynamic Scheduler has been redesigned so that it accepts a nested description of resources as it is evident in NUMA architectures. The depth of such description is not limited, although in the current implementation we have been experimenting with a single type of a child resource. An example of how child resources can be defined is as follows.
CHILD_RESOURCES={"CPU_BANDWIDTH" , "NULL"} CHILD_OPT_CRITERIA={"PROCESSING_SPEED" , "PROCESSING_SPEED"} CHILD_RESOURCES_EST_METHODS={"RL" , "RL"} CHILD_RESOURCES_OPT_METHODS={"AL" , "AL"} We have defined a child resource for the NUMA bandwidth resources, which corresponds to a CPU-based description of the placement. For each one of the child resources, we may define separate estimation and optimization methods. Note that the decisions and algorithms over child resources may not coincide with the corresponding methods applied for the case of the original resources.
Aspiration-learning-based dynamics We propose a class of learning dynamics that provide: a) fast response to rapid performance variations, and b) varying exploration rate. Under the reinforcement learning scheme of [6] , exploration of new affinities is performed with fixed probability and independently of the current performance of the application. It would have been desirable though that the size of exploration varies with the current performance (e.g., larger exploration rate under low performances). To this end, we developed a novel learning scheme that is based upon the notions of benchmark actions/performances and bears similarities with the so-called aspiration learning. At each iteration k, the following steps are executed:
1. Performance update. At time k, update the (discounted) running average performance of the thread (with respect to the optimized resource), denoted v i (k), as follows:v
where v i (k) is the current measurement of the processing speed of thread i. 2. Benchmark update. Define the upper benchmark performanceb i (k) as follows:b
for some constant η > 1. The lower benchmark performance is defined as follows:
3. Action update. Given the current benchmarks and performance, a thread i selects actions according to the following rule: (a) ifv i (k) < b i (k), i.e., if the current average performance is smaller than the lower benchmark performance, then thread i will perform a random switch to an alternative selection according to a uniform distribution.
, then each thread i will keep playing the same action with high probability and experiment with any other action with a small probability λ > 0. (c) ifv i (k) ≥b i (k), i.e., if the current average performance is larger than the upper benchmark performance, then thread i will keep playing the same action.
It is important to note that the above learning scheme will react immediately to a rapid drop in the performance (thus, we indirectly increase the response time to large performance variations). Furthermore, the exploration rate is considerably large under such rapid drops in the performance, and quite small (λ > 0) when performance remains within the safe region (i.e., between b i (k) and b i (k)).
Experiments
In this section, we present an experimental study of the proposed framework. Experiments were conducted on 28×Intel c Xeon c CPU E5-2650 v3 2.30 GHz running Linux Kernel 64bit 3.13.0-43-generic. The cores are divided into two NUMA nodes (Node 1: 0-13 CPU cores, Node 2: 14-27 CPU cores). The experiments were conducted in scenarios where the availability of resources (CPU cores) may vary over time. We compared the overall performance (in terms of processing speed of threads and completion time of an application) of the framework with that of the OS scheduler. The application that we used was Ant Colony Optimisation (ACO) [7] , a metaheuristics used for solving NP-hard combinatorial optimization problems, applied here to the Single Machine Total Weighted Tardiness Problem (SMTWTP). This gives us both computationally-and dataintensive application, where potentially large portions of memory are used for storing and accessing data (making considerations of where the application data is, in terms of NUMA nodes, non-trivial) and where each thread performs large amount of computation over this data. More details about the application can be found in [12] .
Experimental setup. For the purpose of reducing the number of experiments, we have fixed the scheduling interval to be 0.2 sec, which is also the interval in which the RM collects measurements of the total instructions per sec (using the PAPI library [10] ) for one of the threads separately. This is used as an estimate of the processing speed of each thread. Pinning of threads to CPU cores is achieved through the sched.h library (in particular, the pthread_setaffinity_np function). In all of the experiments, the RM is executed by the master thread of an application, which is always running in a fixed CPU core (usually the first available CPU core of the first NUMA node). In Table 1 , we provide an overview of the conducted experiments with the ACO case study. To evaluate the dynamic scheduler under different quantity of exogeneous interferences, we consider four main sets of experiments (A, B, C, and D), where each set differs in the amount of provided resources and their temporal availability. In particular, under the non-uniform CPU availability condition, other applications occupy a constant number of the available CPU cores throughout the whole duration of the experiment. On the other hand, under the time-varying CPU availability condition, other applications occupy a non-constant part of the available bandwidth (i.e., exogenous applications start running 1 min after the beginning of the experiment). In both the non-uniform CPU availability and the time-varying CPU availability case, the exogenous disturbances (other applications) comprise computational tasks often equally distributed among the available CPU cores. In the experiment sets A, B, and C, these exogenous applications occupy the first 6 CPU cores of both NUMA nodes. Furthermore, in the set D, we alternate the exogenous interferences between the two NUMA nodes. Our goal is to investigate the effect of the (stack) memory of threads in the overall performance of the application. (experiments A.1 and B.1), the operating system outperforms the PaRLSched scheduler slightly. This is due to the fact that Linux scheduler is utilising internal load balancing of threads between cores, which has notable effect on the execution time when there is not much external interference (in terms of additional running applications). For the PaRLSched scheduler, this benefit is lost due to the explicit pinning of threads to cores, which prevents the operating system to balance the load between cores. This is also a reason why in the set C of experiments the OS scheduler consistently outperforms the PaRLSched scheduler, as there is a large number of cores available and the actual interference by the external application is much lower than in A and B. As a note, in the experiment C only 50% of cores are occupied by the external application, as opposed to 80% and 75% in the experiments A and B. However, when the interference by the external application is higher (A. 
Thread Pinning

Thread pinning and memory binding
In this last experiment, the PaRLSched scheduler was extended with the memory binding of a newly allocated (stack ) memory to the selected NUMA node of the running thread. The intention here is to constrain any newly allocated memory into the selected NUMA node, which may potentially further increase the running speed of the overall application. We further introduce the variable ζ ∈ [0, 1] that captures the minimum percentage of occupancy (among threads) requested before binding the memory of a thread into a NUMA node. For example, if ζ = 1 /2, it implies that a thread's memory will be bound to a NUMA node if and only if more than 1/2 of the threads also run on that node. Intuitively, the more threads occupy a NUMA node, the more likely it is that the larger part of the shared memory will be (or should be) attached on that node. Thus, essentially, through the introduction of ζ, we get an additional control variable that may potentially affect the speed of the overall application. This is indeed the case as depicted in the statistical analysis of Table 3 . It is observed that under ζ = 1 /2, a small decrease is observed in the completion time of the overall application (which is not observed under ζ = 0). 
Conclusions and future work
We proposed a measurement-(or performance-) based learning scheme for addressing the problem of efficient dynamic pinning of parallelized applications into many-core systems under a NUMA architecture. According to this scheme, a centralized objective is decomposed into thread-based objectives, where each thread is assigned its own utility function. Allocation decisions were organized into a hierarchical decision structure: at the first level, decisions are taken with respect to the assigned NUMA node, while at the second level, decisions are taken with respect to the assigned CPU core (within the selected NUMA node). The proposed framework is flexible enough to accommodate a large set of actuation decisions, including memory placement. Moreover, we introduced a novel learning-based optimization scheme that is more appropriate for administering actuation decisions under a NUMA architecture, since a) it provides better control over the switching frequency and b) it provides better adaptivity to variations in the performance, since the experimentation probability is directly influenced by the current performance.
We demonstrated the utility of the proposed framework in the maximization of the running average processing speed of the threads. Through experiments, we observed that the PaRLSched dynamic scheduler can ensure that the running average speed of the parallelized application will be either larger than or equal to the corresponding speed under the OS's scheduler. In some cases, this also corresponded to the decrease in the execution time of an application by up to 16%.
For future work, we plan to extend this work by extending our heuristics with mechanisms to assign a group of processors to a single thread, instead of just one. This will allow us to take advantage of the internal load balancing that the Linux operating system does, thus eliminating the cases where the OS outperforms our scheduler due to its rigidity in terms of the placement of threads. Additionally, we will evaluate our scheduler on a wider range of use cases and under more variable conditions, in terms of the type and amount of external load, to determine the exact settings in which it can bring benefits to the execution time, compared to the Linux scheduler.
