ABSTRACT Tiled many-core processors (i.e., KNL and the TILE-Gx72 processor), on which processing cores are fitted onto a single chip and cores are interconnected via mesh-based networks, are different from the traditional many-core systems. Their operating system (OS) should be optimized to take into account the unique characteristics (for instance, cores are integrated into a single chip) of tiled many-core processors. This is because these characteristics were not taken into consideration when OSes designed for the traditional multicore (many-core) systems were deployed on tiled many-core processors. In this paper, we propose an optimized load balancing policy to improve the performance of multi-threaded applications. Making a thread select an appropriate idle (lightweight) tile (processing core) across all tiles on the single chip rather than a portion of tiles is able to reduce the overhead triggered by the load balancing policy, the penalty of cache misses because of the scheduling and more threads sharing the same tile (processing core), and the contention for memory controllers due to cache misses. The experimental results demonstrate that the optimized load balancing policy can provide up to 2.7× performance improvement on KNL and mitigate the performance degradation to separate extents on the TILE-Gx72 processor.
I. INTRODUCTION
Scalability problems, in which the execution time of a multithreaded application designed to take advantage of softwarelevel parallelism (and hardware-level parallelism) cannot be reduced as more threads (processing cores) need to cooperate in the parallel phase(s), are still challenges for application programmers, library (i.e., heap manager) designers, and OS (operating system) designers. Previous research [1] - [6] demonstrated that program performance could be improved when load imbalance (i.e., data distributed unevenly across threads) was eliminated from applications, the scalable heap manager (i.e., Jemalloc [7] ) replaced the original heap manager (Ptmalloc [8] ) of the GNU C library on Linux, the lockfree mechanism was introduced in the Linux kernel, and OSes designed to be deployed on many-core processors were rethought. Among all factors that can potentially hinder program performance, OS is deemed to be the most significant and therefore should be paid most attention to.
To mitigate overheads caused by the wire delay and limited throughput, tiled many-core processors have become prevalent. Their tiles (processing cores) interconnected through mesh-based two-dimensional network(s) are integrated onto a single chip and memory controllers are fitted onto the same chip as well. Thus, OS designers need to think about what the OS should do in order to fully utilize hardware resources for maximizing the benefit from software-level parallelism. It is important to note that an OS designed for traditional multicore (many-core) systems can be deployed on tiled many-core processors, but the OS might unintentionally introduce overhead. This is because unique characteristics of tiled many-core processors were not taken into consideration when OSes were designed for traditional multicore (many-core) systems. For instance, locality is important for a NUMA (non-uniform memory access) system because of the penalty when data is fetched from a remote memory controller rather than a local one, and therefore a throughput-oriented scheduler designer should avoid thread migrations among (NUMA) nodes as much as possible. In contrast, the local and remote memory controllers of any tile (processing core) on tiled many-core processors are fitted onto the same chip, and thus the scheduler design policy needs to be optimized, since the penalty between remote and local memory controllers is smaller on tiled many-core processors than on traditional many-core systems. Current tiled many-core processors including KNL [9] and the TILE-Gx72 processor [10] are described in detail in Section II-C.
Before moving to the OS design for emerging tiled many-core processors, it is necessary to learn what should be taken into account for future OS when analyzing the overhead introduced by the OS (i.e., the general-purpose OS (Linux)) designed for traditional multicore (many-core) systems but deployed on tiled many-core processors. In this paper, we explain that performance of multi-threaded sharedmemory applications designed for chip multiprocessors can be improved when the policy of load balancing in the Linux kernel is optimized on tiled many-core processors. This is because specific features of tiled many-core processors are taken into consideration that are easy for careless OS designers to overlook. Synchronization (i.e., barrier) is mainly discussed in this paper as it is widely adopted in applications for conventional high performance computing (HPC) systems and multicore (many-core) systems as well. Constructs such as transactions introduced in database [11] , programming language [12] , and transactional memory [13] - [16] are not concentrated on, since our focus is existing throughput-oriented shared-memory multi-threaded applications for chip multiprocessors. Synchronization in multi-threaded applications guarantees correct program behavior, which is analogous to explicit fence instructions [17] from the perspective of computer architecture. The defect is that synchronization, especially coarse-grain synchronization, forces threads that fail to acquire blocked resources to yield processing cores on throughput-oriented and fairness-oriented systems. This is plausible because these threads can wait a long time for resources to become available, and thus a thread occupying processing core continuously can delay other threads from sharing the same core.
Overhead introduced by the OS (which refers to Linux in this paper) designed for traditional multicore (many-core) systems, can be observed when blocked threads are about to be awakened on tiled many-core processors. This is because blocked threads are requested to be awakened onto their previous scheduling domains as a result of locality and therefore threads may be distributed unevenly across tiles (processing cores) on tiled many-core processors, even though the total thread count is no more than the number of available cores. Since tiles (processing cores), remote memory controllers, and local memory controllers are integrated onto the same chip, and non-uniform memory access latency does not dominate program performance on tiled many-core processors, the blocked thread can be awakened on any idle (or lightweight) tile (processing core) on the single chip, instead of its previous scheduling domain that includes a portion of tiles. This is related to the optimized load balancing policy in the Linux kernel proposed in this paper. The main contributions are as follows:
• It is necessary to rethink what the OS on tiled manycore processors should do in order to take advantage of their unique characteristics such as relatively small non-uniform memory access latency. By optimizing the load balancing policy of a general-purpose OS (Linux), performance can be improved for shared-memory multithreaded applications designed for chip multiprocessors on existing tiled many-core processors including KNL and the TILE-Gx72 processor. This is because threads are assigned to idle/lightweight cores positively in one step with the optimized load balancing policy, instead of being eventually pulled by the idle/lightweight cores passively in two steps in the original load balancing policy of the Linux kernel.
• Making a thread select an appropriate idle (lightweight) tile (processing core) across all tiles on the single chip rather than a portion of tiles, is able to reduce the overhead triggered by the load balancing policy, the penalty of cache misses because of the scheduling and more threads sharing the same tile (processing core), and the contention for memory controllers due to cache misses. This is the main reason program performance can be improved when the load balancing policy is optimized on tiled many-core processors. Scheduling domains proposed to handle the complexity of conventional SMT and NUMA systems might need to be eliminated on tiled many-core processors for sharedmemory multi-threaded applications designed for chip multiprocessors. The rest of this paper is organized as follows. Section II describes the background involving the Linux kernel scheduler (CFS: Completely Fair Scheduler), the scheduling domains proposed to mitigate the complexity of simultaneous multithreading NUMA systems, emerging tiled many-core processors including KNL and the TILE-Gx72 processor, and the current load balancing policy. Section III explains the motivation and the optimized load balancing policy. The performance of the optimized load balancing policy of Linux on KNL and the TILE-Gx72 processor is evaluated in Section IV. Section V discusses what OS designers should take into consideration for future OSes. Related work is exhibited in Section VI, and conclusion and future work are discussed in Section VII.
II. BACKGROUND
To fully understand what OS designers should take into account when designing OS for tiled many-core processors, it is necessary to know what the unique characteristics of existing tiled many-core processors are and how an OS (Linux in this paper) designed for traditional multicore (many-core) systems but deployed on tiled many-core processors manages hardware resources and services applications. In this section, we mainly focus on the load balancing policy of the Linux kernel and the unique features of KNL and the TILE-Gx72 processor.
A. LINUX KERNEL SCHEDULER
Since we concentrate on throughput-oriented applications, in which the program execution time is expected to be VOLUME 7, 2019 reduced as the thread (processing core) count increases, the discussion on load balancing policy is closely associated with the Linux kernel scheduler (CFS: Completely Fair Scheduler introduced in Linux-2.6.23). CFS uses a two-level scheduling mechanism [18] , [19] , including managing distributed per-core run queues from the perspective of core time sharing and (re)distributing tasks across processing cores in terms of space to eliminate load imbalance and maximize hardware resource utilization. Tasks, which refer to threads and processes as they are not differentiated by the Linux kernel, are pulled by the idle (lightweight) core from the busiest (high-loaded) processing core.
B. SCHEDULING DOMAINS
The concept of scheduling domains [20]- [22] was originally introduced into the Linux kernel to deal with the complexity of SMT (simultaneous multithreading) [23] - [27] , NUMA, and SMT NUMA systems due to their respective specific features.
The instruction-level parallelism is maximized by SMT, which allows instructions to be fetched from multiple hardware contexts (logical processing cores) every cycle, but the OS (Linux) had the opportunity to assign two tasks to two logical processing cores from the same physical core when (more than) two physical cores with SMT but without power constraint existed. This is because the OS was not conscious of the difference among logical processing cores from separate physical cores. Moreover, when these two tasks did not share data at all and the working sets were larger than the shared (L2) cache attached to the physical core, assigning two tasks onto two logical processing cores (from the same physical core) could cause more cache misses and therefore contention for memory controllers (and the memory bus as well). For the (SMT) NUMA systems, the OS had the chance to migrate threads among (NUMA) nodes and thus forced the migrated threads to access the remote (original local) memory controllers.
To cope with the complexity triggered by these specific features, (logical) processing cores are grouped together as a scheduling domain in accordance with how hardware resources (SMT hardware context including the in-order front end and the out-of-order execution core, cache shared by multiple physical cores, package, and the whole system) are shared. Scheduling domains are established from the basic domain (i.e., logical processing cores sharing a SMT-based physical core) to the top domain (the whole system) as a hierarchy, which demonstrates that the upper scheduling domain includes lower domains. A new created task can be assigned to any core (processor) across the top domain, since no limitation of cache affinity (locality) exists. A blocked task is designated to be awakened onto the processing core (processor) from its previous scheduling domain, as it is costly to migrate to other domains because of cold cache or/and remote memory access latency. 
C. TILED MANY-CORE PROCESSORS
Tiled many-core processors, which are featured with fitting tiles (processing cores) interconnected via mesh-based twodimensional networks and memory controllers onto a single chip, are analogous to a conventional HPC system composed of hundreds (or thousands) of processors interconnected by (Gigabit) Ethernet. Fig. 1 illustrates the topology of two current tiled many-core processors (KNL and the TILE-Gx72 processor). The detailed hardware information is shown in Table 1 . It is important to note that two physical cores, which have their respective private L1 caches but share the L2 cache, constitute a tile on KNL, and on the TILE-Gx72 processor each processing core, which has its private L1 cache and L2 cache, represents a tile. The conventional shared on-die last level (L3) cache (LLC) is not supported on either processor, and KNL designers [9] have explained that HPC workloads benefit less from it than more added processing cores.
On the basis of the concept of scheduling domains described in Section II-B, multi-level domains can be created in accordance with the topology of KNL and the TILEGx72 processor. Note that Section II-B suggests establishing a scheduling domain on the basis of a shared memory controller, since a shared LLC (or package) is associated with a shared memory controller. As shown in Table 1 , four NUMA nodes 2 can be supported on KNL and the TILE-Gx72 processor separately. Therefore, on the TILE-Gx72 processor, two-level scheduling domains (each scheduling domain of the first level includes a quarter of tiles; all tiles are involved in the scheduling domain of the top level) can be created by the Linux kernel. The creation process is obvious because (1) SMT is not supported; (2) each tile (processing core) has its private L1 cache and L2 cache; (3) tiles are divided into four groups in accordance with the number of memory controllers; and (4) all tiles are fitted onto the same chip (package). On the KNL without Hyper-Threading (SMT), three-level scheduling domains are expected to be established as every two physical cores share the L2 cache in addition to the identical features of (3) and (4) of the TILE-Gx72 processor. However, the current Linux kernel for KNL does not take a specific feature (the L2 cache is shared by two physical cores) into account and creates two-level scheduling domains as well when Hyper-Threading is disabled. Threelevel scheduling domains are formed when Hyper-Threading is enabled.
D. LOAD BALANCING
The crucial goal of load balancing policy (of Linux) is to maximize the utilization of hardware resources and distribute tasks across cores (processors) as evenly as possible on the basis of the assumption of independent tasks and identical computing power of each core (processor). When communication (dependency) among tasks is important [28] or/and the system is heterogeneous (i.e., faster and slower cores coexist) [18] , [29] , the load balancing policy should be reconsidered. Load balancing of Linux is invoked under two circumstances: (1) imbalance among scheduling domains is detected when the periodical timer tick takes place, and balancing is allowed by the balancing interval of the scheduling domain; and (2) a core (processor) becomes idle, and the relative flag of the scheduling domain is set. Note that the balancing interval of the scheduling domain becomes larger as the domain hierarchy goes up, since thread migrations should be triggered as infrequently as possible among (NUMA) nodes on traditional (SMT) NUMA systems.
III. OPTIMIZING THE LOAD BALANCING POLICY
This section describes the motivation to optimize the load balancing policy of the Linux kernel and what the optimized load balancing policy does on emerging tiled many-core processors (KNL and the TILE-Gx72 processor). It is highly associated with the characteristics of tiled many-core processors described in Section II-C and the fact that current load balancing policy is observed to potentially hinder program performance.
A. TASK BLOCKING
A typical application of the SPMD (Single Program Multiple Data) programming model for HPC, which corresponds to a data-parallel multi-threaded program of the PARSEC benchmark suite [30] , is designed to distribute data across independent tasks (processes/threads) and the parallel execution is done between synchronization (i.e., barrier) points. It is well-studied that the performance of a fine-grained parallel application can be degraded by the system noise (i.e., OS clock ticks) [31] - [35] . This is because each parallel phase is requested to be completed when all tasks reach the synchronization point and it is dominated by the last task that arrives to the point. When the last task is delayed by the system noise as a result of sharing the processing core, the overall execution time of the application is prolonged, especially when there are numerous synchronization points.
Recall that a blocked thread of a multi-threaded application is designated to be awakened onto the core (processor) of its previous scheduling domain because of locality (cache affinity) described in Section II-B. Since earlier threads (threads that reach the synchronization point earlier) are designed to be blocked by most synchronization implementations and thus the corresponding cores become idle, later threads (threads that reach the identical synchronization point later) have the possibility to be migrated (assigned) to cores to which earlier threads are blocked. That is, theoretically, threads are not evenly distributed to cores, even if the thread count is no more than the number of cores. When the last thread arrives at the synchronization point, multiple blocked threads belonging to the same parallel phase possibly are awakened onto the identical core. Then, the load balancing is invoked when the load imbalance is detected or one core becomes idle. The worst case is that the procedure of awakening blocked threads to cores of the previous scheduling domains and assigning more than one thread to the same core theoretically, has the chance to be repeated until the termination of the application. It is important to note that awakening blocked threads to cores of the previous scheduling domains is evidently able to assign more than one thread to the identical core, and assigning more than one thread to the same core theoretically demonstrates that one thread is assigned to the core to which another thread belonging to the same parallel phase is blocked.
Overall, the problem of prolonging the application execution time is caused by (1) the system noise; (2) the load balancing policy of Linux; and (3) the task blocking mechanism (blocked threads are awakened onto cores of the previous scheduling domains because of locality). In this paper, (2) and (3) are not involved in the system noise ( (1)).
B. OPTIMIZED LOAD BALANCING POLICY
System noise seems to be the root cause of performance degradation on parallel systems as discussed in Section III-A. The pipeline programming mode, in which the application (i.e., dedup from PARSEC) is partitioned into a sequence of stages, is adopted in applications for HPC as well. As analyzed by [36] , dedup is divided into five pipeline stages, and threads of the third and fourth stages are designed to contend for the lock of the queue read by the thread of the fifth stage. Meanwhile, threads at the same stage contend for the lock of the queue assigned to them.
Threads blocked after the spinning operations when they fail to acquire the lock are about to be awakened once the lock is released. The awakened thread has the opportunity to be blocked again, but before it fails it is more likely to share the processing core (CPU time and cache) with other active threads. This is another category of noise, which is separate from traditional kernel-level noise and more importantly inevitable. Therefore, we concentrate on the approach of assigning threads to appropriate cores in one step instead of the current two steps ( (2) and (3) described in Section III-A).
Recall that cores of tiled many-core processors are fitted onto a single chip and grouped into separate first-level scheduling domains in accordance with the number of memory controllers on the TILE-Gx72 processor and KNL without Hyper-Threading discussed in Section II-C.
Observing that non-uniform memory access latency on tiled many-core processors is smaller than that on a traditional multicore (many-core) system, we consider that load balancing can be performed across tiles (processing cores) on the whole chip, especially when we notice that tiles (processing cores) on the fifth column of the TILE-Gx72 processor are categorized into four first-level scheduling domains. The optimized load balancing policy is that whenever a thread is about to be assigned to a processing core, the appropriate (idle/lightweight) core is selected from all available cores on the single chip (the top-level scheduling domain). This differs from the conventional load balancing policy, because threads are assigned to the appropriate (idle/lightweight) processing cores positively, rather than pulled from the high-loaded cores passively. It is important to note that a given thread of the optimized load balancing policy is (1) first assigned to its previous processing core when it is idle, (2) then assigned to other idle core when (1) fails, and (3) finally assigned to a lightweight core when (2) fails as well.
C. CASE STUDY
In this subsection, we explain how the optimized load balancing policy works on tiled many-core processors with three separate cases shown in Fig. 2 . To clarify the process, an assumed tiled many-core processor that combines the significant typological features of KNL and the TILE-Gx72 processor is used. As illustrated in Fig. 2 , it has 36 tiles represented by the squares and interconnected via the meshbased networks. The green square (tile) presents that the tile (processing core) is idle. The red square means that the tile is busy and the orange one refers to a lightweight tile. The gray tile does not affect the optimized load balancing policy. Tiles (processing cores) are divided into four groups (first-level scheduling domains) represented by the blue dash box. A will-be-awakened thread is assumed to be blocked on tile (1, 2) , where 1 refers to the row index and 2 means the column index.
Case 1, as shown in Fig. 2a , exhibits that the previous tile (processing core) of the blocked thread is idle. In this case, the blocked thread will be awakened onto the idle tile (tile (1, 2) ). The final decisions of Case 1 of the original and the optimized load balancing policies are the same. However, the implementation of the optimized load balancing policy is simpler than the original one. Case 2, as shown in Fig. 2b , presents that tiles in the previous scheduling domain are all busy and a tile (tile(0, 3)) in another scheduling domain is idle. In this case, the blocked thread will be awakened onto the idle tile. However, the original load balancing policy designed for traditional many-core systems (i.e., a multi-socket system, processing cores of which are fitted onto multiple chips interconnected via QuickPath Interconnect (QPI) [37] or HyperTransport [38] ) does not migrate the blocked thread to the idle tile directly due to the data locality reason. Case 3, as shown in Fig. 2c, illustrates that tile(4, 2) is lightweight and the remaining tiles are all busy. In this case, the optimized load balancing policy will find out that lightweight tile and make the blocked thread awakened onto it. However, the behavior of the original load balancing policy is identical to that in Case 2.
HPC experts may argue that binding threads to cores can solve the problem caused by the synchronization. The approach of binding threads to cores may work for traditional many-core systems. However, that solution conceals the root cause derived from the scheduling domain of the Linux kernel scheduler on tiled many-core processors. This is because tiles (processing cores) are fitted onto a single chip but they are still divided into multiple groups (first-level scheduling domains). Moreover, that solution does not help OS designers fully understand what should be taken into consideration when designing their (future) OS for tiled many-core processors.
IV. PERFORMANCE EVALUATION
This section evaluates whether the optimized load balancing policy is feasible or not on current tiled many-core processors (the TILE-Gx72 processor and KNL) and analyzes the root reason it is plausible since it seems to be able to trigger additional thread migrations.
A. EXPERIMENTAL SETUP
Applications of the PARSEC benchmark suite (parsec-3.0), which were designed for chip multiprocessors (tiled manycore processors belong to chip multiprocessors), were selected to compare the program performance based on the optimized load balancing policy with the original one of the Linux kernel. The experimental settings are listed in Table 2 and the application configurations are listed in Table 3 . Because canneal, vips (a data-parallel application), and x264 (a pipeline application) could not be compiled correctly for the TILE-Gx72 processor, they were run on the KNL solely. However, we noticed that x264 could not run correctly on KNL and vips could not exhibit the execution time of ROI (region of interest, which represents the parallel phase). Therefore, vips and x264 are not included in Table 3 , as the optimized load balancing policy works for the parallel phase(s). Swaptions could not run correctly on the TILE-Gx72 processor when more than 64 threads were configured with the simlarge input set.
Both native and simlarge input sets were selected to run each application. We observed that the scalability problem on KNL without Hyper-Threading and the TILE-Gx72 processor is evident from figures shown in Section IV-B. We, therefore, do not take the KNL with Hyper-Threading into account in this paper, since it seems unnecessary to add more threads (processing cores) when running an application that has the scalability problem. Both platforms (KNL and the TILE-Gx72 processor) were configured to run on the NUMA mode without the conventional LLC. Each program ran three times and average values of the full (whole) execution time were used to plot the speedup figure. The speedup of the entire parallel phase is not shown in the figure, since the full (whole) execution time is able to reflect the performance when more threads are configured. The exceptions are canneal with the native input set on KNL and facesim with the simlarge input set on the TILE-Gx72 processor, as shown in Fig. 7 .
B. PERFORMANCE
Program performance with native and simlarge input sets is illustrated in Figs. 3 and 4 for KNL and Figs. 5 and 6 for the TILE-Gx72 processor. The x-axis exhibits the number of threads involved in the parallel phase. Note that it does not always represent the total thread count of the program execution because multiple parallel phases exist in the pipeline application. The y-axis shows the speedup normalized to the baseline, one thread of which is configured to run the program. Labels O and R in the parentheses mean that the performance is evaluated under the original load balancing policy of the Linux kernel and the optimized load balancing policy respectively. Fig. 3 shows that performance of most applications can be improved with the optimized load balancing policy. Furthermore, the optimized load balancing policy does not hurt the program performance when it fails to work. Fig. 4 shows that performance of all applications can be improved to different degrees. In contrast to Fig. 3, Fig. 5 illustrates that the optimized load balancing policy does not hurt the performance of most applications, and can obviously improve the performance of streamcluster. A similar observation can be captured from Fig. 6 as well. The exception, in which the optimized load balancing policy degrades the program performance, exists when 16 and 32 threads are configured separately to run facesim on the TILE-Gx72 processor with the simlarge input set, as demonstrated in Fig. 6d . However, our further survey on the performance degradation of facesim, reveals that the full (whole) execution time with 16 and 32 threads is prolonged by 2.4% and 1.6%, respectively, by the optimized load balancing policy, whereas the parallel (ROI) execution time is reduced by 0.04% and 0.83%. The speedup on the parallel phase of facesim is shown in Fig. 7b . The contradiction between performance degradation on the full execution and performance improvement on the parallel execution, might be related to the timer accuracy adopted by the application and/or the way the parallel phase(s) and the serial phase(s) are defined by the application programmer. This happens to canneal with the native input set on KNL as well, from the comparison between Figs. 3j and 7a. Therefore, we believe that the optimized load balancing policy is feasible.
Note that the scalability problem on tiled many-core processors is not completely solved by the optimized load balancing policy. This is because the scalability problem is associated with not only the load balancing but also other features such as the heap manager as discussed in Section I. Furthermore, the scalability problem differs among separate platforms and input data sets. For instance, the scalability problem of ferret on KNL with the native input set shown in Fig. 3e occurs when the thread count reaches 49 and occurs on the TILE-Gx72 processor when the number of threads reaches 16 as demonstrated in Fig. 5e , whereas the problem occurs when the thread count reaches 8 on KNL and 4 on the TILE-Gx72 processor in Figs. 5e and 6e respectively with the simlarge input set.
Freqmine, which was designed with the OpenMP programming model instead of Pthreads, is a special case to analyze whether the optimized load balancing policy is feasible or not. On KNL, the optimized load balancing policy does not hurt the performance with the native input set shown in Fig. 3g and is able to improve the performance with simlarge exhibited in Fig. 4g . However, on the TILE-Gx72 processor, as illustrated in Figs. 5g and 6g , it is not easy to evaluate which load balancing policy is better. Since threads are first manipulated by the scheduler of the OpenMP programming model, the intended effect from the kernel-level original and the optimized load balancing policies may be undermined. This is a potential defect if OS designers want to manage threads (tasks) explicitly on the basis of user-level optimization. The reason the performance comparison on KNL is different from that on the TILE-Gx72 processor, is discussed in Section IV-C.
C. ANALYSIS
It is relatively easy to understand difference in the scalability problem between distinct input sets. This is because the data size handled by each parallel phase tends to be smaller with the simlarge input size, and therefore, the performance becomes more sensitive to the factors (i.e., load balancing) that potentially prolong each parallel execution time. That is also related to the reason performance of blackscholes (more than 69 threads), bodytrack (more than 52 threads), and facesim (64 threads), obviously benefits from the optimized load balancing policy, which assigns threads to appropriate (idle/lightweight) cores in one step rather than two steps, when the simlarge input set is configured on the TILE-Gx72 processor from Fig. 6 .
The difference in the scalability problem between KNL and the TILE-Gx72 processor is associated with the memory (cache) system of the tiled many-core processor, even though they have common features (i.e., tiles are fitted onto a single chip) as discussed in Section II-C. As analyzed in our previous work [39] , on the NUMA mode, a quarter of the whole L2 caches (with in total 8-MB instead of 32-MB capacity) on KNL can be used as LLC (multiple virtual LLCs coexist on KNL) to store any memory block, whereas on the TILE-Gx72 processor, the whole L2 caches (with 18-MB capacity) are viewed as LLC. It is well-studied that program performance is dominated by the CPU time allocated to threads, the cache capacity that guarantees data can be found to support the thread execution, and communication between threads for HPC applications.
When more than a quarter of threads are assigned (awakened) to cores of the first-level scheduling domain on KNL (without Hyper-Threading), threads contend for not only the CPU time of processing cores (with CFS) but also the underlying shared cache (LLC with 8-MB capacity). Cache lines, which will be accessed by one thread later, are able to be evicted by other aggressive thread sharing the same scheduling domain (associated with the memory controller). Subsequent accesses to the memory controller(s) dominate the program performance, because they are related to not only the long access latency (wire delay) but also the overhead of contention for the memory controller (furthermore, the row buffer of DRAM bank) [40] - [44] . The optimized load VOLUME 7, 2019 balancing policy is able to mitigate the overhead on the CPU time, cache, and memory on KNL, and reduce the penalty of unnecessary contention for the CPU time on the TILE-Gx72 processor. This can explain why the optimized load balancing policy benefits most applications on KNL but fewer programs on the TILE-Gx72 processor. It is important to note that the optimized load balancing policy can potentially increase the number of thread migrations, but the overhead is trivial because cache lines can be transferred from the previous (or home) cache to the new cache on the same chip of the tiled many-core processor, when needed and without being evicted, due to the directory-based cache coherence protocol. One significant but implied finding is that cache lines have the possibility to be evicted later with the optimized load balancing policy than with the original one of the Linux kernel, because the load balancing is done in one step.
V. DISCUSSION
As analyzed in Section IV, the main reason the optimized load balancing policy can feasibly improve the program performance on tiled many-core processors (KNL and the TILE-Gx72 processor), is that it takes advantage of unique features: (tiles) processing cores are fitted onto the single chip and the penalty of transferring cache lines from the previous cache to a new cache is trivial. However, HPC experts may worry that additional thread migrations may be incurred. Observing that the optimized load balancing policy is able to mitigate the overhead associated with the contention for shared resources (i.e., CPU time), especially from the performance improvement on KNL as illustrated in Figs. 3 and 4 , OS designers need to think about what the OS should do on tiled many-core processors to better service multi-threaded applications.
In addition to the computing power of processing cores, the latency of fetching data from main memory when a read cache miss takes place, and communication (dependency) patterns among threads, one important but implicit factor on tiled many-core processors is that, (private) caches are virtually shared by threads because of the cache coherence protocol. It is well-studied that the cache sharing problem [45] - [54] , in which cache lines accessed by one thread later can be evicted by another co-scheduled aggressive thread, exists on chip multiprocessors. Note that most research work related to the cache sharing problem was done when threads from multiple applications were co-scheduled. Since application of the SPMD (also, data-parallel) programming model divides data into multiple parts and assigns each part to a thread of the parallel phase, the cache sharing problem caused by threads from the identical application cannot be ignored on chip multiprocessors either.
OS designers, whose work is planned to move from traditional multicore (many-core) systems to emerging tiled many-core processors, first need to fully understand what the memory (cache) system on a given tiled many-core processor is. For instance, as stated in our work [39] , with the NUMA mode, any memory block is allowed to be stored on a quarter of the area of the total L2 caches when KNL is configured with SNC-4 cluster mode and flat cache mode, whereas on the TILE-Gx72 processor, it can be present on the whole L2 caches. Misunderstanding that risks causing unnoticed and difficult-to-detect bottlenecks on the OS. Meanwhile, OS designers should pay attention to thread-to-thread communication (dependency) patterns, though on tiled manycore processors, the penalty of transferring blocks among caches is trivial when compared with fetching data from the memory. For instance, to the pipeline application (i.e., ferret from PARSEC), the performance might be improved, when communicated threads are assigned to adjacent processing cores because of the reduced network traffic on the whole chip.
The system noise, which includes system calls, scheduling, and interrupt handling, and was investigated by previous researchers [34] , [35] , should be taken into consideration as well, though the problem caused by it was studied on conventional systems. More importantly, the overhead of the system noise, was able to be amplified as more processors were configured. This observation is applicable to tiled manycore processors as well. Moreover, the scheduling domain, which is created on the basis of the way hardware resources are shared and also on the basis of the original load balancing policy of the Linux kernel, might need to be reconsidered (or eliminated) on tiled many-core processors.
Although the topic discussed in this paper, is associated with the optimized load balancing policy on the basis of the original one of the Linux kernel, our analysis regarding the performance improvement of multi-threaded applications is able to let OS designers understand the importance of exploiting the unique characteristics (i.e., the shared cache (memory) system) of tiled many-core processors. We, therefore, believe that OS designers should take these unique features into account when designing their (future) OS for tiled manycore processors, and furthermore, they can learn more from the analysis on the general-purpose OS (Linux), which was designed for traditional multicore (many-core) systems but can be deployed on tiled many-core processors. In addition, the optimized load balancing policy is likely to be beneficial to parallel workloads, which behave similarly to the multithreaded applications evaluated in this paper.
VI. RELATED WORK
Our work is done on the basis of previous research work involving the scalability problem, the system noise, the analysis on the NUMA system, the load balancing policy of the Linux kernel, and the cache sharing problem.
A. SCALABILITY PROBLEM
The scalability problem, in which the program performance is not improved as more threads (processes) are added to VOLUME 7, 2019 the parallel phase(s), was observed evidently when running multi-threaded applications on multicore (many-core) systems [1] , [55] , [56] . The scalability problem incurred by the OS for future system with hundreds (or thousands) of processing cores started being discussed earlier. New OSes (i.e., Corey [4] , fos [5] , and Barrelfish [6] ) were designed to service applications on the scalable multicore systems. However, Boyd-Wickizer et al. [3] , pointed out that scalability bottlenecks could be removed from the Linux kernel or avoided when they modified the applications slightly. Their solutions to the scalability problem on many cores, motivated us to continue our work on Linux (with the traditional OS organization), rather than immediately move to other OS organizations (i.e., Barrelfish, which considers processing cores on many cores as processors in the distributed system). This is because we need to deeply understand what the bottlenecks incurred by the traditional OS (Linux) organization are on tiled many-core processors (KNL and the TILE-Gx72 processor) before rethinking what should be taken into account when designing the (future) OSes.
B. SYSTEM NOISE
Overhead caused by the system noise, which was associated with system calls, scheduling, interrupt handling, daemons, and network operations, was studied a decade ago. Petrini et al. [31] observed that the performance problem obviously occurred when four processors per node were all selected to run the large-scale application (named SAGE) on the 8192 processors of ASCI Q. They optimized the performance when the unnecessary daemons were turned off. Tsafrir et al. [33] noticed that OS clock ticks were the main source of system noise to fine-grained parallel applications. This is because the parallel phase between two continuous synchronization operations is dominated by the last process reaching the latter synchronization point. The analysis regarding the overhead related to the system noise, is applicable to multi-threaded applications as well. Nataraj et al. [34] analyzed the system noise derived from timer (global and local timer interrupts) and (preemptive) scheduling. Ferreira et al. [35] adopted the noise injection approach to analyze the application sensitivity. These studies helped us understand why performance variation existed on the identical system when running the same application multiple times. Their analyses made us pay more attention to the system noise on tiled many-core processors as well.
C. ANALYSIS ON THE NUMA SYSTEM
The non-uniform memory access characteristic on the NUMA system has been widely researched in order to improve the program performance. For instance, Tam et al. [57] designed a scheduling scheme on the basis of the shared patterns among threads, to eliminate the overhead caused by assigning the communication threads to separate nodes. Majo and Gross [58] proposed that eliminating program-level data sharing and regularizing memory access patterns on NUMA systems can improve the program performance. As the non-uniform memory access latency became less important, researchers began to concentrate on the overhead triggered by shared resource contention (i.e., memory controller, memory bus, interconnect, and prefetching hardware). Dashti et al. [59] pointed out that the congestion on memory controllers and interconnect hurt performance a lot more. Lepers et al. [60] observed that the program performance was affected by the asymmetric interconnect. Diener et al. [61] proposed that a mixed policy combining data locality and balance among nodes, could improve the performance the most. Zhuravlev et al. [62] noticed that contention on memory controller, memory bus, and prefetching hardware was able to cause performance degradation. Since KNL and the TILE-Gx72 processor can be viewed as separate NUMA systems, these analyses on the NUMA system gave us a clear idea of what we should focus on for tiled many-core processors.
D. LOAD BALANCING POLICY
Load balancing policy, which intended to distribute tasks to cores (processors) appropriately, is a well-studied topic. Lim et al. [63] optimized per-CPU utilization and minimized the task migration cost on multicore embedded systems, rather than balanced tasks across CPUs, to guarantee the real-time features (i.e., cache efficiency, user responsiveness, effective power consumption, and latency). Li et al. [29] proposed an asymmetric multiprocessor scheduler (AMPS), which was composed of asymmetry-aware load balancing, faster-core-first scheduling, and NUMA-aware migration, for performance-asymmetric multi-core architectures. Their asymmetry-aware load balancing took advantage of distinct computing powers of the heterogeneous system. Hofmeyr et al. [18] proposed a user-level speed balancing, which could be adopted when inconsistent thread speed existed (i.e., the system was heterogeneous; threads could not be distributed across cores evenly; application threads competed with other threads for shared resources). These load balancing policies are different from the optimized policy, but the common feature is that the balancing is done on the basis of the characteristics of systems (or needs).
E. CACHE SHARING PROBLEM
The cache sharing problem refers to the observation that the performance of one application is affected when it is co-scheduled with another application, due to the contention for shared LLC. Kim et al. [45] observed that performance of the application (named gzip) varied when it was co-scheduled with other applications (i.e., apsi and art) separately on a 2-processor CMP sharing a 512-KB L2 cache. Chandra et al. [46] noticed that the cache sharing problem existed when the application (called mcf) was co-scheduled with other applications on the system with a shared 512-KB L2 cache as well. To solve the problem, applications were categorized into multiple (red, yellow, green, and black) groups on the basis of the sensitivity to the shared L2 cache size and the cache access rate and then co-scheduled [50] .
Moreover, Xie and Loh [51] proposed a dynamic classification algorithm to classify applications into four animal types (turtle, sheep, rabbit, and devil). Although our work discussed in this paper does not focus on the cache (memory) system of tiled many-core processors, these studies convinced us that the optimized load balancing policy is plausible, because assigning threads to appropriate (idle/lightweight) cores in one step, is able to mitigate the penalty incurred by the cache sharing problem. That is also related to the reason the optimized load balancing policy works better on KNL than the TILE-Gx72 processor, as discussed in Section IV-C.
VII. CONCLUSION AND FUTURE WORK
In this paper, we proposed an optimized load balancing policy, which assigns threads to idle/lightweight cores in one step positively, on tiled many-core processors (KNL and the TILE-Gx72 processor). It has been designed on the basis of (1) unique features such as cores being fitted onto a single chip and caches being shared by cores because of the cache coherence protocol on tiled many-core processors, and (2) the suggestion that current load balancing policy of the Linux kernel, which is performed on the basis of the scheduling domains, can potentially degrade the performance. The experimental results revealed that the optimized load balancing policy is feasible to improve the program performance, for multi-threaded applications from the PARSEC benchmark suite on both KNL and the TILE-Gx72 processor, though the scalability problem cannot be entirely solved. The performance improvement with the optimized load balancing policy, furthermore, demonstrates that OS designers need to rethink what the (future) OS on tiled many-core processors should be to take advantage of the unique features. In future work, we plan to pay attention to thread-to-thread communication patterns and combine them with the features of KNL and the TILE-Gx72 processor before we move to the (future) OS on tiled many-core processors.
