Scientific applications often contain large computationally-intensive parallel loops. Loop scheduling techniques aim to achieve load balanced executions of such applications. For distributed-memory systems, existing dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a master-worker execution model to assign variably-sized chunks of loop iterations. The master-worker execution model may adversely impact performance due to the master-level contention. This work proposes a distributed chunk-calculation approach that does not require the master-worker execution scheme. Moreover, it considers the novel features in the latest MPI standards, such as passive-target remote memory access, shared-memory window creation, and atomic read-modify-write operations. To evaluate the proposed approach, five well-known DLS techniques, two applications, and two heterogeneous hardware setups have been considered. The DLS techniques implemented using the proposed approach outperformed their counterparts implemented using the traditional master-worker execution model.
I. INTRODUCTION
Over the past decade, the increasing demand for computational power of scientific applications played a significant role in the development of modern high performance computing (HPC) systems. The advancements of modern HPC systems at both, hardware and software levels, raise questions regarding the benefits of these advantages for successful algorithms and techniques proposed in the past. Algorithms and techniques may, therefore, need to be revisited and re-evaluated to fully leverage the capabilities of modern HPC systems. Dynamic loop scheduling (DLS) techniques are important examples of successful scheduling techniques proposed over the years. The DLS techniques are critical for scheduling parallel loops that are the main source of parallelism in scientific applications [1] . A large body of work on DLS was introduced between the late 1980's and the early 2000's [2] - [8] . These DLS techniques were used in several scientific applications to achieve a load balanced execution of loop iterations. Several factors can hinder such a load balanced execution, and consequently, degrade applications' performance. Specifically, problem characteristics, non-uniform input data sets, as well as algorithmic and systemic variations lead to different execution times of each loop iteration. The DLS techniques are designed to mitigate load imbalance due to the aforementioned factors.
Dynamic loop self-scheduling-based techniques such as, self-scheduling (SS) [2] , guided self-scheduling (GSS) [3] , trapezoid self-scheduling (TSS) [4] , factoring (FAC) [5] , and weighted factoring (WF) [6] , constitute an important category of DLS techniques. The distinguishing aspect of loop self-scheduling is that whenever a processing element becomes available and requests work, it obtains a collection of loop iterations (called a chunk) from a central work queue. Each DLS technique uses a certain function to calculate chunk sizes.
Several implementations of self-scheduling-based techniques [8] - [12] employ the master-worker execution model, and use the classical two-sided MPI communication model. In the master-worker execution model, a processing element (called master) holds all the information required to calculate the chunks and serves work requests from other processing elements (called workers). Workers request new chunks once they become available, according to the self-scheduling principle. The master-worker execution model highlights an important performance-relevant detail concerning the implementation of self-scheduling-based techniques; it centralizes the chunk-calculations at the master. This centralization renders the master process a performance bottleneck in three scenarios: (1) the master process has a decreased processing capabilities; this may happen in heterogeneous systems, (2) the master process receives a large number of concurrent work requests; this may happen in large-scale distributed-memory systems, and (3) a combination of (1) and (2) may be the case when executing on large-scale heterogeneous systems.
The current work proposes an approach to address the first execution scenario described above. Intuitively, this scenario can be avoided by mapping the master process to the processing element with the highest processing capabilities. However, such a mapping may not always be guaranteed or feasible. For instance, system variations during applications' execution may adversely affect the computation or the communication capabilities of the master leading to performance degradation.
The MPI-2 standard [13] introduced remote memory access (RMA) semantics that were seldomly adopted in scientific applications because they had several issues [14] . For instance, the unrestricted allocation of exposed-memory regions makes the efficient implementation of one-sided functions extremely difficult. The MPI RMA model was significantly revised in the MPI-3 standard and more performance-capable RMA semantics were introduced [15] , [16] . For instance, MPI_Win_Allocate was introduced to restrict the memory allocation and to allow more efficient MPI one-sided implementations. The performance assessment of MPI RMA-based approaches has recently gained increased attention for different scientific applications [17] - [19] .
The present work proposes a novel approach for developing DLS techniques to execute scientific applications on distributed-memory systems. The proposed approach distributes the chunk-calculation of the DLS techniques among all processing elements. Moreover, the proposed approach is implemented using the recent features offered by the MPI-3 standard, such as passive-target synchronization, shared-memory window creation, and atomic read-modify-write operations. The present work is significant for improving the performance of various scientific applications, such as N-body simulations [20] , Monte-Carlo simulations, and computational fluid dynamics [7] , that employ DLS techniques on heterogeneous and large scale modern HPC systems. Moreover, the present work provides insights for improving existing DLS libraries [10] , [11] such that they take advantage of modern HPC systems by exploiting the remote direct memory access (RDMA) capabilities of modern interconnection networks.
The main contributions of this work are: (1) Proposal and implementation of five DLS techniques with distributed chunk-calculation for distributed-memory systems. (2) Evaluation of the benefit of using MPI one-sided communication and passive-target synchronization mode to implement the DLS techniques: SS [2] , GSS [3] , TSS [4] , FAC [5] , and WF [6] against other existing approaches [10] , [11] , [21] .
The remainder of this work is organized as follows. Section II contains a review of the selected DLS techniques, as well as of the relevant research efforts on the different implementation approaches of DLS techniques for distributed-memory systems. Also, the background on the MPI RMA model is presented in Section II. The proposed distributed chunk-calculation approach and its execution model are introduced in Section III. The design of experiments and the experimental results are discussed in Sections IV and V, respectively. The conclusions and the potential future work are outlined in Section VI.
II. BACKGROUND AND RELATED WORK
Loop Scheduling: Loops are the richest source of parallelism in scientific applications [1] . Computationally-intensive scientific applications spend most of their execution time in large loops. Therefore, efficient scheduling of loop iterations benefits scientific applications' performance. In the current work, five well-known DLS techniques: self-scheduling (SS) [2] , guided self-scheduling (GSS) [3] , trapezoid self-scheduling (TSS) [4] , factoring FAC [5] , and weighted factoring (WF) [6] are considered. These techniques are considered due to their competitive performance in different applications, and due to the fact that they are at the basis of other DLS techniques. For instance, trapezoid factoring self-scheduling (TFSS) [22] is based on TSS and FAC, while adaptive weighted factoring (AWF) [7] and its variants [23] are derived from FAC.
The common aspect of all selected DLS techniques is that new chunks of iterations are assigned to processing elements once they become available and request work. However, each of these DLS techniques employs a different function to calculate the size of the chunk to be assigned.
The notation used in the present work is given in Table I . In the SS [2] technique, the assigned chunk, K i , is always 1 loop iteration. Due to the fine-grained partitioning of the loop iterations, SS can achieve a highly load balanced execution. However, it incurs a high scheduling overhead. GSS [3] , TSS [4] , and FAC [5] outperform SS in terms of scheduling overhead, by assigning chunks of decreasing size. In each scheduling step i, GSS uses a non-linear function to calculate the chunk sizes. It divides the remaining loop iterations, R i , by the total number of processing elements, P .
TSS [4] employs a simplistic linear function to calculate the decreasing chunk sizes using a fixed decrement. This linearity results in low scheduling overhead in each scheduling step.
FAC [5] schedules the loop iterations in batches of P equally-sized chunks. The initial chunk size of FAC is smaller than the initial chunk size of GSS. If more time-consuming loop iterations exist at the beginning of the loop, GSS may not balance their execution as efficiently as FAC. Unlike GSS and TSS, which calculate the chunks deterministically, the chunk-calculation in FAC evolved from comprehensive probabilistic analyses. To calculate an optimal chunk size, FAC requires prior knowledge about the standard deviation, σ, of loop iterations' execution times and their mean execution time μ. A practical implementation of FAC, denoted FAC2, does not require μ and σ, and assigns half of the remaining work in every batch [5] . FAC2 evolved from the probabilistic 
Remaining loop iterations after i-th scheduling step
Standard deviation of the loop iterations' execution times μ
Mean of the loop iterations' execution times Tp
Parallel execution time of the entire application T loop p Parallel execution time of the application's parallelized loops analysis that gave birth to FAC, and is considered in the current work.
WF [6] uses the FAC function to calculate the batch size. However, the processing elements execute variably-sized chunks of this batch according to their relative weights. The processor weights, W p j , are determined prior to applications' execution and do not change during the execution. The chunk-calculation function of each technique is shown in Table II .
Related Work: Chronopoulos et al. introduced a distributed approach for implementing self-scheduling techniques (DSS) [24] . The goal was to improve the performance of the self-scheduling techniques on heterogeneous and distributed-memory resources. The proposed scheme was based on the master-worker execution model, similar to the one illustrated in Figure 1a , and was implemented using the classical two-sided MPI communication. The main idea was to enable the master to consider the speed of the processing elements and their loads when assigning new chunks. Chronopoulos et al. later enhanced the performance of the DSS scheme using a hierarchical master-worker model, and proposed the hierarchical distributed self-scheduling (HDSS) [22] that was similar to the one illustrated in Figure 1b . DSS and HDSS assume a dedicated master configuration in which the master processing element is reserved for handling the worker requests. Such a configuration may enhance the scalability of the proposed self-scheduling schemes. However, it results in low CPU utilization of the master. HDSS [22] suggested deploying the global-master and the local-master on one physical computing node that has multiple processing elements to overcome the low CPU utilization of the master (cf. Figure 1b ).
Cariño et al. proposed a load balancing (LB) tool that integrated several DLS techniques [11] . At the conceptual level, the LB tool is based on a single-level master-worker execution model (cf. Figure 1a ). However, it did not assume a dedicated-master. It introduced the breakAfter parameter which is user-defined and indicates how many iterations the master should execute before serving pending worker requests.
Hierarchical loop scheduling (HLS) [12] was one of the 
earliest efforts to utilize a hybrid MPI and OpenMP programming model to implement DLS techniques. HLS employed a hierarchical master-worker execution model, and was implemented using the classical two-sided MPI communication and OpenMP threads. Unlike HDSS [22] , HLS distributes the local masters across all physical computing nodes (cf. Figure 1c ). The local masters communicate with the global master to request a new chunk when they have no more iterations to distribute between their workers. The main limitation of HLS is that the OpenMP worker threads distribute the work using the #pragma omp parallel region without the explicit use of the nowait clause. This incurs implicit local synchronization at the OpenMP level on local masters. All the aforementioned research efforts [11] , [12] , [22] , [24] have a common aspect are master-worker-based methods that centralize and hence serialize the chunk-calculation at the master-side. The present work introduces an alternative approach to design and implement the DLS techniques by distributing the chunk-calculation among all workers. The uniqueness of the present work is the elimination of the master that could represent a performance bottleneck in many cases (cf. Section I).
MPI RMA Model: In MPI, the memory of each process is by default private, and other processes cannot directly access it. The MPI RMA model allows MPI processes to publicly expose different regions of their memory, called windows. One MPI process (origin) can directly access a memory window without any involvement of the other (target) process that owns the window. The MPI RMA has two synchronization modes: passive-and active-target. In the active-target synchronization, the target process determines the time boundaries, called epochs, when its window can be accessed. In the passive-target synchronization, the target process has no time limits when its window can be accessed. The present work focuses on the passive-target synchronization because it requires a minimal amount of synchronization, and it efficiently allows the greatest overlap of computation and communication. Moreover, it yields the development of DLS techniques for distributed-memory systems to be very similar to their original implementations for shared-memory systems.
III. THE PROPOSED APPROACH
The main challenge to design a distributed chunk-calculation approach is associated with the chunk-calculation functions of the DLS techniques. To calculate the current chunk to be assigned, these functions (except for SS) require either the value of the remaining loop iterations R i or the value of the previous chunk K i−1 (cf . Table II) . Therefore, the chunks have to be calculated in a sequential manner, i.e., two chunks cannot be calculated simultaneously because the values of R i and K i−1 change after each chunk-calculation. This serialization perfectly fits any master-worker-based execution approach because the master serves one request at a time, and consequently, the chunk-calculation can be performed in a centralized and sequential fashion. The approach proposed in this work introduces certain transformations of the respective chunk-calculation functions from Table II , such that the chunk-calculation depends only on the index of the last scheduling step i. These transformations are shown below in Equations 1-3. For GSS and FAC, the transformations were already introduced in the literature [5] . For TSS, our mathematical derivation of the transformation can be found in the companion research report [25] . WF uses the chunk-calculation function of FAC and can naturally inherit the transformed FAC function.
The proposed approach uses Equations 1-3 to distribute the chunk-calculation across all processing elements. Thus, only one processing element (called coordinator) stores index of the last scheduling step i and the index of the last scheduled loop iteration lp start .
TSS: Figure 2 illustrates the main steps of the proposed distributed chunk-calculation approach:
Step 1. the processing element p j obtains a copy of the last scheduling step index, i, and atomically increments the global i by one.
Step 2. p j only uses its local copy of i (before the increment) to calculate K i with the function of the selected DLS technique (Equations 1-3).
Step 3. p j obtains a copy of the last loop index start, lp start , and atomically accumulates the size of the calculated chunk, K i , into it. Finally, p j executes loop iterations between lp start (before accumulation) and min(lp start + K i , N ). The atomic operations in Steps 1 and 3 guarantee the exclusive access to i and lp start . The MPI RMA model provides the necessary function calls that can be used in the implementation of the proposed approach. For instance, the coordinator MPI process can use MPI_Win_create to expose the shared variables, such as i and lp start , to all other MPI processes. The passive-target synchronization mode (MPI_Win_lock(MPI_LOCK_SHARED)) can be used with certain MPI atomic operations, such as MPI_Get_accumulate, to grant the exclusive access to i and lp start by all MPI processes. For more information regarding the implementation, the reader is referred to the code that is developed under the LGPL license and available online [26] . Figure 3 illustrates the DLS execution using the proposed distributed chunk-calculation approach. The calculation of chunks K 0 and K 1 is distributed between processors p 0 and p 1 . The time required to calculate K 0 overlaps with the time taken to calculate K 1 . In the traditional master-worker execution model, there is no such overlap since all the chunk calculations are centralized and performed by the master in sequence. The time required to serve the first work request (including chunk-calculation and chunk-assignment) delays the second work request. Moreover, the time required to serve the work requests is proportional to the processing capabilities of the master processor, which may result in additional delays as discussed in Section I.
The proposed distributed chunk-calculation approach may result in a different ordering of assigning and executing loop iterations compared to the traditional master-worker execution model. For instance, when GSS is the chosen scheduling technique in Figure 3 and N = 10, p 0 obtains a local copy of the last scheduling index i = 0 at t 4 . Also, p 1 obtains at t 5 a local copy of the last scheduling index i = 1. Both, p 0 and p 1 use their copies of i and calculate K 0 = 5 and K 1 = 3, respectively. The proposed approach does not guarantee that p 0 and p 1 will execute loop iterations from lp start = 0 to lp start = 4 and lp start = 5 to lp start = 7. Figure 3 shows the case when the chunk-calculation on p 0 is longer than on p 1 , and results in assigning p 1 , loop iterations between lp start = 0 and lp start = 2, while p 0 is assigned loop iterations between lp start = 3 and lp start = 7. Given that DLS techniques address, by design, independent loop iterations with no restrictions on the monotonicity of the loop execution, the proposed approach does not affect the correctness of the loop execution.
IV. DESIGN AND SETUP OF EXPERIMENTS
In the present work, the performance of two different implementations of DLS techniques is assessed. The first implementation, denoted One_Sided_DLS, employs the proposed distributed chunk-calculation approach and uses one-sided MPI communication in the passive-target synchronization mode. The second implementation, denoted Two_Sided_DLS, employs a master-worker model and uses the two-sided MPI communication. Both implementations assume a non-dedicated coordinator (or a non-dedicated master) processing element.
Selected Applications: Two computationally-intensive parallel applications are considered in this study. The first application, called PSIA [27] , uses a parallel version of the well-known spin-image algorithm (SIA) [28] . SIA converts a 3D object into a set of 2D images. The generated 2D images can be used as descriptive features of the 3D object.
The second application calculates the Mandelbrot set [29] . The Mandelbrot set is used to represent geometric shapes that have the self-similarity property at various scales. Studying such shapes is important and of interest in different domains, such as biology, medicine, and chemistry [30] .
Both applications contain a single large parallel loop that dominates their execution times. Dynamic and static distributions of the most time-consuming parallel loop across all processing elements may enhance applications' performance. Each application has certain execution parameters that affects the average iteration execution time. For PSIA, these parameters are the input size (800,000 3D points), the output size (288,000 images), the image-size (5x5), the bin-size (0.01), and the support-angle (2π) [21] . For Mandelbrot set, these parameters are the output image-size (1152x1152), the maximum number of iterations (1000), and the Z exponent (4) [30] .
Hardware Platform Specifications: Two types of computing resources are used in this work. The first type, denoted KNL, refers to standalone Intel Xeon Phi 7210 manycore processors with 64 cores, 96 GB RAM (flat mode configuration), and 1.3 GHz CPU frequency. The second type, denoted Xeon, refers to two-socket Intel Xeon E5-2640 processors with 20 cores, 64 GB RAM, and 2.4 GHz CPU frequency.
These platform types are part of a fully-controlled computing cluster that consists of 26 nodes: 22 KNL nodes and 4 Xeon nodes. All nodes are interconnected in a non-blocking fat-tree topology. The network characteristics are: Intel Omni-Path fabric, 100 GBit/s link bandwidth, and 100 ns network latency. Each KNL node has one Intel Omni-Path host fabric interface adapter. Each Xeon node has two Intel Omni-Path host fabric interface adapters. All host fabric adapters use a single PCIe x16 100 Gbps port. As this computing cluster is actively used for research and educational purposes, only 40% of the cluster could be dedicated to the present work, at the time of writing, specifically 288 cores out of the total 696 available cores.
In the present work, the total number of cores is fixed to 288 cores, whereas, the ratio between the KNL and the Xeon cores is varied. Two ratios have been considered: 2:1 represents the case when the KNL cores are the dominant type of computing resources, and 1:2 represents the complementary case where the Xeon cores are the dominant computing resources. Table III illustrates these two ratios. Also, 48 KNL cores and 16 Xeon cores per node are used, while the remaining cores on each node were left for other system-level processes.
Mapping of the Coordinator (or the Master) Process to a Certain Core: Two mapping scenarios are considered for the assessment of the proposed One_Sided_DLS approach vs. the Two_Sided_DLS approach. In the first mapping scenario, the process that plays the role of the coordinator for One_Sided_DLS or the role of the master for Two_Sided_DLS is mapped to a KNL core. The CPU frequency of a single KNL core is 1.3 GHz, while the CPU frequency of a single Xeon core is 2.4 GHz. Therefore, this mapping represents a case when the coordinator (or the master) process is mapped to one of the cores that has the lowest processing capabilities. In the second mapping scenario, the process that plays the role of the coordinator (or the master) is mapped to a Xeon core, which is the most powerful processing element in the considered system. Comparing the results of both scenarios shows the adverse impact of reduced processing capabilities of the master on the performance of the DLS techniques using Two_Sided_DLS. On the contrary, the same mapping for the coordinator process did not affect the performance of the DLS techniques using One_Sided_DLS.
V. RESULTS AND DISCUSSION
The straightforward parallelization (STATIC) is used as a baseline to assess the performance of the selected DLS techniques on the target heterogeneous computing platform. For the PSIA application, Figure 4a shows that SS, GSS, and TSS implemented with One_Sided_DLS outperformed their respective versions using Two_Sided_DLS. For instance, when the ratio of the KNL cores to the Xeon cores was 2:1, the parallel loop execution time, T loop p , of SS required 109 and 233 seconds with One_Sided_DLS and Two_Sided_DLS, respectively. Similarly, when the ratio was 2:1, the parallel loop execution time, T loop p , of GSS and TSS increased from 185 and 125 seconds to 236 and 136 seconds, respectively. When the ratio was 1:2, the total processing capabilities of the system increased because the number of Xeon cores increased. However, the parallel loop execution time, T loop p , of SS, GSS, and TSS implemented using Two_Sided_DLS did not take the advantage of increasing the total number of Xeon cores. For instance using One_Sided_DLS, changing the ratio from 2:1 to 1:2 reduced the T loop p of SS from 109 to 68.5 seconds. FAC and WF behaved similarly using both, One_Sided_DLS and Two_Sided_DLS.
The performance degradation of the DLS techniques with Two_Sided_DLS is due to mapping the master to a KNL core, which has the lowest processing capabilities (cf. Section IV). Recall that in Two_Sided_DLS, the master is responsible for serving work requests, and therefore, it has to divide the time between serving the work requests and performing its own chunks. Therefore, if the master has a lower processing capabilities than the other processes, it becomes a performance bottleneck. Also, recall that One_Sided_DLS is designed to addresses this scenario (Sections I and III). The coordinator process executes its own chunks, and is not responsible for the calculation and the allocation of the chunks to the other processes. Figure 4b shows that the DLS techniques with One_Sided_DLS perform comparably to their versions with Two_Sided_DLS. For instance using the ratio 2:1, the One_Sided_DLS implementation of SS, GSS, TSS, FAC2, and WF required 108, 177, 125, 125, and 110 seconds, respectively. The Two_Sided_DLS implementation of the same techniques required 105, 175, 135.6, 125, and 106.45 seconds, respectively. Also, using the ratio 2:1, the DLS techniques behaved similarly regardless of their implementation approach.
For the Mandelbrot application, Figure 5 confirms the same performance advantages of the proposed approach as for the PSIA application. The DLS techniques implemented with One_Sided_DLS performed equally whether the coordinator was mapped to a KNL core or to a Xeon core. The performance of certain DLS techniques with Two_Sided_DLS degraded when the master was mapped to a KNL core compared to their performance when the master was mapped to a Xeon core.
Overall, Figures 4 and 5 highlight two important observations. First observation: The performance variation for executing a certain experiment using the One_Sided_DLS approach is higher than the performance variation when executing the same experiment using the Two_Sided_DLS approach. The reason behind such variation is the manner in which concurrent messages are implemented at the MPI layer in One_Sided_DLS and Two_Sided_DLS. In the current work, the Intel MPI is used to implement both approaches, One_Sided_DLS and Two_Sided_DLS. Intel MPI uses the Lock Polling strategy to implement MPI_Win_lock in which the origin process repeatedly issues lock-attempt messages to the coordinator process until the lock is granted [16] . On the contrary, Two_Sided_DLS uses the MPI_Send, MPI_Recv and MPI_Iprobe functions. For Intel MPI, in the case of simultaneous sends of multiple work requests to the master process, the master checks the outstanding work requests using MPI_Iprobe, and serves them by giving priority to the request of the process with the smallest MPI rank. The One_Sided_DLS has a high probability to grant the lock to different MPI processes at each trial, whereas, Two_Sided_DLS always prioritizes requests from the process with the smallest MPI rank. The GSS has the largest non-linear decrement among the decrements of the selected DLS techniques. Therefore, GSS is highly-sensitive to the chunk-assignment.
Second observation: FAC and WF exhibit a reduced sensitivity to mapping the master to a KNL or to a Xeon core. This low sensitivity could be due to the factoring-based nature of these techniques. Among all the assessed DLS techniques, FAC2 and WF assign chunks in batches, which increases the possibility for the master to have chunks of the same size Fig. 4 : Performance of the proposed approach vs. the existing master-worker based approach for the PSIA. The x-axis represents the two ratios between the KNL cores and the Xeon cores. as the other processing elements. However, further analysis is needed to better understand such reduced sensitivity.
VI. CONCLUSION AND FUTURE WORK
A number of DLS techniques has been revisited and re-evaluated in light of and to enable them to benefit from the significant advancements in modern HPC sys-tems, both at hardware and software levels. A distributed chunk-calculation approach (One_Sided_DLS) has been proposed herein and is implemented using the MPI RMA and atomic read-modify-write operations with passive-target synchronization mode. The One_Sided_DLS approach performs competitively against existing approaches, such as Two_Sided_DLS that uses MPI two-sided communication and employs the conventional master-worker execution model. One_Sided_DLS has the potential to alleviate the masterworker level contention of Two_Sided_DLS in large-scale HPC systems.
The present work revealed interesting aspects, planned as future work. Specifically, the applications that require the return of the intermediate results upon the execution of each chunk of work. These applications will help to assess the impact of the data distribution on the One_Sided_DLS approach. In such cases, the performance of both, One_Sided_DLS and Two_Sided_DLS, is expected to degrade. Yet, it remains of high interest to quantify this performance degradation. The performance assessment of the proposed approach with other MPI libraries is also of high interest for future work. The scalability aspect of the proposed One_Sided_DLS approach also requires further study and analysis.
