16 research outputs found
Dynamic Multigrain Parallelization on the Cell Broadband Engine
This paper addresses the problem of orchestrating and scheduling
parallelism at multiple levels of granularity on heterogeneous
multicore processors. We present policies and mechanisms for adaptive
exploitation and scheduling of multiple layers of parallelism on the
Cell Broadband Engine. Our policies combine event-driven task
scheduling with malleable loop-level parallelism, which is exposed
from the runtime system whenever task-level parallelism leaves cores
idle. We present a runtime system for scheduling applications with
layered parallelism on Cell and investigate its potential with RAxML,
a computational biology application which infers large phylogenetic
trees, using the Maximum Likelihood (ML) method. Our experiments show
that the Cell benefits significantly from dynamic parallelization
methods, that selectively exploit the layers of parallelism in the
system, in response to workload characteristics. Our runtime
environment outperforms naive parallelization and scheduling based on
MPI and Linux by up to a factor of 2.6. We are able to execute RAxML
on one Cell four times faster than on a dual-processor system with
Hyperthreaded Xeon processors, and 5--10\% faster than on a
single-processor system with a dual-core, quad-thread IBM Power5
processor
A Flexible Thread Scheduler for Hierarchical Multiprocessor Machines
International audienceWith the current trend of multiprocessor machines towards more and more hierarchical architectures, exploiting the full computational power requires careful distribution of execution threads and data so as to limit expensive remote memory accesses. Existing multi-threaded libraries provide only limited facilities to let applications express distribution indications, so that programmers end up with explicitly distributing tasks according to the underlying architecture, which is difficult and not portable. In this article, we present: (1) a model for dynamically expressing the structure of the computation; (2) a scheduler interpreting this model so as to make judicious hierarchical distribution decisions; (3) an implementation within the Marcel user-level thread library. We experimented our proposal on a scientific application running on a ccNUMA Bull NovaScale with 16 Intel Itanium II processors; results show a 30% gain compared to a classical scheduler, and are similar to what a handmade scheduler achieves in a non-portable way
Automated Scheduling Algorithm Selection and Chunk Parameter Calculation in OpenMP
Increasing node and cores-per-node counts in supercomputers render scheduling and load balancing critical for exploiting parallelism. OpenMP applications can achieve high performance via careful selection of scheduling kind and chunk parameters on a per-loop, per-application, and per-system basis from a portfolio of advanced scheduling algorithms (Korndörfer et al. , 2022). This selection approach is time-consuming, challenging, and may need to change during execution. We propose Auto4OMP , a novel approach for automated load balancing of OpenMP applications. With Auto4OMP, we introduce three scheduling algorithm selection methods and an expert-defined chunk parameter for OpenMP's schedule clause's kind and chunk , respectively. Auto4OMP extends the OpenMP schedule(auto) and chunk parameter implementation in LLVM's OpenMP runtime library to automatically select a scheduling algorithm and calculate a chunk parameter during execution. Loop characteristics are inferred in Auto4OMP from the loop execution over the application's time-steps. The experiments performed in this work show that Auto4OMP improves applications performance by up to 11 % compared to LLVM's schedule(auto) implementation and outperforms manual selection. Auto4OMP improves MPI+OpenMP applications performance by explicitly minimizing thread- and implicitly reducing process-load imbalance
PRISM: an intelligent adaptation of prefetch and SMT levels
Current microprocessors include hardware to optimize some specifics workloads.
In general, these hardware knobs are set on a default configuration on the booting
process of the machine. This default behavior cannot be beneficial for all types of
workloads and they are not controlled by anyone but the end user, who needs to
know what configuration is the best one for the workload running. Some of these
knobs are: (1) the Simultaneous MultiThreading level, which specifies the number
of threads that can run simultaneously on a physical CPU, and (2) the data
prefetch engine, that manages the prefetches on memory. Parallel programming
models are here to stay, and one programming model that succeed in allowing programmers
to easily parallelize applications is Open Multi Processing (OMP). Also,
the architecture of microprocessors is getting more complex that end users cannot
afford to optimize their workloads for all the architectural details. These architectural
knobs can help to increase performance but it is needed an automatic and
adaptive system managing them. In this work we propose an independent library
for OpenMP runtimes to increase performance up to 220% (14.7% on average)
while reducing dynamic power consumption up to 13% (2% on average) on a real
POWER8 processor
SiL: An Approach for Adjusting Applications to Heterogeneous Systems Under Perturbations
Scientific applications consist of large and computationally-intensive loops.
Dynamic loop scheduling (DLS) techniques are used to load balance the execution
of such applications. Load imbalance can be caused by variations in loop
iteration execution times due to problem, algorithmic, or systemic
characteristics (also, perturbations). The following question motivates this
work: "Given an application, a high-performance computing (HPC) system, and
both their characteristics and interplay, which DLS technique will achieve
improved performance under unpredictable perturbations?" Existing work only
considers perturbations caused by variations in the HPC system delivered
computational speeds. However, perturbations in available network bandwidth or
latency are inevitable on production HPC systems. Simulator in the loop (SiL)
is introduced, herein, as a new control-theoretic inspired approach to
dynamically select DLS techniques that improve the performance of applications
on heterogeneous HPC systems under perturbations. The present work examines the
performance of six applications on a heterogeneous system under all above
system perturbations. The SiL proof of concept is evaluated using simulation.
The performance results confirm the initial hypothesis that no single DLS
technique can deliver best performance in all scenarios, while the SiL-based
DLS selection delivered improved application performance in most experiments
Adaptive parallelism mapping in dynamic environments using machine learning
Modern day hardware platforms are parallel and diverse, ranging from mobiles to
data centers. Mainstream parallel applications execute in the same system competing
for resources. This resource contention may lead to a drastic degradation in a program’s
performance. In addition, the execution environment composed of workloads
and hardware resources, is dynamic and unpredictable. Efficient matching of program
parallelism to machine parallelism under uncertainty is hard. The mapping policies
that determine the optimal allocation of work to threads should anticipate these variations.
This thesis proposes solutions to the mapping of parallel programs in dynamic environments.
It employs predictive modelling techniques to determine the best degree of
parallelism. Firstly, this thesis proposes a machine learning-based model to determine
the optimal thread number for a target program co-executing with varying workloads.
For this purpose, this offline trained model uses static code features and dynamic runtime
information as input.
Next, this thesis proposes a novel solution to monitor the proposed offline model
and adjust its decisions in response to the environment changes. It develops a second
predictive model for determining how the future environment should be, if the current
thread prediction was optimal. Depending on how close this prediction was to the
actual environment, the predicted thread numbers are adjusted.
Furthermore, considering the multitude of potential execution scenarios where no
single policy is best suited in all cases, this work proposes an approach based on the
idea of mixture of experts. It considers a number of offline experts or mapping policies,
each specialized for a given scenario, and learns online the best expert that is optimal
for the current execution. When evaluated on highly dynamic executions, these solutions
are proven to surpass default, state-of-art adaptive and analytic approaches
The fast multipole method at exascale
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems.
We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design.
To demonstrate the scientific significance of FMM, we present two applications
namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities.
Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D