7,602 research outputs found
Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access
Scientific applications often contain large computationally-intensive
parallel loops. Loop scheduling techniques aim to achieve load balanced
executions of such applications. For distributed-memory systems, existing
dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a
master-worker execution model to assign variably-sized chunks of loop
iterations. The master-worker execution model may adversely impact performance
due to the master-level contention. This work proposes a distributed
chunk-calculation approach that does not require the master-worker execution
scheme. Moreover, it considers the novel features in the latest MPI standards,
such as passive-target remote memory access, shared-memory window creation, and
atomic read-modify-write operations. To evaluate the proposed approach, five
well-known DLS techniques, two applications, and two heterogeneous hardware
setups have been considered. The DLS techniques implemented using the proposed
approach outperformed their counterparts implemented using the traditional
master-worker execution model
Task mapping on a dragonfly supercomputer
The dragonfly network topology has recently gained traction in the design of high performance computing (HPC) systems and has been implemented in large-scale supercomputers. The impact of task mapping, i.e., placement of MPI ranks onto compute cores, on the communication performance of applications on dragonfly networks has not been comprehensively investigated on real large-scale systems. This paper demonstrates that task mapping affects the communication overhead significantly in dragonflies and the magnitude of this effect is sensitive to the application, job size, and the OpenMP settings. Among the three task mapping algorithms we study (in-order, random, and recursive coordinate bisection), selecting a suitable task mapper reduces application communication time by up to 47%
Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach
Computationally-intensive loops are the primary source of parallelism in
scientific applications. Such loops are often irregular and a balanced
execution of their loop iterations is critical for achieving high performance.
However, several factors may lead to an imbalanced load execution, such as
problem characteristics, algorithmic, and systemic variations. Dynamic loop
self-scheduling (DLS) techniques are devised to mitigate these factors, and
consequently, improve application performance. On distributed-memory systems,
DLS techniques can be implemented using a hierarchical master-worker execution
model and are, therefore, called hierarchical DLS techniques. These techniques
self-schedule loop iterations at two levels of hardware parallelism: across and
within compute nodes. Hybrid programming approaches that combine the message
passing interface (MPI) with open multi-processing (OpenMP) dominate the
implementation of hierarchical DLS techniques. The MPI-3 standard includes the
feature of sharing memory regions among MPI processes. This feature introduced
the MPI+MPI approach that simplifies the implementation of parallel scientific
applications. The present work designs and implements hierarchical DLS
techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques
are considered in the evaluation proposed herein. The results indicate certain
performance advantages of the proposed approach compared to the hybrid
MPI+OpenMP approach
TANGO: Transparent heterogeneous hardware Architecture deployment for eNergy Gain in Operation
The paper is concerned with the issue of how software systems actually use
Heterogeneous Parallel Architectures (HPAs), with the goal of optimizing power
consumption on these resources. It argues the need for novel methods and tools
to support software developers aiming to optimise power consumption resulting
from designing, developing, deploying and running software on HPAs, while
maintaining other quality aspects of software to adequate and agreed levels. To
do so, a reference architecture to support energy efficiency at application
construction, deployment, and operation is discussed, as well as its
implementation and evaluation plans.Comment: Part of the Program Transformation for Programmability in
Heterogeneous Architectures (PROHA) workshop, Barcelona, Spain, 12th March
2016, 7 pages, LaTeX, 3 PNG figure
Exploring Scientific Application Performance Using Large Scale Object Storage
One of the major performance and scalability bottlenecks in large scientific
applications is parallel reading and writing to supercomputer I/O systems. The
usage of parallel file systems and consistency requirements of POSIX, that all
the traditional HPC parallel I/O interfaces adhere to, pose limitations to the
scalability of scientific applications. Object storage is a widely used storage
technology in cloud computing and is more frequently proposed for HPC workload
to address and improve the current scalability and performance of I/O in
scientific applications. While object storage is a promising technology, it is
still unclear how scientific applications will use object storage and what the
main performance benefits will be. This work addresses these questions, by
emulating an object storage used by a traditional scientific application and
evaluating potential performance benefits. We show that scientific applications
can benefit from the usage of object storage on large scales.Comment: Preprint submitted to WOPSSS workshop at ISC 201
Towards a goal-oriented agent-based simulation framework for high-performance computing
Currently, agent-based simulation frameworks force the user to choose between simulations involving a large number of agents (at the expense of limited agent reasoning capability) or simulations including agents with increased reasoning capabilities (at the expense of a limited number of agents per simulation). This paper describes a first attempt at putting goal-oriented agents into large agentbased (micro-)simulations. We discuss a model for goal-oriented agents in HighPerformance Computing (HPC) and then briefly discuss its implementation in PyCOMPSs (a library that eases the parallelisation of tasks) to build such a platform that benefits from a large number of agents with the capacity to execute complex cognitive agents.Peer ReviewedPostprint (author's final draft
MPI-Vector-IO: Parallel I/O and Partitioning for Geospatial Vector Data
In recent times, geospatial datasets are growing in terms of size, complexity and heterogeneity. High performance systems are needed to analyze such data to produce actionable insights in an efficient manner. For polygonal a.k.a vector datasets, operations such as I/O, data partitioning, communication, and load balancing becomes challenging in a cluster environment. In this work, we present MPI-Vector-IO 1 , a parallel I/O library that we have designed using MPI-IO specifically for partitioning and reading irregular vector data formats such as Well Known Text. It makes MPI aware of spatial data, spatial primitives and provides support for spatial data types embedded within collective computation and communication using MPI message-passing library. These abstractions along with parallel I/O support are useful for parallel Geographic Information System (GIS) application development on HPC platforms
- …