Search CORE

114 research outputs found

Design and Implementation of MPICH2 over InfiniBand with RDMA Support

Author: Ashton David
Buntinas Darius
Gropp William
Jiang Weihang
Liu Jiuxing
Panda Dhabaleswar K.
Toonen Brian
Wyckoff Pete
Publication venue
Publication date: 30/10/2003
Field of study

For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPI-1 functions in MPICH2. One of our objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to achieve high performance. We have based our design on the RDMA Channel interface provided by MPICH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2 implementation achieves 7.6

\mu

s latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA Channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation of MPICH2 on InfiniBand using RDMA support.Comment: 12 pages, 17 figure

arXiv.org e-Print Archive

CiteSeerX

SKaMPI: A Comprehensive Benchmark for Public Benchmarking of MPI

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2002
Field of study

Crossref

One-Sided Communication for High Performance Computing Applications

Author: Barrett Brian William
Publication venue: [Bloomington, Ind.] : Indiana University
Publication date: 01/01/2009
Field of study

Thesis (Ph.D.) - Indiana University, Computer Sciences, 2009Parallel programming presents a number of critical challenges to application developers. Traditionally, message passing, in which a process explicitly sends data and another explicitly receives the data, has been used to program parallel applications. With the recent growth in multi-core processors, the level of parallelism necessary for next generation machines is cause for concern in the message passing community. The one-sided programming paradigm, in which only one of the two processes involved in communication actively participates in message transfer, has seen increased interest as a potential replacement for message passing. One-sided communication does not carry the heavy per-message overhead associated with modern message passing libraries. The paradigm offers lower synchronization costs and advanced data manipulation techniques such as remote atomic arithmetic and synchronization operations. These combine to present an appealing interface for applications with random communication patterns, which traditionally present message passing implementations with difficulties. This thesis presents a taxonomy of both the one-sided paradigm and of applications which are ideal for the one-sided interface. Three case studies, based on real-world applications, are used to motivate both taxonomies and verify the applicability of the MPI one-sided communication and Cray SHMEM one-sided interfaces to real-world problems. While our results show a number of short-comings with existing implementations, they also suggest that a number of applications could beneﬁt from the one-sided paradigm. Finally, an implementation of the MPI one-sided interface within Open MPI is presented, which provides a number of unique performance features necessary for efficient use of the one-sided programming paradigm

IUScholarWorks (University of Indiana)

Computational Methods in Science and Engineering : Proceedings of the Workshop SimLabs@KIT, November 29 - 30, 2010, Karlsruhe, Germany

Author: Kirner Ole
Kondov Ivan
Poghosyan Gevorg
Schmitz Frank
Schneider Olaf
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2011
Field of study

In this proceedings volume we provide a compilation of article contributions equally covering applications from different research fields and ranging from capacity up to capability computing. Besides classical computing aspects such as parallelization, the focus of these proceedings is on multi-scale approaches and methods for tackling algorithm and data complexity. Also practical aspects regarding the usage of the HPC infrastructure and available tools and software at the SCC are presented

KITopen

Recommended from our members

Cray X1 Evaluation Status Report

Author: Vetter J.S.
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 09/02/2004
Field of study

On August 15, 2002 the Department of Energy (DOE) selected the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory (ORNL) to deploy a new scalable vector supercomputer architecture for solving important scientific problems in climate, fusion, biology, nanoscale materials and astrophysics. ''This program is one of the first steps in an initiative designed to provide U.S. scientists with the computational power that is essential to 21st century scientific leadership,'' said Dr. Raymond L. Orbach, director of the department's Office of Science The Cray X1 is an attempt to incorporate the best aspects of previous Cray vector systems and massively-parallel-processing (MPP) systems into one design. Like the Cray T90, the X1 has high memory bandwidth, which is key to realizing a high percentage of theoretical peak performance. Like the Cray T3E, the X1 has a high-bandwidth, low-latency, scalable interconnect, and scalable system software. And, like the Cray SV1, the X1 leverages commodity off-the-shelf (CMOS) technology and incorporates non-traditional vector concepts, like vector caches and multi-streaming processors. In FY03, CCS procured a 256-processor Cray X1 to evaluate the processors, memory subsystem, scalability of the architecture, software environment and to predict the expected sustained performance on key DOE applications codes. The results of the micro-benchmarks and kernel benchmarks show the architecture of the Cray X1 to be exceptionally fast for most operations. The best results are shown on large problems, where it is not possible to fit the entire problem into the cache of the processors. These large problems are exactly the types of problems that are important for the DOE and ultra-scale simulation

UNT Digital Library

Cray X1 Evaluation Status Report

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Runtime MPI Correctness Checking with a Scalable Tools Infrastructure

Author: Hilbrich Tobias
Publication venue
Publication date: 08/06/2015
Field of study

Increasing computational demand of simulations motivates the use of parallel computing systems. At the same time, this parallelism poses challenges to application developers. The Message Passing Interface (MPI) is a de-facto standard for distributed memory programming in high performance computing. However, its use also enables complex parallel programing errors such as races, communication errors, and deadlocks. Automatic tools can assist application developers in the detection and removal of such errors. This thesis considers tools that detect such errors during an application run and advances them towards a combination of both precise checks (neither false positives nor false negatives) and scalability. This includes novel hierarchical checks that provide scalability, as well as a formal basis for a distributed deadlock detection approach. At the same time, the development of parallel runtime tools is challenging and time consuming, especially if scalability and portability are key design goals. Current tool development projects often create similar tool components, while component reuse remains low. To provide a perspective towards more efficient tool development, which simplifies scalable implementations, component reuse, and tool integration, this thesis proposes an abstraction for a parallel tools infrastructure along with a prototype implementation. This abstraction overcomes the use of multiple interfaces for different types of tool functionality, which limit flexible component reuse. Thus, this thesis advances runtime error detection tools and uses their redesign and their increased scalability requirements to apply and evaluate a novel tool infrastructure abstraction. The new abstraction ultimately allows developers to focus on their tool functionality, rather than on developing or integrating common tool components. The use of such an abstraction in wide ranges of parallel runtime tool development projects could greatly increase component reuse. Thus, decreasing tool development time and cost. An application study with up to 16,384 application processes demonstrates the applicability of both the proposed runtime correctness concepts and of the proposed tools infrastructure

Technische Universität Dresden: Qucosa

Photonic architecture for scalable quantum information processing in NV-diamond

Author: Ashley M. Stephens
Burkhard Scharfenberger
Jörg Schmiedmayer
Kae Nemoto
Kathrin Buczak
Mark S. Everitt
Michael Trupke
Simon J. Devitt
Tobias Nöbauer
William J. Munro
Publication venue: 'American Physical Society (APS)'
Publication date: 17/09/2013
Field of study

Physics and information are intimately connected, and the ultimate information processing devices will be those that harness the principles of quantum mechanics. Many physical systems have been identified as candidates for quantum information processing, but none of them are immune from errors. The challenge remains to find a path from the experiments of today to a reliable and scalable quantum computer. Here, we develop an architecture based on a simple module comprising an optical cavity containing a single negatively-charged nitrogen vacancy centre in diamond. Modules are connected by photons propagating in a fiber-optical network and collectively used to generate a topological cluster state, a robust substrate for quantum information processing. In principle, all processes in the architecture can be deterministic, but current limitations lead to processes that are probabilistic but heralded. We find that the architecture enables large-scale quantum information processing with existing technology.Comment: 24 pages, 14 Figures. Comment welcom

arXiv.org e-Print Archive

Directory of Open Access Journals

High Performance Computing Applied to Structural Analysis of Concrete Structures

Author: Mena Ortiz Jose A
Publication venue: UNM Digital Repository
Publication date: 28/11/2016
Field of study

Continuum mechanics models are the main tool in structural analysis. Due to their continuity assumption, some of them do not predict accurately the behavior of concrete structures. While these models are widely used by design engineers, many are flawed for fundamental reasons. The State-Based Peridynamic Lattice Model (SPLM) is presented in this thesis as a viable alternative to continuum models. The SPLM is shown to have a simple formulation that allows the engineer to fully understand the underlying theory. The elastic, plastic and damage SPLM models are presented. Moreover, SPLM is shown to be capable of modelling essential mechanisms in concrete structures. SPLM is run on massive parallel computers, since it requires large computational power. In this thesis, a new parallel implementation is presented, and a study of the performance of the original and the new system is presented

Runtime support for irregular computation in MPI-based applications

Author: Zhao Xin
Publication venue
Publication date: 01/05/2016
Field of study

In recent years there are increasing number of applications that have been using irregular computation models in various domains, such as computational chemistry, bioinformatics, nuclear reactor simulation and social network analysis. Due to the irregular and data-dependent communication patterns and sparse data structures involved in those applications, the traditional parallel programming model and runtime need to be carefully designed and implemented in order to accommodate the performance and scalability requirements of those irregular applications on large-scale systems. The Message Passing Interface (MPI) is the industry standard communication library for high performance computing. However, whether MPI can serve as a suitable programming model / runtime for irregular applications or not is one of the most debated aspects in the community. The goal of this thesis is to investigate the suitability of MPI to irregular applications. This thesis consists of two subtopics. The first subtopic focuses on improving MPI runtime to support the irregular applications from perspective of scalability and performance. The first three parts in this subtopic focus on MPI one-sided communication. In the first part, we present a thorough survey of current MPI one-sided implementations and illustrate scalability limitations in those implementations. In the second part, we propose a new design and implementation of MPI one-sided communication, called ScalaRMA, to effectively address those scalability limitations. The third part in this subtopic focuses on various issuing strategies in MPI one-sided communication. We propose an adaptive issuing strategy which can adaptively choose between delayed issuing strategy and eager issuing strategy in MPI runtime to achieve high performance based on current communication volume in MPI-based application. The last part in this subtopic is to tackle the scalability limitations in the virtual connection (VC) objects in MPI implementation. We propose a scalable design to reduce the memory consumption of VC objects in MPI runtime. The second subtopic of this thesis focuses on improving MPI programming model to better support the irregular applications. Traditional two-sided data movement model in MPI standard designed for scientific computation provides a paradigm for user to specify how to move the data between processes, however, it does not provide interface to flexibly manage the computation, which means user needs to explicitly manage where the computation should be performed. This model is not well suited for irregular applications which involve irregular and data-dependent communication pattern. In this work, we combine Active Messages (AM), an alternative programming paradigm which is more suitable for irregular computations, with traditional MPI data movement model, and propose a generalized MPI-interoperable Active Messages framework (MPI-AM). The framework allows MPI-based applications to incrementally use AMs only when necessary, avoiding rewriting the entire MPI-based application. Such framework integrates data movement and computation together in the programming model and MPI can coordinate the computation and communication in a much more flexible manner. In this subtopic, we propose several strategies including message streaming, buffer management and asynchronous processing, in order to efficiently handle AMs inside MPI. We also propose subtle correctness semantics of MPI-AM to define how AMs can work correctly with other MPI messages in the system, from perspectives of memory consistency, concurrency, ordering and atomicity

Illinois Digital Environment for Access to Learning and Scholarship Repository