Search CORE

597 research outputs found

Grid and P2P middleware for scientific computing systems

Author: Barolli Leonard
Pllana Sabri
Xhafa Xhafa Fatos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Grid and P2P systems have achieved a notable success in the domain of scientific and engineering applications, which commonly demand considerable amounts of computational resources. However, Grid and P2P systems remain still difficult to be used by the domain scientists and engineers due to the inherent complexity of the corresponding middleware and the lack of adequate documentation. In this paper we survey recent developments of Grid and P2P middleware in the context of scientific computing systems. The differences on the approaches taken for Grid and P2P middleware as well as the common points of both paradigms are highlighted. In addition, we discuss the corresponding programming models, languages, and applications.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

09191 Abstracts Collection -- Fault Tolerance in High-Performance Computing and Grids

Author: Capello Franck
Kale Laxmikant
Mueller Frank
Pingali Keshav
Reinefeld Alexander
Publication venue: Dagstuhl Seminar Proceedings. 09191 - Fault Tolerance in High-Performance Computing and Grids
Publication date: 01/01/2009
Field of study

From June 4--8, 2009, the Dagstuhl Seminar 09191 ``Fault Tolerance in High-Performance Computing and Grids \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available. Slides of the talks and abstracts are available online at url{http://www.dagstuhl.de/Materials/index.en.phtml?09191}

Dagstuhl Research Online Publication Server

High performance Peer-to-Peer distributed computing with application to obstacle problem

Author: Chau Ming
El Baz Didier
Jourjon Guillaume
Nguyen The Tung
Spiteri Pierre
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/04/2010
Field of study

International audienceThis paper deals with high performance Peer-to-Peer computing applications. We concentrate on the solution of large scale numerical simulation problems via distributed iterative methods. We present the current version of an environment that allows direct communication between peers. This environment is based on a self-adaptive communication protocol. The protocol configures itself automatically and dynamically in function of application requirements like scheme of computation and elements of context like topology by choosing the most appropriate communication mode between peers. A first series of computational experiments is presented and analyzed for the obstacle problem

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

HAL-INSA Toulouse

Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension

Author: Zhong Dong
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2021
Field of study

As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software. First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications. Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient

University of Tennessee, Knoxville: Trace