328 research outputs found
Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation
The increasing number of processing elements and decreas- ing memory to core
ratio in modern high-performance platforms makes efficient strong scaling a key
requirement for numerical algorithms. In order to achieve efficient scalability
on massively parallel systems scientific software must evolve across the entire
stack to exploit the multiple levels of parallelism exposed in modern
architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP
parallelisation to optimise parallel sparse matrix-vector multiplication in
PETSc, a widely used scientific library for the scalable solution of partial
differential equations. Using large matrices generated by Fluidity, an open
source CFD application code which uses PETSc as its linear solver engine, we
evaluate the effect of explicit communication overlap using task-based
parallelism and show how to further improve performance by explicitly load
balancing threads within MPI processes. We demonstrate a significant speedup
over the pure-MPI mode and efficient strong scaling of sparse matrix-vector
multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
F-MPJ: scalable Java message-passing communications on parallel systems
This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-009-0270-0[Abstract] This paper presents F-MPJ (Fast MPJ), a scalable and efficient Message-Passing in Java (MPJ) communication middleware for parallel computing. The increasing interest in Java as the programming language of the multi-core era demands scalable performance on hybrid architectures (with both shared and distributed memory spaces). However, current Java communication middleware lacks efficient communication support. F-MPJ boosts this situation by: (1) providing efficient non-blocking communication, which allows communication overlapping and thus scalable performance; (2) taking advantage of shared memory systems and high-performance networks through the use of our high-performance Java sockets implementation (named JFS, Java Fast Sockets); (3) avoiding the use of communication buffers; and (4) optimizing MPJ collective primitives. Thus, F-MPJ significantly improves the scalability of current MPJ implementations. A performance evaluation on an InfiniBand multi-core cluster has shown that F-MPJ communication primitives outperform representative MPJ libraries up to 60 times. Furthermore, the use of F-MPJ in communication-intensive MPJ codes has increased their performance up to seven times.Ministerio de Educación y Ciencia; TIN2004-07797-C02Ministerio de Educación y Ciencia; TIN2007-67537-C03-2Xunta de Galicia; PGIDIT06PXIB105228P
A multi-agent architecture for internet distributed computing system
This thesis presents the developed taxonomy of the agent-based distributed computing systems. Based on this taxonomy, a design, implementation, analysis and distribution protocol of a multi-agent architecture for internet-based distributed computing system was developed. A prototype of the designed architecture was implemented on Spider III using the IBM Aglets software development kit (ASDK 2.0) and the language Java
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
- …