328 research outputs found

    Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation

    Full text link
    The increasing number of processing elements and decreas- ing memory to core ratio in modern high-performance platforms makes efficient strong scaling a key requirement for numerical algorithms. In order to achieve efficient scalability on massively parallel systems scientific software must evolve across the entire stack to exploit the multiple levels of parallelism exposed in modern architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise parallel sparse matrix-vector multiplication in PETSc, a widely used scientific library for the scalable solution of partial differential equations. Using large matrices generated by Fluidity, an open source CFD application code which uses PETSc as its linear solver engine, we evaluate the effect of explicit communication overlap using task-based parallelism and show how to further improve performance by explicitly load balancing threads within MPI processes. We demonstrate a significant speedup over the pure-MPI mode and efficient strong scaling of sparse matrix-vector multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems

    F-MPJ: scalable Java message-passing communications on parallel systems

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-009-0270-0[Abstract] This paper presents F-MPJ (Fast MPJ), a scalable and efficient Message-Passing in Java (MPJ) communication middleware for parallel computing. The increasing interest in Java as the programming language of the multi-core era demands scalable performance on hybrid architectures (with both shared and distributed memory spaces). However, current Java communication middleware lacks efficient communication support. F-MPJ boosts this situation by: (1) providing efficient non-blocking communication, which allows communication overlapping and thus scalable performance; (2) taking advantage of shared memory systems and high-performance networks through the use of our high-performance Java sockets implementation (named JFS, Java Fast Sockets); (3) avoiding the use of communication buffers; and (4) optimizing MPJ collective primitives. Thus, F-MPJ significantly improves the scalability of current MPJ implementations. A performance evaluation on an InfiniBand multi-core cluster has shown that F-MPJ communication primitives outperform representative MPJ libraries up to 60 times. Furthermore, the use of F-MPJ in communication-intensive MPJ codes has increased their performance up to seven times.Ministerio de Educación y Ciencia; TIN2004-07797-C02Ministerio de Educación y Ciencia; TIN2007-67537-C03-2Xunta de Galicia; PGIDIT06PXIB105228P

    A multi-agent architecture for internet distributed computing system

    Get PDF
    This thesis presents the developed taxonomy of the agent-based distributed computing systems. Based on this taxonomy, a design, implementation, analysis and distribution protocol of a multi-agent architecture for internet-based distributed computing system was developed. A prototype of the designed architecture was implemented on Spider III using the IBM Aglets software development kit (ASDK 2.0) and the language Java

    QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

    Get PDF
    Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.
    corecore