4,106 research outputs found

    Sparse Message Passing Based Preamble Estimation for Crowded M2M Communications

    Full text link
    Due to the massive number of devices in the M2M communication era, new challenges have been brought to the existing random-access (RA) mechanism, such as severe preamble collisions and resource block (RB) wastes. To address these problems, a novel sparse message passing (SMP) algorithm is proposed, based on a factor graph on which Bernoulli messages are updated. The SMP enables an accurate estimation on the activity of the devices and the identity of the preamble chosen by each active device. Aided by the estimation, the RB efficiency for the uplink data transmission can be improved, especially among the collided devices. In addition, an analytical tool is derived to analyze the iterative evolution and convergence of the SMP algorithm. Finally, numerical simulations are provided to verify the validity of our analytical results and the significant improvement of the proposed SMP on estimation error rate even when preamble collision occurs.Comment: submitted to ICC 2018 with 6 pages and 4 figure

    ParFORM: recent development

    Full text link
    We report on the status of our project of parallelization of the symbolic manipulation program FORM. We have now parallel versions of FORM running on Cluster- or SMP-architectures. These versions can be used to run arbitrary FORM programs in parallel.Comment: 5 pages, 6 Encapsulated postscript figures, LaTeX2e, uses espcrc2.sty (included). Talk given at ACAT0

    Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation

    Full text link
    The increasing number of processing elements and decreas- ing memory to core ratio in modern high-performance platforms makes efficient strong scaling a key requirement for numerical algorithms. In order to achieve efficient scalability on massively parallel systems scientific software must evolve across the entire stack to exploit the multiple levels of parallelism exposed in modern architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise parallel sparse matrix-vector multiplication in PETSc, a widely used scientific library for the scalable solution of partial differential equations. Using large matrices generated by Fluidity, an open source CFD application code which uses PETSc as its linear solver engine, we evaluate the effect of explicit communication overlap using task-based parallelism and show how to further improve performance by explicitly load balancing threads within MPI processes. We demonstrate a significant speedup over the pure-MPI mode and efficient strong scaling of sparse matrix-vector multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems

    Optimizing message-passing performance within symmetric multiprocessor systems

    Get PDF
    The Message Passing Interface (MPI) has been widely used in the area of parallel computing due to its portability, scalability, and ease of use. Message passing within Symmetric Multiprocessor (SMP) systems is an import part of any MPI library since it enables parallel programs to run efficiently on SMP systems, or clusters of SMP systems when combined with other ways of communication such as TCP/IP. Most message-passing implementations use a shared memory pool as an intermediate buffer to hold messages, some lock mechanisms to protect the pool, and some synchronization mechanism for coordinating the processes. However, the performance varies significantly depending on how these are implemented. The work here implements two SMP message-passing modules using lock-based and lock-free approaches for MPLi̲te, a compact library that implements a subset of the most commonly used MPI functions. Various optimization techniques have been used to optimize the performance. These two modules are evaluated using a communication performance analysis tool called NetPIPE, and compared with the implementations of other MPI libraries such as MPICH, MPICH2, LAM/MPI and MPI/PRO. Performance tools such as PAPI and VTune are used to gather some runtime information at the hardware level. This information together with some cache theory and the hardware configuration is used to explain various performance phenomena. Tests using a real application have shown the performance of the different implementations in real practice. These results all show that the improvements of the new techniques over existing implementations
    corecore