4,106 research outputs found
Sparse Message Passing Based Preamble Estimation for Crowded M2M Communications
Due to the massive number of devices in the M2M communication era, new
challenges have been brought to the existing random-access (RA) mechanism, such
as severe preamble collisions and resource block (RB) wastes. To address these
problems, a novel sparse message passing (SMP) algorithm is proposed, based on
a factor graph on which Bernoulli messages are updated. The SMP enables an
accurate estimation on the activity of the devices and the identity of the
preamble chosen by each active device. Aided by the estimation, the RB
efficiency for the uplink data transmission can be improved, especially among
the collided devices. In addition, an analytical tool is derived to analyze the
iterative evolution and convergence of the SMP algorithm. Finally, numerical
simulations are provided to verify the validity of our analytical results and
the significant improvement of the proposed SMP on estimation error rate even
when preamble collision occurs.Comment: submitted to ICC 2018 with 6 pages and 4 figure
ParFORM: recent development
We report on the status of our project of parallelization of the symbolic
manipulation program FORM. We have now parallel versions of FORM running on
Cluster- or SMP-architectures. These versions can be used to run arbitrary FORM
programs in parallel.Comment: 5 pages, 6 Encapsulated postscript figures, LaTeX2e, uses espcrc2.sty
(included). Talk given at ACAT0
Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation
The increasing number of processing elements and decreas- ing memory to core
ratio in modern high-performance platforms makes efficient strong scaling a key
requirement for numerical algorithms. In order to achieve efficient scalability
on massively parallel systems scientific software must evolve across the entire
stack to exploit the multiple levels of parallelism exposed in modern
architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP
parallelisation to optimise parallel sparse matrix-vector multiplication in
PETSc, a widely used scientific library for the scalable solution of partial
differential equations. Using large matrices generated by Fluidity, an open
source CFD application code which uses PETSc as its linear solver engine, we
evaluate the effect of explicit communication overlap using task-based
parallelism and show how to further improve performance by explicitly load
balancing threads within MPI processes. We demonstrate a significant speedup
over the pure-MPI mode and efficient strong scaling of sparse matrix-vector
multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
Optimizing message-passing performance within symmetric multiprocessor systems
The Message Passing Interface (MPI) has been widely used in the area of parallel computing due to its portability, scalability, and ease of use. Message passing within Symmetric Multiprocessor (SMP) systems is an import part of any MPI library since it enables parallel programs to run efficiently on SMP systems, or clusters of SMP systems when combined with other ways of communication such as TCP/IP. Most message-passing implementations use a shared memory pool as an intermediate buffer to hold messages, some lock mechanisms to protect the pool, and some synchronization mechanism for coordinating the processes. However, the performance varies significantly depending on how these are implemented. The work here implements two SMP message-passing modules using lock-based and lock-free approaches for MPLi̲te, a compact library that implements a subset of the most commonly used MPI functions. Various optimization techniques have been used to optimize the performance. These two modules are evaluated using a communication performance analysis tool called NetPIPE, and compared with the implementations of other MPI libraries such as MPICH, MPICH2, LAM/MPI and MPI/PRO. Performance tools such as PAPI and VTune are used to gather some runtime information at the hardware level. This information together with some cache theory and the hardware configuration is used to explain various performance phenomena. Tests using a real application have shown the performance of the different implementations in real practice. These results all show that the improvements of the new techniques over existing implementations
- …