1,557 research outputs found
Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms
Matrix multiplication is a very important computation kernel both in its own
right as a building block of many scientific applications and as a popular
representative for other scientific applications. Cannon algorithm which dates
back to 1969 was the first efficient algorithm for parallel matrix
multiplication providing theoretically optimal communication cost. However this
algorithm requires a square number of processors. In the mid 1990s, the SUMMA
algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm
as it can be used on a non-square number of processors as well. Since then the
number of processors in HPC platforms has increased by two orders of magnitude
making the contribution of communication in the overall execution time more
significant. Therefore, the state of the art parallel matrix multiplication
algorithms should be revisited to reduce the communication cost further. This
paper introduces a new parallel matrix multiplication algorithm, Hierarchical
SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the
communication cost of SUMMA by introducing a two-level virtual hierarchy into
the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P
demonstrate the reduction of communication cost up to 2.08 times on 2048 cores
and up to 5.89 times on 16384 cores.Comment: 9 page
The physics of parallel machines
The idea is considered that architectures for massively parallel computers must be designed to go beyond supporting a particular class of algorithms to supporting the underlying physical processes being modelled. Physical processes modelled by partial differential equations (PDEs) are discussed. Also discussed is the idea that an efficient architecture must go beyond nearest neighbor mesh interconnections and support global and hierarchical communications
Symmetric Tori connected Torus Network
A Symmetric Tori connected Torus Network (STTN) is
a 2D-torus network of multiple basic modules, in which
the basic modules are 2D-torus networks that are
hierarchically interconnected for higher-level networks.
In this paper, we present the architecture of the STTN,
addressing of node, routing of message, and evaluate
the static network performance of STTN, TTN, TESH,
mesh, and torus networks. It is shown that the STTN
possesses several attractive features, including constant
degree, small diameter, low cost, small average
distance, moderate bisection width, and high fault
tolerant performance than that of other conventional
and hierarchical interconnection networks
The Abacus Cosmos: A Suite of Cosmological N-body Simulations
We present a public data release of halo catalogs from a suite of 125
cosmological -body simulations from the Abacus project. The simulations span
40 CDM cosmologies centered on the Planck 2015 cosmology at two mass
resolutions, and , in and
boxes, respectively. The boxes are phase-matched to
suppress sample variance and isolate cosmology dependence. Additional volume is
available via 16 boxes of fixed cosmology and varied phase; a few boxes of
single-parameter excursions from Planck 2015 are also provided. Catalogs
spanning to are available for friends-of-friends and Rockstar
halo finders and include particle subsamples. All data products are available
at https://lgarrison.github.io/AbacusCosmosComment: 13 pages, 9 figures, 3 tables. Additional figures added for mass
resolution convergence tests, and additional redshifts added for existing
tests. Matches ApJS accepted versio
Programming Model to Develop Supercomputer Combinatorial Solvers
© 2017 IEEE. Novel architectures for massively parallel machines offer better scalability and the prospect of achieving linear speedup for sizable problems in many domains. The development of suitable programming models and accompanying software tools for these architectures remains one of the biggest challenges towards exploiting their full potential. We present a multi-layer software abstraction model to develop combinatorial solvers on massively-parallel machines with regular topologies. The model enables different challenges in the design and optimization of combinatorial solvers to be tackled independently (separation of concerns) while permitting problem-specific tuning and cross-layer optimization. In specific, the model decouples the issues of inter-node communication, n ode-level scheduling, problem mapping, mesh-level load balancing and expressing problem logic. We present an implementation of the model and use it to profile a Boolean satisfiability solver on simulated massively-parallel machines with different scales and topologies
- …