1,518 research outputs found
ASCR/HEP Exascale Requirements Review Report
This draft report summarizes and details the findings, results, and
recommendations derived from the ASCR/HEP Exascale Requirements Review meeting
held in June, 2015. The main conclusions are as follows. 1) Larger, more
capable computing and data facilities are needed to support HEP science goals
in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of
the demand at the 2025 timescale is at least two orders of magnitude -- and in
some cases greater -- than that available currently. 2) The growth rate of data
produced by simulations is overwhelming the current ability, of both facilities
and researchers, to store and analyze it. Additional resources and new
techniques for data analysis are urgently needed. 3) Data rates and volumes
from HEP experimental facilities are also straining the ability to store and
analyze large and complex data volumes. Appropriately configured
leadership-class facilities can play a transformational role in enabling
scientific discovery from these datasets. 4) A close integration of HPC
simulation and data analysis will aid greatly in interpreting results from HEP
experiments. Such an integration will minimize data movement and facilitate
interdependent workflows. 5) Long-range planning between HEP and ASCR will be
required to meet HEP's research needs. To best use ASCR HPC resources the
experimental HEP program needs a) an established long-term plan for access to
ASCR computational and data resources, b) an ability to map workflows onto HPC
resources, c) the ability for ASCR facilities to accommodate workflows run by
collaborations that can have thousands of individual members, d) to transition
codes to the next-generation HPC platforms that will be available at ASCR
facilities, e) to build up and train a workforce capable of developing and
using simulations and analysis to support HEP scientific research on
next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio
Multi-Architecture Monte-Carlo (MC) Simulation of Soft Coarse-Grained Polymeric Materials: SOft coarse grained Monte-carlo Acceleration (SOMA)
Multi-component polymer systems are important for the development of new
materials because of their ability to phase-separate or self-assemble into
nano-structures. The Single-Chain-in-Mean-Field (SCMF) algorithm in conjunction
with a soft, coarse-grained polymer model is an established technique to
investigate these soft-matter systems. Here we present an im- plementation of
this method: SOft coarse grained Monte-carlo Accelera- tion (SOMA). It is
suitable to simulate large system sizes with up to billions of particles, yet
versatile enough to study properties of different kinds of molecular
architectures and interactions. We achieve efficiency of the simulations
commissioning accelerators like GPUs on both workstations as well as
supercomputers. The implementa- tion remains flexible and maintainable because
of the implementation of the scientific programming language enhanced by
OpenACC pragmas for the accelerators. We present implementation details and
features of the program package, investigate the scalability of our
implementation SOMA, and discuss two applications, which cover system sizes
that are difficult to reach with other, common particle-based simulation
methods
Optimization of a parallel Monte Carlo method for linear algebra problems
Many problems in science and engineering can be represented by Systems of
Linear Algebraic Equations (SLAEs). Numerical methods such as direct or
iterative ones are used to solve these kind of systems. Depending on the size
and other factors that characterize these systems they can be sometimes
very difficult to solve even for iterative methods, requiring long time and
large amounts of computational resources. In these cases a preconditioning
approach should be applied.
Preconditioning is a technique used to transform a SLAE into a equivalent
but simpler system which requires less time and effort to be solved. The
matrix which performs such transformation is called the preconditioner [7].
There are preconditioners for both direct and iterative methods but they
are more commonly used among the later ones.
In the general case a preconditioned system will require less effort to
be solved than the original one. For example, when an iterative method is
being used, less iterations will be required or each iteration will require less
time, depending on the quality and the efficiency of the preconditioner.
There are different classes of preconditioners but we will focused only on
those that are based on the SParse Approximate Inverse (SPAI) approach.
These algorithms are based on the fact that the approximate inverse of a
given SLAE matrix can be used to approximate its result or to reduce its
complexity.
Monte Carlo methods are probabilistic methods, that use random numbers
to either simulate a stochastic behaviour or to estimate the solution of
a problem. They are good candidates for parallelization due to the fact that
many independent samples are used to estimate the solution. These samples
can be calculated in parallel, thereby speeding up the solution finding
process [27].
In the past there has been a lot of research around the use of Monte
Carlo methods to calculate SPAI preconditioners [1] [27] [10]. In this work
we present the implementation of a SPAI preconditioner that is based on a Monte Carlo method. This algorithm calculates the matrix inverse by sampling
a random variable which approximates the Neumann Series expansion.
Using the Neumman series it is possible to calculate the matrix inverse of
a system A by performing consecutive additions of the powers of a matrix
expressed by the series expansion of (I − A)
−1
.
Given the stochastic approach of the Monte Carlo algorithm, the computational
effort required to find an element of the inverse matrix is independent
from the size of the matrix. This allows to target systems that, due
to their size, can be prohibitive for common deterministic approaches [27].
Great part of this work is focused on the enhancement of this algorithm.
First, the current errors of the implementation were fixed, making the algorithm
able to target larger systems. Then multiple optimizations were
applied at different stages of the implementation making a better use of the
resources and improving the performance of the algorithm.
Four optimizations, with consistently improvements have been performed:
1. An inefficient implementation of the realloc function within the MPI
library was provoking the application to rapidly run out of memory.
This function was replaced by the malloc function and some slight
modifications to estimate the size of matrix A.
2. A coordinate format (COO) was introduced within the algorithm’s
core to make a more efficient use of the memory, avoiding several
unnecessary memory accesses.
3. A method to produce an intermediate matrix P was shown to produce
similar results to the default one and with matrix P being reduced to a
single vector, thus requiring less data. Given that this was a broadcast
data a diminishing on it, translated into a reduction of the broadcast
time.
4. Four individual procedures which accessed the whole initial matrix
memory, were merged into two processes, reducing this way the number
of memory accesses.
For each optimization applied, a comparison was performed to show the
particular improvements achieved. A set of different matrices, representing
different SLAEs, was used to show the consistency of these improvements.
In order to provide with insights about the scalability issues of the algorithm,
other approaches are presented to show the particularities of the
algorithm’s scalability: 1. Given that the original version of this algorithm was designed for a
cluster of single-core machines, an hybrid approach of MPI + openMP
was proposed to target the nowadays multi-core architectures. Surprisingly
this new approach did not show any improvement but it was
useful to show a scalability problem related to the random pattern
used to access the memory.
2. Having that common MPI implementations of the broadcast operation
do not take into account the different latencies between inter-node and
intra-node communications [25]. Therefore, we decided to implement
the broadcast in two steps. First by reaching a single process in each
of the compute nodes and then using those processes to perform a
local broadcast within their compute nodes. Results on this approach
showed that this method could lead to improvements when very big
systems are used.
Finally a comparison is carried out between the optimized version of the
Monte Carlo algorithm and the state of the art Modified SPAI (MSPAI).
Four metrics are used to compare these approaches:
1. The amount of time needed for the preconditioner construction.
2. The time needed by the solver to calculate the solution of the preconditioned
system.
3. The addition of the previous metrics, which gives a overview of the
quality and efficiency of the preconditioner.
4. The number of cores used in the preconditioner construction. This
gives an idea of the energy efficiency of the algorithm.
Results from previous comparison showed that Monte Carlo algorithm
can deal with both symmetric and nonsymmetric matrices while MSPAI
only performs well with the nonsymetric ones. Furthermore the time for
Monte Carlo’s algorithm is always faster for the preconditioner construction
and most of the times also for the solver calculation. This means that Monte
Carlo produces preconditioners of better or same quality than MSPAI. Finally,
the number of cores used in the Monte Carlo approach is always equal
or smaller than in the case of MSPAI
Research and Education in Computational Science and Engineering
Over the past two decades the field of computational science and engineering
(CSE) has penetrated both basic and applied research in academia, industry, and
laboratories to advance discovery, optimize systems, support decision-makers,
and educate the scientific and engineering workforce. Informed by centuries of
theory and experiment, CSE performs computational experiments to answer
questions that neither theory nor experiment alone is equipped to answer. CSE
provides scientists and engineers of all persuasions with algorithmic
inventions and software systems that transcend disciplines and scales. Carried
on a wave of digital technology, CSE brings the power of parallelism to bear on
troves of data. Mathematics-based advanced computing has become a prevalent
means of discovery and innovation in essentially all areas of science,
engineering, technology, and society; and the CSE community is at the core of
this transformation. However, a combination of disruptive
developments---including the architectural complexity of extreme-scale
computing, the data revolution that engulfs the planet, and the specialization
required to follow the applications to new frontiers---is redefining the scope
and reach of the CSE endeavor. This report describes the rapid expansion of CSE
and the challenges to sustaining its bold advances. The report also presents
strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie
Towards a Mini-App for Smoothed Particle Hydrodynamics at Exascale
The smoothed particle hydrodynamics (SPH) technique is a purely Lagrangian
method, used in numerical simulations of fluids in astrophysics and
computational fluid dynamics, among many other fields. SPH simulations with
detailed physics represent computationally-demanding calculations. The
parallelization of SPH codes is not trivial due to the absence of a structured
grid. Additionally, the performance of the SPH codes can be, in general,
adversely impacted by several factors, such as multiple time-stepping,
long-range interactions, and/or boundary conditions. This work presents
insights into the current performance and functionalities of three SPH codes:
SPHYNX, ChaNGa, and SPH-flow. These codes are the starting point of an
interdisciplinary co-design project, SPH-EXA, for the development of an
Exascale-ready SPH mini-app. To gain such insights, a rotating square patch
test was implemented as a common test simulation for the three SPH codes and
analyzed on two modern HPC systems. Furthermore, to stress the differences with
the codes stemming from the astrophysics community (SPHYNX and ChaNGa), an
additional test case, the Evrard collapse, has also been carried out. This work
extrapolates the common basic SPH features in the three codes for the purpose
of consolidating them into a pure-SPH, Exascale-ready, optimized, mini-app.
Moreover, the outcome of this serves as direct feedback to the parent codes, to
improve their performance and overall scalability.Comment: 18 pages, 4 figures, 5 tables, 2018 IEEE International Conference on
Cluster Computing proceedings for WRAp1
SPH-EXA: Enhancing the Scalability of SPH codes Via an Exascale-Ready SPH Mini-App
Numerical simulations of fluids in astrophysics and computational fluid
dynamics (CFD) are among the most computationally-demanding calculations, in
terms of sustained floating-point operations per second, or FLOP/s. It is
expected that these numerical simulations will significantly benefit from the
future Exascale computing infrastructures, that will perform 10^18 FLOP/s. The
performance of the SPH codes is, in general, adversely impacted by several
factors, such as multiple time-stepping, long-range interactions, and/or
boundary conditions. In this work an extensive study of three SPH
implementations SPHYNX, ChaNGa, and XXX is performed, to gain insights and to
expose any limitations and characteristics of the codes. These codes are the
starting point of an interdisciplinary co-design project, SPH-EXA, for the
development of an Exascale-ready SPH mini-app. We implemented a rotating square
patch as a joint test simulation for the three SPH codes and analyzed their
performance on a modern HPC system, Piz Daint. The performance profiling and
scalability analysis conducted on the three parent codes allowed to expose
their performance issues, such as load imbalance, both in MPI and OpenMP.
Two-level load balancing has been successfully applied to SPHYNX to overcome
its load imbalance. The performance analysis shapes and drives the design of
the SPH-EXA mini-app towards the use of efficient parallelization methods,
fault-tolerance mechanisms, and load balancing approaches.Comment: arXiv admin note: substantial text overlap with arXiv:1809.0801
Computational Methods in Science and Engineering : Proceedings of the Workshop SimLabs@KIT, November 29 - 30, 2010, Karlsruhe, Germany
In this proceedings volume we provide a compilation of article contributions equally covering applications from different research fields and ranging from capacity up to capability computing. Besides classical computing aspects such as parallelization, the focus of these proceedings is on multi-scale approaches and methods for tackling algorithm and data complexity. Also practical aspects regarding the usage of the HPC infrastructure and available tools and software at the SCC are presented
Parallel cross interpolation for high-precision calculation of high-dimensional integrals
We propose a parallel version of the cross interpolation algorithm and apply it to calculate high-dimensional integrals motivated by Ising model in quantum physics. In contrast to mainstream approaches, such as Monte Carlo and quasi Monte Carlo, the samples calculated by our algorithm are neither random nor form a regular lattice. Instead we calculate the given function along individual dimensions (modes) and use these values to reconstruct its behaviour in the whole domain. The positions of the calculated univariate fibres are chosen adaptively for the given function. The required evaluations can be executed in parallel along each mode (variable) and over all modes.
To demonstrate the efficiency of the proposed method, we apply it to compute high-dimensional Ising susceptibility integrals, arising from asymptotic expansions for the spontaneous magnetisation in two-dimensional Ising model of ferromagnetism. We observe strong superlinear convergence of the proposed method, while the MC and qMC algorithms converge sublinearly. Using multiple precision arithmetic, we also observe exponential convergence of the proposed algorithm. Combining high-order convergence, almost perfect scalability up to hundreds of processes, and the same flexibility as MC and qMC, the proposed algorithm can be a new method of choice for problems involving high-dimensional integration, e.g. in statistics, probability, and quantum physics
- …