75 research outputs found
XXVI IUPAP Conference on Computational Physics (CCP2014)
The 26th IUPAP Conference on Computational Physics, CCP2014, was held in Boston, Massachusetts, during August 11-14, 2014. Almost 400 participants from 38 countries convened at the George Sherman Union at Boston University for four days of plenary and parallel sessions spanning a broad range of topics in computational physics and related areas.
The first meeting in the series that developed into the annual Conference on Computational Physics (CCP) was held in 1989, also on the campus of Boston University and chaired by our colleague Claudio Rebbi. The express purpose of that meeting was to discuss the progress, opportunities and challenges of common interest to physicists engaged in computational research. The conference having returned to the site of its inception, it is interesting to recect on the development of the field during the intervening years. Though 25 years is a short time for mankind, computational physics has taken giant leaps during these years, not only because of the enormous increases in computer power but especially because of the development of new methods and algorithms, and the growing awareness of the opportunities the new technologies and methods can offer. Computational physics now represents a ''third leg'' of research alongside analytical theory and experiments in almost all subfields of physics, and because of this there is also increasing specialization within the community of computational physicists. It is therefore a challenge to organize a meeting such as CCP, which must have suffcient depth in different areas to hold the interest of experts while at the same time being broad and accessible. Still, at a time when computational research continues to gain in importance, the CCP series is critical in the way it fosters cross-fertilization among fields, with many participants specifically attending in order to get exposure to new methods in fields outside their own.
As organizers and editors of these Proceedings, we are very pleased with the high quality of the papers provided by the participants. These articles represent a good cross-section of what was presented at the meeting, and it is our hope that they will not only be useful individually for their specific scientific content but will also represent a historical snapshot of the state of computational physics that they represent collectively.
The remainder of this Preface contains lists detailing the organizational structure of CCP2014, endorsers and sponsors of the meeting, plenary and invited talks, and a presentation of the 2014 IUPAP C20 Young Scientist Prize.
We would like to take the opportunity to again thank all those who contributed to the success of CCP214, as organizers, sponsors, presenters, exhibitors, and participants.
Anders Sandvik, David Campbell, David Coker, Ying TangPublished versio
Driving NEMO Towards Exascale: Introduction of a New Software Layer in the NEMO Stack Software
This paper addresses scientific challenges related to high level implementation strategies that leads NEMO to effectively use of the opportunities of exascale systems. We consider two software modules as proof-of-concept: the Sea Surface Height equation solver and the Variational Data Assimilation system, which are components of the NEMO ocean model (OPA). Advantages rising from the introduction of consolidated scientific libraries in NEMO are highlighted: such advantages concern both the "software quality" improvement (see the software quality parameters like robustness, portability, resilence, etc.) and time reduction of software development
A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems
Among the algorithms that are likely to play a major role in future exascale
computing, the fast multipole method (FMM) appears as a rising star. Our
previous recent work showed scaling of an FMM on GPU clusters, with problem
sizes in the order of billions of unknowns. That work led to an extremely
parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This
paper reports on a a campaign of performance tuning and scalability studies
using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were
parallelized using OpenMP, and a test using 10^7 particles randomly distributed
in a cube showed 78% efficiency on 8 threads. Tuning of the
particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of
the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel
scalability was studied in both strong and weak scaling. The strong scaling
test used 10^8 particles and resulted in 93% parallel efficiency on 2048
processes for the non-SIMD code and 54% for the SIMD-optimized code (which was
still 2x faster). The weak scaling test used 10^6 particles per process, and
resulted in 72% efficiency on 32,768 processes, with the largest calculation
taking about 40 seconds to evaluate more than 32 billion unknowns. This work
builds up evidence for our view that FMM is poised to play a leading role in
exascale computing, and we end the paper with a discussion of the features that
make it a particularly favorable algorithm for the emerging heterogeneous and
massively parallel architectural landscape
To distribute or not to distribute: The question of load balancing for performance or energy
Heterogeneous systems are nowadays a common choice in the path to Exascale. Through the use of accelerators they offer outstanding energy efficiency. The programming of these devices employs the host-device model, which is suboptimal as CPU remains idle during kernel executions, but still consumes energy. Making the CPU contribute computin effort might improve the performance and energy consumption of the system. This paper analyses the advantages of this approach and sets the limits of when its beneficial. The claims are supported by a set of models that determine how to share a single data-parallel task between the CPU and the accelerator for optimum performance, energy consumption or efficiency. Interestingly, the models show that optimising performance does not always mean optimum energy or efficiency as well. The paper experimentally validates the models, which represent an invaluable tool for programmers when faced with the dilemma of whether to distribute their workload in these systems.This work has been supported by the University of Cantabria (CVE-2014-18166), the Spanish Science and Technology Commission (TIN2016-76635-C2-2-R), the European Research Council (G.A. No 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 671697.Peer ReviewedPostprint (author's final draft
The Green500 List: Escapades to Exascale
Energy efficiency is now a top priority. The first
four years of the Green500 have seen the importance of en-
ergy efficiency in supercomputing grow from an afterthought
to the forefront of innovation as we near a point where sys-
tems will be forced to stop drawing more power. Even so,
the landscape of efficiency in supercomputing continues to
shift, with new trends emerging, and unexpected shifts in
previous predictions.
This paper offers an in-depth analysis of the new and
shifting trends in the Green500. In addition, the analysis of-
fers early indications of the track we are taking toward exas-
cale, and what an exascale machine in 2018 is likely to look
like. Lastly, we discuss the new efforts and collaborations
toward designing and establishing better metrics, method-
ologies and workloads for the measurement and analysis of
energy-efficient supercomputing
Development of an oceanographic application in HPC
High Performance Computing (HPC) is used for running advanced application programs
efficiently, reliably, and quickly.
In earlier decades, performance analysis of HPC applications was evaluated based on
speed, scalability of threads, memory hierarchy. Now, it is essential to consider the
energy or the power consumed by the system while executing an application.
In fact, the High Power Consumption (HPC) is one of biggest problems for the High
Performance Computing (HPC) community and one of the major obstacles for exascale
systems design.
The new generations of HPC systems intend to achieve exaflop performances and will
demand even more energy to processing and cooling. Nowadays, the growth of HPC
systems is limited by energy issues
Recently, many research centers have focused the attention on doing an automatic tuning
of HPC applications which require a wide study of HPC applications in terms of power
efficiency.
In this context, this paper aims to propose the study of an oceanographic application,
named OceanVar, that implements Domain Decomposition based 4D Variational model
(DD-4DVar), one of the most commonly used HPC applications, going to evaluate not
only the classic aspects of performance but also aspects related to power efficiency in
different case of studies.
These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the
Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology.
In this work of thesis it was initially explained the concept of assimilation date, the
context in which it is developed, and a brief description of the mathematical model
4DVAR.
After this problem’s close examination, it was performed a porting from Matlab
description of the problem of data-assimilation to its sequential version in C language.
Secondly, after identifying the most onerous computational kernels in order of time, it
has been developed a parallel version of the application with a parallel multiprocessor
programming style, using the MPI (Message Passing Interface) protocol.
The experiments results, in terms of performance, have shown that, in the case of
running on HCA server, an Intel architecture, values of efficiency of the two most
onerous functions obtained, growing the number of process, are approximately equal to
80%.
In the case of running on ARM architecture, specifically on Thunder mini-cluster,
instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can
be explained by a more efficient use of resources (cache memory access) compared with
the sequential case.
In the second part of this paper was presented an analysis of the some issues of this
application that has impact in the energy efficiency.
After a brief discussion about the energy consumption characteristics of the Thunder
chip in technological landscape, through the use of a power consumption detector, the
Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were
evaluated in order to determine an overview on the power-to-solution of this application
to use as the basic standard for successive analysis with other parallel styles.
Finally, a comprehensive performance evaluation, targeted to estimate the goodness of
MPI parallelization, is conducted using a suitable performance tool named Paraver,
developed by BSC.
Paraver is such a performance analysis and visualisation tool which can be used to
analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing.
A set of graphical representation of these statistics make it easy for a developer to
identify performance problems. Some of the problems that can be easily identified are
load imbalanced decompositions, excessive communication overheads and poor average
floating operations per second achieved.
Paraver can also report statistics based on hardware counters, which are provided by the
underlying hardware.
This project aimed to use Paraver configuration files to allow certain metrics to be
analysed for this application.
To explain in some way the performance trend obtained in the case of analysis on the
mini-cluster Thunder, the tracks were extracted from various case of studies and the
results achieved is what expected, that is a drastic drop of cache misses by the case ppn
(process per node) = 1 to case ppn = 16.
This in some way explains a more efficient use of cluster resources with an increase of
the number of processes
Parthenon -- a performance portable block-structured adaptive mesh refinement framework
On the path to exascale the landscape of computer device architectures and
corresponding programming models has become much more diverse. While various
low-level performance portable programming models are available, support at the
application level lacks behind. To address this issue, we present the
performance portable block-structured adaptive mesh refinement (AMR) framework
Parthenon, derived from the well-tested and widely used Athena++ astrophysical
magnetohydrodynamics code, but generalized to serve as the foundation for a
variety of downstream multi-physics codes. Parthenon adopts the Kokkos
programming model, and provides various levels of abstractions from
multi-dimensional variables, to packages defining and separating components, to
launching of parallel compute kernels. Parthenon allocates all data in device
memory to reduce data movement, supports the logical packing of variables and
mesh blocks to reduce kernel launch overhead, and employs one-sided,
asynchronous MPI calls to reduce communication overhead in multi-node
simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong
scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD
x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale
on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of
zone-cycles/s on 9,216 nodes (73,728 logical GPUs) at ~92%
weak scaling parallel efficiency (starting from a single node). In combination
with being an open, collaborative project, this makes Parthenon an ideal
framework to target exascale simulations in which the downstream developers can
focus on their specific application rather than on the complexity of handling
massively-parallel, device-accelerated AMR.Comment: 17 pages, 11 figures, accepted for publication in IJHPCA, Codes
available at https://github.com/parthenon-hpc-la
- …