75 research outputs found

    XXVI IUPAP Conference on Computational Physics (CCP2014)

    Get PDF
    The 26th IUPAP Conference on Computational Physics, CCP2014, was held in Boston, Massachusetts, during August 11-14, 2014. Almost 400 participants from 38 countries convened at the George Sherman Union at Boston University for four days of plenary and parallel sessions spanning a broad range of topics in computational physics and related areas. The first meeting in the series that developed into the annual Conference on Computational Physics (CCP) was held in 1989, also on the campus of Boston University and chaired by our colleague Claudio Rebbi. The express purpose of that meeting was to discuss the progress, opportunities and challenges of common interest to physicists engaged in computational research. The conference having returned to the site of its inception, it is interesting to recect on the development of the field during the intervening years. Though 25 years is a short time for mankind, computational physics has taken giant leaps during these years, not only because of the enormous increases in computer power but especially because of the development of new methods and algorithms, and the growing awareness of the opportunities the new technologies and methods can offer. Computational physics now represents a ''third leg'' of research alongside analytical theory and experiments in almost all subfields of physics, and because of this there is also increasing specialization within the community of computational physicists. It is therefore a challenge to organize a meeting such as CCP, which must have suffcient depth in different areas to hold the interest of experts while at the same time being broad and accessible. Still, at a time when computational research continues to gain in importance, the CCP series is critical in the way it fosters cross-fertilization among fields, with many participants specifically attending in order to get exposure to new methods in fields outside their own. As organizers and editors of these Proceedings, we are very pleased with the high quality of the papers provided by the participants. These articles represent a good cross-section of what was presented at the meeting, and it is our hope that they will not only be useful individually for their specific scientific content but will also represent a historical snapshot of the state of computational physics that they represent collectively. The remainder of this Preface contains lists detailing the organizational structure of CCP2014, endorsers and sponsors of the meeting, plenary and invited talks, and a presentation of the 2014 IUPAP C20 Young Scientist Prize. We would like to take the opportunity to again thank all those who contributed to the success of CCP214, as organizers, sponsors, presenters, exhibitors, and participants. Anders Sandvik, David Campbell, David Coker, Ying TangPublished versio

    Driving NEMO Towards Exascale: Introduction of a New Software Layer in the NEMO Stack Software

    Get PDF
    This paper addresses scientific challenges related to high level implementation strategies that leads NEMO to effectively use of the opportunities of exascale systems. We consider two software modules as proof-of-concept: the Sea Surface Height equation solver and the Variational Data Assimilation system, which are components of the NEMO ocean model (OPA). Advantages rising from the introduction of consolidated scientific libraries in NEMO are highlighted: such advantages concern both the "software quality" improvement (see the software quality parameters like robustness, portability, resilence, etc.) and time reduction of software development

    A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

    Full text link
    Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were parallelized using OpenMP, and a test using 10^7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10^8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2x faster). The weak scaling test used 10^6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape

    Office of Research, News & Opportunities, February 24, 2010

    Get PDF

    To distribute or not to distribute: The question of load balancing for performance or energy

    Get PDF
    Heterogeneous systems are nowadays a common choice in the path to Exascale. Through the use of accelerators they offer outstanding energy efficiency. The programming of these devices employs the host-device model, which is suboptimal as CPU remains idle during kernel executions, but still consumes energy. Making the CPU contribute computin effort might improve the performance and energy consumption of the system. This paper analyses the advantages of this approach and sets the limits of when its beneficial. The claims are supported by a set of models that determine how to share a single data-parallel task between the CPU and the accelerator for optimum performance, energy consumption or efficiency. Interestingly, the models show that optimising performance does not always mean optimum energy or efficiency as well. The paper experimentally validates the models, which represent an invaluable tool for programmers when faced with the dilemma of whether to distribute their workload in these systems.This work has been supported by the University of Cantabria (CVE-2014-18166), the Spanish Science and Technology Commission (TIN2016-76635-C2-2-R), the European Research Council (G.A. No 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 671697.Peer ReviewedPostprint (author's final draft

    The Green500 List: Escapades to Exascale

    Get PDF
    Energy efficiency is now a top priority. The first four years of the Green500 have seen the importance of en- ergy efficiency in supercomputing grow from an afterthought to the forefront of innovation as we near a point where sys- tems will be forced to stop drawing more power. Even so, the landscape of efficiency in supercomputing continues to shift, with new trends emerging, and unexpected shifts in previous predictions. This paper offers an in-depth analysis of the new and shifting trends in the Green500. In addition, the analysis of- fers early indications of the track we are taking toward exas- cale, and what an exascale machine in 2018 is likely to look like. Lastly, we discuss the new efforts and collaborations toward designing and establishing better metrics, method- ologies and workloads for the measurement and analysis of energy-efficient supercomputing

    Development of an oceanographic application in HPC

    Get PDF
    High Performance Computing (HPC) is used for running advanced application programs efficiently, reliably, and quickly. In earlier decades, performance analysis of HPC applications was evaluated based on speed, scalability of threads, memory hierarchy. Now, it is essential to consider the energy or the power consumed by the system while executing an application. In fact, the High Power Consumption (HPC) is one of biggest problems for the High Performance Computing (HPC) community and one of the major obstacles for exascale systems design. The new generations of HPC systems intend to achieve exaflop performances and will demand even more energy to processing and cooling. Nowadays, the growth of HPC systems is limited by energy issues Recently, many research centers have focused the attention on doing an automatic tuning of HPC applications which require a wide study of HPC applications in terms of power efficiency. In this context, this paper aims to propose the study of an oceanographic application, named OceanVar, that implements Domain Decomposition based 4D Variational model (DD-4DVar), one of the most commonly used HPC applications, going to evaluate not only the classic aspects of performance but also aspects related to power efficiency in different case of studies. These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology. In this work of thesis it was initially explained the concept of assimilation date, the context in which it is developed, and a brief description of the mathematical model 4DVAR. After this problem’s close examination, it was performed a porting from Matlab description of the problem of data-assimilation to its sequential version in C language. Secondly, after identifying the most onerous computational kernels in order of time, it has been developed a parallel version of the application with a parallel multiprocessor programming style, using the MPI (Message Passing Interface) protocol. The experiments results, in terms of performance, have shown that, in the case of running on HCA server, an Intel architecture, values of efficiency of the two most onerous functions obtained, growing the number of process, are approximately equal to 80%. In the case of running on ARM architecture, specifically on Thunder mini-cluster, instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can be explained by a more efficient use of resources (cache memory access) compared with the sequential case. In the second part of this paper was presented an analysis of the some issues of this application that has impact in the energy efficiency. After a brief discussion about the energy consumption characteristics of the Thunder chip in technological landscape, through the use of a power consumption detector, the Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were evaluated in order to determine an overview on the power-to-solution of this application to use as the basic standard for successive analysis with other parallel styles. Finally, a comprehensive performance evaluation, targeted to estimate the goodness of MPI parallelization, is conducted using a suitable performance tool named Paraver, developed by BSC. Paraver is such a performance analysis and visualisation tool which can be used to analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing. A set of graphical representation of these statistics make it easy for a developer to identify performance problems. Some of the problems that can be easily identified are load imbalanced decompositions, excessive communication overheads and poor average floating operations per second achieved. Paraver can also report statistics based on hardware counters, which are provided by the underlying hardware. This project aimed to use Paraver configuration files to allow certain metrics to be analysed for this application. To explain in some way the performance trend obtained in the case of analysis on the mini-cluster Thunder, the tracks were extracted from various case of studies and the results achieved is what expected, that is a drastic drop of cache misses by the case ppn (process per node) = 1 to case ppn = 16. This in some way explains a more efficient use of cluster resources with an increase of the number of processes

    Parthenon -- a performance portable block-structured adaptive mesh refinement framework

    Full text link
    On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multi-dimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7×10131.7\times10^{13} zone-cycles/s on 9,216 nodes (73,728 logical GPUs) at ~92% weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.Comment: 17 pages, 11 figures, accepted for publication in IJHPCA, Codes available at https://github.com/parthenon-hpc-la
    corecore