31 research outputs found
Recommended from our members
A highly scalable Met Office NERC Cloud model
Large Eddy Simulation is a critical modelling tool for scien- tists investigating atmospheric flows, turbulence and cloud microphysics. Within the UK, the principal LES model used by the atmospheric research community is the Met Office Large Eddy Model (LEM). The LEM was originally devel- oped in the late 1980s using computational techniques and assumptions of the time, which means that the it does not scale beyond 512 cores. In this paper we present the Met Office NERC Cloud model, MONC, which is a re-write of the existing LEM. We discuss the software engineering and architectural decisions made in order to develop a flexible, extensible model which the community can easily customise for their own needs. The scalability of MONC is evaluated, along with numerous additional customisations made to fur- ther improve performance at large core counts. The result of this work is a model which delivers to the community signifi- cant new scientific modelling capability that takes advantage of the current and future generation HPC machine
Type oriented parallel programming for Exascale
Whilst there have been great advances in HPC hardware and software in recent
years, the languages and models that we use to program these machines have
remained much more static. This is not from a lack of effort, but instead by
virtue of the fact that the foundation that many programming languages are
built on is not sufficient for the level of expressivity required for parallel
work. The result is an implicit trade-off between programmability and
performance which is made worse due to the fact that, whilst many scientific
users are experts within their own fields, they are not HPC experts.
Type oriented programming looks to address this by encoding the complexity of
a language via the type system. Most of the language functionality is contained
within a loosely coupled type library that can be flexibly used to control many
aspects such as parallelism. Due to the high level nature of this approach
there is much information available during compilation which can be used for
optimisation and, in the absence of type information, the compiler can apply
sensible default options thus supporting both the expert programmer and novice
alike.
We demonstrate that, at no performance or scalability penalty when running on
up to 8196 cores of a Cray XE6 system, codes written in this type oriented
manner provide improved programmability. The programmer is able to write
simple, implicit parallel, HPC code at a high level and then explicitly tune by
adding additional type information if required.Comment: As presented at the Exascale Applications and Software Conference
(EASC), 9th-11th April 201
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
Recommended from our members
GGDML: icosahedral models language extensions
The optimization opportunities of a code base are not completely exploited by compilers. In fact, there are optimizations that must be done within the source code. Hence, if the code developers skip some details, some performance is lost. Thus, the use of a general-purpose language to develop a performance-demanding software -e.g. climate models- needs more care from the developers. They should take into account hardware details of the target machine.
Besides, writing a high-performance code for one machine will have a lower performance on another one. The developers usually write multiple optimized sections or even code versions for the different target machines. Such codes are complex and hard to maintain.
In this article we introduce a higher-level code development approach, where we develop a set of extensions to the language that is used to write a model’s code. Our extensions form a domain-specific language (DSL) that abstracts domain concepts and leaves the lower level details to a configurable source-to-source translation process.
The purpose of the developed extensions is to support the icosahedral climate/atmospheric model development. We have started with the three icosahedral models: DYNAMICO, ICON, and NICAM. The collaboration with the scientists from the weather/climate sciences enabled agreed-upon extensions. When we have suggested an extension we kept in mind that it represents a higher-level domain-based concept, and that it carries no lower-level details.
The introduced DSL (GGDML- General Grid Definition and Manipulation Language) hides optimization details like memory layout. It reduces code size of a model to less than one third its original size in terms of lines of code. The development costs of a model with GGDML are therefore reduced significantly
Recommended from our members
Performance portability of Earth system models with user-controlled GGDML code translation
The increasing need for performance of earth system modeling and other scientific domains pushes the computing technologies in diverse architectural directions. The development of models needs technical expertise and skills of using tools that are able to exploit the hardware capabilities. The heterogeneity of architectures complicates the development and the maintainability of the models. To improve the software development process of earth system models, we provide an approach that simplifies the code maintainability by fostering separation of concerns while providing performance portability. We propose the use of high-level language extensions that reflect scientific concepts. The scientists can use the programming language of their own choice to develop models, however, they can use the language extensions optionally wherever they need. The code translation is driven by configurations that are separated from the model source code. These configurations are prepared by scientific programmers to optimally use the machine’s features. The main contribution of this paper is the demonstration of a user-controlled source-to-source translation technique of earth system models that are written with higher-level semantics. We discuss a flexible code translation technique that is driven by the users through a configuration input that is prepared especially to transform the code, and we use this technique to produce OpenMP or OpenACC enabled codes besides MPI to support multi-node configurations
Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs
In this paper, we target the parallel solution of sparse linear systems via iterative Krylov subspace-based method enhanced with a block-Jacobi preconditioner on a cluster of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the preconditioned conjugate gradient method that improve the interoperability between the message-passing interface and OmpSs programming models. Specifically, we progressively integrate several communication-reduction and iteration-fusing strategies into the initial code, obtaining more efficient versions of the method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 32 nodes with 24 cores each. The experimental analysis shows that the techniques described in the paper outperform the classical method by a margin that varies between 6 and 48%, depending on the evaluation