488 research outputs found
Comparative evaluation of bandwidth-bound applications on the Intel Xeon CPU MAX Series
In this paper we explore the performance of Intel Xeon MAX CPU Series,
representing the most significant new variation upon the classical CPU
architecture since the Intel Xeon Phi Processor. Given the availability of a
large on-package high-bandwidth memory, the bandwidth-to-compute ratio has
significantly shifted compared to other CPUs on the market. Since a large
fraction of HPC workloads are sensitive to the available bandwidth, we explore
how this architecture performs on a selection of HPC proxies and applications
that are mostly sensitive to bandwidth, and how it compares to the previous 3rd
generation Intel Xeon Scalable processors (codenamed Ice Lake) and an AMD EPYC
7003 Series Processor with 3D V-Cache Technology (codenamed Milan-X). We
explore performance with different parallel implementations (MPI, MPI+OpenMP,
MPI+SYCL), compiled with different compilers and flags, and executed with or
without hyperthreading. We show how performance bottlenecks are shifted from
bandwidth to communication latencies for some applications, and demonstrate
speedups compared to the previous generation between 2.0x-4.3x
Automatic parallel implementations of adjoint codes for structured mesh applications
Algorithmic Differentiation (AD) shown to be an essential tool to get sensitivity information for va in multiple areas of science such as Computational Fluid Dynamics (CFD) applications or finance. Yet there is no sufficient tool to ease the cost of providing performance portable AD codes, especially for modern hardware like GPU clusters. This paper sketches our plans and progress so far to extend the OPS framework with an adjoint tape (storage for descriptors of intermediate steps and intermediate states of variables) and shows preliminary performance results on CPU nodes. The OPS (Oxford Parallel library for Structured mesh solvers) has shown good performance and scaling on a wide range of HPC architectures. Our work aims to exploit the benefits of OPS to provide performance portable adjoint implementations for future structured mesh stencil applications using OPS with minimal modifications
Bitwise Reproducible task execution on unstructured mesh applications
Many mesh applications use floating point arithmetic which do not necessarily hold the associative laws of algebra. This could cause the application to become unreproducible. In this paper we present some work on generating a method for unstructured mesh applications to provide bitwise reproducibility between separate runs, even if they are started with different number of MPI processes. We implement our work in the OP2 domain-specific library, which provides an API that abstracts the solution of unstructured mesh computations. We carry out a performance analysis of our method applied on two applications: a simple airfoil application, and a more complex Aero application which uses a finite element method and a conjugate-gradient algorithm. We show a 2.37Ă—to 1.49Ă— slowdown on this applications as a price for full bitwise reproducibility
Detection and characterization of DNA damage
The DNA molecule is constantly subjected to endogenous and exogenous sources of damage which if left unrepaired can lead to genotoxic and cytotoxic outcomes. These lesions have been implicated in the development of numerous diseases, carcinogenesis, and aging. This research focuses on the formation of such lesions, and unlike current research, demonstrates the potential of combining multiple analytical techniques to characterize a potential damage detection system
- …