Search CORE

6 research outputs found

Using AVX2 Instruction Set to Increase Performance of High Performance Computing Code

Author: Gepner Pawel
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 19/12/2017
Field of study

In this paper we discuss new Intel instruction extensions - Intel Advance Vector Extensions 2 (AVX2) and what these bring to high performance computing (HPC). To illustrate this new systems utilizing AVX2 are evaluated to demonstrate how to effectively exploit AVX2 for HPC types of the code and expose the situation when AVX2 might not be the most effective way to increase performance

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Effective Implementation of DGEMM on Modern Multicore CPU

Author: Gepner Pawel
Gamayunov Victor
Fraser David L.
Publication venue: Published by Elsevier B.V.
Publication date: 10/02/1975
Field of study

AbstractIn this paper we will present a detailed study on tuning double-precision matrix-matrix multiplication (DGEMM) on the Intel Xeon E5-2680 CPU. We selected an optimal algorithm from the instruction set perspective as well software tools optimized for Intel Advance Vector Extensions (AVX). Our optimizations included the use of vector memory operations, and AVX instructions. Our proposed algorithm achieves a performance improvement of 33% compared to the latest results achieved using the Intel Math Kernel Library DGEMM subroutine

Elsevier - Publisher Connector

Crossref

Repositori Obert de Coneixement de l'Ajuntament de Barcelona

InterCriteria Analysis of ACO Start Startegies

Author: Marcin Paprzycki
Olympia Roeva
Pawel Gepner
Stefka Fidanova
Publication venue: 'Polish Information Processing Society PTI'
Publication date: 01/10/2016
Field of study

Crossref

Directory of Open Access Journals

Optimizing throughput of Seq2Seq model training on the IPU platform for AI-accelerated CFD simulations

Author: Gepner Pawel
Iserte Sergio
Krzywaniak Adam
Rojek Krzysztof
Rościszewski Paweł
Publication venue: 'Elsevier BV'
Publication date: 01/01/2023
Field of study

Intelligence Processing Units (IPU) have proven useful for many AI applications. In this paper, we evaluate them within the emerging field of AI for simulation, where traditional numerical simulations are supported by artificial intelligence approaches. We focus specifically on a program for training machine learning models supporting a computational fluid dynamics application. We use custom TensorFlow provided by the Poplar Software Development Kit to adapt the program for the IPU-POD16 platform and investigate its ease of use and performance scalability. Training a model on data from OpenFOAM simulations allows us to get accurate simulation state predictions in test time. We describe how to optimize multi-threading runtime options and utilize the popdist library to overcome a performance bottleneck in feeding training data to the IPU on the host side. Due to communication overheads, using data parallelism to utilize two IPUs instead of one does not improve the throughput. However, once the intra-IPU costs have been paid, the hardware capabilities for inter-IPU communication allow for good scalability. Increasing the number of IPUs from two to 16 improves the throughput from 560.8 to 2805.8 samples/s. Additionally, the experimental results show that reducing the precision of input data storage from FP32 to FP16 allows to improve training throughput by 12%, while tuning selected runtime variables, by up to 6.3%

Repositori Institucional de la Universitat Jaume I

Adaptation of MPDATA Heterogeneous Stencil Computation to Intel Xeon Phi Coprocessor

Author: Kamil Halbiniak
Krzysztof Rojek
Lukasz Kuczynski
Lukasz Szustak
Pawel Gepner
Tomasz Olas
Publication venue: Hindawi Limited
Publication date: 01/01/2015
Field of study

The multidimensional positive definite advection transport algorithm (MPDATA) belongs to the group of nonoscillatory forward-in-time algorithms and performs a sequence of stencil computations. MPDATA is one of the major parts of the dynamic core of the EULAG geophysical model. In this work, we outline an approach to adaptation of the 3D MPDATA algorithm to the Intel MIC architecture. In order to utilize available computing resources, we propose the (3 + 1)D decomposition of MPDATA heterogeneous stencil computations. This approach is based on combination of the loop tiling and fusion techniques. It allows us to ease memory/communication bounds and better exploit the theoretical floating point efficiency of target computing platforms. An important method of improving the efficiency of the (3 + 1)D decomposition is partitioning of available cores/threads into work teams. It permits for reducing inter-cache communication overheads. This method also increases opportunities for the efficient distribution of MPDATA computation onto available resources of the Intel MIC architecture, as well as Intel CPUs. We discuss preliminary performance results obtained on two hybrid platforms, containing two CPUs and Intel Xeon Phi. The top-of-the-line Intel Xeon Phi 7120P gives the best performance results, and executes MPDATA almost 2 times faster than two Intel Xeon E5-2697v2 CPUs

Crossref

Directory of Open Access Journals

Steering Customized AI Architectures for HPC Scientific Applications

Author: Abdulah Sameh
Alomairy Rabab
Dabah Adel
Gepner Pawel
Goreczny Chris
Gratadour Damien
Hong Yuxi
Keyes David
Ltaief Hatem
Ravasi Matteo
Publication venue: Springer Nature Switzerland
Publication date: 10/05/2023
Field of study

International audienceAI hardware technologies have revolutionized computational science. While they have been mostly used to accelerate deep learning training and inference models for machine learning, HPC scientific applications do not seem to directly benefit from these specific hardware features unless AI-based components are introduced into their simulation workflows, for instance, as a replacement of their numerical solvers. This paper proposes to take another direction in an attempt to democratize customized AI architectures for HPC scientific computing. The main idea consists in demonstrating how legacy applications can leverage these AI engines after a necessary algorithmic redesign. It is critical that the resulting software implementations map onto the underlying memory-austere hardware architectures to extract the expected performance. To facilitate this process, we promote the matricization technique for restructuring codes (1) by exploiting data sparsity via algebraic compression and (2) by expressing the critical computational phases in terms of tile low-rank matrix-vector multiplications (TLR-MVM) and batch matrix-matrix multiplications (batch GEMM). Algebraic compression enables to reduce memory footprint and to fit into small local cache/memory, while batch execution ensures high occupancy. We highlight how we can steer the Graphcore AI-focused Wafer-on-Wafer Intelligence Processing Units (IPUs) to deliver high performance for both operations. We conduct a performance benchmarking campaign of these two matrix operations that account for most of the elapsed times of four real applications in computational astronomy, seismic imaging, wireless communications, and climate/weather predictions. We report bandwidth and execution rates with speedup factors up to 150X/14X/25X/40X, respectively, on IPUs compared to other systems. H. Ltaief et al

HAL-INSU

HAL-OBSPM