Search CORE

10 research outputs found

Unleashing the performance of ccNUMA multiprocessor architectures in heterogeneous stencil computations

Author: A Eltablawy
A Lastovetsky
A Strugarek
D Culler
J Guo
Kamil Halbiniak
L Szustak
L Szustak
Lukasz Szustak
Lukasz Szustak
Lukasz Szustak
M Ciznicki
Ondřej Jakl
P Smolarkiewicz
Roman Wyrzykowski
X Cao
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

About the granularity portability of block-based Krylov methods in heterogeneous computing environments

Author: Carracciuolo L.
Mele V.
Szustak L.
Publication venue: 'Wiley'
Publication date: 01/01/2021
Field of study

Large-scale problems in engineering and science often require the solution of sparse linear algebra problems and the Krylov subspace iteration methods (KM) have led to a major change in how users deal with them. But, for these solvers to use extreme-scale hardware efficiently a lot of work was spent to redesign both the KM algorithms and their implementations to address challenges like extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. All the redesign approaches bases the KM algorithm on block-based strategies which lead to the Block-KM (BKM) algorithm which has high granularity (i.e., the ratio of computation time to communication time). The work proposes novel parallel revisitation of the modules used in BKM which are based on the overlapping of communication and computation. Such revisitation is evaluated by a model of their granularity and verified on the basis of a case study related to a classical problem from numerical linear algebra

Archivio della ricerca - Università degli studi di Napoli Federico II

Correlation of Performance Optimizations and Energy Consumption for Stencil-Based Application on Intel Xeon Scalable Processors

Author: Mele V.
Olas T.
Szustak L.
Wyrzykowski R.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

This article provides a comprehensive study of the impact of performance optimizations on the energy efficiency of a real-world CFD application called MPDATA, as well as an insightful analysis of performance-energy interaction of these optimizations with the underlying hardware that represents the first generation of Intel Xeon Scalable processors. Considering the MPDATA iterative application as a use case, we explore the fundamentals of energy and performance analysis for a memory-bound application when exposed to a set of optimization steps that increase the application performance, by improving the operational intensity of code and utilizing resources more efficiently. It is shown that for memory-bound applications, optimizing toward high performance could be a powerful strategy for improving the energy efficiency as well. In fact, for the considered performance optimizations, the energy gain is correlated with the performance gain but with varying degrees. As a result, these optimizations allow improving both performance and energy consumption radically, up to about 10.9 and 8.8 times, respectively. The impact of the Intel AVX-512 SIMD extension on the energy consumption and performance is demonstrated. Also, we discover limitations on the usability of CPU frequency scaling as a tool for balancing energy savings with admissible performance losses

Archivio della ricerca - Università degli studi di Napoli Federico II

Dynamic workload prediction and distribution in numerical modeling of solidification on multi-/manycore architectures

Author: Halbiniak K.
Kulawik A.
Lapegna M.
Olas T.
Szustak L.
Publication venue: 'Wiley'
Publication date: 01/01/2021
Field of study

This work is a part of the global tendency to use modern computing systems for modeling the phase-field phenomena. The main goal of this article is to improve the performance of a parallel application for the solidification modeling, assuming the dynamic intensity of computations in successive time steps when calculations are performed using a carefully selected group of nodes in the grid. A two-step method is proposed to optimize the application for multi-/manycore architectures. In the first step, the loop fusion is used to execute all kernels in a single nested loop and reduce the number of conditional operators. These modifications are vital to implementing the second step, which includes an algorithm for the dynamic workload prediction and load balancing across cores of a computing platform. Two versions of the algorithm are proposed—with the 1D and 2D maps used for predicting the computational domain within the grid. The proposed optimizations allow increasing the application performance significantly for all tested configurations of computing resources. The highest performance gain is achieved for two Intel Xeon Platinum 8180 CPUs, where the new code based on the 2D map yields the speedup of up to 2.74 times, while the usage of the proposed method with the 2D map for a single KNL accelerator permits reducing the execution time up to 1.91 times

Archivio della ricerca - Università degli studi di Napoli Federico II

Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors

Author: Bandishti V
Bobulski J
Eltablawy A
Hager G
Jeffers J
Lukasz Szustak
Pawel Bratek
Szustak L
Vladimirov A
Publication venue: 'SAGE Publications'
Publication date
Field of study

Crossref

Performance enhancement of a dynamic K-means algorithm through a parallel adaptive strategy on multicore CPUs

Author: Laccetti G.
Lapegna M.
Mele V.
Romano D.
Szustak L.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

The K-means algorithm is one of the most popular algorithms in Data Science, and it is aimed to discover similarities among the elements belonging to large datasets, partitioning them in K distinct groups called clusters. The main weakness of this technique is that, in real problems, it is often impossible to define the value of K as input data. Furthermore, the large amount of data used for useful simulations makes impracticable the execution of the algorithm on traditional architectures. In this paper, we address the previous two issues. On the one hand, we propose a method to dynamically define the value of K by optimizing a suitable quality index with special care to the computational cost. On the other hand, to improve the performance and the effectiveness of the algorithm, we propose a strategy for parallel implementation on modern multicore CPUs

Archivio della ricerca - Università degli studi di Napoli Federico II