Block size estimation for data partitioning in HPC applications using
  machine learning techniques

Badia, Rosa M.; Cantini, Riccardo; Ejarque, Jorge; Marozzo, Fabrizio; Orsino, Alessio; Talia, Domenico; Trunfio, Paolo; Vazquez, Fernando

Block size estimation for data partitioning in HPC applications using machine learning techniques

Authors: Rosa M. Badia
Riccardo Cantini
Jorge Ejarque
Fabrizio Marozzo
Alessio Orsino
Domenico Talia
Paolo Trunfio
Fernando Vazquez
Publication date: 19 November 2022
Publisher

Abstract

The extensive use of HPC infrastructures and frameworks for running data-intensive applications has led to a growing interest in data partitioning techniques and strategies. In fact, finding an effective partitioning, i.e. a suitable size for data blocks, is a key strategy to speed-up parallel data-intensive applications and increase scalability. This paper describes a methodology for data block size estimation in HPC applications, which relies on supervised machine learning techniques. The implementation of the proposed methodology was evaluated using as a testbed dislib, a distributed computing library highly focused on machine learning algorithms built on top of the PyCOMPSs framework. We assessed the effectiveness of our solution through an extensive experimental evaluation considering different algorithms, datasets, and infrastructures, including the MareNostrum 4 supercomputer. The results we obtained show that the methodology is able to efficiently determine a suitable way to split a given dataset, thus enabling the efficient execution of data-parallel applications in high performance environments

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2211.10819

Last time updated on 24/12/2022