The extensive use of HPC infrastructures and frameworks for running
data-intensive applications has led to a growing interest in data partitioning
techniques and strategies. In fact, finding an effective partitioning, i.e. a
suitable size for data blocks, is a key strategy to speed-up parallel
data-intensive applications and increase scalability. This paper describes a
methodology for data block size estimation in HPC applications, which relies on
supervised machine learning techniques. The implementation of the proposed
methodology was evaluated using as a testbed dislib, a distributed computing
library highly focused on machine learning algorithms built on top of the
PyCOMPSs framework. We assessed the effectiveness of our solution through an
extensive experimental evaluation considering different algorithms, datasets,
and infrastructures, including the MareNostrum 4 supercomputer. The results we
obtained show that the methodology is able to efficiently determine a suitable
way to split a given dataset, thus enabling the efficient execution of
data-parallel applications in high performance environments