4 research outputs found
Enabling EASEY deployment of containerized applications for future HPC systems
The upcoming exascale era will push the changes in computing architecture
from classical CPU-based systems in hybrid GPU-heavy systems with much higher
levels of complexity. While such clusters are expected to improve the
performance of certain optimized HPC applications, it will also increase the
difficulties for those users who have yet to adapt their codes or are starting
from scratch with new programming paradigms. Since there are still no
comprehensive automatic assistance mechanisms to enhance application
performance on such systems, we are proposing a support framework for future
HPC architectures, called EASEY (Enable exASclae for EverYone). The solution
builds on a layered software architecture, which offers different mechanisms on
each layer for different tasks of tuning. This enables users to adjust the
parameters on each of the layers, thereby enhancing specific characteristics of
their codes. We introduce the framework with a Charliecloud-based solution,
showcasing the LULESH benchmark on the upper layers of our framework. Our
approach can automatically deploy optimized container computations with
negligible overhead and at the same time reduce the time a scientist needs to
spent on manual job submission configurations.Comment: International Conference on Computational Science ICCS2020, 13 page
Towards Exascale Computing Architecture and Its Prototype: Services and Infrastructure
This paper presents the design and implementation of a scalable compute platform for processing large data sets in the scope of the EU H2020 project PROCESS. We are presenting requirements of the platform, related works, infrastructure with focus on the compute components and finally results of our work
A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters
Deep learning has been postulated as a solution for numerous problems in
different branches of science. Given the resource-intensive nature of these
models, they often need to be executed on specialized hardware such graphical
processing units (GPUs) in a distributed manner. In the academic field,
researchers get access to this kind of resources through High Performance
Computing (HPC) clusters. This kind of infrastructures make the training of
these models difficult due to their multi-user nature and limited user
permission. In addition, different HPC clusters may possess different
peculiarities that can entangle the research cycle (e.g., libraries
dependencies). In this paper we develop a workflow and methodology for the
distributed training of deep learning models in HPC clusters which provides
researchers with a series of novel advantages. It relies on udocker as
containerization tool and on Horovod as library for the distribution of the
models across multiple GPUs. udocker does not need any special permission,
allowing researchers to run the entire workflow without relying on any
administrator. Horovod ensures the efficient distribution of the training
independently of the deep learning framework used. Additionally, due to
containerization and specific features of the workflow, it provides researchers
with a cluster-agnostic way of running their models. The experiments carried
out show that the workflow offers good scalability in the distributed training
of the models and that it easily adapts to different clusters.Comment: Under review for Cluster Computin
A container-based workflow for distributed training of deep learning algorithms in HPC clusters
Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. Additionally, due to containerization and specific features of the workflow, it provides researchers with a cluster-agnostic way of running their models. The experiments carried out show that the workflow offers good scalability in the distributed training of the models and that it easily adapts to different clusters