4 research outputs found

    Enabling EASEY deployment of containerized applications for future HPC systems

    Full text link
    The upcoming exascale era will push the changes in computing architecture from classical CPU-based systems in hybrid GPU-heavy systems with much higher levels of complexity. While such clusters are expected to improve the performance of certain optimized HPC applications, it will also increase the difficulties for those users who have yet to adapt their codes or are starting from scratch with new programming paradigms. Since there are still no comprehensive automatic assistance mechanisms to enhance application performance on such systems, we are proposing a support framework for future HPC architectures, called EASEY (Enable exASclae for EverYone). The solution builds on a layered software architecture, which offers different mechanisms on each layer for different tasks of tuning. This enables users to adjust the parameters on each of the layers, thereby enhancing specific characteristics of their codes. We introduce the framework with a Charliecloud-based solution, showcasing the LULESH benchmark on the upper layers of our framework. Our approach can automatically deploy optimized container computations with negligible overhead and at the same time reduce the time a scientist needs to spent on manual job submission configurations.Comment: International Conference on Computational Science ICCS2020, 13 page

    Towards Exascale Computing Architecture and Its Prototype: Services and Infrastructure

    Get PDF
    This paper presents the design and implementation of a scalable compute platform for processing large data sets in the scope of the EU H2020 project PROCESS. We are presenting requirements of the platform, related works, infrastructure with focus on the compute components and finally results of our work

    A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

    Get PDF
    Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. Additionally, due to containerization and specific features of the workflow, it provides researchers with a cluster-agnostic way of running their models. The experiments carried out show that the workflow offers good scalability in the distributed training of the models and that it easily adapts to different clusters.Comment: Under review for Cluster Computin

    A container-based workflow for distributed training of deep learning algorithms in HPC clusters

    Get PDF
    Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. Additionally, due to containerization and specific features of the workflow, it provides researchers with a cluster-agnostic way of running their models. The experiments carried out show that the workflow offers good scalability in the distributed training of the models and that it easily adapts to different clusters
    corecore