7 research outputs found

    A Study of Checkpointing in Large Scale Training of Deep Neural Networks

    Full text link
    Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC

    An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks

    Get PDF
    Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for.Peer ReviewedPostprint (published version

    AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

    Get PDF
    African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology

    Convergence of deep learning and high performance computing: challenges and solutions

    Get PDF
    Deep Learning has achieved outstanding results in many fields and led to groundbreaking discoveries. With the steady increase in datasets and model sizes, there has been a recent surge in Machine Learning applications in High-Performance Computing (HPC) to speed up training. Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models or using high dimension inputs. However, training DNN in HPC infrastructures presents a unique set of challenges: scalability, IO contention, network congestion and fault tolerance. Solving these problems is particularly challenging and unique due to DL applications’ nature and the history of adaptation of DL in HPC. This thesis addresses scalability and resilience challenges by looking at different parts of the Machine Learning Workflow. We first address hyper-parameters optimisation (HPO), which is one of the most time consuming and resource-intensive parts of a Machine Learning Workflow. We present a HPO scheme built on top of PyCOMPSs, a programming model and runtime which aims to ease the development of parallel applications for distributed infrastructures. We show that PyCOMPSs is a robust framework that can accelerate the process of Hyperparameter Optimisation across multiple devices and computing units. We perform a detailed performance analysis showing different configurations to demonstrate the effectiveness of our approach. We then analyse the compute, communication, and memory requirements of DNNs to understand the trade-offs of different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility that can help detect the limitations and bottlenecks of different parallelism approaches at scale. While significant effort has been put to facilitate distributed training by DL frameworks,fault tolerance has been largely ignored. We examine the checkpointing implementation of popular DL platforms. We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide take-away points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.El Deep Learning ha logrado resultados sobresalientes en muchas aplicaciones y ha dado lugar a descubrimientos revolucionarios. Con el aumento constante del tamaño de las colecciones de datos y de los modelos, ha habido un reciente desarrollo de aplicaciones de Machine Learning en computación de alto rendimiento (HPC) que se enfocan en reducir el tiempo de entrenamiento de los modelos diseñados. Las librerías de Deep Neural Networks (DNN) utilizan el entrenamiento distribuido para reducir el tiempo de convergencia y aliviar las limitaciones de capacidad de memoria al entrenar modelos grandes o al utilizar entradas de gran dimensión. Sin embargo, capacitar a DNN en infraestructuras de HPC presenta una serie única de desafíos: escalabilidad, contención de E/S, congestión de la red y tolerancia a fallas. Resolver estos problemas es particularmente desafiante y único debido a la naturaleza de las aplicaciones DL y la historia de adaptación de DL en HPC. Esta tesis aborda los desafíos de escalabilidad y resiliencia al analizar el flujo de trabajo completo del Machine Learning. Primero abordamos la optimización de hiper-parámetros (HPO), que es una de las partes del flujo de trabajo de Machine Learning que consume más tiempo y recursos. Presentamos un esquema HPO construido sobre PyCOMPSs, un modelo de programación que tiene como objetivo facilitar el desarrollo de aplicaciones paralelas para infraestructuras distribuidas. Demostramos que PyCOMPSs es un marco robusto que puede acelerar el proceso de optimización de hiper-parámetros en múltiples dispositivos y unidades informáticas. Realizamos un detallado análisis de rendimiento que muestra diferentes configuraciones para demostrar la efectividad de nuestro enfoque. Luego, analizamos los requisitos de computación, comunicación y memoria de las DNN para comprender las compensaciones de los diferentes enfoques de paralelismo en el rendimiento y la escalabilidad. Aprovechamos nuestro análisis basado en modelos como base de una utilidad de Oracle que puede ayudar a detectar las limitaciones y los cuellos de botella de diferentes enfoques de paralelismo a escala. Si bien se ha realizado un esfuerzo significativo para facilitar el entrenamiento distribuido por los marcos de DL, la tolerancia a fallas se ha ignorado en gran medida. Examinamos la implementación de puntos de control de plataformas DL populares. Evaluamos el costo computacional de los puntos de control, los formatos y tamaños de los archivos, el impacto de la escala y los puntos de control deterministas. Proporcionamos puntos de discusión que pueden ayudar a los usuarios a seleccionar un marco tolerante a fallas para usar en HPC. También proporcionamos puntos de referencia que los desarrolladores de marcos pueden utilizar para facilitar un mejor control de las cargas de trabajo de DL en HPC.Postprint (published version

    Accelerating hyperparameter optimisation with PyCOMPSs

    No full text
    Machine Learning applications now span across multiple domains due to the increase in computational power of modern systems. There has been a recent surge in Machine Learning applications in High Performance Computing (HPC) in an attempt to speed up training. However, besides training, hyperparameters optimisation(HPO) is one of the most time consuming and resource intensive parts in a Machine Learning Workflow. Numerous algorithms and tools exist to accelerate the process of finding the right parameters for a model. Most of these tools do not utilize the parallelism provided by modern systems and are serial or limited to a single node. The few ones that are offer distributed execution require a serious amount of programming effort. There is, therefore, a need for a tool/scheme that can scale and leverage HPC infrastructures such as supercomputers, with minimum programmers effort and little or no overhead in performance. We present a HPO scheme built on top of PyCOMPSs, a programming model and runtime which aims to ease the development of parallel applications for distributed infrastructures. We show that PyCOMPSs is a powerful framework that can accelerate the process of Hyperparameter Optimisation across multiple devices and computing units. We also show that PyCOMPSs provides easy programmability, seamless distribution and scalability, key features missing in existing tools. Furthermore, we perform a detailed performance analysis showing different configurations to demonstrate the effectiveness our approach.Peer Reviewe

    An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks

    Get PDF
    Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism.The project that gave rise to these results received the support of a fellowship from the "la Caixa" Foundation (ID 100010434). The fellowship code is LCF/BQ/DI17/11620059. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713673. The Eurolab4HPC project has received funding from the European Union Horizon 2020 Framework Programme (H2020-EU.1.2.2. - FET Proactive) under grant agreement number 800962. This work was supported by JST, ACT-X Grant Number JPMJAX190C, Japan; by JST, PRESTO Grant Number JPMJPR20MA, Japan.Peer ReviewedPostprint (author's final draft

    AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

    Full text link
    African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology
    corecore