    Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

    Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202

    An In-Depth Analysis of the Slingshot Interconnect

    The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.Comment: To be published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020

    Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

    Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202

    Etude d’applications émergentes en HPC et leurs impacts sur des stratégies d’ordonnancement

    With the expected convergence between HPC, BigData and AI, newapplications with different profiles are coming to HPC infrastructures.We aim at better understanding the features and needs of theseapplications in order to be able to run them efficiently on HPC platforms.The approach followed is bottom-up: we study thoroughly an emergingapplication, Spatially Localized Atlas Network (SLANT, originating from the neuroscience community) to understand its behavior.Based on these observations, we derive a generic, yet simple, application model (namely, a linear sequence of stochastic jobs). We expect this model to be representative for a large set of upcoming applicationsthat require the computational power of HPC clusters without fitting the typical behavior oflarge-scale traditional applications.In a second step, we show how one can manipulate this generic model in a scheduling framework. Specifically we consider the problem of making reservations (both time andmemory) for an execution on an HPC platform.We derive solutions using the model of the first step of this work.We experimentally show the robustness of the model, even with very few data or with another application, to generate themodel, and provide performance gainsLa convergence entre les domaines du calcul haute-performance, du BigData et de l'intelligence artificiellefait émerger de nouveaux profils d'application sur les infrastructures HPC.Dans ce travail, nous proposons une étude de ces nouvelles applications afin de mieux comprendre leurs caractériques et besoinsdans le but d'optimiser leur exécution sur des plateformes HPC.Pour ce faire, nous adoptons une démarche ascendante. Premièrement, nous étudions en détail une application émergente, SLANT, provenant du domaine des neurosciences. Par un profilage détaillé de l'application, nous exposons ses principales caractéristiques ainsi que ses besoins en terme de ressources de calcul.A partir de ces observations, nous proposons un modèle d'application générique, pour le moment simple, composé d'une séquence linéaire de tâches stochastiques. Ce modèle devrait, selon nous, être adapté à une grande variété de ces applications émergentes qui requièrent la puissance de calcul des clusters HPC sans présenter le comportement typique des applications qui s'exécutent sur des machines à grande-échelle.Deuxièmement, nous montrons comment utiliser le modèle d'application générique dans le cadre du développement de stratégies d'ordonnancement. Plus précisément, nous nous intéressons à la conception de stratégies de réservations (à la fois en terme de temps de calcul et de mémoire).Nous proposons de telles solutions utilisant le modèle d'application générique exprimé dans la première étape de ce travail.Enfin, nous montrons la robustesse du modèle d'application et de nos stratégies d'ordonnancement au travers d'évaluations expérimentales de nos stratégies.Notamment, nous démontrons que nos solutions surpassent les approches standards de la communauté des neurosciences, même en cas de donnéespartielles ou d'extension à d'autres applications que SLANT

    Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent

    The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset in order to converge to a solution of sufficient quality.In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state.In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose LeashedSGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD