105 research outputs found

    Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

    Full text link
    Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202

    Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

    Full text link
    Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202

    An In-Depth Analysis of the Slingshot Interconnect

    Full text link
    The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.Comment: To be published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020

    Big Data Application and System Co-optimization in Cloud and HPC Environment

    Get PDF
    The emergence of big data requires powerful computational resources and memory subsystems that can be scaled efficiently to accommodate its demands. Cloud is a new well-established computing paradigm that can offer customized computing and memory resources to meet the scalable demands of big data applications. In addition, the flexible pay-as-you-go pricing model offers opportunities for using large scale of resources with low cost and no infrastructure maintenance burdens. High performance computing (HPC) on the other hand also has powerful infrastructure that has potential to support big data applications. In this dissertation, we explore the application and system co-optimization opportunities to support big data in both cloud and HPC environments. Specifically, we explore the unique features of both application and system to seek overlooked optimization opportunities or tackle challenges that are difficult to be addressed by only looking at the application or system individually. Based on the characteristics of the workloads and their underlying systems to derive the optimized deployment and runtime schemes, we divide the workflow into four categories: 1) memory intensive applications; 2) compute intensive applications; 3) both memory and compute intensive applications; 4) I/O intensive applications.When deploying memory intensive big data applications to the public clouds, one important yet challenging problem is selecting a specific instance type whose memory capacity is large enough to prevent out-of-memory errors while the cost is minimized without violating performance requirements. In this dissertation, we propose two techniques for efficient deployment of big data applications with dynamic and intensive memory footprint in the cloud. The first approach builds a performance-cost model that can accurately predict how, and by how much, virtual memory size would slow down the application and consequently, impact the overall monetary cost. The second approach employs a lightweight memory usage prediction methodology based on dynamic meta-models adjusted by the application's own traits. The key idea is to eliminate the periodical checkpointing and migrate the application only when the predicted memory usage exceeds the physical allocation. When applying compute intensive applications to the clouds, it is critical to make the applications scalable so that it can benefit from the massive cloud resources. In this dissertation, we first use the Kirchhoff law, which is one of the most widely used physical laws in many engineering principles, as an example workload for our study. The key challenge of applying the Kirchhoff law to real-world applications at scale lies in the high, if not prohibitive, computational cost to solve a large number of nonlinear equations. In this dissertation, we propose a high-performance deep-learning-based approach for Kirchhoff analysis, namely HDK. HDK employs two techniques to improve the performance: (i) early pruning of unqualified input candidates which simplify the equation and select a meaningful input data range; (ii) parallelization of forward labelling which execute steps of the problem in parallel. When it comes to both memory and compute intensive applications in clouds, we use blockchain system as a benchmark. Existing blockchain frameworks exhibit a technical barrier for many users to modify or test out new research ideas in blockchains. To make it worse, many advantages of blockchain systems can be demonstrated only at large scales, which are not always available to researchers. In this dissertation, we develop an accurate and efficient emulating system to replay the execution of large-scale blockchain systems on tens of thousands of nodes in the cloud. For I/O intensive applications, we observe one important yet often neglected side effect of lossy scientific data compression. Lossy compression techniques have demonstrated promising results in significantly reducing the scientific data size while guaranteeing the compression error bounds, but the compressed data size is often highly skewed and thus impact the performance of parallel I/O. Therefore, we believe it is critical to pay more attention to the unbalanced parallel I/O caused by lossy scientific data compression

    CGX: Adaptive System Support for Communication-Efficient Deep Learning

    Full text link
    The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy

    Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent

    Get PDF
    The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset in order to converge to a solution of sufficient quality.In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state.In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose LeashedSGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD

    Etude d’applications émergentes en HPC et leurs impacts sur des stratégies d’ordonnancement

    Get PDF
    With the expected convergence between HPC, BigData and AI, newapplications with different profiles are coming to HPC infrastructures.We aim at better understanding the features and needs of theseapplications in order to be able to run them efficiently on HPC platforms.The approach followed is bottom-up: we study thoroughly an emergingapplication, Spatially Localized Atlas Network (SLANT, originating from the neuroscience community) to understand its behavior.Based on these observations, we derive a generic, yet simple, application model (namely, a linear sequence of stochastic jobs). We expect this model to be representative for a large set of upcoming applicationsthat require the computational power of HPC clusters without fitting the typical behavior oflarge-scale traditional applications.In a second step, we show how one can manipulate this generic model in a scheduling framework. Specifically we consider the problem of making reservations (both time andmemory) for an execution on an HPC platform.We derive solutions using the model of the first step of this work.We experimentally show the robustness of the model, even with very few data or with another application, to generate themodel, and provide performance gainsLa convergence entre les domaines du calcul haute-performance, du BigData et de l'intelligence artificiellefait émerger de nouveaux profils d'application sur les infrastructures HPC.Dans ce travail, nous proposons une étude de ces nouvelles applications afin de mieux comprendre leurs caractériques et besoinsdans le but d'optimiser leur exécution sur des plateformes HPC.Pour ce faire, nous adoptons une démarche ascendante. Premièrement, nous étudions en détail une application émergente, SLANT, provenant du domaine des neurosciences. Par un profilage détaillé de l'application, nous exposons ses principales caractéristiques ainsi que ses besoins en terme de ressources de calcul.A partir de ces observations, nous proposons un modèle d'application générique, pour le moment simple, composé d'une séquence linéaire de tâches stochastiques. Ce modèle devrait, selon nous, être adapté à une grande variété de ces applications émergentes qui requièrent la puissance de calcul des clusters HPC sans présenter le comportement typique des applications qui s'exécutent sur des machines à grande-échelle.Deuxièmement, nous montrons comment utiliser le modèle d'application générique dans le cadre du développement de stratégies d'ordonnancement. Plus précisément, nous nous intéressons à la conception de stratégies de réservations (à la fois en terme de temps de calcul et de mémoire).Nous proposons de telles solutions utilisant le modèle d'application générique exprimé dans la première étape de ce travail.Enfin, nous montrons la robustesse du modèle d'application et de nos stratégies d'ordonnancement au travers d'évaluations expérimentales de nos stratégies.Notamment, nous démontrons que nos solutions surpassent les approches standards de la communauté des neurosciences, même en cas de donnéespartielles ou d'extension à d'autres applications que SLANT

    DEVELOPING A NEW APPROACH TO CHANGE AND LEARNING IN PUBLIC SECTOR ORGANISATIONS

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Provincial government and regional development

    Get PDF
    This research uses a case study of Xinjiang to challenge China's reform by addressing the problems rooted in its partiality and regionalisation. The reform started in the field of political administration and toleration of decentralisation and marketisation in the economic sphere has generated economic prosperity in some regions. But economic reform was not necessarily accompanied by political transformation. Most characteristics of socialism have been retained, including political discretion and economic bailout. Both are regarded as major causes to economic weakness in some sectors and some provinces. The central argument for the continuation of the partial reform is decentralisation of decision-making to the local political state, enabling local government to give a "helping hand" in facilitating change. But the partiality of the reform drives local governments in those regions with political sensitivities to become a "political defender", holding back the progress of the reform there. Such unbalanced and unparalleled developments amongst the regions and institutions has create imbalances in provinces such as Xinjiang, challenging the success of China's reform overall. In politically sensitive regions, the Communist Party has retained an administrative stranglehold and development has stagnated, not only calling into question the sustainability the reforms but also potentially threatening China's unity and political stability. The thesis uses Xinjiang, which is politically very sensitive, because of its ethnicity and strategic resources, to argue this point
    • …
    corecore