27,912 research outputs found

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

    JXTA-Overlay: a P2P platform for distributed, collaborative, and ubiquitous computing

    Get PDF
    With the fast growth of the Internet infrastructure and the use of large-scale complex applications in industries, transport, logistics, government, health, and businesses, there is an increasing need to design and deploy multifeatured networking applications. Important features of such applications include the capability to be self-organized, be decentralized, integrate different types of resources (personal computers, laptops, and mobile and sensor devices), and provide global, transparent, and secure access to resources. Moreover, such applications should support not only traditional forms of reliable distributing computing and optimization of resources but also various forms of collaborative activities, such as business, online learning, and social networks in an intelligent and secure environment. In this paper, we present the Juxtapose (JXTA)-Overlay, which is a JXTA-based peer-to-peer (P2P) platform designed with the aim to leverage capabilities of Java, JXTA, and P2P technologies to support distributed and collaborative systems. The platform can be used not only for efficient and reliable distributed computing but also for collaborative activities and ubiquitous computing by integrating in the platform end devices. The design of a user interface as well as security issues are also tackled. We evaluate the proposed system by experimental study and show its usefulness for massive processing computations and e-learning applications.Peer ReviewedPostprint (author's final draft

    The SATIN component system - a metamodel for engineering adaptable mobile systems

    Get PDF
    Mobile computing devices, such as personal digital assistants and mobile phones, are becoming increasingly popular, smaller, and more capable. We argue that mobile systems should be able to adapt to changing requirements and execution environments. Adaptation requires the ability-to reconfigure the deployed code base on a mobile device. Such reconfiguration is considerably simplified if mobile applications are component-oriented rather than monolithic blocks of code. We present the SATIN (system adaptation targeting integrated networks) component metamodel, a lightweight local component metamodel that offers the flexible use of logical mobility primitives to reconfigure the software system by dynamically transferring code. The metamodel is implemented in the SATIN middleware system, a component-based mobile computing middleware that uses the mobility primitives defined in the metamodel to reconfigure both itself and applications that it hosts. We demonstrate the suitability of SATIN in terms of lightweightedness, flexibility, and reusability for the creation of adaptable mobile systems by using it to implement, port, and evaluate a number of existing and new applications, including an active network platform developed for satellite communication at the European space agency. These applications exhibit different aspects of adaptation and demonstrate the flexibility of the approach and the advantages gaine

    Acute: high-level programming language design for distributed computation

    No full text
    Existing languages provide good support for typeful programming of standalone programs. In a distributed system, however, there may be interaction between multiple instances of many distinct programs, sharing some (but not necessarily all) of their module structure, and with some instances rebuilt with new versions of certain modules as time goes on. In this paper we discuss programming language support for such systems, focussing on their typing and naming issues. We describe an experimental language, Acute, which extends an ML core to support distributed development, deployment, and execution, allowing type-safe interaction between separately-built programs. The main features are: (1) type-safe marshalling of arbitrary values; (2) type names that are generated (freshly and by hashing) to ensure that type equality tests suffice to protect the invariants of abstract types, across the entire distributed system; (3) expression-level names generated to ensure that name equality tests suffice for type-safety of associated values, e.g. values carried on named channels; (4) controlled dynamic rebinding of marshalled values to local resources; and (5) thunkification of threads and mutexes to support computation mobility. These features are a large part of what is needed for typeful distributed programming. They are a relatively lightweight extension of ML, should be efficiently implementable, and are expressive enough to enable a wide variety of distributed infrastructure layers to be written as simple library code above the byte-string network and persistent store APIs. This disentangles the language runtime from communication intricacies. This paper highlights the main design choices in Acute. It is supported by a full language definition (of typing, compilation, and operational semantics), by a prototype implementation, and by example distribution libraries

    Gunrock: A High-Performance Graph Processing Library on the GPU

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We evaluate Gunrock on five key graph primitives and show that Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives, and better performance than any other GPU high-level graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the previous version v5
    corecore