27,912 research outputs found
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
JXTA-Overlay: a P2P platform for distributed, collaborative, and ubiquitous computing
With the fast growth of the Internet infrastructure and the use of large-scale complex applications in industries, transport, logistics, government, health, and businesses, there is an increasing need to design and deploy multifeatured networking applications. Important features of such applications include the capability to be self-organized, be decentralized, integrate different types of resources (personal computers, laptops, and mobile and sensor devices), and provide global, transparent, and secure access to resources. Moreover, such applications should support not only traditional forms of reliable distributing computing and optimization of resources but also various forms of collaborative activities, such as business, online learning, and social networks in an intelligent and secure environment. In this paper, we present the Juxtapose (JXTA)-Overlay, which is a JXTA-based peer-to-peer (P2P) platform designed with the aim to leverage capabilities of Java, JXTA, and P2P technologies to support distributed and collaborative systems. The platform can be used not only for efficient and reliable distributed computing but also for collaborative activities and ubiquitous computing by integrating in the platform end devices. The design of a user interface as well as security issues are also tackled. We evaluate the proposed system by experimental study and show its usefulness for massive processing computations and e-learning applications.Peer ReviewedPostprint (author's final draft
The SATIN component system - a metamodel for engineering adaptable mobile systems
Mobile computing devices, such as personal digital assistants and mobile phones, are becoming increasingly popular, smaller, and more capable. We argue that mobile systems should be able to adapt to changing requirements and execution environments. Adaptation requires the ability-to reconfigure the deployed code base on a mobile device. Such reconfiguration is considerably simplified if mobile applications are component-oriented rather than monolithic blocks of code. We present the SATIN (system adaptation targeting integrated networks) component metamodel, a lightweight local component metamodel that offers the flexible use of logical mobility primitives to reconfigure the software system by dynamically transferring code. The metamodel is implemented in the SATIN middleware system, a component-based mobile computing middleware that uses the mobility primitives defined in the metamodel to reconfigure both itself and applications that it hosts. We demonstrate the suitability of SATIN in terms of lightweightedness, flexibility, and reusability for the creation of adaptable mobile systems by using it to implement, port, and evaluate a number of existing and new applications, including an active network platform developed for satellite communication at the European space agency. These applications exhibit different aspects of adaptation and demonstrate the flexibility of the approach and the advantages gaine
Acute: high-level programming language design for distributed computation
Existing languages provide good support for typeful programming of standalone programs. In a distributed system, however, there may be interaction between multiple instances of many distinct programs, sharing some (but not necessarily all) of their module structure, and with some instances rebuilt with new versions of certain modules as time goes on. In this paper we discuss programming language support for such systems, focussing on their typing and naming issues. We describe an experimental language, Acute, which extends an ML core to support distributed development, deployment, and execution, allowing type-safe interaction between separately-built programs. The main features are: (1) type-safe marshalling of arbitrary values; (2) type names that are generated (freshly and by hashing) to ensure that type equality tests suffice to protect the invariants of abstract types, across the entire distributed system; (3) expression-level names generated to ensure that name equality tests suffice for type-safety of associated values, e.g. values carried on named channels; (4) controlled dynamic rebinding of marshalled values to local resources; and (5) thunkification of threads and mutexes to support computation mobility. These features are a large part of what is needed for typeful distributed programming. They are a relatively lightweight extension of ML, should be efficiently implementable, and are expressive enough to enable a wide variety of distributed infrastructure layers to be written as simple library code above the byte-string network and persistent store APIs. This disentangles the language runtime from communication intricacies. This paper highlights the main design choices in Acute. It is supported by a full language definition (of typing, compilation, and operational semantics), by a prototype implementation, and by example distribution libraries
Gunrock: A High-Performance Graph Processing Library on the GPU
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs have been two
significant challenges for developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We evaluate Gunrock on five key graph
primitives and show that Gunrock has on average at least an order of magnitude
speedup over Boost and PowerGraph, comparable performance to the fastest GPU
hardwired primitives, and better performance than any other GPU high-level
graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the
previous version v5
- …