615 research outputs found

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

    Supporting distributed computation over wide area gigabit networks

    Get PDF
    The advent of high bandwidth fibre optic links that may be used over very large distances has lead to much research and development in the field of wide area gigabit networking. One problem that needs to be addressed is how loosely coupled distributed systems may be built over these links, allowing many computers worldwide to take part in complex calculations in order to solve "Grand Challenge" problems. The research conducted as part of this PhD has looked at the practicality of implementing a communication mechanism proposed by Craig Partridge called Late-binding Remote Procedure Calls (LbRPC). LbRPC is intended to export both code and data over the network to remote machines for evaluation, as opposed to traditional RPC mechanisms that only send parameters to pre-existing remote procedures. The ability to send code as well as data means that LbRPC requests can overcome one of the biggest problems in Wide Area Distributed Computer Systems (WADCS): the fixed latency due to the speed of light. As machines get faster, the fixed multi-millisecond round trip delay equates to ever increasing numbers of CPU cycles. For a WADCS to be efficient, programs should minimise the number of network transits they incur. By allowing the application programmer to export arbitrary code to the remote machine, this may be achieved. This research has looked at the feasibility of supporting secure exportation of arbitrary code and data in heterogeneous, loosely coupled, distributed computing environments. It has investigated techniques for making placement decisions for the code in cases where there are a large number of widely dispersed remote servers that could be used. The latter has resulted in the development of a novel prototype LbRPC using multicast IP for implicit placement and a sequenced, multi-packet saturation multicast transport protocol. These prototypes show that it is possible to export code and data to multiple remote hosts, thereby removing the need to perform complex and error prone explicit process placement decisions

    Applications of agent architectures to decision support in distributed simulation and training systems

    Get PDF
    This work develops the approach and presents the results of a new model for applying intelligent agents to complex distributed interactive simulation for command and control. In the framework of tactical command, control communications, computers and intelligence (C4I), software agents provide a novel approach for efficient decision support and distributed interactive mission training. An agent-based architecture for decision support is designed, implemented and is applied in a distributed interactive simulation to significantly enhance the command and control training during simulated exercises. The architecture is based on monitoring, evaluation, and advice agents, which cooperate to provide alternatives to the dec ision-maker in a time and resource constrained environment. The architecture is implemented and tested within the context of an AWACS Weapons Director trainer tool. The foundation of the work required a wide range of preliminary research topics to be covered, including real-time systems, resource allocation, agent-based computing, decision support systems, and distributed interactive simulations. The major contribution of our work is the construction of a multi-agent architecture and its application to an operational decision support system for command and control interactive simulation. The architectural design for the multi-agent system was drafted in the first stage of the work. In the next stage rules of engagement, objective and cost functions were determined in the AWACS (Airforce command and control) decision support domain. Finally, the multi-agent architecture was implemented and evaluated inside a distributed interactive simulation test-bed for AWACS Vv\u27Ds. The evaluation process combined individual and team use of the decision support system to improve the performance results of WD trainees. The decision support system is designed and implemented a distributed architecture for performance-oriented management of software agents. The approach provides new agent interaction protocols and utilizes agent performance monitoring and remote synchronization mechanisms. This multi-agent architecture enables direct and indirect agent communication as well as dynamic hierarchical agent coordination. Inter-agent communications use predefined interfaces, protocols, and open channels with specified ontology and semantics. Services can be requested and responses with results received over such communication modes. Both traditional (functional) parameters and nonfunctional (e.g. QoS, deadline, etc.) requirements and captured in service requests

    Minor Whey Protein Purification Using Ion-Exchange Column Chromatography

    Get PDF
    This thesis is concerned with application of mechanistic models for recovery and purification of two minor milk proteins to develop an efficient and robust process. A fundamental and quantitative understanding of the underlying mechanisms assists to evaluate chances and challenges in non-linear chromatography. The first chapter considers adsorption isotherm data of two minor whey proteins on cation exchanger under various conditions and used as the basis to develop a predictive approach for correlating adsorption behavior using a mechanistic isotherm model. The SMA isotherm model explicitly considers the contributions of protein-adsorbent and protein-protein interactions in the simulation of salt gradients in ion exchange chromatography.Sensitivity and robustness analysis by factorial design of experiments within this framework showed to be highly consistent and even allowed for upscale predictions with an excellent quality. In the next part of the thesis, the nonlinear gradient elution was to be optimized by three process factors the length of gradient, final salt concentration at the end of gradient and flow velocity. Predictions based on response surface modeling (RSM) approach were applied to reveal significant process factors. The optimal operating point was then determined by calibrated mechanistic model within and outside the design space. The operating conditions containing optimal information were experimentally verified which confirmed simulations accuracy. The third chapter considers the effects of scale-up and operating conditions on dynamic adsorption of proteins. For two columns having similar bed height, flow distribution properties was observed under non-binding conditions. Elution profiles were employed to determine dominant mass transport mechanisms. Breakthrough profiles were compared at different flow rates and protein loading concentrations.The efficiency of the columns in terms of HETP and dynamic binding capacity were calculated and compared for two columns. The outcomes resulting from the application of mechanistic models to the purification of lactoperoxidase and lactoferrin in this thesis exploit the platform for the next step towards the recovery of high-value proteins at industrial scales

    Reconfigurable middleware architectures for large scale sensor networks

    Get PDF
    Wireless sensor networks, in an effort to be energy efficient, typically lack the high-level abstractions of advanced programming languages. Though strong, the dichotomy between these two paradigms can be overcome. The SENSIX software framework, described in this dissertation, uniquely integrates constraint-dominated wireless sensor networks with the flexibility of object-oriented programming models, without violating the principles of either. Though these two computing paradigms are contradictory in many ways, SENSIX bridges them to yield a dynamic middleware abstraction unifying low-level resource-aware task reconfiguration and high-level object recomposition. Through the layered approach of SENSIX, the software developer creates a domain-specific sensing architecture by defining a customized task specification and utilizing object inheritance. In addition, SENSIX performs better at large scales (on the order of 1000 nodes or more) than other sensor network middleware which do not include such unified facilities for vertical integration

    RESTful Service Composition

    Get PDF
    The Service-Oriented Architecture (SOA) has become one of the most popular approaches to building large-scale network applications. The web service technologies are de facto the default implementation for SOA. Simple Object Access Protocol (SOAP) is the key and fundamental technology of web services. Service composition is a way to deliver complex services based on existing partner services. Service orchestration with the support of Web Services Business Process Execution Language (WSBPEL) is the dominant approach of web service composition. WSBPEL-based service orchestration inherited the issue of interoperability from SOAP, and it was furthermore challenged for performance, scalability, reliability and modifiability. I present an architectural approach for service composition in this thesis to address these challenges. An architectural solution is so generic that it can be applied to a large spectrum of problems. I name the architectural style RESTful Service Composition (RSC), because many of its elements and constraints are derived from Representational State Transfer (REST). REST is an architectural style developed to describe the architectural style of the Web. The Web has demonstrated outstanding interoperability, performance, scalability, reliability and modifiability. RSC is designed for service composition on the Internet. The RSC style is composed on specific element types, including RESTful service composition client, RESTful partner proxy, composite resource, resource client, functional computation and relaying service. A service composition is partitioned into stages; each stage is represented as a computation that has a uniform identifier and a set of uniform access methods; and the transitions between stages are driven by computational batons. RSC is supplemented by a programming model that emphasizes on-demand function, map-reduce and continuation passing. An RSC-style composition does not depend on either a central conductor service or a common choreography specification, which makes it different from service orchestration or service choreography. Four scenarios are used to evaluate the performance, scalability, reliability and modifiability improvement of the RSC approach compared to orchestration. An RSC-style solution and an orchestration solution are compared side by side in every scenario. The first scenario evaluates the performance improvement of the X-Ray Diffraction (XRD) application in ScienceStudio; the second scenario evaluates the scalability improvement of the Process Variable (PV) snapshot application; the third scenario evaluates the reliability improvement of a notification application by simulation; and the fourth scenario evaluates the modifiability improvement of the XRD application in order to fulfil emerging requirements. The results show that the RSC approach outperforms the orchestration approach in every aspect

    Analysis of current middleware used in peer-to-peer and grid implementations for enhancement by catallactic mechanisms

    Get PDF
    This deliverable describes the work done in task 3.1, Middleware analysis: Analysis of current middleware used in peer-to-peer and grid implementations for enhancement by catallactic mechanisms from work package 3, Middleware Implementation. The document is divided in four parts: The introduction with application scenarios and middleware requirements, Catnets middleware architecture, evaluation of existing middleware toolkits, and conclusions. -- Die Arbeit definiert Anforderungen an Grid und Peer-to-Peer Middleware Architekturen und analysiert diese auf ihre Eignung für die prototypische Umsetzung der Katallaxie. Eine Middleware-Architektur für die Umsetzung der Katallaxie in Application Layer Netzwerken wird vorgestellt.Grid Computing

    A Consensus Algorithm Based on Risk Assessment Model for Permissioned Blockchain

    Full text link
    Blockchain technology enables stakeholders to conduct trusted data sharing and exchange without a trusted centralized institution. These features make blockchain applications attractive to enhance trustworthiness in very different contexts. Due to unique design concepts and outstanding performance, blockchain has become a popular research topic in industry and academia in recent years. Every participant is anonymous in a permissionless blockchain represented by cryptocurrency applications such as Bitcoin. In this situation, some special incentive mechanisms are applied to permissionless blockchain, such as mined native cryptocurrency to solve the trust issues of permissionless blockchain. In many use cases, permissionless blockchain has bottlenecks in transaction throughput performance, which restricts further application in the real world. A permissioned blockchain can reach a consensus among a group of entities that do not establish an entire trust relationship. Unlike permissionless blockchains, the participants must be identified in permissioned blockchains. By relying on the traditional crash fault-tolerant consensus protocols, permissioned blockchains can achieve high transaction throughput and low latency without sacrificing security. However, how to balance the security and consensus efficiency is still the issue that needs to be solved urgently in permissioned blockchains. As the core module of blockchain technology, the consensus algorithm plays a vital role in the performance of the blockchain system. Thus, this paper proposes a new consensus algorithm for permissioned blockchain, the Risk Assessment-based Consensus protocol (RAC), combined with the decentralized design concept and the risk-node assessment mechanism to address the unbalance issues of performance in speed, scalability, and security.Comment: 32 pages, 11 figure
    corecore