21,798 research outputs found

    SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

    Full text link
    Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.Comment: Accepted to International Conference on Machine Learning (ICML) 2023. 25 pages, 8 figure

    Exploring Fully Offloaded GPU Stream-Aware Message Passing

    Full text link
    Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the CPU to orchestrate the data transfer operations. A new offload-friendly communication strategy, stream-triggered (ST) communication, was explored to allow offloading the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization based implementation was used as an exemplar to illustrate the proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used to explore the various performance aspects of the implementation. The offloaded implementation shows significant on-node performance advantages over standard MPI active RMA (36%) and point-to-point (61%) communication. The current multi-node improvement is less (23% faster than standard active RMA but 11% slower than point-to-point), but plans are in progress to purse further improvements.Comment: 12 pages, 17 figure

    Digitaalisten työvälineiden käyttö sovittelun laadun tukena

    Get PDF
    Selvitys tarkastelee sovittelun digitalisaatiota ja sovitteluteknologioiden nykytilaa sekä niiden mahdollisuuksia ja rajoitteita Suomessa ja kahdeksassa muussa vertailumaassa. Selvitys vastaa tavoitteeseen oikeudenhoidon ja digitaalisten palveluiden kehittämisestä tutkimuslähtöisesti, oikeusvaltioperiaatetta mukaillen. Tavoite on etsiä keinoja eri teknologioiden hyödyntämiseen sovittelussa ja kartoittaa eri vaihtoehtoihin liittyviä oikeudellisia, kokemuksellisia ja hallinnollisia riskejä. Aineisto koostuu kahdeksasta asiantuntijahaastattelusta, viranomaisraporteista, uutislähteistä ja tieteellisistä artikkeleista. Selvityksen mukaan digitaalisten työvälineiden käyttö sovittelutoiminnassa on pääsääntöisesti vähäistä, vaikka Covid-pandemia lisäsi hetkellisesti etäyhteyden käyttöä sovitteluistunnoissa. Digitaalisten sovittelumenettelyjen kehittämishaasteet ovat vakiintuneiden menettelytapojen uudistamisen haastavuus sekä aiheeseen liittyvien tutkimusten ja kokemusten vähäisyys. Selvityksessä suositellaan, että digitalisaatiota edistetään osana sovittelutoiminnan ja sen rakenteiden laajempaa kehittämistä. Tavoitteenasettelun tulee painottaa laatua eikä ensisijaisesti taloudellista tehokkuutta. Sovittelun digitalisaation kehittämisen ohjaavina periaatteina on oltava osapuolten itsemääräämisoikeus ja oikeudenmukaisuuden kokemukset sekä käyttäjäystävällisyys

    Towards A Practical High-Assurance Systems Programming Language

    Full text link
    Writing correct and performant low-level systems code is a notoriously demanding job, even for experienced developers. To make the matter worse, formally reasoning about their correctness properties introduces yet another level of complexity to the task. It requires considerable expertise in both systems programming and formal verification. The development can be extremely costly due to the sheer complexity of the systems and the nuances in them, if not assisted with appropriate tools that provide abstraction and automation. Cogent is designed to alleviate the burden on developers when writing and verifying systems code. It is a high-level functional language with a certifying compiler, which automatically proves the correctness of the compiled code and also provides a purely functional abstraction of the low-level program to the developer. Equational reasoning techniques can then be used to prove functional correctness properties of the program on top of this abstract semantics, which is notably less laborious than directly verifying the C code. To make Cogent a more approachable and effective tool for developing real-world systems, we further strengthen the framework by extending the core language and its ecosystem. Specifically, we enrich the language to allow users to control the memory representation of algebraic data types, while retaining the automatic proof with a data layout refinement calculus. We repurpose existing tools in a novel way and develop an intuitive foreign function interface, which provides users a seamless experience when using Cogent in conjunction with native C. We augment the Cogent ecosystem with a property-based testing framework, which helps developers better understand the impact formal verification has on their programs and enables a progressive approach to producing high-assurance systems. Finally we explore refinement type systems, which we plan to incorporate into Cogent for more expressiveness and better integration of systems programmers with the verification process

    TeamSTEPPS and Organizational Culture

    Get PDF
    Patient safety issues remain despite several strategies developed for their deterrence. While many safety initiatives bring about improvement, they are repeatedly unsustainable and short-lived. The index hospital’s goal was to build an organizational culture within a groundwork that improves teamwork and continuing healthcare team engagement. Teamwork influences the efficiency of patient care, patient safety, and clinical outcomes, as it has been identified as an approach for enhancing collaboration, decreasing medical errors, and building a culture of safety in healthcare. The facility implemented Team Strategies and Tools to Enhance Performance and Patient Safety (TeamSTEPPS), an evidence-based framework which was used for team training to produce valuable and needed changes, facilitating modification of organizational culture, increasing patient safety compliance, or solving particular issues. This study aimed to identify the correlation between TeamSTEPPS enactment and improved organizational culture in the ambulatory care nursing department of a New York City public hospital

    Eunomia: Enabling User-specified Fine-Grained Search in Symbolically Executing WebAssembly Binaries

    Full text link
    Although existing techniques have proposed automated approaches to alleviate the path explosion problem of symbolic execution, users still need to optimize symbolic execution by applying various searching strategies carefully. As existing approaches mainly support only coarse-grained global searching strategies, they cannot efficiently traverse through complex code structures. In this paper, we propose Eunomia, a symbolic execution technique that allows users to specify local domain knowledge to enable fine-grained search. In Eunomia, we design an expressive DSL, Aes, that lets users precisely pinpoint local searching strategies to different parts of the target program. To further optimize local searching strategies, we design an interval-based algorithm that automatically isolates the context of variables for different local searching strategies, avoiding conflicts between local searching strategies for the same variable. We implement Eunomia as a symbolic execution platform targeting WebAssembly, which enables us to analyze applications written in various languages (like C and Go) but can be compiled into WebAssembly. To the best of our knowledge, Eunomia is the first symbolic execution engine that supports the full features of the WebAssembly runtime. We evaluate Eunomia with a dedicated microbenchmark suite for symbolic execution and six real-world applications. Our evaluation shows that Eunomia accelerates bug detection in real-world applications by up to three orders of magnitude. According to the results of a comprehensive user study, users can significantly improve the efficiency and effectiveness of symbolic execution by writing a simple and intuitive Aes script. Besides verifying six known real-world bugs, Eunomia also detected two new zero-day bugs in a popular open-source project, Collections-C.Comment: Accepted by ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) 202

    Formal description of ML models for unambiguous implementation

    Full text link
    Implementing deep neural networks in safety critical systems, in particular in the aeronautical domain, will require to offer adequate specification paradigms to preserve the semantics of the trained model on the final hardware platform. We propose to extend the nnef language in order to allow traceable distribution and parallelisation optimizations of a trained model. We show how such a specification can be implemented in cuda on a Xavier platform

    Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

    Full text link
    While the pay-as-you-go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training ML models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it extremely challenging to select the ``right'' cloud cluster configuration for training. Furthermore, the training time and cost of distributed model training is highly sensitive to the cluster configurations, and presents a large and complex tradeoff-space. In this paper, we develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. Our key insight is that both parallel and statistical efficiency must be considered when selecting the optimum job configuration parameters such as the number of workers and the batch size. By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with < 5% error. Using the repetitive nature of training and our models, we can search for optimum cloud configurations in a black-box, online manner. Our approach reduces training times by 2 times and costs more more than 50%. Compared to an oracle-based approach, our performance models are accurate to within 2% such that the search imposes an overhead of just 10%

    A Comprehensive Study on Off-path SmartNIC

    Full text link
    SmartNIC has recently emerged as an attractive device to accelerate distributed systems. However, there has been no comprehensive characterization of SmartNIC especially on the network part. This paper presents the first comprehensive study of off-path SmartNIC. Our experimental study uncovers the key performance characteristics of the communication among the client, SmartNIC SoC, and the host. We find without considering SmartNIC hardware architecture, communications with it can cause up to 48% bandwidth degradation due to performance anomalies. We also propose implications to address the anomalies.Comment: This is the short version. Full version will appear at OSDI2

    The LDBC social network benchmark: Business intelligence workload

    Get PDF
    The Social Network Benchmark’s Business Intelligence workload (SNB BI) is a comprehensive graph OLAP benchmark targeting analytical data systems capable of supporting graph workloads. This paper marks the finalization of almost a decade of research in academia and industry via the Linked Data Benchmark Council (LDBC). SNB BI advances the state-of-the art in synthetic and scalable analytical database benchmarks in many aspects. Its base is a sophisticated data generator, implemented on a scalable distributed infrastructure, that produces a social graph with small-world phenomena, whose value properties follow skewed and correlated distributions and where values correlate with structure. This is a temporal graph where all nodes and edges follow lifespan-based rules with temporal skew enabling realistic and consistent temporal inserts and (recursive) deletes. The query workload exploiting this skew and correlation is based on LDBC’s “choke point”-driven design methodology and will entice technical and scientific improvements in future (graph) database systems. SNB BI includes the first adoption of “parameter curation” in an analytical benchmark, a technique that ensures stable runtimes of query variants across different parameter values. Two performance metrics characterize peak single-query performance (power) and sustained concurrent query throughput. To demonstrate the portability of the benchmark, we present experimental results on a relational and a graph DBMS. Note that these do not constitute an official LDBC Benchmark Result – only audited results can use this trademarked term
    corecore