1,071 research outputs found

    Breadth-First Pipeline Parallelism

    Full text link
    We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster

    Dynamic Pipeline: an adaptive solution for big data

    Get PDF
    The Dynamic Pipelineis a concurrent programming pattern amenable to be parallelized. Furthermore, the number of processing units used in the parallelization is adjusted to the size of the problem, and each processing unit uses a reduced memory footprint. Contrary to other approaches, the Dynamic Pipeline can be seen as ageneralization of the (parallel) Divide and Conquer schema, where systems can be reconfigured depending on the particular instance of the problem to be solved. We claim that the Dynamic Pipelines is useful to deal with Big Data related problems. In particular, we have designed and implemented algorithms for computing graphs parameters as number of triangles, connected components, and maximal cliques, among others. Currently, we are focused on designing and implementing an efficient algorithm to evaluate conjunctive query.Peer ReviewedPostprint (author's final draft

    A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Applications

    Get PDF
    Heterogeneous High-Performance Computing (HPC) platforms present a significant programming challenge, especially because the key users of HPC resources are scientists, not parallel programmers. We contend that compiler technology has to evolve to automatically create the best program variant by transforming a given original program. We have developed a novel methodology based on type transformations for generating correct-by-construction design variants, and an associated light-weight cost model for evaluating these variants for implementation on FPGAs. In this paper we present a key enabler of our approach, the cost model. We discuss how we are able to quickly derive accurate estimates of performance and resource-utilization from the design’s representation in our intermediate language. We show results confirming the accuracy of our cost model by testing it on three different scientific kernels. We conclude with a case-study that compares a solution generated by our framework with one from a conventional high-level synthesis tool, showing better performance and power-efficiency using our cost model based approach
    • …
    corecore