93,756 research outputs found

    A compiler extension for parallelizing arrays automatically on the cell heterogeneous processor

    Get PDF
    This paper describes the approaches taken to extend an array programming language compiler using a Virtual SIMD Machine (VSM) model for parallelizing array operations on Cell Broadband Engine heterogeneous machine. This development is part of ongoing work at the University of Glasgow for developing array compilers that are beneficial for applications in many areas such as graphics, multimedia, image processing and scientific computation. Our extended compiler, which is built upon the VSM interface, eases the parallelization processes by allowing automatic parallelisation without the need for any annotations or process directives. The preliminary results demonstrate significant improvement especially on data-intensive applications

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    Dynamic Control Flow in Large-Scale Machine Learning

    Full text link
    Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.Comment: Appeared in EuroSys 2018. 14 pages, 16 figure

    Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors

    Full text link
    Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO

    cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications

    Full text link
    Modern processor architectures, in addition to having still more cores, also require still more consideration to memory-layout in order to run at full capacity. The usefulness of most languages is deprecating as their abstractions, structures or objects are hard to map onto modern processor architectures efficiently. The work in this paper introduces a new abstract machine framework, cphVB, that enables vector oriented high-level programming languages to map onto a broad range of architectures efficiently. The idea is to close the gap between high-level languages and hardware optimized low-level implementations. By translating high-level vector operations into an intermediate vector bytecode, cphVB enables specialized vector engines to efficiently execute the vector operations. The primary success parameters are to maintain a complete abstraction from low-level details and to provide efficient code execution across different, modern, processors. We evaluate the presented design through a setup that targets multi-core CPU architectures. We evaluate the performance of the implementation using Python implementations of well-known algorithms: a jacobi solver, a kNN search, a shallow water simulation and a synthetic stencil simulation. All demonstrate good performance
    corecore