806 research outputs found

    Towards high-level execution primitives for and-parallelism: preliminary results

    Full text link
    Most implementations of parallel logic programming rely on complex low-level machinery which is arguably difflcult to implement and modify. We explore an alternative approach aimed at taming that complexity by raising core parts of the implementation to the source language level for the particular case of and-parallelism. Therefore, we handle a signiflcant portion of the parallel implementation mechanism at the Prolog level with the help of a comparatively small number of concurrency-related primitives which take care of lower-level tasks such as locking, thread management, stack set management, etc. The approach does not eliminate altogether modiflcations to the abstract machine, but it does greatly simplify them and it also facilitates experimenting with different alternatives. We show how this approach allows implementing both restricted and unrestricted (i.e., non fork-join) parallelism. Preliminary experiments show that the amount of performance sacriflced is reasonable, although granularity control is required in some cases. Also, we observe that the availability of unrestricted parallelism contributes to better observed speedups

    A new parallelisation technique for heterogeneous CPUs

    Get PDF
    Parallelization has moved in recent years into the mainstream compilers, and the demand for parallelizing tools that can do a better job of automatic parallelization is higher than ever. During the last decade considerable attention has been focused on developing programming tools that support both explicit and implicit parallelism to keep up with the power of the new multiple core technology. Yet the success to develop automatic parallelising compilers has been limited mainly due to the complexity of the analytic process required to exploit available parallelism and manage other parallelisation measures such as data partitioning, alignment and synchronization. This dissertation investigates developing a programming tool that automatically parallelises large data structures on a heterogeneous architecture and whether a high-level programming language compiler can use this tool to exploit implicit parallelism and make use of the performance potential of the modern multicore technology. The work involved the development of a fully automatic parallelisation tool, called VSM, that completely hides the underlying details of general purpose heterogeneous architectures. The VSM implementation provides direct and simple access for users to parallelise array operations on the Cell’s accelerators without the need for any annotations or process directives. This work also involved the extension of the Glasgow Vector Pascal compiler to work with the VSM implementation as a one compiler system. The developed compiler system, which is called VP-Cell, takes a single source code and parallelises array expressions automatically. Several experiments were conducted using Vector Pascal benchmarks to show the validity of the VSM approach. The VP-Cell system achieved significant runtime performance on one accelerator as compared to the master processor’s performance and near-linear speedups over code runs on the Cell’s accelerators. Though VSM was mainly designed for developing parallelising compilers it also showed a considerable performance by running C code over the Cell’s accelerators

    A compiler extension for parallelizing arrays automatically on the cell heterogeneous processor

    Get PDF
    This paper describes the approaches taken to extend an array programming language compiler using a Virtual SIMD Machine (VSM) model for parallelizing array operations on Cell Broadband Engine heterogeneous machine. This development is part of ongoing work at the University of Glasgow for developing array compilers that are beneficial for applications in many areas such as graphics, multimedia, image processing and scientific computation. Our extended compiler, which is built upon the VSM interface, eases the parallelization processes by allowing automatic parallelisation without the need for any annotations or process directives. The preliminary results demonstrate significant improvement especially on data-intensive applications

    Some Notes on the Past and Future of Lisp-Stat

    Get PDF
    Lisp-Stat was originally developed as a framework for experimenting with dynamic graphics in statistics. To support this use, it evolved into a platform for more general statistical computing. The choice of the Lisp language as the basis of the system was in part coincidence and in part a very deliberate decision. This paper describes the background behind the choice of Lisp, as well as the advantages and disadvantages of this choice. The paper then discusses some lessons that can be drawn from experience with Lisp-Stat and with the R language to guide future development of Lisp-Stat, R, and similar systems.

    LINVIEW: Incremental View Maintenance for Complex Analytical Queries

    Full text link
    Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change. As a consequence, our techniques make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our evaluation demonstrates the efficiency of LINVIEW in generating parallel incremental programs that outperform re-evaluation techniques by more than an order of magnitude.Comment: 14 pages, SIGMO

    Direct Inter-Process Communication (dIPC): Repurposing the CODOMs architecture to accelerate IPC

    Get PDF
    In current architectures, page tables are the fundamental mechanism that allows contemporary OSs to isolate user processes, binding each thread to a specific page table. A thread cannot therefore directly call another process's function or access its data; instead, the OS kernel provides data communication primitives and mediates process synchronization through inter-process communication (IPC) channels, which impede system performance. Alternatively, the recently proposed CODOMs architecture provides memory protection across software modules. Threads can cross module protection boundaries inside the same process using simple procedure calls, while preserving memory isolation. We present dIPC (for "direct IPC"), an OS extension that repurposes and extends the CODOMs architecture to allow threads to cross process boundaries. It maps processes into a shared address space, and eliminates the OS kernel from the critical path of inter-process communication. dIPC is 64.12× faster than local remote procedure calls (RPCs), and 8.87× faster than IPC in the L4 microkernel. We show that applying dIPC to a multi-tier OLTP web server improves performance by up to 5.12× (2.13× on average), and reaches over 94% of the ideal system efficiency.We thank Diego Marr´on for helping with MariaDB, the anonymous reviewers for their feedback and, especially, Andrew Baumann for helping us improve the paper. This research was partially funded by HiPEAC through a collaboration grant for Lluís Vilanova (agreement number 687698 for the EU’s Horizon2020 research and innovation programme), the Israel Science Fundation (ISF grant 769/12) and the Israeli Ministry of Science, Technology and Space.Peer ReviewedPostprint (author's final draft

    Finite difference methods fengshui: alignment through a mathematics of arrays

    Get PDF
    Numerous scientific-computational domains make use of array data. The core computing of the numerical methods and the algorithms involved is related to multi-dimensional array manipulation. Memory layout and the access patterns of that data are crucial to the optimal performance of the array-based computations. As we move towards exascale computing, writing portable code for efficient data parallel computations is increasingly requiring an abstract productive working environment. To that end, we present the design of a framework for optimizing scientific array-based computations, building a case study for a Partial Differential Equations solver. By embedding the Mathematics of Arrays formalism in the Magnolia programming language, we assemble a software stack capable of abstracting the continuous high-level application layer from the discrete formulation of the collective array-based numerical methods and algorithms and the final detailed low-level code. The case study lays the groundwork for achieving optimized memory layout and efficient computations while preserving a stable abstraction layer independent of underlying algorithms and changes in the architecture.Peer ReviewedPostprint (author's final draft

    Automatic parallelization of array-oriented programs for a multi-core machine

    Get PDF
    Abstract We present the work on automatic parallelization of array-oriented programs for multi-core machines. Source programs written in standard APL are translated by a parallelizing APL-to-C compiler into parallelized C code, i.e. C mixed with OpenMP directives. We describe techniques such as virtual operations and datapartitioning used to effectively exploit parallelism structured around array-primitives. We present runtime performance data, showing the speedup of the resulting parallelized code, using different numbers of threads and different problem sizes, on a 4-core machine, for several examples

    A Fast Causal Profiler for Task Parallel Programs

    Full text link
    This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF's use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF's profile execution runs in parallel using multi-cores. TASKPROF's causal profile enables users to estimate improvements in parallelism when a region of code is optimized even when concrete optimizations are not yet known. We have used TASKPROF to isolate parallelism bottlenecks in twenty three applications that use the Intel Threading Building Blocks library. We have designed parallelization techniques in five applications to in- crease parallelism by an order of magnitude using TASKPROF. Our user study indicates that developers are able to isolate performance bottlenecks with ease using TASKPROF.Comment: 11 page
    corecore