807 research outputs found
Towards high-level execution primitives for and-parallelism: preliminary results
Most implementations of parallel logic programming rely on complex low-level machinery which is arguably difflcult to implement and modify. We explore an alternative approach aimed at taming that complexity by raising core parts of the implementation to the source language level for the particular case of and-parallelism. Therefore, we handle a signiflcant portion of the parallel implementation mechanism at the Prolog level with the help of a comparatively small number of concurrency-related primitives which take care of lower-level tasks such as locking, thread management, stack set management, etc. The approach does not eliminate altogether modiflcations to the abstract machine, but it does greatly simplify them and it also facilitates experimenting with different alternatives. We show how this approach allows implementing both restricted and unrestricted (i.e., non fork-join) parallelism. Preliminary experiments show that the amount of performance sacriflced is reasonable, although granularity control is required in some cases. Also, we observe that the availability of unrestricted parallelism contributes to better observed speedups
A new parallelisation technique for heterogeneous CPUs
Parallelization has moved in recent years into the mainstream compilers, and the demand
for parallelizing tools that can do a better job of automatic parallelization is higher than
ever. During the last decade considerable attention has been focused on developing programming
tools that support both explicit and implicit parallelism to keep up with the
power of the new multiple core technology. Yet the success to develop automatic parallelising
compilers has been limited mainly due to the complexity of the analytic process
required to exploit available parallelism and manage other parallelisation measures such
as data partitioning, alignment and synchronization.
This dissertation investigates developing a programming tool that automatically parallelises
large data structures on a heterogeneous architecture and whether a high-level programming
language compiler can use this tool to exploit implicit parallelism and make use
of the performance potential of the modern multicore technology. The work involved the
development of a fully automatic parallelisation tool, called VSM, that completely hides
the underlying details of general purpose heterogeneous architectures. The VSM implementation
provides direct and simple access for users to parallelise array operations on the
Cell’s accelerators without the need for any annotations or process directives. This work
also involved the extension of the Glasgow Vector Pascal compiler to work with the VSM
implementation as a one compiler system. The developed compiler system, which is called
VP-Cell, takes a single source code and parallelises array expressions automatically.
Several experiments were conducted using Vector Pascal benchmarks to show the validity
of the VSM approach. The VP-Cell system achieved significant runtime performance
on one accelerator as compared to the master processor’s performance and near-linear
speedups over code runs on the Cell’s accelerators. Though VSM was mainly designed for
developing parallelising compilers it also showed a considerable performance by running
C code over the Cell’s accelerators
A compiler extension for parallelizing arrays automatically on the cell heterogeneous processor
This paper describes the approaches taken to extend an array
programming language compiler using a Virtual SIMD Machine (VSM)
model for parallelizing array operations on Cell Broadband Engine heterogeneous
machine. This development is part of ongoing work at the
University of Glasgow for developing array compilers that are beneficial
for applications in many areas such as graphics, multimedia, image processing
and scientific computation. Our extended compiler, which is built
upon the VSM interface, eases the parallelization processes by allowing
automatic parallelisation without the need for any annotations or process
directives. The preliminary results demonstrate significant improvement
especially on data-intensive applications
Some Notes on the Past and Future of Lisp-Stat
Lisp-Stat was originally developed as a framework for experimenting with dynamic graphics in statistics. To support this use, it evolved into a platform for more general statistical computing. The choice of the Lisp language as the basis of the system was in part coincidence and in part a very deliberate decision. This paper describes the background behind the choice of Lisp, as well as the advantages and disadvantages of this choice. The paper then discusses some lessons that can be drawn from experience with Lisp-Stat and with the R language to guide future development of Lisp-Stat, R, and similar systems.
LINVIEW: Incremental View Maintenance for Complex Analytical Queries
Many analytics tasks and machine learning problems can be naturally expressed
by iterative linear algebra programs. In this paper, we study the incremental
view maintenance problem for such complex analytical queries. We develop a
framework, called LINVIEW, for capturing deltas of linear algebra programs and
understanding their computational cost. Linear algebra operations tend to cause
an avalanche effect where even very local changes to the input matrices spread
out and infect all of the intermediate results and the final view, causing
incremental view maintenance to lose its performance benefit over
re-evaluation. We develop techniques based on matrix factorizations to contain
such epidemics of change. As a consequence, our techniques make incremental
view maintenance of linear algebra practical and usually substantially cheaper
than re-evaluation. We show, both analytically and experimentally, the
usefulness of these techniques when applied to standard analytics tasks. Our
evaluation demonstrates the efficiency of LINVIEW in generating parallel
incremental programs that outperform re-evaluation techniques by more than an
order of magnitude.Comment: 14 pages, SIGMO
Direct Inter-Process Communication (dIPC): Repurposing the CODOMs architecture to accelerate IPC
In current architectures, page tables are the fundamental mechanism that allows contemporary OSs to isolate user processes, binding each thread to a specific page table. A thread cannot therefore directly call another process's function or access its data; instead, the OS kernel provides data communication primitives and mediates process synchronization through inter-process communication (IPC) channels, which impede system performance. Alternatively, the recently proposed CODOMs architecture provides memory protection across software modules. Threads can cross module protection boundaries inside the same process using simple procedure calls, while preserving memory isolation. We present dIPC (for "direct IPC"), an OS extension that repurposes and extends the CODOMs architecture to allow threads to cross process boundaries. It maps processes into a shared address space, and eliminates the OS kernel from the critical path of inter-process communication. dIPC is 64.12× faster than local remote procedure calls (RPCs), and 8.87× faster than IPC in the L4 microkernel. We show that applying dIPC to a multi-tier OLTP web server improves performance by up to 5.12× (2.13× on average), and reaches over 94% of the ideal system efficiency.We thank Diego Marr´on for helping with MariaDB, the anonymous reviewers for their feedback and, especially, Andrew Baumann for helping us improve the paper. This research was partially funded by HiPEAC through a collaboration grant for Lluís Vilanova (agreement number 687698 for
the EU’s Horizon2020 research and innovation programme), the Israel Science Fundation (ISF grant 769/12) and the Israeli Ministry of Science, Technology and Space.Peer ReviewedPostprint (author's final draft
Finite difference methods fengshui: alignment through a mathematics of arrays
Numerous scientific-computational domains make use of array data. The core computing of the numerical methods and the algorithms involved is related to multi-dimensional array manipulation. Memory layout and the access patterns of that data are crucial to the optimal performance of the array-based computations. As we move towards exascale computing, writing portable code for efficient data parallel computations is increasingly requiring an abstract productive working environment. To that end, we present the design of a framework for optimizing scientific array-based computations, building a case study for a Partial Differential Equations solver. By embedding the Mathematics of Arrays formalism in the Magnolia programming language, we assemble a software stack capable of abstracting the continuous high-level application layer from the discrete formulation of the collective array-based numerical methods and algorithms and the final detailed low-level code. The case study lays the groundwork for achieving optimized memory layout and efficient computations while preserving a stable abstraction layer independent of underlying algorithms and changes in the architecture.Peer ReviewedPostprint (author's final draft
Automatic parallelization of array-oriented programs for a multi-core machine
Abstract We present the work on automatic parallelization of array-oriented programs for multi-core machines. Source programs written in standard APL are translated by a parallelizing APL-to-C compiler into parallelized C code, i.e. C mixed with OpenMP directives. We describe techniques such as virtual operations and datapartitioning used to effectively exploit parallelism structured around array-primitives. We present runtime performance data, showing the speedup of the resulting parallelized code, using different numbers of threads and different problem sizes, on a 4-core machine, for several examples
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
- …