6,846 research outputs found

    pocl: A Performance-Portable OpenCL Implementation

    Get PDF
    OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

    A Special Purpose Architecture for Finite Element Analysis

    Get PDF
    The analysis of aerospace structures by the finite element method consumes considerable computer time. The cost of this resource and the designer's desire to have rapid feedback concerning such questions as the effect of a change in loading of the structure or in a parameter of some structural material led to the design of a special purpose parallel computing system for finite element analysis. As a special purpose computer, the architecture of this finite element computer is closely tied to computational aspects of the particular problem. Various aspects of an MIMD array of microprocessors are related to the requirements of the class of finite element analysis problems which it is intended to solve

    Beyond the Fokker-Planck equation: Pathwise control of noisy bistable systems

    Get PDF
    We introduce a new method, allowing to describe slowly time-dependent Langevin equations through the behaviour of individual paths. This approach yields considerably more information than the computation of the probability density. The main idea is to show that for sufficiently small noise intensity and slow time dependence, the vast majority of paths remain in small space-time sets, typically in the neighbourhood of potential wells. The size of these sets often has a power-law dependence on the small parameters, with universal exponents. The overall probability of exceptional paths is exponentially small, with an exponent also showing power-law behaviour. The results cover time spans up to the maximal Kramers time of the system. We apply our method to three phenomena characteristic for bistable systems: stochastic resonance, dynamical hysteresis and bifurcation delay, where it yields precise bounds on transition probabilities, and the distribution of hysteresis areas and first-exit times. We also discuss the effect of coloured noise.Comment: 37 pages, 11 figure

    Sequential escapes: onset of slow domino regime via a saddle connection

    Get PDF
    We explore sequential escape behaviour of coupled bistable systems under the influence of stochastic perturbations. We consider transient escapes from a marginally stable "quiescent" equilibrium to a more stable "active" equilibrium. The presence of coupling introduces dependence between the escape processes: for diffusive coupling there is a strongly coupled limit (fast domino regime) where the escapes are strongly synchronised while for intermediate coupling (slow domino regime) without partially escaped stable states, there is still a delayed effect. These regimes can be associated with bifurcations of equilibria in the low-noise limit. In this paper we consider a localized form of non-diffusive (i.e pulse-like) coupling and find similar changes in the distribution of escape times with coupling strength. However we find transition to a slow domino regime that is not associated with any bifurcations of equilibria. We show that this transition can be understood as a codimension-one saddle connection bifurcation for the low-noise limit. At transition, the most likely escape path from one attractor hits the escape saddle from the basin of another partially escaped attractor. After this bifurcation we find increasing coefficient of variation of the subsequent escape times

    A Multi-GPU Programming Library for Real-Time Applications

    Full text link
    We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We describe the library design, programming interface and implementation details in light of this specific problem domain. The core concepts of this work are a novel kind of container abstraction and MPI-like communication methods for intra-system communication. We further demonstrate how MGPU is used as a framework for porting existing GPU libraries to multi-device architectures. Putting our library to the test, we accelerate an iterative non-linear image reconstruction algorithm for real-time magnetic resonance imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us to conclude that multi-GPU systems are a viable solution for real-time MRI reconstruction as well as signal-processing applications in general.Comment: 15 pages, 10 figure
    corecore