6,846 research outputs found
Recommended from our members
Percolation scheduling for non-VLIW machines
Percolation Scheduling, a technique for compile-time code parallelization, has proven very successful for exploiting fine-grain irregular parallelism in ordinary programs. Currently, this technology is targeted only to VLIW (Very Long Instruction Word) machines, which have the advantages of 'free' synchronization and communication. Shared memory multi-processors can simulate the execution characteristics of VLIW machines with the use of static barriers. Preliminary results show that Percolation Scheduling can be used with good results on this type of architecture by increasing the granularity from operation level to source statement level, removing any redundant synchronization, and providing an efficient implementation of multi-way jumps
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
A Special Purpose Architecture for Finite Element Analysis
The analysis of aerospace structures by the finite element method consumes considerable computer time. The cost of this resource and the designer's desire to have rapid feedback concerning such questions as the effect of a change in loading of the structure or in a parameter of some structural material led to the design of a special purpose parallel computing system for finite element analysis. As a special purpose computer, the architecture of this finite element computer is closely tied to computational aspects of the particular problem. Various aspects of an MIMD array of microprocessors are related to the requirements of the class of finite element analysis problems which it is intended to solve
Beyond the Fokker-Planck equation: Pathwise control of noisy bistable systems
We introduce a new method, allowing to describe slowly time-dependent
Langevin equations through the behaviour of individual paths. This approach
yields considerably more information than the computation of the probability
density. The main idea is to show that for sufficiently small noise intensity
and slow time dependence, the vast majority of paths remain in small space-time
sets, typically in the neighbourhood of potential wells. The size of these sets
often has a power-law dependence on the small parameters, with universal
exponents. The overall probability of exceptional paths is exponentially small,
with an exponent also showing power-law behaviour. The results cover time spans
up to the maximal Kramers time of the system. We apply our method to three
phenomena characteristic for bistable systems: stochastic resonance, dynamical
hysteresis and bifurcation delay, where it yields precise bounds on transition
probabilities, and the distribution of hysteresis areas and first-exit times.
We also discuss the effect of coloured noise.Comment: 37 pages, 11 figure
Sequential escapes: onset of slow domino regime via a saddle connection
We explore sequential escape behaviour of coupled bistable systems under the
influence of stochastic perturbations. We consider transient escapes from a
marginally stable "quiescent" equilibrium to a more stable "active"
equilibrium. The presence of coupling introduces dependence between the escape
processes: for diffusive coupling there is a strongly coupled limit (fast
domino regime) where the escapes are strongly synchronised while for
intermediate coupling (slow domino regime) without partially escaped stable
states, there is still a delayed effect. These regimes can be associated with
bifurcations of equilibria in the low-noise limit. In this paper we consider a
localized form of non-diffusive (i.e pulse-like) coupling and find similar
changes in the distribution of escape times with coupling strength. However we
find transition to a slow domino regime that is not associated with any
bifurcations of equilibria. We show that this transition can be understood as a
codimension-one saddle connection bifurcation for the low-noise limit. At
transition, the most likely escape path from one attractor hits the escape
saddle from the basin of another partially escaped attractor. After this
bifurcation we find increasing coefficient of variation of the subsequent
escape times
A Multi-GPU Programming Library for Real-Time Applications
We present MGPU, a C++ programming library targeted at single-node multi-GPU
systems. Such systems combine disproportionate floating point performance with
high data locality and are thus well suited to implement real-time algorithms.
We describe the library design, programming interface and implementation
details in light of this specific problem domain. The core concepts of this
work are a novel kind of container abstraction and MPI-like communication
methods for intra-system communication. We further demonstrate how MGPU is used
as a framework for porting existing GPU libraries to multi-device
architectures. Putting our library to the test, we accelerate an iterative
non-linear image reconstruction algorithm for real-time magnetic resonance
imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs
and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us
to conclude that multi-GPU systems are a viable solution for real-time MRI
reconstruction as well as signal-processing applications in general.Comment: 15 pages, 10 figure
- …