14 research outputs found
Accelerating sequential programs using FastFlow and self-offloading
FastFlow is a programming environment specifically targeting cache-coherent
shared-memory multi-cores. FastFlow is implemented as a stack of C++ template
libraries built on top of lock-free (fence-free) synchronization mechanisms. In
this paper we present a further evolution of FastFlow enabling programmers to
offload part of their workload on a dynamically created software accelerator
running on unused CPUs. The offloaded function can be easily derived from
pre-existing sequential code. We emphasize in particular the effective
trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove
An Optimization Theory for Structured Stencil-based Parallel Applications
In this thesis, we introduce a new optimization theory for stencil-based applications which is centered both on a modification of the well known owner-computes rule and on base but powerful properties oftoroidal spaces. The proposed optimization techniques provide notable results in different computational aspects: from the reduction of communication overhead to the reduction of computation time, through the minimization of memory requirement without performance loss.
All classical optimization theory is based on defining transformations that can produce optimized programs which are computationally equivalent to the original ones. According to Kennedy, two programs are equivalent if, from the same input data, they produce identical output data.
As other proposed modifications to the owner-computes rule, we exploit stencil application feature of being characterized by a set of consecutive steps. For such configurations, it is possible to define specific two phase optimizations.
The first phase is characterized by the application of program
transformations which result in an efficient computation of an
output that be easily converted into the original one. In other words the transformed program defined by the first phase is not computational equivalent with respect to the original one.
The second phase converts the output of the previous phase back into the original one exploiting optimized technique in order to introduce the lowest additional overhead. The phase guarantees the computational equivalence of the approach.
Obviously, in order to define an interesting new optimization technique, we have to prove that the overall performance of the two phases sequence is greater than the one of the original program.
Exploiting a structured approach and studying this optimization theory on stencils featuring specific patterns of functional dependencies, we discover a set of novel transformations which result in significant optimizations.
Among the new transformations, the most notable one, which aims to reduce the number of communications necessary to implement a stencil-based application, turns out to be the best optimization technique amongst those cited in the literature.
All the improvements provided by transformations presented in this thesis have been both formally proved and experimentally tested on an heterogeneous set of architectures including clusters and different types of multi-cores
FastFlow: Efficient Parallel Streaming Applications on Multi-core
Shared memory multiprocessors come back to popularity thanks to rapid
spreading of commodity multi-core architectures. As ever, shared memory
programs are fairly easy to write and quite hard to optimise; providing
multi-core programmers with optimising tools and programming frameworks is a
nowadays challenge. Few efforts have been done to support effective streaming
applications on these architectures. In this paper we introduce FastFlow, a
low-level programming framework based on lock-free queues explicitly designed
to support high-level languages for streaming applications. We compare FastFlow
with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel
TBB. We experimentally demonstrate that FastFlow is always more efficient than
all of them in a set of micro-benchmarks and on a real world application; the
speedup edge of FastFlow over other solutions might be bold for fine grain
tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the
alignment of protein P01111 against UniProt DB using Smith-Waterman algorithm.Comment: 23 pages + cove
Minimizing Communications with Q-transformations in Uniform and Affine Stencils
In stencil based parallel applications, communications represent the main overhead, especially when targeting a fine grain parallelization in order to reduce the completion time. Techniques that minimize the number and the impact of communications are clearly relevant. In literature the best optimization reduces the number of communications per step from 3dim, featured by a naive implementation, to 2*dim, where dim is the number of the domain dimensions. To break down the previous bound, in the paper we introduce and formally prove Q-transformations, for stencils featuring data dependencies that can be expressed as geometric affine translations. Q-transformations, based on data dependencies orientations though space translations, lowers the number of communications per step to dim
Efficient Smith-Waterman on multi-core with FastFlow
Abstract—Shared memory multiprocessors have returned to popularity thanks to rapid spreading of commodity multi-core architectures. However, little attention has been paid to supporting effective streaming applications on these architectures. In this paper we describe FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-theart programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than them on a given real world application: the speedup of FastFlow over other solutions may be substantial for fine grain tasks, for example +35% over OpenMP, +226 % over Cilk, +96 % over TBB for the alignment of protein P01111 against UniProt DB using the Smith-Waterman algorithm. I
Recommended from our members