56 research outputs found
Accelerating sequential programs using FastFlow and self-offloading
FastFlow is a programming environment specifically targeting cache-coherent
shared-memory multi-cores. FastFlow is implemented as a stack of C++ template
libraries built on top of lock-free (fence-free) synchronization mechanisms. In
this paper we present a further evolution of FastFlow enabling programmers to
offload part of their workload on a dynamically created software accelerator
running on unused CPUs. The offloaded function can be easily derived from
pre-existing sequential code. We emphasize in particular the effective
trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove
Enhancing the performance of Decoupled Software Pipeline through Backward Slicing
The rapidly increasing number of cores available in multicore processors does
not necessarily lead directly to a commensurate increase in performance:
programs written in conventional languages, such as C, need careful
restructuring, preferably automatically, before the benefits can be observed in
improved run-times. Even then, much depends upon the intrinsic capacity of the
original program for concurrent execution. The subject of this paper is the
performance gains from the combined effect of the complementary techniques of
the Decoupled Software Pipeline (DSWP) and (backward) slicing. DSWP extracts
threadlevel parallelism from the body of a loop by breaking it into stages
which are then executed pipeline style: in effect cutting across the control
chain. Slicing, on the other hand, cuts the program along the control chain,
teasing out finer threads that depend on different variables (or locations).
parts that depend on different variables. The main contribution of this paper
is to demonstrate that the application of DSWP, followed by slicing offers
notable improvements over DSWP alone, especially when there is a loop-carried
dependence that prevents the application of the simpler DOALL optimization.
Experimental results show an improvement of a factor of ?1.6 for DSWP + slicing
over DSWP alone and a factor of ?2.4 for DSWP + slicing over the original
sequential code
Single-Producer/Single-Consumer Queues on Shared Cache Multi-Core Systems
Using efficient point-to-point communication channels is critical for
implementing fine grained parallel program on modern shared cache multi-core
architectures.
This report discusses in detail several implementations of wait-free
Single-Producer/Single-Consumer queue (SPSC), and presents a novel and
efficient algorithm for the implementation of an unbounded wait-free SPSC queue
(uSPSC). The correctness proof of the new algorithm, and several performance
measurements based on simple synthetic benchmark and microbenchmark, are also
discussed
FastFlow: Efficient Parallel Streaming Applications on Multi-core
Shared memory multiprocessors come back to popularity thanks to rapid
spreading of commodity multi-core architectures. As ever, shared memory
programs are fairly easy to write and quite hard to optimise; providing
multi-core programmers with optimising tools and programming frameworks is a
nowadays challenge. Few efforts have been done to support effective streaming
applications on these architectures. In this paper we introduce FastFlow, a
low-level programming framework based on lock-free queues explicitly designed
to support high-level languages for streaming applications. We compare FastFlow
with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel
TBB. We experimentally demonstrate that FastFlow is always more efficient than
all of them in a set of micro-benchmarks and on a real world application; the
speedup edge of FastFlow over other solutions might be bold for fine grain
tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the
alignment of protein P01111 against UniProt DB using Smith-Waterman algorithm.Comment: 23 pages + cove
Lock-free Concurrent Data Structures
Concurrent data structures are the data sharing side of parallel programming.
Data structures give the means to the program to store data, but also provide
operations to the program to access and manipulate these data. These operations
are implemented through algorithms that have to be efficient. In the sequential
setting, data structures are crucially important for the performance of the
respective computation. In the parallel programming setting, their importance
becomes more crucial because of the increased use of data and resource sharing
for utilizing parallelism.
The first and main goal of this chapter is to provide a sufficient background
and intuition to help the interested reader to navigate in the complex research
area of lock-free data structures. The second goal is to offer the programmer
familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing
Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and
Distributed Computin
- …