4,880 research outputs found
A general and efficient divide-and-conquer algorithm framework for multi-core clusters
This is a post-peer-review, pre-copyedit version of an article published in Cluster Computing.
The final authenticated version is available online at: https://doi.org/10.1007/s10586-017-0766-y[Abstract]Divide-and-conquer is one of the most important patterns of parallelism, being applicable to a large variety of problems. In addition, the most powerful parallel systems available nowadays are computer clusters composed of distributed-memory nodes that contain an increasing number of cores that share a common memory. The optimal exploitation of these systems often requires resorting to a hybrid model that mimics the underlying hardware by combining a distributed and a shared memory parallel programming model. This results in longer development times and increased maintenance costs. In this paper we present a very general skeleton library that allows to parallelize any divide-and-conquer problem in hybrid distributed-shared memory systems with little effort while providing much flexibility and good performance. Our proposal combines a message-passing paradigm at the process level and a threaded model inside each process, hiding the related complexity from the user. The evaluation shows that this skeleton provides performance comparable, and often better than that of manually optimized codes while requiring considerably less effort when parallelizing applications on multi-core clusters.Ministerio de Economía y Competitividad; TIN2013-42148-PMinisterio de Economía y Competitividad; TIN2016-75845-PXunta de Galicia; GRC2013/05
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
An Evaluation of the X10 Programming Language
As predicted by Moore\u27s law, the number of transistors on a chip has been doubled approximately every two years. As miraculous as it sounds, for many years, the extra transistors have massively benefited the whole computer industry, by using the extra transistors to increase CPU clock speed, thus boosting performance. However, due to heat wall and power constraints, the clock speed cannot be increased limitlessly. Hardware vendors now have to take another path other than increasing clock speed, which is to utilize the transistors to increase the number of processor cores on each chip. This hardware structural change presents inevitable challenges to software structure, where single thread targeted software will not benefit from newer chips or may even suffer from lower clock speed. The two fundamental challenges are: 1. How to deal with the stagnation of single core clock speed and cache memory. 2. How to utilize the additional power generated from more cores on a chip. Most software programming languages nowadays have distributed computing support, such as C and Java [1]. Meanwhile, some new programming languages were invented from scratch just to take advantage of the more distributed hardware structures. The X10 Programming Language is one of them. The goal of this project is to evaluate X10 in terms of performance, programmability and tool support
A Divide-and-Conquer Algorithm for Betweenness Centrality
The problem of efficiently computing the betweenness centrality of nodes has
been researched extensively. To date, the best known exact and centralized
algorithm for this task is an algorithm proposed in 2001 by Brandes. The
contribution of our paper is Brandes++, an algorithm for exact efficient
computation of betweenness centrality. The crux of our algorithm is that we
create a sketch of the graph, that we call the skeleton, by replacing subgraphs
with simpler graph structures. Depending on the underlying graph structure,
using this skeleton and by keeping appropriate summaries Brandes++ we can
achieve significantly low running times in our computations. Extensive
experimental evaluation on real life datasets demonstrate the efficacy of our
algorithm for different types of graphs. We release our code for benefit of the
research community.Comment: Shorter version of this paper appeared in Siam Data Mining 201
On Designing Multicore-aware Simulators for Biological Systems
The stochastic simulation of biological systems is an increasingly popular
technique in bioinformatics. It often is an enlightening technique, which may
however result in being computational expensive. We discuss the main
opportunities to speed it up on multi-core platforms, which pose new challenges
for parallelisation techniques. These opportunities are developed in two
general families of solutions involving both the single simulation and a bulk
of independent simulations (either replicas of derived from parameter sweep).
Proposed solutions are tested on the parallelisation of the CWC simulator
(Calculus of Wrapped Compartments) that is carried out according to proposed
solutions by way of the FastFlow programming framework making possible fast
development and efficient execution on multi-cores.Comment: 19 pages + cover pag
FastFlow tutorial
FastFlow is a structured parallel programming framework targeting shared
memory multicores. Its layered design and the optimized implementation of the
communication mechanisms used to implement the FastFlow streaming networks
provided to the application programmer as algorithmic skeletons support the
development of efficient fine grain parallel applications. FastFlow is
available (open source) at SourceForge
(http://sourceforge.net/projects/mc-fastflow/). This work introduces FastFlow
programming techniques and points out the different ways used to parallelize
existing C/C++ code using FastFlow as a software accelerator. In short: this is
a kind of tutorial on FastFlow.Comment: 49 pages + cove
Performance predictability of divide and conquer skeletons
Parallel divide and conquer computations, encompassing a wide variety of applications, can be modeled and encapsulated as a high level primitive called skeleton.
The paper deals with a skeleton designed for parallel divide and conquer algorithms that provide hypercubical communications among processes The paper also introduces an accurate timing model designed for prediction of proposed primitive. The timing analysis model presented here still characterizing the communication time through architecture parameters but introduces a few novelties. The proposal is to introduce different kinds of components to the analytical model by associating a performance constant for each specific conceptual block of the skeleton. The trace files obtained from the execution of the resulting code using the skeleton are used by lineal regression techniques giving us, among other information, the values of the parameters of those blocks. An extended example showing the relative accuracy of the proposed approach concludes the paper.Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
- …