Search CORE

2,534 research outputs found

SL: a "quick and dirty" but working intermediate language for SVP systems

Author: Poss Raphael
Publication venue
Publication date: 01/01/2012
Field of study

The CSA group at the University of Amsterdam has developed SVP, a framework to manage and program many-core and hardware multithreaded processors. In this article, we introduce the intermediate language SL, a common vehicle to program SVP platforms. SL is designed as an extension to the standard C language (ISO C99/C11). It includes primitive constructs to bulk create threads, bulk synchronize on termination of threads, and communicate using word-sized dataflow channels between threads. It is intended for use as target language for higher-level parallelizing compilers. SL is a research vehicle; as of this writing, it is the only interface language to program a main SVP platform, the new Microgrid chip architecture. This article provides an overview of the language, to complement a detailed specification available separately.Comment: 22 pages, 3 figures, 18 listings, 1 tabl

arXiv.org e-Print Archive

CiteSeerX

UvA-DARE

International Migration, Integration and Social Cohesion online publications

An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

Author: Ghysels Pieter
Li Xiaoye S.
Napov Artem
Rouet Francois-Henry
Williams Samuel
Publication venue
Publication date: 25/02/2015
Field of study

We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

DI-fusion

Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines

Author: Cooper Lee
Kong Jun
Kurc Tahsin
Pan Tony
Saltz Joel
Teodoro George
Publication venue
Publication date: 14/09/2012
Field of study

In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.Comment: 37 pages, 16 figure

arXiv.org e-Print Archive

CiteSeerX

Parallel structurally-symmetric sparse matrix-vector products on multi-core processors

Author: Ainsworth Jr. George O.
Batista Vicente H. F.
Ribeiro Fernando L. B.
Publication venue: 'Civil-Comp, Ltd.'
Publication date: 28/05/2010
Field of study

We consider the problem of developing an efficient multi-threaded implementation of the matrix-vector multiplication algorithm for sparse matrices with structural symmetry. Matrices are stored using the compressed sparse row-column format (CSRC), designed for profiting from the symmetric non-zero pattern observed in global finite element matrices. Unlike classical compressed storage formats, performing the sparse matrix-vector product using the CSRC requires thread-safe access to the destination vector. To avoid race conditions, we have implemented two partitioning strategies. In the first one, each thread allocates an array for storing its contributions, which are later combined in an accumulation step. We analyze how to perform this accumulation in four different ways. The second strategy employs a coloring algorithm for grouping rows that can be concurrently processed by threads. Our results indicate that, although incurring an increase in the working set size, the former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo

arXiv.org e-Print Archive

Crossref

Analytical/ML Mixed Approach for Concurrency Regulation in Software Transactional Memory

Author: Ciciani Bruno
DI SANZO Pierangelo
Quaglia Francesco
Rughetti Diego
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

In this article we exploit a combination of analytical and Machine Learning (ML) techniques in order to build a performance model allowing to dynamically tune the level of concurrency of applications based on Software Transactional Memory (STM). Our mixed approach has the advantage of reducing the training time of pure machine learning methods, and avoiding approximation errors typically affecting pure analytical approaches. Hence it allows very fast construction of highly reliable performance models, which can be promptly and effectively exploited for optimizing actual application runs. We also present a real implementation of a concurrency regulation architecture, based on the mixed modeling approach, which has been integrated with the open source Tiny STM package, together with experimental data related to runs of applications taken from the STAMP benchmark suite demonstrating the effectiveness of our proposal. © 2014 IEEE

CiteSeerX

Crossref

Archivio della Ricerca - Università di Roma 3

Archivio della ricerca- Università di Roma La Sapienza

Programming Languages for Distributed Computing Systems

Author: Bal H.E.
Steiner J.G.
Tanenbaum A.S.
Publication venue
Publication date: 01/01/1989
Field of study

When distributed systems first appeared, they were programmed in traditional sequential languages, usually with the addition of a few library procedures for sending and receiving messages. As distributed applications became more commonplace and more sophisticated, this ad hoc approach became less satisfactory. Researchers all over the world began designing new programming languages specifically for implementing distributed applications. These languages and their history, their underlying principles, their design, and their use are the subject of this paper. We begin by giving our view of what a distributed system is, illustrating with examples to avoid confusion on this important and controversial point. We then describe the three main characteristics that distinguish distributed programming languages from traditional sequential languages, namely, how they deal with parallelism, communication, and partial failures. Finally, we discuss 15 representative distributed languages to give the flavor of each. These examples include languages based on message passing, rendezvous, remote procedure call, objects, and atomic transactions, as well as functional languages, logic languages, and distributed data structure languages. The paper concludes with a comprehensive bibliography listing over 200 papers on nearly 100 distributed programming languages

CiteSeerX

VU Research Portal

Crossref

A Parallel Tree code for large Nbody simulation: dynamic load balance and data distribution on CRAY T3D system

Author: A Pagliaro
Antonuccio-Delogu
Barnes
Barnes
Becciani
Brooks
Cray Research Inc
Cray Research Inc
Dubunski
G Erbacci
Gouhing
M Gambera
R Ansaloni
Romeel
Salmon
Salmon
U Becciani
V Antonuccio-Delogu
Publication venue: 'Elsevier BV'
Publication date: 01/01/1997
Field of study

N-body algorithms for long-range unscreened interactions like gravity belong to a class of highly irregular problems whose optimal solution is a challenging task for present-day massively parallel computers. In this paper we describe a strategy for optimal memory and work distribution which we have applied to our parallel implementation of the Barnes & Hut (1986) recursive tree scheme on a Cray T3D using the CRAFT programming environment. We have performed a series of tests to find an " optimal data distribution " in the T3D memory, and to identify a strategy for the " Dynamic Load Balance " in order to obtain good performances when running large simulations (more than 10 million particles). The results of tests show that the step duration depends on two main factors: the data locality and the T3D network contention. Increasing data locality we are able to minimize the step duration if the closest bodies (direct interaction) tend to be located in the same PE local memory (contiguous block subdivison, high granularity), whereas the tree properties have a fine grain distribution. In a very large simulation, due to network contention, an unbalanced load arises. To remedy this we have devised an automatic work redistribution mechanism which provided a good Dynamic Load Balance at the price of an insignificant overhead.Comment: 16 pages with 11 figures included, (Latex, elsart.style). Accepted by Computer Physics Communication

arXiv.org e-Print Archive

Crossref

OA@INAF - Istituto Nazionale di Astrofisica

CERN Document Server

Asynchronous Games over Tree Architectures

Author: A. Pnueli
B. Bollig
B. Genest
G. Katz
P. Gastin
P. Gastin
P. Madhusudan
P. Madhusudan
P.-A. Melliès
P.J.G. Ramadge
R. Meyden van der
S. Schewe
W. Zielonka
Publication venue
Publication date: 01/01/2013
Field of study

We consider the task of controlling in a distributed way a Zielonka asynchronous automaton. Every process of a controller has access to its causal past to determine the next set of actions it proposes to play. An action can be played only if every process controlling this action proposes to play it. We consider reachability objectives: every process should reach its set of final states. We show that this control problem is decidable for tree architectures, where every process can communicate with its parent, its children, and with the environment. The complexity of our algorithm is l-fold exponential with l being the height of the tree representing the architecture. We show that this is unavoidable by showing that even for three processes the problem is EXPTIME-complete, and that it is non-elementary in general

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization

Author: McMahan Brendan
Smith Adam
Srebro Nathan
Wang Jialei
Woodworth Blake
Publication venue
Publication date: 01/12/2018
Field of study

We suggest a general oracle-based framework that captures different parallel stochastic optimization settings described by a dependency graph, and derive generic lower bounds in terms of this graph. We then use the framework and derive lower bounds for several specific parallel optimization settings, including delayed updates and parallel processing with intermittent communication. We highlight gaps between lower and upper bounds on the oracle complexity, and cases where the "natural" algorithms are not known to be optimal

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)