Search CORE

2,224 research outputs found

Convergence of the D-iteration algorithm: convergence rate and asynchronous distributed scheme

Author: Burnside Gérard
Hong Dohy
Mathieu Fabien
Publication venue
Publication date: 14/01/2013
Field of study

In this paper, we define the general framework to describe the diffusion operators associated to a positive matrix. We define the equations associated to diffusion operators and present some general properties of their state vectors. We show how this can be applied to prove and improve the convergence of a fixed point problem associated to the matrix iteration scheme, including for distributed computation framework. The approach can be understood as a decomposition of the matrix-vector product operation in elementary operations at the vector entry level.Comment: 9 page

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Author: Buluc Aydin
Owens John D.
Yang Carl
Publication venue
Publication date: 14/11/2020
Field of study

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

arXiv.org e-Print Archive

eScholarship - University of California

Energy efficiency improvement through MPC-based peripherals management for an industrial process test-bench

Author: Bermeo Ayerbe Miguel Ángel
Ocampo-Martínez Carlos
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

High energy costs evince the growing need for energy efficiency in industrial companies. This paper presents a solution at the industrial machine level to obtain efficient energy consumption. Therefore, a controller inspired by the well-known model predictive control (MPC) strategy was developed for the management of peripheral devices. The validation of the control requires a test-bench to emulate the energy consumption of a manufacturing machine. The test-bench has four devices, two used to emulate the periodic and fixed energy consumption of the manufacturing process and two as peripherals, subject to rules associated with the process. Consequently, a subspace identification (SI) was employed to identify energy models to simulate the behavior of the device. As a final step, a performance comparison between a rule-based control (RBC) and the proposed predictive-like controller revealed the remarkable energy savings. The MPC results show an energy saving of around 3% with respect to RBC as well as an instant maximum energy consumption reduction of 8%, approximately.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

A Case Study in Coordination Programming: Performance Evaluation of S-Net vs Intel's Concurrent Collections

Author: Gijsbers Bert
Grelck Clemens
Shafarenko Alex
Tveretina Olga
Zaichenkov Pavel
Publication venue
Publication date: 01/01/2014
Field of study

We present a programming methodology and runtime performance case study comparing the declarative data flow coordination language S-Net with Intel's Concurrent Collections (CnC). As a coordination language S-Net achieves a near-complete separation of concerns between sequential software components implemented in a separate algorithmic language and their parallel orchestration in an asynchronous data flow streaming network. We investigate the merits of S-Net and CnC with the help of a relevant and non-trivial linear algebra problem: tiled Cholesky decomposition. We describe two alternative S-Net implementations of tiled Cholesky factorization and compare them with two CnC implementations, one with explicit performance tuning and one without, that have previously been used to illustrate Intel CnC. Our experiments on a 48-core machine demonstrate that S-Net manages to outperform CnC on this problem.Comment: 9 pages, 8 figures, 1 table, accepted for PLC 2014 worksho

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

International Migration, Integration and Social Cohesion online publications

Dynamic Task Execution on Shared and Distributed Memory Architectures

Author: YarKhan Asim
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2012
Field of study

Multicore architectures with high core counts have come to dominate the world of high performance computing, from shared memory machines to the largest distributed memory clusters. The multicore route to increased performance has a simpler design and better power efficiency than the traditional approach of increasing processor frequencies. But, standard programming techniques are not well adapted to this change in computer architecture design. In this work, we study the use of dynamic runtime environments executing data driven applications as a solution to programming multicore architectures. The goals of our runtime environments are productivity, scalability and performance. We demonstrate productivity by defining a simple programming interface to express code. Our runtime environments are experimentally shown to be scalable and give competitive performance on large multicore and distributed memory machines. This work is driven by linear algebra algorithms, where state-of-the-art libraries (e.g., LAPACK and ScaLAPACK) using a fork-join or block-synchronous execution style do not use the available resources in the most efficient manner. Research work in linear algebra has reformulated these algorithms as tasks acting on tiles of data, with data dependency relationships between the tasks. This results in a task-based DAG for the reformulated algorithms, which can be executed via asynchronous data-driven execution paths analogous to dataflow execution. We study an API and runtime environment for shared memory architectures that efficiently executes serially presented tile based algorithms. This runtime is used to enable linear algebra applications and is shown to deliver performance competitive with state-of- the-art commercial and research libraries. We develop a runtime environment for distributed memory multicore architectures extended from our shared memory implementation. The runtime takes serially presented algorithms designed for the shared memory environment, and schedules and executes them on distributed memory architectures in a scalable and high performance manner. We design a distributed data coherency protocol and a distributed task scheduling mechanism which avoid global coordination. Experimental results with linear algebra applications show the scalability and performance of our runtime environment

University of Tennessee, Knoxville: Trace