Search CORE

35 research outputs found

Computing the R of the QR factorization of tall and skinny matrices using MPI_Reduce

Author: Langou Julien
Publication venue
Publication date: 01/01/2010
Field of study

A QR factorization of a tall and skinny matrix with n columns can be represented as a reduction. The operation used along the reduction tree has in input two n-by-n upper triangular matrices and in output an n-by-n upper triangular matrix which is defined as the R factor of the two input matrices stacked the one on top of the other. This operation is binary, associative, and commutative. We can therefore leverage the MPI library capabilities by using user-defined MPI operations and MPI_Reduce to perform this reduction. The resulting code is compact and portable. In this context, the user relies on the MPI library to select a reduction tree appropriate for the underlying architecture

arXiv.org e-Print Archive

CiteSeerX

Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2010
Field of study

Crossref

Pipelining the Fast Multipole Method over a Runtime System

Author: Agullo Emmanuel
Bramas Béranger
Coulaud Olivier
Darve Eric
Messner Matthias
Toru Takahashi
Publication venue
Publication date: 01/01/2012
Field of study

Fast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, in order to process the tasks on the different processing units. We carefully design the task flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

A study of possible optimizations for the task scheduler ‘QUARK’ on the shared memory architecture

Author: Joshi Vijay Gopal
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/05/2013
Field of study

Multicore processors are replacing most of the single core processors nowadays. Current trends show that there will be increasing numbers of cores on a single chip in the coming future. However, programming multicore processors remains bug prone and less productive. Thus, making use of a runtime to schedule tasks on multicore processor hides most of the complexities of parallel programming to improve productivity. QUARK is one of the runtimes available for the multicore processors. This work looks at identifying and solving performance bottlenecks for QUARK on the shared memory architecture. The problem of finding bottlenecks is divided into two parts, low level details and high level details. Low level details deal with issues like length of the critical section and locking mechanisms. High level details involve use of a suitable scheduling algorithm and better load balancing. We discuss the possible solutions of the bottlenecks and its impact on the overall performance

University of Tennessee, Knoxville: Trace

New Multithreaded Hybrid CPU/GPU Approach to Hartree−Fock

Author: Andrey Asadchev
Asadchev A.
Buttari A.
Davidson E. R.
Furlani T. R.
Gordon M. S.
Ishimura K.
Janssen C. L.
Mark S. Gordon
Rys J.
Turney J. M.
Ufimtsev I. S.
Ufimtsev I. S.
Wilkinson K. A.
Yasuda K.
Publication venue: Iowa State University Digital Repository
Publication date: 01/09/2012
Field of study

In this article, a new multithreaded Hartree–Fock CPU/GPU method is presented which utilizes automatically generated code and modern C++ techniques to achieve a significant improvement in memory usage and computer time. In particular, the newly implemented Rys Quadrature and Fock Matrix algorithms, implemented as a stand-alone C++ library, with C and Fortran bindings, provides up to 40% improvement over the traditional Fortran Rys Quadrature. The C++ GPU HF code provides approximately a factor of 17.5 improvement over the corresponding C++ CPU code

Digital Repository @ Iowa State University (ISU)

Crossref

Designing LU-QR hybrid solvers for performance and stability

Author: Bradley Lowery
Jack Dongarra
Julien Herrmann
Julien Langou
Mathieu Faverge
Yves Robert
Publication venue
Publication date: 21/01/2014
Field of study

Abstract—This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floatingpoint operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. I

arXiv.org e-Print Archive

HAL-ENS-LYON

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Task scheduling techniques for asymmetric multi-core systems

Author: Ayguadé Parra Eduard
Badia Sala Rosa Maria
Casas Marc
Chronaki Kallia
Labarta Mancho Jesús José
Moreto Planas Miquel
Rico Alejandro
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

As performance and energy efficiency have become the main challenges for next-generation high-performance computing, asymmetric multi-core architectures can provide solutions to tackle these issues. Parallel programming models need to be able to suit the needs of such systems and keep on increasing the application’s portability and efficiency. This paper proposes two task scheduling approaches that target asymmetric systems. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application, or by finding the earliest executor of a task. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with two existing state-of the art heterogeneous schedulers and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45 in a real 8-core asymmetric system and up to 2.1 in a simulated 32-core asymmetric chip.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402 and from the EU’s H2020 Framework Programme (H2020/2014-2020) under grant agreement no 671697. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

Tiled QR factorization algorithms

Author: Bouwmeester Henricus
Jacquelin Mathias
Langou Julien
Robert Yves
Publication venue
Publication date: 01/01/2011
Field of study

This work revisits existing algorithms for the QR factorization of rectangular matrices composed of p-by-q tiles, where p >= q. Within this framework, we study the critical paths and performance of algorithms such as Sameh and Kuck, Modi and Clarke, Greedy, and those found within PLASMA. Although neither Modi and Clarke nor Greedy is optimal, both are shown to be asymptotically optimal for all matrices of size p = q^2 f(q), where f is any function such that \lim_{+\infty} f= 0. This novel and important complexity result applies to all matrices where p and q are proportional, p = \lambda q, with \lambda >= 1, thereby encompassing many important situations in practice (least squares). We provide an extensive set of experiments that show the superiority of the new algorithms for tall matrices

arXiv.org e-Print Archive

HAL-ENS-LYON

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Task-based programming for Seismic Imaging: Preliminary Results

Author: Agullo Emmanuel
Boillot Lionel
Bosilca George
Calandra Henri
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/08/2014
Field of study

International audienceThe level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code) may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on an homogenous multicore node and can exploit much more efficiently complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1