Search CORE

6 research outputs found

Solving Dense Generalized Eigenproblems on Multi-threaded Architectures

Author: Aliaga José Ignacio
Bientinesi Paolo
Davidović Davor
Di Napoli Eduardo
Igual Peña Francisco D.
Quintana-Ortí Enrique S.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

We compare two approaches to compute a fraction of the spectrum of dense symmetric definite generalized eigenproblems: one is based on the reduction to tridiagonal form, and the other on the Krylov-subspace iteration. Two large-scale applications, arising in molecular dynamics and material science, are employed to investigate the contributions of the application, architecture, and parallelism of the method to the performance of the solvers. The experimental results on a state-of-the-art 8-core platform, equipped with a graphics processing unit (GPU), reveal that in realistic applications, iterative Krylov-subspace methods can be a competitive approach also for the solution of dense problems

arXiv.org e-Print Archive

Crossref

Repositori Institucional de la Universitat Jaume I

Full-text Institutional Repository of the Ruđer Bošković Institute

Publikationsserver der RWTH Aachen University

Juelich Shared Electronic Resources

A new generation of task-parallel algorithms for matrix inversion in many-threaded CPUs

Author: Catalán Pallarés Sandra
Herrero Zaragoza José Ramón
Igual Peña Francisco D.
Quintana Ortí Enrique Salvador
Rodríguez Sánchez Rafael
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorithms for the inversion of dense matrices via Gauss-Jordan elimination. Our algorithms perform a partitioning of the matrix operand into two levels of tasks: The matrix is first divided vertically, by column blocks (or panels), in order to accommodate the standard partial pivoting scheme that ensures the numerical stability of the method. In addition, depending on the particular kernel to be applied, each panel is partitioned either horizontally by row blocks (tiles) or vertically by µ-panels (of columns), in order to extract sufficient task parallelism to feed a many-threaded general purpose processor (CPU). The results of the experimental evaluation show the performance benefits of the advanced tasking algorithms on an Intel Xeon Gold processor with 20 cores.This research was sponsored by projects RTI2018-093684-B-I00 and TIN2017-82972-R of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; and project PR65/19-22445 of Universidad Complutense de Madrid.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

Author: Catalán Pallarés Sandra
Herrero Zaragoza José Ramón
Igual Peña Francisco D.
Quintana Ortí Enrique Salvador
Rodríguez Sánchez Rafael
Publication venue: Elsevier
Publication date: 01/05/2023
Field of study

We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

RiuNet

Energy efficiency optimization of task-parallel codes on asymmetric architectures

Author: Costero Valero Luis María
Francisco D. Igual
Igual Peña Francisco Daniel
Olcoz Herrero Katzalin
Tirado Fernández José Francisco
Publication venue
Publication date: 17/07/2017
Field of study

We present a family of policies that, integrated within a runtime task scheduler (Nanox), pursue the goal of improving the energy efficiency of task-parallel executions with no intervention from the programmer. The proposed policies tackle the problem by modifying the core operating frequency via DVFS mechanisms, or by enabling/disabling the mapping of tasks to specific cores at selected execution points, depending on the internal status of the scheduler. Experimental results on an asymmetric SoC (Exynos 5422) and for a specific operation (Cholesky factorization) reveal gains up to 29% in terms of energy efficiency and considerable reductions in average powerDepto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEpu

Docta Complutense

Fine‐grain task‐parallel algorithms for matrix factorizations and inversion on many‐threaded CPUs

Author: Catalán Pallarés Sandra
Herrero José R.
Igual Peña Francisco D.
Quintana‐Ortí Enrique S.
Rodríguez Sánchez Rafael
Publication venue: 'Wiley'
Publication date: 07/04/2022
Field of study

We extend a two-level task partitioning previously applied to the inversion of dense matrices via Gauss–Jordan elimination to the more challenging QR factorization as well as the initial orthogonal reduction to band form found in the singular value decomposition. Our new task-parallel algorithms leverage the tasking mechanism currently available in OpenMP to exploit “nested” task parallelism, with a first outer level that operates on matrix panels and a second inner level that processes the matrix either by µ -panels or by tiles, in order to expose a large number of independent tasks. We present a detailed performance analysis, including execution traces, which shows that the two-level refinement into fine grain tasks allows for an improved load balancing and delivers high performance on current general-purpose many-core processors (CPUs) from Intel and AMD.This research was sponsored by projects RTI2018-093684-B-I00, PID2019-107255GB andTIN2017-82972-R of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445.Peer ReviewedPostprint (published version

Docta Complutense

UPCommons. Portal del coneixement obert de la UPC

Extending OpenMP to survive the heterogeneous multi-core era

Author: Ayguadé Parra Eduard
Badia Sala Rosa Maria
Bellens Pieter
Cabrera Daniel
Duran González Alejandro
Ferrer Roger
González Tallada Marc
Igual Peña Francisco D.
Jiménez González Daniel
Labarta Mancho Jesús José
Martinell Andreu Luis
Martorell Bofill Xavier
Mayo Gual Rafael
Planas Judit
Pérez Cáncer Josep Maria
Quintana-Ortí Enrique S.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, theCell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Repositori Institucional de la Universitat Jaume I