Search CORE

25 research outputs found

Guest editorial: Special issue on parallel matrix algorithms and applications (PMAA’16)

Author: Agullo Emmanuel
Arbenz Peter
Giraud Luc
Schenk Olaf
Publication venue: 'Elsevier BV'
Publication date: 01/05/2018
Field of study

International audienceThis special issue of Parallel Computing contains nine articles, selected after peer reviewing, from invited and contributed presentations made at the 8th International Workshop on Parallel Matrix Algorithms and Applications (PMAA'16), that took place at the Université of Bordeaux, France, from July 6-8, 2016. The workshop attracted around 120 participants from all continents, 25% were PhD students and around 10% from industry. The workshop was co-chaired by Emmanuel Agullo, Peter Arbenz, Luc Gi-raud, and Olaf Schenk. The members of the program committee were : P. D'Am-bra, H A total of twelve high quality submissions were received. In this special issue nine eventually accepted papers appear. The nine papers address diverse aspects of linear algebra and high performance computing 1. Jack Dongarra, Mark Gates, Stanimire Tomov address accelerating the SVD two stage reduction and divide-and-conquer using GPUs. The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today's high performance computers. For dense matrices, the classic algorithm for the SVD uses a one-stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two-stage reduction to bidiagonal has been gaining popularity. As accelerators , such as GPUs and co-processors, are becoming increasingly widespread in high-performance computing, the authors present an accelerated SVD employing a two-stage reduction to bidiagonal as well as a parallelized and accelerated divide-and-conquer algorithm to solve the subsequent bidiagonal SVD. The new implementation provides a significant speedup compared to existing multi-core and GPU-based SVD implementations

INRIA a CCSD electronic archive server

A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems

Author: Li Shengguo
Liao Xia
Lu Yutong
Román Moltó José Enrique
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/11/2020
Field of study

© 2021 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] In this article, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [16] and only scaled to 300 processes for the same matrices. Comparing with PDSTEDC in ScaLAPACK, PSDC is always faster and achieves 1.4x-1.6x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.The authors would like to thank the referees for their valuable comments which greatly improve the presentation of this article. This work was supported by National Natural Science Foundation of China (No. NNW2019ZT6-B20, NNW2019ZT6B21, NNW2019ZT5-A10, U1611261, 61872392, and U1811461), National Key RD Program of China (2018YFB0204303), NSF of Hunan (No. 2019JJ40339), NSF of NUDT (No. ZK18-03-01), Guangdong Natural Science Foundation (2018B030312002), and the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant 2016ZT06D211. The work of Jose E. Roman was supported by the Spanish Agencia Estatal de Investigacion (AEI) under project SLEPc-DA (PID2019-107379RB-I00).Liao, X.; Li, S.; Lu, Y.; Román Moltó, JE. (2021). A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems. IEEE Transactions on Parallel and Distributed Systems. 32(2):367-378. https://doi.org/10.1109/TPDS.2020.3019471S36737832

arXiv.org e-Print Archive

Crossref

RiuNet

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

Author: Faverge Mathieu
Haidar Azzam
Kurzak Jakub
Pichon Grégoire
Publication venue: HAL CCSD
Publication date: 25/05/2015
Field of study

International audienceComputing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automo-biles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the INTEL MKL library, and outperforms the best MRRR implementation for many matrices

INRIA a CCSD electronic archive server

MRRR-based Eigensolvers for Multi-core Processors and Supercomputers

Author: Petschow Matthias
Publication venue
Publication date: 01/01/2013
Field of study

The real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; it arises frequently as part of eigensolvers for standard and generalized dense Hermitian eigenproblems that are based on a reduction to tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations (MRRR or MR3 in short) - introduced in the late 1990s - is among the fastest methods. To compute k eigenpairs of a real n-by-n tridiagonal T, MRRR only requires O(kn) arithmetic operations; in contrast, all the other practical methods require O(k^2 n) or O(n^3) operations in the worst case. This thesis centers around the performance and accuracy of MRRR.Comment: PhD thesi

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Fast Algorithm Development for SVD: Applications in Pattern Matching and Fault Diagnosis

Author: Detroja Ketan P
N Rajasekhar
Publication venue
Publication date: 01/01/2018
Field of study

The project aims for fast detection and diagnosis of faults occurring in process plants by designing a low-cost FPGA module for the computation. Fast detection and diagnosis when the process is still operating in a controllable region helps avoiding the further advancement of the fault and reduce the productivity loss. Model-based methods are not popular in the domain of process control as obtaining an accurate model is expensive and requires an expertise. Data-driven methods like Principal Component Analysis(PCA) is a quite popular diagnostic method for process plants as they do not require any model. PCA is widely used tool for dimensionality reduction and thus reducing the computational e�ort. The trends are captured in prinicpal components as it is di�cult to have a same amount of disturbance as simulated in historical database. The historical database has multiple instances of various kinds of faults and disturbances along with normal operation. A moving window approach has been employed to detect similar instances in the historical database based on Standard PCA similarity factor. The measurements of variables of interest over a certain period of time forms the snapshot dataset, S. At each instant, a window of same size as that of snapshot dataset is picked from the historical database forms the historical window, H. The two datasets are then compared using similarity factors like Standard PCA similarity factor which signi�es the angular di�erence between the principal components of two datasets. Since many of the operating conditions are quite similar to each other and signi�cant number of mis-classi�cations have been observed, a candidate pool which orders the historical data windows on the values of similarity factor is formed. Based on the most detected operation among the top-most windows, the operating personnel takes necessary action. Tennessee Eastman Challenge process has been chosen as an initial case study for evaluating the performance. The measurements are sampled for every one minute and the fault having the smallest maximum duration is 8 hours. Hence the snapshot window size, m has been chosen to be consisting of 500 samples i.e 8.33 hours of most recent data of all the 52 variables. Ideally, the moving window should replace the oldest sample with a new one. Then it would take approximately the same number of comparisons as that of size of historical database. The size of the historical database is 4.32 million measurements(past 8years data) for each of the 52 variables. With software simulation on Matlab, this takes around 80-100 minutes to sweep through the whole 4.32 million historical database. Since most of the computation is spent in �nding principal components of the two datasets using SVD, a hardware design has to be incorporated to accelerate the pattern matching approach. The thesis is organized as follows: Chapter 1 describes the moving window approach, various similarity factors and metrics used for pattern matching. The previous work proposed by Ashish Singhal is based on skipping few samples for reducing the computational e�ort and also employs windows as large as 5761 which is four days of snapshot. Instead, a new method which skips the samples when the similarity factor is quite low has been proposed. A simpli�ed form of the Standard PCA similarity has been proposed without any trade-o� in accuracy. Pre-computation of historical database can also be done as the data is available aprior, but this requires a large memory requirement as most of the time is spent in read/write operations. The large memory requirement is due to the fact that every sample will give rise to 52�35 matrix assuming the top-35 PC's are sufficient enough to capture the variance of the dataset. Chapter 2 describes various popular algorithms for SVD. Algorithms apart from Jacobi methods like Golub-Kahan, Divide and conquer SVD algorithms are brie y discussed. While bi-diagonal methods are very accurate they suffer from large latency and computationally intensive. On the other hand, Jacobi methods are computationally inexpensive and parallelizable, thus reducing the latency. We also evaluted the performance of the proposed hybrid Golub-Kahan Jacobi algorithm to our application. Chapter 3 describes the basic building block CORDIC which is used for performing rotations required for Jacobi methods or for n-D householder re ections of Golub-Kahan SVD. CORIDC is widely employed in hardware design for computing trigonometric, exponential or logarithmic functions as it makes use of simple shift and add/subtract operations. Two modes of CORDIC namely Rotation mode and Vectoring mode are discussed which are used in the derivation of Two-sided Jacobi SVD. Chapter 4 describes the Jacobi methods of SVD which are quite popular in hardware implementation as they are quite amenable to parallel computation. Two variants of Jacobi methods namely One-sided and Two-sided Jacobi methods are brie y discussed. Two-sided Jacobi making making use of CORDIC has has been derived. The systolic array implementation which is quite popular in hardware implementation for the past three decades has been discussed. Chapter 5 deals with the Hardware implementation of Pattern matching and reports the literature survey of various architectures developed for computing SVD. Xilinx ZC7020 has been chosen as target device for FPGA implementation as it is inexpensive device with many built-in peripherals. The latency reports with both Vivado HLS and Vivado SDSoC are also reported for the application of interest. Evaluation of other case studies and other datadriven methods similar to PCA like Correspondence Analysis(CA) and Independent Component Analysis(ICA), development of efficient hybrid method for computing SVD in hardware and highly discriminating similarity factor, extending CORDIC to n-dimensions for householder re ections have been considered for future research

Research Archive of Indian Institute of Technology Hyderabad

PACF: A precision-adjustable computational framework for solving singular values

Author: Barrio R.
Du P.
Jiang H.
Li Ch.
Li K.
Quan Z.
Xiao X.
Publication venue
Publication date: 01/01/2023
Field of study

Singular value decomposition (SVD) plays a significant role in matrix analysis, and the differential quotient difference with shifts (DQDS) algorithm is an important technique for solving singular values of upper bidiagonal matrices. However, ill-conditioned matrices and large-scale matrices may cause inaccurate results or long computation times when solving singular values. At the same time, it is difficult for users to effectively find the desired solution according to their needs. In this paper, we design a precision-adjustable computational framework for solving singular values, named PACF. In our framework, the same solution algorithm contains three options: original mode, high-precision mode, and mixed-precision mode. The first algorithm is the original version of the algorithm. The second algorithm is a reliable numerical algorithm we designed using Error-free transformation (EFT) technology. The last algorithm is an efficient numerical algorithm we developed using the mixed-precision idea. Our PACF can add different solving algorithms for different types of matrices, which are universal and extensible. Users can choose different algorithms to solve singular values according to different needs. This paper implements the high-precision DQDS and mixed-precision DQDS algorithms and conducts extensive experiments on a supercomputing platform to demonstrate that our algorithm is reliable and efficient. Besides, we introduce the error analysis of the inner loop of the DQDS and HDQDS algorithms

Repositorio Universidad de Zaragoza

Recommended from our members

Building Rank-Revealing Factorizations with Randomization

Author: Heavner Nathan
Publication venue: University of Colorado Boulder
Publication date: 01/01/2019
Field of study

This thesis describes a set of randomized algorithms for computing rank revealing factorizations of matrices. These algorithms are designed specifically to minimize the amount of data movement required, which is essential to high practical performance on modern computing hardware. The work presented builds on existing randomized algorithms for computing low-rank approximations to matrices, but essentially ex- tends the range of applicability of these methods by allowing for the efficient decomposition of matrices of any numerical rank, including full rank matrices. In contrast, existing methods worked well only when the numerical rank was substantially smaller than the dimensions of the matrix.The thesis describes algorithms for computing two of the most popular rank-revealing matrix decom- positions: the column pivoted QR (CPQR) decomposition, and the so called UTV decomposition that factors a given matrix A as A = UTV∗, where U and V have orthonormal columns and T is triangular. For each algorithm, the thesis presents algorithms that are tailored for different computing environments, including multicore shared memory processors, GPUs, distributed memory machines, and matrices that are stored on hard drives (“out of core”).The first chapter of the thesis consists of an introduction that provides context, reviews previous work in the field, and summarizes the key contributions. Beside the introduction, the thesis contains six additional chapters:Chapter 2 introduces a fully blocked algorithm HQRRP for computing a QR factorization with col- umn pivoting. The key to the full blocking of the algorithm lies in using randomized projections to create a low dimensional sketch of the data, where multiple good pivot columns may be cheaply computed. Nu- merical experiments show that HQRRP is several times faster than the classical algorithm for computing a column pivoted QR on a multicore machine, and the acceleration factor increases with the number of cores.Chapter 3 introduces randUTV, a randomized algorithm for computing a rank-revealing factorizationof the form A = UTV∗, where U and V are orthogonal and T is upper triangular. RandUTV uses random- ized methods to efficiently build U and V as approximations of the column and row spaces of A. The result is an algorithm that reveals rank nearly as well as the SVD and costs at most as much as a column pivoted QR.Chapter 4 provides optimized implementations for shared and distributed memory architectures. For shared memory, we show that formulating randUTV as an algorithm-by-blocks increases its efficiency in parallel. The fifth chapter implements randUTV on the GPU and augments the algorithm with an over- sampling technique to further increase the low rank approximation properties of the resulting factorization. Chapter 6 implements both randUTV and HQRRP for use with matrices stored out of core. It is shown that reorganizing HQRRP as a left-looking algorithm to reduce the number of writes to the drive is in the tested cases necessary for the scalability of the algorithm when using spinning disk storage. Finally, chapter 7 discusses an alternative use for randUTV as a nuclear norm estimator and measures the acceleration gained from trimming down the algorithm when only singular value estimates are required

CU Scholar Institutional Repository