Search CORE

190 research outputs found

A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units

Author: Novaković Vedran
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 27/09/2014
Field of study

We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU's memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single GPU setting needs a CPU for the controlling purposes only, while utilizing GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin

arXiv.org e-Print Archive

CiteSeerX

Efficient GPU implementation of a Boltzmann‑Schrödinger‑Poisson solver for the simulation of nanoscale DG MOSFETs

Author: Alonso‑Jordá Pedro
Mantas Ruiz José Miguel
Vecil Francesco
Publication venue: Springer Nature
Publication date: 23/04/2023
Field of study

81–102, 2019) describes an efficient and accurate solver for nanoscale DG MOSFETs through a deterministic Boltzmann-Schrödinger-Poisson model with seven electron–phonon scattering mechanisms on a hybrid parallel CPU/GPU platform. The transport computational phase, i.e. the time integration of the Boltzmann equations, was ported to the GPU using CUDA extensions, but the computation of the system’s eigenstates, i.e. the solution of the Schrödinger-Poisson block, was parallelized only using OpenMP due to its complexity. This work fills the gap by describing a port to GPU for the solver of the Schrödinger-Poisson block. This new proposal implements on GPU a Scheduled Relaxation Jacobi method to solve the sparse linear systems which arise in the 2D Poisson equation. The 1D Schrödinger equation is solved on GPU by adapting a multi-section iteration and the Newton-Raphson algorithm to approximate the energy levels, and the Inverse Power Iterative Method is used to approximate the wave vectors. We want to stress that this solver for the Schrödinger-Poisson block can be thought as a module independent of the transport phase (Boltzmann) and can be used for solvers using different levels of description for the electrons; therefore, it is of particular interest because it can be adapted to other macroscopic, hence faster, solvers for confined devices exploited at industrial level.Project PID2020-117846GB-I00 funded by the Spanish Ministerio de Ciencia e InnovaciónProject A-TIC-344-UGR20 funded by European Regional Development Fund

Repositorio Institucional Universidad de Granada

Parallel methods for linear systems solution in extreme learning machines: an overview

Author: Baldera-Moreno Y
Barrientos R
Gelvez-Almeida E
Huérfano Y
Mora M
Vera M
Publication venue
Publication date: 01/01/2020
Field of study

This paper aims to present an updated review of parallel algorithms for solving square and rectangular single and double precision matrix linear systems using multi-core central processing units and graphic processing units. A brief description of the methods for the solution of linear systems based on operations, factorization and iterations was made. The methodology implemented, in this article, is a documentary and it was based on the review of about 17 papers reported in the literature during the last five years (2016-2020). The disclosed findings demonstrate the potential of parallelism to significantly decrease extreme learning machines training times for problems with large amounts of data given the calculation of the Moore Penrose pseudo inverse. The implementation of parallel algorithms in the calculation of the pseudo-inverse will allow to contribute significantly in the applications of diversifying areas, since it can accelerate the training time of the extreme learning machines with optimal results

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

PHIST: a Pipelined, Hybrid-parallel Iterative Solver Toolkit

Author: Basermann Achim
Ernst Domik
Hager Georg
Overmars Nigel
Röhrig-Zöllner Melven
Thies Jonas
Wellein Gerhard
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

The increasing complexity of hardware and software environments in high-performance computing poses big challenges on the development of sustainable and hardware-efcient numerical software. This paper addresses these challenges in the context of sparse solvers. Existing solutions typically target sustainability, flexibility or performance, but rarely all of them. Our new library PHIST provides implementations of solvers for sparse linear systems and eigenvalue problems. It is a productivity platform for performance-aware developers of algorithms and application software with abstractions that do not obscure the view on hardware-software interaction. The PHIST software architecture and the PHIST development process were designed to overcome shortcomings of existing packages. An interface layer for basic sparse linear algebra functionality that can be provided by multiple backends ensures sustainability, and PHIST supports common techniques for improving scalability and performance of algorithms such as blocking and kernel fusion. We showcase these concepts using the PHIST implementation of a block Jacobi-Davidson solver for non-Hermitian and generalized eigenproblems. We study its performance on a multi-core CPU, a GPU and a large-scale many-core system. Furthermore, we show how an existing implementation of a block Krylov-Schur method in the Trilinos package Anasazi can beneft from the performance engineering techniques used in PHIST

Institute of Transport Research:Publications

High performance digital signal processing: Theory, design, and applications in finance

Author: Torun Mustafa Ugur
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2013
Field of study

The way scientific research and business is conducted has drastically changed over the last decade. Big data and data-intensive scientific discovery are two terms that have been coined recently. They describe the tremendous amounts of noisy data, created extremely rapidly by various sensing devices and methods that need to be explored for information inference. Researchers and practitioners who can obtain meaningful information out of big data in the shortest time gain a competitive advantage. Hence, there is more need than ever for a variety of high performance computational tools for scientific and business analytics. Interest in developing efficient data processing methods like compression and noise filtering tools enabling real-time analytics of big data is increasing. A common concern in digital signal processing applications has been the lack of fast handling of observed data. This problem has been an active research topic being addressed by the progress in analytical tools allowing fast processing of big data. One particular tool is the Karhunen-Loève transform (KLT) (also known as the principal component analysis) where covariance matrix of a stochastic process is decomposed into its eigenvectors and eigenvalues as the optimal orthonormal transform. Specifically, eigenanalysis is utilized to determine the KLT basis functions. KLT is a widely employed signal analysis method used in applications including noise filtering of measured data and compression. However, defining KLT basis for a given signal covariance matrix demands prohibitive computational resources in many real-world scenarios. In this dissertation, engineering implementation of KLT as well as the theory of eigenanalysis for auto-regressive order one, AR(1), discrete stochastic processes are investigated and novel improvements are proposed. The new findings are applied to well-known problems in quantitative finance (QF). First, an efficient method to derive the explicit KLT kernel for AR(1) processes that utilizes a simple root finding method for the transcendental equations is introduced. Performance improvement over a popular numerical eigenanalysis algorithm, called divide and conquer, is shown. Second, implementation of parallel Jacobi algorithm for eigenanalysis on graphics processing units is improved such that the access to the dynamic random access memory is entirely coalesced. The speed is improved by a factor of 68.5 by the proposed method compared to a CPU implementation for a square matrix of size 1,024. Third, several tools developed and implemented in the dissertation are applied to QF problems such as risk analysis and portfolio risk management. In addition, several topics in QF, such as price models, Epps effect, and jump processes are investigated and new insights are suggested from a multi-resolution (multi-rate) signal processing perspective. It is expected to see this dissertation to make contributions in better understanding and bridging the analytical methods in digital signal processing and applied mathematics, and their wider utilization in the finance sector. The emerging joint research and technology development efforts in QF and financial engineering will benefit the investors, bankers, and regulators to build and maintain more robust and fair financial markets in the future

Digital Commons @ New Jersey Institute of Technology (NJIT)

GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

Author: Basermann Achim
Fehske Holger
Galgon Martin
Hager Georg
Kreutzer Moritz
Pieper Andreas
Röhrig-Zöllner Melven
Shahzad Faisal
Thies Jonas
Wellein Gerhard
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

arXiv.org e-Print Archive

Institute of Transport Research:Publications

Asynchronous and Multiprecision Linear Solvers - Scalable and Fault-Tolerant Numerics for Energy Efficient High Performance Computing

Author: Anzt Hartwig
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2012
Field of study

Asynchronous methods minimize idle times by removing synchronization barriers, and therefore allow the efficient usage of computer systems. The implied high tolerance with respect to communication latencies improves the fault tolerance. As asynchronous methods also enable the usage of the power and energy saving mechanisms provided by the hardware, they are suitable candidates for the highly parallel and heterogeneous hardware platforms that are expected for the near future

KITopen

Tools and Selected Applications

Author: Kramer Stephan Christoph
Publication venue: Niedersächsische Staats- und Universitätsbibliothek Göttingen
Publication date: 22/11/2012
Field of study

Georg-August-University Göttingen