Search CORE

651 research outputs found

Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling

Author: Cao Yong
Park Seung In
Quek Francis
Watson Layne T.
Publication venue
Publication date: 01/01/2012
Field of study

Though the GPGPU concept is well-known in image processing, much more work remains to be done to fully exploit GPUs as an alternative computation engine. This paper investigates the computation-to-core mapping strategies to probe the efficiency and scalability of the robust facet image modeling algorithm on GPUs. Our fine-grained computation-to-core mapping scheme shows a significant performance gain over the standard pixel-wise mapping scheme. With in-depth performance comparisons across the two different mapping schemes, we analyze the impact of the level of parallelism on the GPU computation and suggest two principles for optimizing future image processing applications on the GPU platform

Computer Science Technical Reports @Virginia Tech

A bibliography on parallel and vector numerical algorithms

Author: Ortega J. M.
Voigt R. G.
Publication venue
Publication date
Field of study

This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also

NASA Technical Reports Server

A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units

Author: Novaković Vedran
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 27/09/2014
Field of study

We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU's memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single GPU setting needs a CPU for the controlling purposes only, while utilizing GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin

arXiv.org e-Print Archive

CiteSeerX

Design and analysis of numerical algorithms for the solution of linear systems on parallel and distributed architectures

Author: Rosni Abdullah (7169939)
Publication venue
Publication date: 01/01/1997
Field of study

The increasing availability of parallel computers is having a very significant impact on all aspects of scientific computation, including algorithm research and software development in numerical linear algebra. In particular, the solution of linear systems, which lies at the heart of most calculations in scientific computing is an important computation found in many engineering and scientific applications. In this thesis, well-known parallel algorithms for the solution of linear systems are compared with implicit parallel algorithms or the Quadrant Interlocking (QI) class of algorithms to solve linear systems. These implicit algorithms are (2x2) block algorithms expressed in explicit point form notation. [Continues.

Loughborough University Institutional Repository

A GPU-based hyperbolic SVD algorithm

Author: A.H. Sameh
F.T. Luk
F.T. Luk
G.S. Sachdev
H. Zha
I. Slapničar
I. Slapničar
I. Slapničar
J.R. Bunch
K. Veselić
R. Mathias
R.P. Brent
S. Lahabar
S. Singer
S. Singer
S. Singer
S. Zhang
Sanja Singer
V. Hari
V. Hari
Vedran Novaković
Z. Drmač
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

A one-sided Jacobi hyperbolic singular value decomposition (HSVD) algorithm, using a massively parallel graphics processing unit (GPU), is developed. The algorithm also serves as the final stage of solving a symmetric indefinite eigenvalue problem. Numerical testing demonstrates the gains in speed and accuracy over sequential and MPI-parallelized variants of similar Jacobi-type HSVD algorithms. Finally, possibilities of hybrid CPU--GPU parallelism are discussed.Comment: Accepted for publication in BIT Numerical Mathematic

arXiv.org e-Print Archive

CiteSeerX

Crossref

FAMENA Repository

Solution of partial differential equations on vector and parallel computers

Author: Ortega J. M.
Voigt R. G.
Publication venue
Publication date
Field of study

The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

NASA Technical Reports Server

Accelerating Industrial Applications: The Development of Basic GPU Kernels for the New Block AMG Algorithms for Solving SLE with Explicitly Calculated Sparse Basis

Author: Afanasyev Ilya
Kharchenko Sergey
Potapov Yury
Sobolev Sergey
Publication venue: The Authors. Published by Elsevier B.V.
Publication date: 31/12/2015
Field of study

AbstractNowadays, GPU computations are playing significant role in supercomputing technologies. This work is a part of a project dealing with solving problems of modeling hydro- and aerodynamics where linear algebra operations are frequently used and occupy most of execution time. In despite of the fact that GPUs are traditionally used for solving high sized problems, in our project we need to solve many tasks of low sizes. Because of this, modern library's solutions such as cuBLAS (1) and cuSPARSE (2) are not suitable enough for that, so we have a task of implementation more efficient functions for concrete linear algebra operations taking into account its specialties

Elsevier - Publisher Connector

Architectural support for task dependence management with flexible software scheduling

Author: Beivide Palacio Ramon
Bosque Jose L.
Casas Marc
Castillo Emilio
Moreto Planas Miquel
Valero Cortés Mateo
Vallejo Enrique
Álvarez Martí Lluc
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC