Search CORE

115 research outputs found

Symmetric indefinite triangular factorization revealing the rank profile matrix

Author: Dumas Jean-Guillaume
Pernet Clement
Publication venue
Publication date: 26/02/2018
Field of study

We present a novel recursive algorithm for reducing a symmetric matrix to a triangular factorization which reveals the rank profile matrix. That is, the algorithm computes a factorization

\mathbf{P}^T\mathbf{A}\mathbf{P} = \mathbf{L}\mathbf{D}\mathbf{L}^T

where

\mathbf{P}

is a permutation matrix,

\mathbf{L}

is lower triangular with a unit diagonal and

\mathbf{D}

is symmetric block diagonal with

1{\times}1

and

2{\times}2

antidiagonal blocks. The novel algorithm requires

O(n^2r^{\omega-2})

arithmetic operations. Furthermore, experimental results demonstrate that our algorithm can even be slightly more than twice as fast as the state of the art unsymmetric Gaussian elimination in most cases, that is it achieves approximately the same computational speed. By adapting the pivoting strategy developed in the unsymmetric case, we show how to recover the rank profile matrix from the permutation matrix and the support of the block-diagonal matrix. There is an obstruction in characteristic

2

for revealing the rank profile matrix which requires to relax the shape of the block diagonal by allowing the 2-dimensional blocks to have a non-zero bottom-right coefficient. This relaxed decomposition can then be transformed into a standard

\mathbf{P}\mathbf{L}\mathbf{D}\mathbf{L}^T\mathbf{P}^T

decomposition at a negligible cost

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

Author: Huang Bin
NC DOCKS at The University of North Carolina at Charlotte
Publication venue
Publication date: 01/01/2014
Field of study

Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers

The University of North Carolina at Greensboro

Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

Author: Wu Jing
Publication venue
Publication date: 01/01/2014
Field of study

An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms. The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT. We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU

Digital Repository at the University of Maryland

Iterative methods for option pricing in Merton's Jump diffusion model

Author
Publication venue
Publication date
Field of study

The main purpose of this thesis is to study numerical schemes for the solution of the PDE associated with the Merton jump-diffusion model. The implementation of these schemes will be achieved through the finite differences method, and in particular two implicit methods (Backward Euler and Crank-Nicolson) will be developed. The linear systems created will be solved by two iterative methods, namely Multigrid and GMRES; for the latter, the effectiveness of using some preconditioners will be tested. Moreover, an explicit-implicit scheme will be applied with the Backward Euler method and it will be resolved by the direct method Tridiagonal solver

Padua Thesis and Dissertation Archive

Custom optimization algorithms for efficient hardware implementation

Author: A. Constantinides
Eric C. Kerrigan
Juan Luis Jerez
Supervised George
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/01/2013
Field of study

The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special characteristics of our custom architectures we propose several high-level parallel optimal control schemes that can reduce computation time. A novel optimization formulation was devised for reducing the computational effort in solving certain problems independent of the computing platform used. In order to be able to solve optimization problems in fixed-point arithmetic, which is significantly more resource-efficient than floating-point, tailored linear algebra algorithms were developed for solving the linear systems that form the computational bottleneck in many optimization methods. These methods come with guarantees for reliable operation. We also provide finite-precision error analysis for fixed-point implementations of first-order methods that can be used to minimize the use of resources while meeting accuracy specifications. The suggested techniques are demonstrated on several practical examples, including a hardware-in-the-loop setup for optimization-based control of a large airliner.Open Acces

CiteSeerX

Spiral - Imperial College Digital Repository

Dynamics of the solar tachocline II: the stratified case

Author: Acheson
Brown
Brun
Brun
Christensen-Dalsgaard
Christensen-Dalsgaard
Clune
Dziembowski
Eddington
Elliott
Ferraro
Garaud
Garaud
Garaud
Gilman
Glatzmaier
Golub
Gough
Gough
J.-D. Garaud
Kitchatinov
Kosovichev
MacGregor
Mestel
Mestel
Miesch
Miesch
Miesch
P. Garaud
Press
Rempel
Rüdiger
Schou
Spiegel
Sule
Sweet
Wood
Wright
Publication venue: 'Wiley'
Publication date: 16/06/2008
Field of study

We present a detailed numerical study of the Gough & McIntyre model for the solar tachocline. This model explains the uniformity of the rotation profile observed in the bulk of the radiative zone by the presence of a large-scale primordial magnetic field confined below the tachocline by flows originating from within the convection zone. We attribute the failure of previous numerical attempts at reproducing even qualitatively Gough & McIntyre's idea to the use of boundary conditions which inappropriately model the radiative--convective interface. We emphasize the key role of flows downwelling from the convection zone in confining the assumed internal field. We carefully select the range of parameters used in the simulations to guarantee a faithful representation of the hierarchy of expected lengthscales. We then present, for the first time, a fully nonlinear and self-consistent numerical solution of the Gough & McIntyre model which qualitatively satisfies the following set of observational constraints: (i) the quenching of the large-scale differential rotation below the tachocline - including in the polar regions - as seen by helioseismology (ii) the confinement of the large-scale meridional flows to the uppermost layers of the radiative zone as required by observed light element abundances and suggested by helioseismic sound-speed data.Comment: 21 pages, 15 figures, submitted to MNRA

arXiv.org e-Print Archive

Crossref

Future Computer Requirements for Computational Aerodynamics

Author
Publication venue
Publication date
Field of study

Recent advances in computational aerodynamics are discussed as well as motivations for and potential benefits of a National Aerodynamic Simulation Facility having the capability to solve fluid dynamic equations at speeds two to three orders of magnitude faster than presently possible with general computers. Two contracted efforts to define processor architectures for such a facility are summarized

NASA Technical Reports Server