115 research outputs found
Symmetric indefinite triangular factorization revealing the rank profile matrix
We present a novel recursive algorithm for reducing a symmetric matrix to a
triangular factorization which reveals the rank profile matrix. That is, the
algorithm computes a factorization where is a permutation matrix,
is lower triangular with a unit diagonal and is
symmetric block diagonal with and antidiagonal
blocks. The novel algorithm requires arithmetic
operations. Furthermore, experimental results demonstrate that our algorithm
can even be slightly more than twice as fast as the state of the art
unsymmetric Gaussian elimination in most cases, that is it achieves
approximately the same computational speed. By adapting the pivoting strategy
developed in the unsymmetric case, we show how to recover the rank profile
matrix from the permutation matrix and the support of the block-diagonal
matrix. There is an obstruction in characteristic for revealing the rank
profile matrix which requires to relax the shape of the block diagonal by
allowing the 2-dimensional blocks to have a non-zero bottom-right coefficient.
This relaxed decomposition can then be transformed into a standard
decomposition at a
negligible cost
PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE
Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks.
This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers
Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms
An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms.
The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT.
We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480.
We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU
Iterative methods for option pricing in Merton's Jump diffusion model
The main purpose of this thesis is to study numerical schemes for the solution of the PDE associated with the Merton jump-diffusion model. The implementation of these schemes will be achieved through the finite differences method, and in particular two implicit methods (Backward Euler and Crank-Nicolson) will be developed. The linear systems created will be solved by two iterative methods, namely Multigrid and GMRES; for the latter, the effectiveness of using some preconditioners will be tested. Moreover, an explicit-implicit scheme will be applied with the Backward Euler method and it will be resolved by the direct method Tridiagonal solver
Custom optimization algorithms for efficient hardware implementation
The focus is on real-time optimal decision making with application in advanced control
systems. These computationally intensive schemes, which involve the repeated solution of
(convex) optimization problems within a sampling interval, require more efficient computational
methods than currently available for extending their application to highly dynamical
systems and setups with resource-constrained embedded computing platforms.
A range of techniques are proposed to exploit synergies between digital hardware, numerical
analysis and algorithm design. These techniques build on top of parameterisable
hardware code generation tools that generate VHDL code describing custom computing
architectures for interior-point methods and a range of first-order constrained optimization
methods. Since memory limitations are often important in embedded implementations we
develop a custom storage scheme for KKT matrices arising in interior-point methods for
control, which reduces memory requirements significantly and prevents I/O bandwidth
limitations from affecting the performance in our implementations. To take advantage of
the trend towards parallel computing architectures and to exploit the special characteristics
of our custom architectures we propose several high-level parallel optimal control
schemes that can reduce computation time. A novel optimization formulation was devised
for reducing the computational effort in solving certain problems independent of the computing
platform used. In order to be able to solve optimization problems in fixed-point
arithmetic, which is significantly more resource-efficient than floating-point, tailored linear
algebra algorithms were developed for solving the linear systems that form the computational
bottleneck in many optimization methods. These methods come with guarantees
for reliable operation. We also provide finite-precision error analysis for fixed-point implementations
of first-order methods that can be used to minimize the use of resources while
meeting accuracy specifications. The suggested techniques are demonstrated on several
practical examples, including a hardware-in-the-loop setup for optimization-based control
of a large airliner.Open Acces
Dynamics of the solar tachocline II: the stratified case
We present a detailed numerical study of the Gough & McIntyre model for the
solar tachocline. This model explains the uniformity of the rotation profile
observed in the bulk of the radiative zone by the presence of a large-scale
primordial magnetic field confined below the tachocline by flows originating
from within the convection zone. We attribute the failure of previous numerical
attempts at reproducing even qualitatively Gough & McIntyre's idea to the use
of boundary conditions which inappropriately model the radiative--convective
interface. We emphasize the key role of flows downwelling from the convection
zone in confining the assumed internal field. We carefully select the range of
parameters used in the simulations to guarantee a faithful representation of
the hierarchy of expected lengthscales. We then present, for the first time, a
fully nonlinear and self-consistent numerical solution of the Gough & McIntyre
model which qualitatively satisfies the following set of observational
constraints: (i) the quenching of the large-scale differential rotation below
the tachocline - including in the polar regions - as seen by helioseismology
(ii) the confinement of the large-scale meridional flows to the uppermost
layers of the radiative zone as required by observed light element abundances
and suggested by helioseismic sound-speed data.Comment: 21 pages, 15 figures, submitted to MNRA
Future Computer Requirements for Computational Aerodynamics
Recent advances in computational aerodynamics are discussed as well as motivations for and potential benefits of a National Aerodynamic Simulation Facility having the capability to solve fluid dynamic equations at speeds two to three orders of magnitude faster than presently possible with general computers. Two contracted efforts to define processor architectures for such a facility are summarized
Research in computerized structural analysis and synthesis
Computer applications in dynamic structural analysis and structural design modeling are discussed
A Spline LR Test for Goodness-of-Fit
Goodness-of-Fit tests, nuisance parameters, cubic spline, Neyman smooth test, Lagrange Multiplier test, stable distributions, student t distributions
- …