    Performance Improvements of Common Sparse Numerical Linear Algebra Computations

    Manufacturers of computer hardware are able to continuously sustain an unprecedented pace of progress in computing speed of their products, partially due to increased clock rates but also because of ever more complicated chip designs. With new processor families appearing every few years, it is increasingly harder to achieve high performance rates in sparse matrix computations. This research proposes new methods for sparse matrix factorizations and applies in an iterative code generalizations of known concepts from related disciplines. The proposed solutions and extensions are implemented in ways that tend to deliver efficiency while retaining ease of use of existing solutions. The implementations are thoroughly timed and analyzed using a commonly accepted set of test matrices. The tests were conducted on modern processors that seem to have gained an appreciable level of popularity and are fairly representative for a wider range of processor types that are available on the market now or in the near future. The new factorization technique formally introduced in the early chapters is later on proven to be quite competitive with state of the art software currently available. Although not totally superior in all cases (as probably no single approach could possibly be), the new factorization algorithm exhibits a few promising features. In addition, an all-embracing optimization effort is applied to an iterative algorithm that stands out for its robustness. This also gives satisfactory results on the tested computing platforms in terms of performance improvement. The same set of test matrices is used to enable an easy comparison between both investigated techniques, even though they are customarily treated separately in the literature. Possible extensions of the presented work are discussed. They range from easily conceivable merging with existing solutions to rather more evolved schemes dependent on hard to predict progress in theoretical and algorithmic research

    A framework for efficient execution of matrix computations

    Matrix computations lie at the heart of most scientific computational tasks. The solution of linear systems of equations is a very frequent operation in many fields in science, engineering, surveying, physics and others. Other matrix operations occur frequently in many other fields such as pattern recognition and classification, or multimedia applications. Therefore, it is important to perform matrix operations efficiently. The work in this thesis focuses on the efficient execution on commodity processors of matrix operations which arise frequently in different fields.We study some important operations which appear in the solution of real world problems: some sparse and dense linear algebra codes and a classification algorithm. In particular, we focus our attention on the efficient execution of the following operations: sparse Cholesky factorization; dense matrix multiplication; dense Cholesky factorization; and Nearest Neighbor Classification.A lot of research has been conducted on the efficient parallelization of numerical algorithms. However, the efficiency of a parallel algorithm depends ultimately on the performance obtained from the computations performed on each node. The work presented in this thesis focuses on the sequential execution on a single processor.There exists a number of data structures for sparse computations which can be used in order to avoid the storage of and computation on zero elements. We work with a hierarchical data structure known as hypermatrix. A matrix is subdivided recursively an arbitrary number of times. Several pointer matrices are used to store the location ofsubmatrices at each level. The last level consists of data submatrices which are dealt with as dense submatrices. When the block size of this dense submatrices is small, the number of zeros can be greatly reduced. However, the performance obtained from BLAS3 routines drops heavily. Consequently, there is a trade-off in the size of data submatrices used for a sparse Cholesky factorization with the hypermatrix scheme. Our goal is that of reducing the overhead introduced by the unnecessary operation on zeros when a hypermatrix data structure is used to produce a sparse Cholesky factorization. In this work we study several techniques for reducing such overhead in order to obtain high performance.One of our goals is the creation of codes which work efficiently on different platforms when operating on dense matrices. To obtain high performance, the resources offered by the CPU must be properly utilized. At the same time, the memory hierarchy must be exploited to tolerate increasing memory latencies. To achieve the former, we produce inner kernels which use the CPU very efficiently. To achieve the latter, we investigate nonlinear data layouts. Such data formats can contribute to the effective use of the memory system.The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. However, we want to create efficient inner kernels for a variety of processors using a general approach and avoiding hand-made codification in assembly language. In this work, we present an alternative way to produce efficient kernels automatically, based on a set of simple codes written in a high level language, which can be parameterized at compilation time. The advantage of our method lies in the ability to generate very efficient inner kernels by means of a good compiler. Working on regular codes for small matrices most of the compilers we used in different platforms were creating very efficient inner kernels for matrix multiplication. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms.In this work we also show that techniques used in linear algebra codes can be useful in other fields. We present the work we have done in the optimization of the Nearest Neighbor classification focusing on the speed of the classification process.Tuning several codes for different problems and machines can become a heavy and unbearable task. For this reason we have developed an environment for development and automatic benchmarking of codes which is presented in this thesis.As a practical result of this work, we have been able to create efficient codes for several matrix operations on a variety of platforms. Our codes are highly competitive with other state-of-art codes for some problems

    Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-based Linear Algebra

    Linear algebra underlies a large proportion of computational problems. With the continuous increase of scale on modern hardware, performance of small sized linear algebra has become increasingly important. To overcome the shortcomings of conventional approaches, we employ a new approach using a microkernel framework provided by ATLAS to improve the performance of a few linear algebra routines for all problem sizes. Our initial research consists of improving the performance of parallel LU factorization in ATLAS for which we were able to achieve up to 2.07x and 2.66x speedup for small problems, up to 91% and 87% of theoretical peak performance for asymptotic problems on a 12-core Intel Xeon and a 32-core AMD Opteron machine, respectively, outperforming all the state-of-the-art libraries at the time. Such performance was achieved via an exhaustive search of all the tuning parameters, which could take days. This motivated us to try to develop a computational model for our LU factorization that could predict those parameters by combining some basic empirical timings and a theoretical model based on the amount of required computations. While our model provided good prediction for mid-to-asymptotic sized problems, there were some unknown factors for small problems that could possibly be answered by extending the ATLAS tuning framework. While this extension is underway, we decided to pursue the model research using simpler serial BLAS-based approach. We investigated and implemented two Level-3 BLAS routines: TRSM and TRMM that are widely used primarily by LAPACK operations like the aforementioned LU factorization. With the microkernel-based approach, we were able to improve the performance of both routines by up to 15% and 73% for square and fat problems, respectively, over prior ATLAS implementations on modern hardware. Finally, with a collaborative research with ARM Inc., we improved the performance of the most important Level-3 BLAS operation GEMM in ATLAS by up to 53% via implementing microkernels for two 64-bit ARM architectures. This automatically improves other BLAS and LAPACK routines that rely on GEMM for high performance

    Algorithms and Methods for High-Performance Model Predictive Control

    High performance Cholesky and symmetric indefinite factorizations with applications

    The process of factorizing a symmetric matrix using the Cholesky (LLT ) or indefinite (LDLT ) factorization of A allows the efficient solution of systems Ax = b when A is symmetric. This thesis describes the development of new serial and parallel techniques for this problem and demonstrates them in the setting of interior point methods. In serial, the effects of various scalings are reported, and a fast and robust mixed precision sparse solver is developed. In parallel, DAG-driven dense and sparse factorizations are developed for the positive definite case. These achieve performance comparable with other world-leading implementations using a novel algorithm in the same family as those given by Buttari et al. for the dense problem. Performance of these techniques in the context of an interior point method is assessed

    Parallel algorithms for three dimensional electrical impedance tomography

    This thesis is concerned with Electrical Impedance Tomography (EIT), an imaging technique in which pictures of the electrical impedance within a volume are formed from current and voltage measurements made on the surface of the volume. The focus of the thesis is the mathematical and numerical aspects of reconstructing the impedance image from the measured data (the reconstruction problem). The reconstruction problem is mathematically difficult and most reconstruction algorithms are computationally intensive. Many of the potential applications of EIT in medical diagnosis and industrial process control depend upon rapid reconstruction of images. The aim of this investigation is to find algorithms and numerical techniques that lead to fast reconstruction while respecting the real mathematical difficulties involved. A general framework for Newton based reconstruction algorithms is developed which describes a large number of the reconstruction algorithms used by other investigators. Optimal experiments are defined in terms of current drive and voltage measurement patterns and it is shown that adaptive current reconstruction algorithms are a special case of their use. This leads to a new reconstruction algorithm using optimal experiments which is considerably faster than other methods of the Newton type. A tomograph is tested to measure the magnitude of the major sources of error in the data used for image reconstruction. An investigation into the numerical stability of reconstruction algorithms identifies the resulting uncertainty in the impedance image. A new data collection strategy and a numerical forward model are developed which minimise the effects of, previously, major sources of error. A reconstruction program is written for a range of Multiple Instruction Multiple Data, (MIMD), distributed memory, parallel computers. These machines promise high computational power for low cost and so look promising as components in medical tomographs. The performance of several reconstruction algorithms on these computers is analysed in detail

    Application of Efficient Matrix Inversion to the Decomposition of Hierarchical Matrices

    Anisotropy and heterogeneity in finite deformation: resolving versus upscaling

    On High Performance Computing in Geodesy : Applications in Global Gravity Field Determination

    Autonomously working sensor platforms deliver an increasing amount of precise data sets, which are often usable in geodetic applications. Due to the volume and quality, models determined from the data can be parameterized more complex and in more detail. To derive model parameters from these observations, the solution of a high dimensional inverse data fitting problem is often required. To solve such high dimensional adjustment problems, this thesis proposes a systematical, end-to-end use of a massive parallel implementation of the geodetic data analysis, using standard concepts of massive parallel high performance computing. It is shown how these concepts can be integrated into a typical geodetic problem, which requires the solution of a high dimensional adjustment problem. Due to the proposed parallel use of the computing and memory resources of a compute cluster it is shown, how general Gauss-Markoff models become solvable, which were only solvable by means of computationally motivated simplifications and approximations before. A basic, easy-to-use framework is developed, which is able to perform all relevant operations needed to solve a typical geodetic least squares adjustment problem. It provides the interface to the standard concepts and libraries used. Examples, including different characteristics of the adjustment problem, show how the framework is used and can be adapted for specific applications. In a computational sense rigorous solutions become possible for hundreds of thousands to millions of unknown parameters, which have to be estimated from a huge number of observations. Three special problems with different characteristics, as they arise in global gravity field recovery, are chosen and massive parallel implementations of the solution processes are derived. The first application covers global gravity field determination from real data as collected by the GOCE satellite mission (comprising 440 million highly correlated observations, 80,000 parameters). Within the second application high dimensional global gravity field models are estimated from the combination of complementary data sets via the assembly and solution of full normal equations (scenarios with 520,000 parameters, 2 TB normal equations). The third application solves a comparable problem, but uses an iterative least squares solver, allowing for a parameter space of even higher dimension (now considering scenarios with two million parameters). This thesis forms the basis for a flexible massive parallel software package, which is extendable according to further current and future research topics studied in the department. Within this thesis, the main focus lies on the computational aspects.Autonom arbeitende Sensorplattformen liefern prĂ€zise geodĂ€tisch nutzbare DatensĂ€tze in grĂ¶ĂŸer werdendem Umfang. Deren Menge und QualitĂ€t fĂŒhrt dazu, dass Modelle die aus den Beobachtungen abgeleitet werden, immer komplexer und detailreicher angesetzt werden können. Zur Bestimmung von Modellparametern aus den Beobachtungen gilt es oftmals, ein hochdimensionales inverses Problem im Sinne der Ausgleichungsrechnung zu lösen. Innerhalb dieser Arbeit soll ein Beitrag dazu geleistet werden, Methoden und Konzepte aus dem Hochleistungsrechnen in der geodĂ€tischen Datenanalyse strukturiert, durchgĂ€ngig und konsequent zu verwenden. Diese Arbeit zeigt, wie sich diese nutzen lassen, um geodĂ€tische Fragestellungen, die ein hochdimensionales Ausgleichungsproblem beinhalten, zu lösen. Durch die gemeinsame Nutzung der Rechen- und Speicherressourcen eines massiv parallelen Rechenclusters werden Gauss-Markoff Modelle lösbar, die ohne den Einsatz solcher Techniken vorher höchstens mit massiven Approximationen und Vereinfachungen lösbar waren. Ein entwickeltes GrundgerĂŒst stellt die Schnittstelle zu den massiv parallelen Standards dar, die im Rahmen einer numerischen Lösung von typischen Ausgleichungsaufgaben benötigt werden. Konkrete Anwendungen mit unterschiedlichen Charakteristiken zeigen das detaillierte Vorgehen um das GrundgerĂŒst zu verwenden und zu spezifizieren. Rechentechnisch strenge Lösungen sind so fĂŒr Hunderttausende bis Millionen von unbekannten Parametern möglich, die aus einer Vielzahl von Beobachtungen geschĂ€tzt werden. Drei spezielle Anwendungen aus dem Bereich der globalen Bestimmung des Erdschwerefeldes werden vorgestellt und die Implementierungen fĂŒr einen massiv parallelen Hochleistungsrechner abgeleitet. Die erste Anwendung beinhaltet die Bestimmung von Schwerefeldmodellen aus realen Beobachtungen der Satellitenmission GOCE (welche 440 Millionen korrelierte Beobachtungen umfasst, 80,000 Parameter). In der zweite Anwendung werden globale hochdimensionale Schwerefelder aus komplementĂ€ren Daten ĂŒber das Aufstellen und Lösen von vollen Normalgleichungen geschĂ€tzt (basierend auf Szenarien mit 520,000 Parametern, 2 TB Normalgleichungen). Die dritte Anwendung löst dasselbe Problem, jedoch ĂŒber einen iterativen Löser, wodurch der Parameterraum noch einmal deutlich höher dimensional sein kann (betrachtet werden nun Szenarien mit 2 Millionen Parametern). Die Arbeit bildet die Grundlage fĂŒr ein massiv paralleles Softwarepaket, welches schrittweise um Spezialisierungen, abhĂ€ngig von aktuellen Forschungsprojekten in der Arbeitsgruppe, erweitert werden wird. Innerhalb dieser Arbeit liegt der Fokus rein auf den rechentechnischen Aspekten

    Parameter estimation supplement to the Mission Analysis Evaluation and Space Trajectory Operations program (MAESTRO)

    This Parameter Estimation Supplement describes the PEST computer program and gives instructions for its use in determination of lunar gravitation field coefficients. PEST was developed for use in the RAE-B lunar orbiting mission as a means of lunar field recovery. The observations processed by PEST are short-arc osculating orbital elements. These observations are the end product of an orbit determination process obtained with another program. PEST's end product it a set of harmonic coefficients to be used in long-term prediction of the lunar orbit. PEST employs some novel techniques in its estimation process, notably a square batch estimator and linear variational equations in the orbital elements (both osculating and mean) for measurement sensitivities. The program's capabilities are described, and operating instructions and input/output examples are given. PEST utilizes MAESTRO routines for its trajectory propagation. PEST's program structure and subroutines which are not common to MAESTRO are described. Some of the theoretical background information for the estimation process, and a derivation of linear variational equations for the Method 7 elements are included
