Search CORE

10 research outputs found

Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System

Author: Kernert David
Publication venue
Publication date: 20/09/2016
Field of study

Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists

Technische Universität Dresden: Qucosa

Evaluation of Distributed Programming Models and Extensions to Task-based Runtime Systems

Author: Pei Yu
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2022
Field of study

High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models\u27 training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard\u27s Scaling and when Moore\u27s Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks\u27 speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex. This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format. Next I proposed a number of optimizations and implemented them in PaRSEC\u27s Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations

University of Tennessee, Knoxville: Trace

RT-CUDA: A Software Tool for CUDA Code Restructuring

Author
Publication venue
Publication date
Field of study

KFUPM ePrints

MASSIVELY PARALLEL OIL RESERVOIR SIMULATION FOR HISTORY MATCHING

Author
Publication venue
Publication date
Field of study

A language and a system for program optimization

Author: Santos Faria Xavier Teixeira Thiago
Publication venue
Publication date: 01/12/2020
Field of study

Hardware complexity has increased over time, and as architectures evolve and new ones are adopted, programs must often be altered by numerous optimizations to attain maximum computing power on each target environment. As a result, the code becomes unrecognizable over time, hard to maintain, and challenging to modify. Furthermore, as the code evolves, it is hard to keep the optimizations up to date. The need to develop and maintain separate versions of the application for each target platform is an immense undertaking, especially for the large and long-lived applications commonly found in the high-performance computing (HPC) community. This dissertation presents Locus, a new system, and a language for optimizing complex, long-lived applications for different platforms. We describe the requirements that we believe are necessary for making automatic performance tuning widely adopted. We present the design and implementation of a system that fulfills these requirements. It includes a domain-specific language that can represent complex collections of transformations, an interface to integrate external modules, and a database to manage platform-specific efficient code. The database allows the system’s users to access optimized code without having to install the code generation toolset. The Locus language allows the definition of a search space combined with the programming of optimization sequences separated from the application’s reference code. After all, we present an approach for performance portability. Our thesis is that we can ameliorate the difficulty of optimizing applications using a methodology based on optimization programming and automated empirical search. Our system automatically selects, generates, and executes candidate implementations to find the one with the best performance. We present examples to illustrate the power and simplicity of the language. The experimental evaluation shows that exploring the space of candidate implementations typically leads to better performing codes than those produced by conventional compiler optimizations that are based solely on heuristics. Locus was able to generate a matrix-matrix multiplication code that outperformed the IBM XLC internal hand-optimized version by 2× on the Power 9 processors. On Intel E5, Locus generates code with performance comparable to Intel MKL’s. We also improve performance relative to the reference implementation of up to 4× on stencil computations. Locus ability to integrate complex search spaces with optimization sequences can result in very complicated optimization programs. Locus compiler applies optimizations to remove from the optimization sequences unnecessary search statements making the exploration for faster implementations more accessible. We optimize matrix transpose, matrix-matrix multiplication, fast Fourier transform, symmetric eigenproblem, and sparse matrix-vector multiplication through divide and conquer. We implement three strategies using the Locus language to create search spaces to find the best shapes of the base case and the best ways of subdividing the problem. The search space representation for the divide-and-conquer strategy uses a combination of recursion and OR blocks. The Locus compiler automatically expands the recursion and ensures that the search space is correctly represented. The results showed that the empirical search was important to improve performance by generating faster base cases and finding the best splitting. We also use Locus to optimize large, complex applications. We match the performance of hand-optimized kernels of the Kripke transport code for different input data layouts. The Plascom2 multi-physics application is optimized to find the best way to use a multi-core CPU and GPU. The use of Tangram, Hydra, and OpenMP provided an interesting search space that improved performance by approximately 4.3× on ZAXPY and ZXDOTY kernels. Lastly, in a similar fashion to how a compiler works, we applied a search space representing a collection of optimization sequences to 856 loops extracted from 16 benchmarks that resulted in good performance improvements

Illinois Digital Environment for Access to Learning and Scholarship Repository

Software for Exascale Computing - SPPEXA 2016-2019

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

OAPEN Library

Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

Author
Publication venue: AUAI Press
Publication date: 01/09/2018
Field of study

UCL Discovery