12 research outputs found

    Parallel structurally-symmetric sparse matrix-vector products on multi-core processors

    Full text link
    We consider the problem of developing an efficient multi-threaded implementation of the matrix-vector multiplication algorithm for sparse matrices with structural symmetry. Matrices are stored using the compressed sparse row-column format (CSRC), designed for profiting from the symmetric non-zero pattern observed in global finite element matrices. Unlike classical compressed storage formats, performing the sparse matrix-vector product using the CSRC requires thread-safe access to the destination vector. To avoid race conditions, we have implemented two partitioning strategies. In the first one, each thread allocates an array for storing its contributions, which are later combined in an accumulation step. We analyze how to perform this accumulation in four different ways. The second strategy employs a coloring algorithm for grouping rows that can be concurrently processed by threads. Our results indicate that, although incurring an increase in the working set size, the former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo

    Improving the performance of tensor matrix vector multiplication in quantum chemistry codes.

    Full text link

    Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions

    Full text link
    The sparse matrix-vector product (SpMV) is a fundamental operation in many scientific applications from various fields. The High Performance Computing (HPC) community has therefore continuously invested a lot of effort to provide an efficient SpMV kernel on modern CPU architectures. Although it has been shown that block-based kernels help to achieve high performance, they are difficult to use in practice because of the zero padding they require. In the current paper, we propose new kernels using the AVX-512 instruction set, which makes it possible to use a blocking scheme without any zero padding in the matrix memory storage. We describe mask-based sparse matrix formats and their corresponding SpMV kernels highly optimized in assembly language. Considering that the optimal blocking size depends on the matrix, we also provide a method to predict the best kernel to be used utilizing a simple interpolation of results from previous executions. We compare the performance of our approach to that of the Intel MKL CSR kernel and the CSR5 open-source package on a set of standard benchmark matrices. We show that we can achieve significant improvements in many cases, both for sequential and for parallel executions. Finally, we provide the corresponding code in an open source library, called SPC5.Comment: Published in Peer J C

    Implementation of a general algorithm for incompressible and compressible flows within the multi-physics code Kratos and preparation of fluid-structure coupling

    Get PDF
    This diploma thesis deals with the implementation of a fluid solver for incompressible and compressible flows within the multi-physics framework Kratos. The presentation of this environment based on the finite element method (FEM) and an introduction to multidisciplinary problems in general are the starting point of this work and help understanding the following steps more easily. Originating from the basic conservation equations for mass, momentum and energy, the Euler equations for inviscid flow are derived. In this context some approximations are presented that avoid the solution of the energy equation and allow the use of a general approach for the simulation of incompressible, slightly compressible and barotropic flow. The implementation of the incompressible case is outlined step-by-step: Having discretized the continuous problem, a fractional step scheme is presented in order to uncouple pressure and velocity components by a split of the momentum equation. Emphasis is placed on the nodal implementation using an edge-based data structure. Moreover, the orthogonal subscale stabilization - necessary because of the finite element discretization - is explained very briefly. Subsequently, the solver is extended to compressible regime mentioning the respective modifications. For validation purposes numerical examples of incompressible and compressible flows in two and three dimensions round of this first part. In a second step, the implemented flow solver is prepared for the fluid-structure coupling. After presenting solving procedures for multi-disciplinary problems, the arbitrary Lagrangian Eulerian (ALE) formulation is introduced and the conservation equations are modified accordingly. Some preliminary tests are performed, particularly with regard to mesh motion and adjustment of the boundary conditions. Finally, expectations for the envisaged fluid-structure coupling are brought forward

    Implementation of a general algorithm for incompressible and compressible flows within the multi-physics code KRATOS and preparation of fluid-structure coupling

    Get PDF
    This diploma thesis deals with the implementation of a fluid solver for incompressible and compressible flows within the multi-physics framework Kratos. The presentation of this environment based on the finite element method (FEM) and an introduction to multidisciplinary problems in general are the starting point of this work and help understanding the following steps more easily. Originating from the basic conservation equations for mass, momentum and energy, the Euler equations for inviscid flow are derived. In this context some approximations are presented that avoid the solution of the energy equation and allow the use of a general approach for the simulation of incompressible, slightly compressible and barotropic flow. The implementation of the incompressible case is outlined step-by-step: Having discretized the continuous problem, a fractional step scheme is presented in order to uncouple pressure and velocity components by a split of the momentum equation. Emphasis is placed on the nodal implementation using an edge-based data structure. Moreover, the orthogonal subscale stabilization - necessary because of the finite element discretization - is explained very briefly. Subsequently, the solver is extended to compressible regime mentioning the respective modifications. For validation purposes numerical examples of incompressible and compressible flows in two and three dimensions round of this first part. In a second step, the implemented flow solver is prepared for the fluid-structure coupling. After presenting solving procedures for multi-disciplinary problems, the arbitrary Lagrangian Eulerian (ALE) formulation is introduced and the conservation equations are modified accordingly. Some preliminary tests are performed, particularly with regard to mesh motion and adjustment of the boundary conditions. Finally, expectations for the envisaged fluid-structure coupling are brought forward.Preprin

    SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations

    Full text link
    Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an effective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce significant instruction overhead and expensive pointer-chasing operations to discover the positions of the non-zero elements. In this paper, we identify the discovery of the positions (i.e., indexing) of non-zero elements as a key bottleneck in sparse matrix-based workloads, which greatly reduces the benefits of compression. We propose SMASH, a hardware-software cooperative mechanism that enables highly-efficient indexing and storage of sparse matrices. The key idea of SMASH is to explicitly enable the hardware to recognize and exploit sparsity in data. To this end, we devise a novel software encoding based on a hierarchy of bitmaps. This encoding can be used to efficiently compress any sparse matrix, regardless of the extent and structure of sparsity. At the same time, the bitmap encoding can be directly interpreted by the hardware. We design a lightweight hardware unit, the Bitmap Management Unit (BMU), that buffers and scans the bitmap hierarchy to perform highly-efficient indexing of sparse matrices. SMASH exposes an expressive and rich ISA to communicate with the BMU, which enables its use in accelerating any sparse matrix computation. We demonstrate the benefits of SMASH on four use cases that include sparse matrix kernels and graph analytics applications

    Component-based software engineering

    Get PDF
    To solve the problems coming with the current software development methodologies, component-based software engineering has caught many researchers\u27 attention recently. In component-based software engineering, a software system is considered as a set of software components assembled together instead of as a set of functions from the traditional perspective. Software components can be bought from third party vendors as off-the-shelf components and be assembled together. Component-based software engineering, though very promising, needs to solve several core issues before it becomes a mature software development strategy. The goal of this dissertation is to establish an infrastructure for component-based software development. The author identifies and studies some of the core issues such as component planning, component building, component assembling, component representation, and component retrieval. A software development process model is developed in this dissertation to emphasize the reuse of existing software components. The software development process model addresses how a software system should be planned and built to maximize the reuse of software components. It conducts domain engineering and application engineering simultaneously to map a software system to a set of existing components in such a way that the development of a software system can reuse the existing software components to the full extent. Besides the planning of software development based on component technology, the migration and integration of legacy systems, most of which are non-component-based systems, to the component-based software systems are studied. A framework and several methodologies are developed to serve as the guidelines of adopting component technology in legacy systems. Component retrieval is also studied in this dissertation. One of the most important issues in component-based software engineering is how to find a software component quickly and accurately in a component repository. A component representation framework is developed in this dissertation to represent software components. Based on the component representation framework, an efficient searching method that combines neural network, information retrieval, and Bayesian inference technology is developed. Finally a prototype component retrieval system is implemented to demonstrate the correctness and feasibility of the proposed method

    Performance Optimization With An Integrated View Of Compiler And Application Knowledge

    Get PDF
    Compiler optimization is a long-standing research field that enhances program performance with a set of rigorous code analyses and transformations. Traditional compiler optimization focuses on general programs or program structures without considering too much high-level application operations or data structure knowledge. In this thesis, we claim that an integrated view of the application and compiler is helpful to further improve program performance. Particularly, we study integrated optimization opportunities for three kinds of applications: irregular tree-based query processing systems such as B+ tree, security enhancement such as buffer overflow protection, and tensor/matrix-based linear algebra computation. The performance of B+ tree query processing is important for many applications, such as file systems and databases. Latch-free B+ tree query processing is efficient since the queries are processed in batches without locks. To avoid long latency, the batch size can not be very large. However, modern processors provide opportunities to process larger batches parallel with acceptable latency. From studying real-world data, we find that there are many redundant and unnecessary queries especially when the real-world data is highly skewed. We develop a query sequence transformation framework Qtrans to reduce the redundancies in queries by applying classic dataflow analysis to queries. To further confirm the effectiveness, we integrate Qtrans into an existing BSP-based B+ tree query processing system, PALM tree. The evaluations show that the throughput can be improved up to 16X. Heap overflows are still the most common vulnerabilities in C/C++ programs. Common approaches incur high overhead since it checks every memory access. By analyzing dozens of bugs, we find that all heap overflows are related to arrays. We only need to check array-related memory accesses. We propose Prober to efficiently detect and prevent heap overflows. It contains Prober-Static to identify the array-related allocations and Prober-Dynamic to protect objects at runtime. In this thesis, our contributions lie on the Prober-Static side. The key challenge is to correctly identify the array-related allocations. We propose a hybrid method. Some objects can be identified as array-related (or not) by static analysis. For the remaining ones, we instrument the basic allocation type size statically and then determine the real allocation size at runtime. The evaluations show Prober-Static is effective. Tensor algebra is widely used in many applications, such as machine learning and data analytics. Tensors representing real-world data are usually large and sparse. There are many sparse tensor storage formats, and the kernels are different with varied formats. These different kernels make performance optimization for sparse tensor algebra challenging. We propose a tensor algebra domain-specific language and a compiler to automatically generate kernels for sparse tensor algebra computations, called SPACe. This compiler supports a wide range of sparse tensor formats. To further improve the performance, we integrate the data reordering into SPACe to improve data locality. The evaluations show that the code generated by SPACe outperforms state-of-the-art sparse tensor algebra compilers
    corecore