1,715 research outputs found

    Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

    Full text link
    This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201

    A VLSI Approach for Cache Compression in Microprocessor

    Get PDF
    Speed is one of the important issues that generally customers consider for selecting any electronic component in the market. Speed of a microprocessor based system mainly depends on the speed of the microprocessor which in turn depends on the memory access time. Accessing on chip memory takes more time than accessing off-chip memory. Because of these, designers of memory system may find cache compression as an advantageous method to increase speed of a microprocessor based system, as it increases cache capacity and off-chip bandwidth. The However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. It is not possible to determine whether compression at levels of the memory hierarchy closest to the processor is beneficial without understanding its costs. Proposed hardware compression algorithms fall into the dictionary-based category, which depend on building a dictionary and using its entries to encode repeated data values. Proposed algorithm has number of novel features like including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio

    PReaCH: A Fast Lightweight Reachability Index using Pruning and Contraction Hierarchies

    Full text link
    We develop the data structure PReaCH (for Pruned Reachability Contraction Hierarchies) which supports reachability queries in a directed graph, i.e., it supports queries that ask whether two nodes in the graph are connected by a directed path. PReaCH adapts the contraction hierarchy speedup techniques for shortest path queries to the reachability setting. The resulting approach is surprisingly simple and guarantees linear space and near linear preprocessing time. Orthogonally to that, we improve existing pruning techniques for the search by gathering more information from a single DFS-traversal of the graph. PReaCH-indices significantly outperform previous data structures with comparable preprocessing cost. Methods with faster queries need significantly more preprocessing time in particular for the most difficult instances

    Fast Rendering of Forest Ecosystems with Dynamic Global Illumination

    Get PDF
    Real-time rendering of large-scale, forest ecosystems remains a challenging problem, in that important global illumination effects, such as leaf transparency and inter-object light scattering, are difficult to capture, given tight timing constraints and scenes that typically contain hundreds of millions of primitives. We propose a new lighting model, adapted from a model previously used to light convective clouds and other participating media, together with GPU ray tracing, in order to achieve these global illumination effects while maintaining near real-time performance. The lighting model is based on a lattice-Boltzmann method in which reflectance, transmittance, and absorption parameters are taken from measurements of real plants. The lighting model is solved as a preprocessing step, requires only seconds on a single GPU, and allows dynamic lighting changes at run-time. The ray tracing engine, which runs on one or multiple GPUs, combines multiple acceleration structures to achieve near real-time performance for large, complex scenes. Both the preprocessing step and the ray tracing engine make extensive use of NVIDIA\u27s Compute Unified Device Architecture (CUDA)
    • …
    corecore