174 research outputs found

    Automatically Optimizing Tree Traversal Algorithms

    Get PDF
    Many domains in computer science, from data-mining to graphics to computational astrophysics, focus heavily on irregular applications. In contrast to regular applications, which operate over dense matrices and arrays, irregular programs manipulate and traverse complex data structures like trees and graphs. As irregular applications operate on ever larger datasets, their performance suffers from poor locality and parallelism. Programmers are burdened with the arduous task of manually tuning such applications for better performance. Generally applicable techniques to optimize irregular applications are highly desired, yet scarce. In this dissertation, we argue that, for an important subset of irregular programs which arises in many domains, namely, tree traversal algorithms like Barnes-Hut, nearest neighbor and ray tracing, there exist general techniques to enhance performance. We investigate two sources of performance improvement: locality enhancement and vectorization. Furthermore we demonstrate that these techniques can be automatically applied by an optimizing compiler, relieving programmers of manual, error-prone, application-specific effort. Achieving high performance in many applications requires achieving good locality of reference. We propose two novel transformations called point blocking and traversal splicing, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in tree traversals. We then present a transformation framework called TreeSplicer, that automatically applies these transformations, and uses autotuning techniques to determine appropriate parameters for the transformations. For six benchmark algorithms, we show that a combination of point blocking and traversal splicing can deliver single-thread speedups of up to 8.71 (geometric mean: 2.48), just from better locality. Modern commodity processors support SIMD instructions, and using these instructions to process multiple traversals at once has the potential to provide substantial performance improvements. Unfortunately tree algorithms often feature highly diverging traversals which inhibit efficient SIMD utilization, to the point that other, less profitable sources of vectorization must be exploited instead. We propose a dynamic reordering of traversals based on previous behavior, based on the insight that traversals which have behaved similarly so far are likely to behave similarly in the future, and show that this reordering can dramatically improve the SIMD utilization of diverging traversals, close to ideal utilization. We present a transformation framework, SIMTree, which facilitates vectorization of tree algorithms, and demonstrate speedups of up to 6.59 (geometric mean: 2.78). Furthermore our techniques can effectively SIMDize algorithms that prior, manual vectorization attempts could not

    Compiling Tree Transforms to Operate on Packed Representations

    Get PDF
    When written idiomatically in most programming languages, programs that traverse and construct trees operate over pointer-based data structures, using one heap object per-leaf and per-node. This representation is efficient for random access and shape-changing modifications, but for traversals, such as compiler passes, that process most or all of a tree in bulk, it can be inefficient. In this work we instead compile tree traversals to operate on pointer-free pre-order serializations of trees. On modern architectures such programs often run significantly faster than their pointer-based counterparts, and additionally are directly suited to storage and transmission without requiring marshaling. We present a prototype compiler, Gibbon, that compiles a small first-order, purely functional language sufficient for tree traversals. The compiler transforms this language into intermediate representation with explicit pointers into input and output buffers for packed data. The key compiler technologies include an effect system for capturing traversal behavior, combined with an algorithm to insert destination cursors. We evaluate our compiler on tree transformations over a real-world dataset of source-code syntax trees. For traversals touching the whole tree, such as maps and folds, packed data allows speedups of over 2x compared to a highly-optimized pointer-based baseline

    Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku

    Full text link
    The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different execution characteristics for various heterogeneous workloads. In this paper, we demonstrate an approach to code and performance portability that is based entirely on established standards in the industry. In addition to applying Kokkos as an abstraction over the execution of compute kernels on different heterogeneous execution environments, we show that the use of standard C++ constructs as exposed by the HPX runtime system enables superb portability in terms of code and performance based on the real-world Octo-Tiger astrophysics application. We report our experience with porting Octo-Tiger to the ARM A64FX architecture provided by Stony Brook's Ookami and Riken's Supercomputer Fugaku and compare the resulting performance with that achieved on well established GPU-oriented HPC machines such as ORNL's Summit, NERSC's Perlmutter and CSCS's Piz Daint systems. Octo-Tiger scaled well on Supercomputer Fugaku without any major code changes due to the abstraction levels provided by HPX and Kokkos. Adding vectorization support for ARM's SVE to Octo-Tiger was trivial thanks to using standard C+

    Characterization of vectorization strategies for recursive algorithms

    Get PDF
    A successful architectural trend in parallelism is the emphasis on data parallelism with SIMD hardware. Since SIMD extensions on commodity processors tend to require relatively little extra hardware, executing a SIMD instruction is essentially free from a power perspective, making vector computation an attractive target for parallelism. SIMD instructions are designed to accelerate the performance of applications such as motion video, real-time physics and graphics. Such applications perform repetitive operations on large arrays of numbers. While the key idea is to parallelize significant portions of data that get operated by several sequential instructions into a single instruction, not every application can be parallelized automatically. Regular applications with dense matrices and arrays are easier to vectorize compared to irregular applications that involve pointer based data structures like trees and graphs. Programmers are burdened with the arduous task of manually tuning such applications for better performance. One such class of applications are recursive programs. While they are not traditional serial instruction sequences, they follow a serialized pattern in their control flow graph and exhibit dependencies. They can be visualized to be directed trees data structures. Vectorizing recursive applications with SIMD hardware cannot be achieved by using the existing intrinsic directly because of the nature of these algorithms. In this dissertation, we argue that, for an important subset of recursive programs which arise in many domains, there exists general techniques to efficiently vectorize the program to operate on SIMD architecture. Recursive algorithms are very popular in graph problems, tree traversal algorithms, gaming applications et al. While multi-core and GPU implementation of such algorithms have been explored, methods to execute them efficiently on vector units like SIMD and AVX have not been explored. We investigate techniques for work generation and efficient vectorization to enable vectorization in recursion. We further implement a generic tree model that allows us to guarantee lower bounds on its utilization efficiency

    Generating optimized Fourier interpolation routines for density function theory using SPIRAL

    Get PDF
    © 2015 IEEE.Upsampling of a multi-dimensional data-set is an operation with wide application in image processing and quantum mechanical calculations using density functional theory. For small up sampling factors as seen in the quantum chemistry code ONETEP, a time-shift based implementation that shifts samples by a fraction of the original grid spacing to fill in the intermediate values using a frequency domain Fourier property can be a good choice. Readily available highly optimized multidimensional FFT implementations are leveraged at the expense of extra passes through the entire working set. In this paper we present an optimized variant of the time-shift based up sampling. Since ONETEP handles threading, we address the memory hierarchy and SIMD vectorization, and focus on problem dimensions relevant for ONETEP. We present a formalization of this operation within the SPIRAL framework and demonstrate auto-generated and auto-tuned interpolation libraries. We compare the performance of our generated code against the previous best implementations using highly optimized FFT libraries (FFTW and MKL). We demonstrate speed-ups in isolation averaging 3x and within ONETEP of up to 15%

    Inference of Many-Taxon Phylogenies

    Get PDF
    Phylogenetic trees are tree topologies that represent the evolutionary history of a set of organisms. In this thesis, we address computational challenges related to the analysis of large-scale datasets with Maximum Likelihood based phylogenetic inference. We have approached this using different strategies: reduction of memory requirements, reduction of running time, and reduction of man-hours
    • …
    corecore