Search CORE

174 research outputs found

Automatically Optimizing Tree Traversal Algorithms

Author: Jo Youngjoon
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2013
Field of study

Many domains in computer science, from data-mining to graphics to computational astrophysics, focus heavily on irregular applications. In contrast to regular applications, which operate over dense matrices and arrays, irregular programs manipulate and traverse complex data structures like trees and graphs. As irregular applications operate on ever larger datasets, their performance suffers from poor locality and parallelism. Programmers are burdened with the arduous task of manually tuning such applications for better performance. Generally applicable techniques to optimize irregular applications are highly desired, yet scarce. In this dissertation, we argue that, for an important subset of irregular programs which arises in many domains, namely, tree traversal algorithms like Barnes-Hut, nearest neighbor and ray tracing, there exist general techniques to enhance performance. We investigate two sources of performance improvement: locality enhancement and vectorization. Furthermore we demonstrate that these techniques can be automatically applied by an optimizing compiler, relieving programmers of manual, error-prone, application-specific effort. Achieving high performance in many applications requires achieving good locality of reference. We propose two novel transformations called point blocking and traversal splicing, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in tree traversals. We then present a transformation framework called TreeSplicer, that automatically applies these transformations, and uses autotuning techniques to determine appropriate parameters for the transformations. For six benchmark algorithms, we show that a combination of point blocking and traversal splicing can deliver single-thread speedups of up to 8.71 (geometric mean: 2.48), just from better locality. Modern commodity processors support SIMD instructions, and using these instructions to process multiple traversals at once has the potential to provide substantial performance improvements. Unfortunately tree algorithms often feature highly diverging traversals which inhibit efficient SIMD utilization, to the point that other, less profitable sources of vectorization must be exploited instead. We propose a dynamic reordering of traversals based on previous behavior, based on the insight that traversals which have behaved similarly so far are likely to behave similarly in the future, and show that this reordering can dramatically improve the SIMD utilization of diverging traversals, close to ideal utilization. We present a transformation framework, SIMTree, which facilitates vectorization of tree algorithms, and demonstrate speedups of up to 6.59 (geometric mean: 2.78). Furthermore our techniques can effectively SIMDize algorithms that prior, manual vectorization attempts could not

Purdue E-Pubs

Recommended from our members

Applying an abstract data structure description approach to parallelizing scientific pointer programs

Author: Hendren Laurie J.
Hummel Joseph
Nicolau Alexandru
Publication venue: eScholarship, University of California
Publication date: 27/01/1992
Field of study

Even though impressive progress has been made in the area of parallelizing scientific programs with arrays, the application of similar techniques to programs with pointer data structures has remained difficult. Unlike arrays which have a small number of well-defined properties that can be utilized by a parallelizing compiler, pointer data structures are used to implement a wide variety of structures that exhibit a much more diverse set of properties. The complexity and diversity of such properties means that, in general, scientific programs with pointer data structures cannot be effectively analyzed by an optimizing and parallelizing compiler.In order to provide a system in which the compiler can fully utilize the properties of different types of pointer data structures, we have developed a mechanism for the Abstract Description of Data Structures (ADDS). With our approach, the programmer can explicitly describe important properties such as dimensionality of the pointer data structure, independence of dimensions, and direction of traversal. These abstract descriptions of pointer data structures are then used by the compiler to guide analysis, optimization, and parallelization.In this paper we summarize the ADDS approach through the use of numerous examples of data structures used in scientific computations, we illustrate how such declarations are natural and non-tedious to specify, and we show how the ADDS declarations can be used to improve compile-time analysis. In order to demonstrate the viability of our approach, we show how such techniques can be used to parallelize an important class of scientific codes which naturally use recursive pointer data structures. In particular, we use our approach to develop the parallelization of an N-body simulation that is based on a relatively complicated pointer data structure, and we report the speedup results for a Sequent multiprocessor

eScholarship - University of California

Compiling Tree Transforms to Operate on Packed Representations

Author: Chamith Buddhika
Koparkar Chaitanya
Kulkarni Milind
Newton Ryan R.
Sakka Laith
Spall Sarah
Tobin-Hochstadt Sam
Vollmer Michael
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st European Conference on Object-Oriented Programming (ECOOP 2017)
Publication date: 01/01/2017
Field of study

When written idiomatically in most programming languages, programs that traverse and construct trees operate over pointer-based data structures, using one heap object per-leaf and per-node. This representation is efficient for random access and shape-changing modifications, but for traversals, such as compiler passes, that process most or all of a tree in bulk, it can be inefficient. In this work we instead compile tree traversals to operate on pointer-free pre-order serializations of trees. On modern architectures such programs often run significantly faster than their pointer-based counterparts, and additionally are directly suited to storage and transmission without requiring marshaling. We present a prototype compiler, Gibbon, that compiles a small first-order, purely functional language sufficient for tree traversals. The compiler transforms this language into intermediate representation with explicit pointers into input and output buffers for packed data. The key compiler technologies include an effect system for capturing traversal behavior, combined with an algorithm to insert destination cursors. We evaluate our compiler on tree transformations over a real-world dataset of source-code syntax trees. For traversals touching the whole tree, such as maps and folds, packed data allows speedups of over 2x compared to a highly-optimized pointer-based baseline

Dagstuhl Research Online Publication Server

Kent Academic Repository

Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku

Author: Daiß Gregor
Diehl Patrick
Huck Kevin
Kaiser Hartmut
Marcello Dominic
Pflüger Dirk
Shiber Sagiv
Publication venue
Publication date: 15/03/2023
Field of study

The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different execution characteristics for various heterogeneous workloads. In this paper, we demonstrate an approach to code and performance portability that is based entirely on established standards in the industry. In addition to applying Kokkos as an abstraction over the execution of compute kernels on different heterogeneous execution environments, we show that the use of standard C++ constructs as exposed by the HPX runtime system enables superb portability in terms of code and performance based on the real-world Octo-Tiger astrophysics application. We report our experience with porting Octo-Tiger to the ARM A64FX architecture provided by Stony Brook's Ookami and Riken's Supercomputer Fugaku and compare the resulting performance with that achieved on well established GPU-oriented HPC machines such as ORNL's Summit, NERSC's Perlmutter and CSCS's Piz Daint systems. Octo-Tiger scaled well on Supercomputer Fugaku without any major code changes due to the abstraction levels provided by HPX and Kokkos. Adding vectorization support for ARM's SVE to Octo-Tiger was trivial thanks to using standard C+

arXiv.org e-Print Archive

Characterization of vectorization strategies for recursive algorithms

Author: Balakrishna Shruthi
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2015
Field of study

A successful architectural trend in parallelism is the emphasis on data parallelism with SIMD hardware. Since SIMD extensions on commodity processors tend to require relatively little extra hardware, executing a SIMD instruction is essentially free from a power perspective, making vector computation an attractive target for parallelism. SIMD instructions are designed to accelerate the performance of applications such as motion video, real-time physics and graphics. Such applications perform repetitive operations on large arrays of numbers. While the key idea is to parallelize significant portions of data that get operated by several sequential instructions into a single instruction, not every application can be parallelized automatically. Regular applications with dense matrices and arrays are easier to vectorize compared to irregular applications that involve pointer based data structures like trees and graphs. Programmers are burdened with the arduous task of manually tuning such applications for better performance. One such class of applications are recursive programs. While they are not traditional serial instruction sequences, they follow a serialized pattern in their control flow graph and exhibit dependencies. They can be visualized to be directed trees data structures. Vectorizing recursive applications with SIMD hardware cannot be achieved by using the existing intrinsic directly because of the nature of these algorithms. In this dissertation, we argue that, for an important subset of recursive programs which arise in many domains, there exists general techniques to efficiently vectorize the program to operate on SIMD architecture. Recursive algorithms are very popular in graph problems, tree traversal algorithms, gaming applications et al. While multi-core and GPU implementation of such algorithms have been explored, methods to execute them efficiently on vector units like SIMD and AVX have not been explored. We investigate techniques for work generation and efficient vectorization to enable vectorization in recursion. We further implement a generic tree model that allows us to guarantee lower bounds on its utilization efficiency

Purdue E-Pubs

Generating optimized Fourier interpolation routines for density function theory using SPIRAL

Author: Franchetti F
Kelly PHJ
Popovici T
Russell FP
Skylaris CK
Wilkinson KA
Publication venue
Publication date: 12/12/2014
Field of study

© 2015 IEEE.Upsampling of a multi-dimensional data-set is an operation with wide application in image processing and quantum mechanical calculations using density functional theory. For small up sampling factors as seen in the quantum chemistry code ONETEP, a time-shift based implementation that shifts samples by a fraction of the original grid spacing to fill in the intermediate values using a frequency domain Fourier property can be a good choice. Readily available highly optimized multidimensional FFT implementations are leveraged at the expense of extra passes through the entire working set. In this paper we present an optimized variant of the time-shift based up sampling. Since ONETEP handles threading, we address the memory hierarchy and SIMD vectorization, and focus on problem dimensions relevant for ONETEP. We present a formalization of this operation within the SPIRAL framework and demonstrate auto-generated and auto-tuned interpolation libraries. We compare the performance of our generated code against the previous best implementations using highly optimized FFT libraries (FFTW and MKL). We demonstrate speed-ups in isolation averaging 3x and within ONETEP of up to 15%

Spiral - Imperial College Digital Repository

Inference of Many-Taxon Phylogenies

Author: Izquierdo-Carrasco Fernando
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2014
Field of study

Phylogenetic trees are tree topologies that represent the evolutionary history of a set of organisms. In this thesis, we address computational challenges related to the analysis of large-scale datasets with Maximum Likelihood based phylogenetic inference. We have approached this using different strategies: reduction of memory requirements, reduction of running time, and reduction of man-hours

KITopen