23 research outputs found

    Automatically Optimizing Tree Traversal Algorithms

    Get PDF
    Many domains in computer science, from data-mining to graphics to computational astrophysics, focus heavily on irregular applications. In contrast to regular applications, which operate over dense matrices and arrays, irregular programs manipulate and traverse complex data structures like trees and graphs. As irregular applications operate on ever larger datasets, their performance suffers from poor locality and parallelism. Programmers are burdened with the arduous task of manually tuning such applications for better performance. Generally applicable techniques to optimize irregular applications are highly desired, yet scarce. In this dissertation, we argue that, for an important subset of irregular programs which arises in many domains, namely, tree traversal algorithms like Barnes-Hut, nearest neighbor and ray tracing, there exist general techniques to enhance performance. We investigate two sources of performance improvement: locality enhancement and vectorization. Furthermore we demonstrate that these techniques can be automatically applied by an optimizing compiler, relieving programmers of manual, error-prone, application-specific effort. Achieving high performance in many applications requires achieving good locality of reference. We propose two novel transformations called point blocking and traversal splicing, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in tree traversals. We then present a transformation framework called TreeSplicer, that automatically applies these transformations, and uses autotuning techniques to determine appropriate parameters for the transformations. For six benchmark algorithms, we show that a combination of point blocking and traversal splicing can deliver single-thread speedups of up to 8.71 (geometric mean: 2.48), just from better locality. Modern commodity processors support SIMD instructions, and using these instructions to process multiple traversals at once has the potential to provide substantial performance improvements. Unfortunately tree algorithms often feature highly diverging traversals which inhibit efficient SIMD utilization, to the point that other, less profitable sources of vectorization must be exploited instead. We propose a dynamic reordering of traversals based on previous behavior, based on the insight that traversals which have behaved similarly so far are likely to behave similarly in the future, and show that this reordering can dramatically improve the SIMD utilization of diverging traversals, close to ideal utilization. We present a transformation framework, SIMTree, which facilitates vectorization of tree algorithms, and demonstrate speedups of up to 6.59 (geometric mean: 2.78). Furthermore our techniques can effectively SIMDize algorithms that prior, manual vectorization attempts could not

    Postcondition-preserving fusion of postorder tree transformations

    Get PDF
    Tree transformations are commonly used in applications such as program rewriting in compilers. Using a series of simple transformations to build a more complex system can make the resulting software easier to understand, maintain, and reason about. Fusion strategies for combining such successive tree transformations promote this modularity, whilst mitigating the performance impact from increased numbers of tree traversals. However, it is important to ensure that fused transformations still perform their intended tasks. Existing approaches to fusing tree transformations tend to take an informal approach to soundness, or be too restrictive to consider the kind of transformations needed in a compiler. We use postconditions to define a more useful formal notion of successful fusion, namely postcondition-preserving fusion. We also present criteria that are sufficient to ensure postcondition-preservation and facilitate modular reasoning about the success of fusion

    Recursive tree traversal dependence analysis

    Get PDF
    While there has been much work done on analyzing and transforming regular programs that operate over linear arrays and dense matrices, comparatively little has been done to try to carry these optimizations over to programs that operate over heap-based data structures using pointers. Previous work has shown that point blocking, a technique similar to loop tiling in regular programs, can help increase the temporal locality of repeated tree traversals. Point blocking, however, has only been shown to work on tree traversals where each traversal is fully independent and would allow parallelization, greatly limiting the types of applications that this transformation could be applied to.^ The purpose of this study is to develop a new framework for analyzing recursive methods that perform traversals over trees, called tree dependence analysis. This analysis translates dependence analysis techniques for regular programs to the irregular space, identifying the structure of dependences within a recursive method that traverses trees. In this study, a dependence test that exploits the dependence structure of such programs is developed, and is shown to be able to prove the legality of several locality— and parallelism-enhancing transformations, including point blocking. In addition, the analysis is extended with a novel path-dependent, conditional analysis to refine the dependence test and prove the legality of transformations for a wider range of algorithms. These analyses are then used to show that several common algorithms that manipulate trees recursively are amenable to several locality— and parallelism-enhancing transformations. This work shows that classical dependence analysis techniques, which have largely been confined to nested loops over array data structures, can be extended and translated to work for complex, recursive programs that operate over pointer-based data structures

    Scheduling Transformation and Dependence Tests for Recursive Programs

    Get PDF
    Scheduling transformations reorder the execution of operations in a program to improve locality and/or parallelism. The polyhedral model provides a general framework for performing instance-wise scheduling transformations for regular programs, reordering the iterations of loops that operate over dense arrays through transformations like tiling. There is no analogous framework for recursive programs—despite recent interest in optimizations like tiling and fusion for recursive applications. This paper presents PolyRec, the first general framework for applying scheduling transformations—like inlining, interchange, and code motion—to nested recursive programs and reasoning about their correctness. We describe the phases of PolyRec—representing dynamic instances, applying transformations, reasoning about correctness—and show that PolyRec is able to apply sophisticated, composed transformations to complex, nested recursive programs and improve performance through enhanced locality

    Miniphases: Compilation using Modular and Efficient Tree Transformations

    Get PDF
    Production compilers commonly perform dozens of transformations on an intermediate representation. Running those transformations in separate passes harms performance. One approach to recover performance is to combine transformations by hand in order to reduce number of passes. Such an approach harms modularity, and thus makes it hard to maintain and evolve a compiler over the long term, and makes reasoning about performance harder. This paper describes a methodology that allows a compiler writer to define multiple transformations separately, but fuse them into a single traversal of the intermediate representation when the compiler runs. This approach has been implemented in a compiler for the Scala language. Our performance evaluation indicates that this approach reduces the running time of tree transformations by 35\% and shows that this is due to improved cache friendliness. At the same time, the approach improves total memory consumption by reducing the object tenuring rate by 50\%. This approach enables compiler writers to write transformations that are both modular and fast at the same time

    Laying Tiles Ornamentally: An approach to structuring container traversals

    Get PDF
    Having hardware more capable of parallel execution means that more program scheduling decisions have to be taken to utilize that hardware efficiently. To this end, compilers implement coarse-grained loop transformations in addition to traditionally used fine-grained instruction reordering. Implementors of embedded domain specific languages have to face a difficult choice: to translate operations on collections to a low-level language naively hoping that its optimizer will do the job, or to implement their own optimizer as a part of the EDSL.<br /><br />We turn ourselves to the concept of loop tiling from the imperative world and find its equivalent for recursive functions. We show the construction of a <em>tiled</em> functorial map over containers that can be naively translated to a corresponding nested loop.<br /><br />We illustrate the connection between <em>untiled</em> and tiled functorial maps by means of a type-theoretic notion of <em>algebraic ornament</em>. This approach produces an family of container traversals indexed by <em>tile sizes</em> and serves as a basis of a proof that untiled and tiled functorial maps have the same semantics.<br /><br />We evaluate our approach by designing a language of tree traversals as a DSL embedded into Haskell which compiles into C code. We use this language to implement tiled and untiled tree traversals which we benchmark under varying choices of tile sizes and shapes of input trees. For some tree shapes, we show that a tiled tree traversal can be up to 50% faster than an untiled one under a good choice of the tile size

    æœšă‚’ç”šă„ăŸæ§‹é€ ćŒ–äžŠćˆ—ăƒ—ăƒ­ă‚°ăƒ©ăƒŸăƒłă‚°

    Get PDF
    High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.é›»æ°—é€šäżĄć€§ć­Š201

    Modular reasoning about combining modular compiler phases

    Get PDF
    Compilers are large and complex pieces of software, which can be challenging to work with. Modularity has significant benefits in such cases: building a complex system from a series of simpler components can make understanding, maintaining, and reasoning about the resulting software more straightforward. Not only does this modularity aid the compiler developer, but the compiler user benefits too, from a compiler that is more likely to be correct and regularly updated. A good focus for modularity in a compiler lies in the phases that make up the compiler pipeline. Often, compiler phases involve transforming some graph structure, in order to perform program rewriting. Techniques for automatically combining such graph transformations aim to promote modularity whilst mitigating the increased performance overheads that can occur from an increased number of separate transformations. Nevertheless, it is important that the effectiveness and correctness of compiler phases is not compromised in favour of modularity or performance. Therefore, the combined graph transformations need to still satisfy the intended outcomes of their individual components. Many existing approaches either take an informal approach to soundness, or impose conditions that are too restrictive for the kind of graph transformations found in a realistic compiler. Some approaches only allow transformations to be combined if the ensuing transformation will produce identical results. However, certain compiler optimisations behave more effectively in combination, thus producing a different but better optimised result. Another limitation of some approaches is that, although the compiler phases are intentionally modular, the process of combining them is often tested or reasoned about in a non-modular way, once they have already been combined. Thus, this thesis outlines an approach for modular reasoning about successfully combining modular compiler phases, where success refers to preserving only the truly necessary behaviour of transformations. Focusing on postorder transformations of, first, abstract syntax trees and, then, program expression graphs, the fusion technique of interleaving transformations combines compiler phases, reducing the number of graph traversals required. Postconditions allow compiler developers to encode the behaviour required of a given compiler phase, with preservation of postconditions then a significant part of successful fusion. Building on these ideas, this thesis formalises the idea of postcondition preserving fusion, and presents criteria that are sufficient to facilitate modular reasoning about the success of fusion

    Design and implementation of an optimizing type-centric compiler for a high-level language

    Get PDF
    Production compilers for programming languages face multiple requirements. They should be correct, as we rely on them to produce code. They should be fast, in order to provide a good developer experience. They should also be easy to maintain and evolve. This thesis shows how an expressive high level type system can be used to simplify the development of a compiler and demonstrates this on a compiler for Scala. First, it shows how expressive types of high level languages can be used to build internal data structures that provide a statically checked API, ensuring that important properties hold at compile time. Second, we also show how high level language features can be used to abstract the components of a compiler. We demonstrate this by introducing a type-safe layer on top of the bytecode emission phase. This makes it possible to abstract away the implementation details of the compiler frontend and run the same bytecode emission phase in two different Scala compilers. Third, it presents MiniPhases, a novel way to organize transformation passes in a compiler. MiniPhases impose constraints on the organization of passes that are beneficial for maintainability, performance, and testability. We include a detailed performance evaluation of MiniPhases which indicates that their speedup is due to improved cache friendliness and to a lower rate of promotions of objects into the old generations of garbage collectors. Finally, we demonstrate how the expressive type system of the language being compiled can be used for static analysis. We present a novel call graph construction algorithm which uses the typing context for context sensitivity. The resulting algorithm is both substantially faster and more precise than existing alternatives. We demonstrate the applicability of this analysis by extending common subexpression elimination to idempotent expression elimination
    corecore