2,111 research outputs found

    Partial Redundancy Elimination for Multi-threaded Programs

    Full text link
    Multi-threaded programs have many applications which are widely used such as operating systems. Analyzing multi-threaded programs differs from sequential ones; the main feature is that many threads execute at the same time. The effect of all other running threads must be taken in account. Partial redundancy elimination is among the most powerful compiler optimizations: it performs loop-invariant code motion and common subexpression elimination. We present a type system with optimization component which performs partial redundancy elimination for multi-threaded programs.Comment: 7 page

    Profile-guided redundancy elimination

    Full text link
    Program optimisations analyse and transform the programs such that better performance results can be achieved. Classical optimisations mainly use the static properties of the programs to analyse program code and make sure that the optimisations work for every possible combination of the program and the input data. This approach is conservative in those cases when the programs show the same runtime behaviours for most of their execution time. On the other hand, profile-guided optimisations use runtime profiling information to discover the aforementioned common behaviours of the programs and explore more optimisation opportunities, which are missed in the classical, non-profile-guided optimisations. Redundancy elimination is one of the most powerful optimisations in compilers. In this thesis, a new partial redundancy elimination (PRE) algorithm and a partial dead code elimination algorithm (PDE) are proposed for a profile-guided redundancy elimination framework. During the design and implementation of the algorithms, we address three critical issues: optimality, feasibility and profitability. First, we prove that both our speculative PRE algorithm and our region-based PDE algorithm are optimal for given edge profiling information. The total number of dynamic occurrences of redundant expressions or dead codes cannot be further eliminated by any other code motion. Moreover, our speculative PRE algorithm is lifetime optimal, which means that the lifetimes of new introduced temporary variables are minimised. Second, we show that both algorithms are practical and can be efficiently implemented in production compilers. For SPEC CPU2000 benchmarks, the average compilation overhead for our PRE algorithm is 3%, and the average overhead for our PDE algorithm is less than 2%. Moreover, edge profiling rather than expensive path profiling is sufficient to guarantee the optimality of the algorithms. Finally, we demonstrate that the proposed profile-guided redundancy elimination techniques can provide speedups on real machines by conducting a thorough performance evaluation. To the best of our knowledge, this is the first performance evaluation of the profile-guided redundancy elimination techniques on real machines

    AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES

    Get PDF
    A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph

    Safe code transfromations for speculative execution in real-time systems

    Get PDF
    Although compiler optimization techniques are standard and successful in non-real-time systems, if naively applied, they can destroy safety guarantees and deadlines in hard real-time systems. For this reason, real-time systems developers have tended to avoid automatic compiler optimization of their code. However, real-time applications in several areas have been growing substantially in size and complexity in recent years. This size and complexity makes it impossible for real-time programmers to write optimal code, and consequently indicates a need for compiler optimization. Recently researchers have developed or modified analyses and transformations to improve performance without degrading worst-case execution times. Moreover, these optimization techniques can sometimes transform programs which may not meet constraints/deadlines, or which result in timeouts, into deadline-satisfying programs. One such technique, speculative execution, also used for example in parallel computing and databases, can enhance performance by executing parts of the code whose execution may or may not be needed. In some cases, rollback is necessary if the computation turns out to be invalid. However, speculative execution must be applied carefully to real-time systems so that the worst-case execution path is not extended. Deterministic worst-case execution for satisfying hard real-time constraints, and speculative execution with rollback for improving average-case throughput, appear to lie on opposite ends of a spectrum of performance requirements and strategies. Deterministic worst-case execution for satisfying hard real-time constraints, and speculative execution with rollback for improving average-case throughput, appear to lie on opposite ends of a spectrum of performance requirements and strategies. Nonetheless, this thesis shows that there are situations in which speculative execution can improve the performance of a hard real-time system, either by enhancing average performance while not affecting the worst-case, or by actually decreasing the worst-case execution time. The thesis proposes a set of compiler transformation rules to identify opportunities for speculative execution and to transform the code. Proofs for semantic correctness and timeliness preservation are provided to verify safety of applying transformation rules to real-time systems. Moreover, an extensive experiment using simulation of randomly generated real-time programs have been conducted to evaluate applicability and profitability of speculative execution. The simulation results indicate that speculative execution improves average execution time and program timeliness. Finally, a prototype implementation is described in which these transformations can be evaluated for realistic applications

    Declassification: transforming java programs to remove intermediate classes

    Get PDF
    Computer applications are increasingly being written in object-oriented languages like Java and C++ Object-onented programming encourages the use of small methods and classes. However, this style of programming introduces much overhead as each method call results in a dynamic dispatch and each field access becomes a pointer dereference to the heap allocated object. Many of the classes in these programs are included to provide structure rather than to act as reusable code, and can therefore be regarded as intermediate. We have therefore developed an optimisation technique, called declassification, which will transform Java programs into equivalent programs from which these intermediate classes have been removed. The optimisation technique developed involves two phases, analysis and transformation. The analysis involves the identification of intermediate classes for removal. A suitable class is defined to be a class which is used exactly once within a program. Such classes are identified by this analysis The subsequent transformation involves eliminating these intermediate classes from the program. This involves inlinmg the fields and methods of each intermediate class within the enclosing class which uses it. In theory, declassification reduces the number of classes which are instantiated and used in a program during its execution. This should reduce the overhead of object creation and maintenance as child objects are no longer created, and it should also reduce the number of field accesses and dynamic dispatches required by a program to execute. An important feature of the declassification technique, as opposed to other similar techniques, is that it guarantees there will be no increase in code size. An empirical study was conducted on a number of reasonable-sized Java programs and it was found that very few suitable classes were identified for miming. The results showed that the declassification technique had a small influence on the memory consumption and a negligible influence on the run-time performance of these programs. It is therefore concluded that the declassification technique was not successful in optimizing the test programs but further extensions to this technique combined with an intrinsically object-onented set of test programs could greatly improve its success

    Simplification and Run-time Resolution of Data Dependence Constraints for Loop Transformations

    Get PDF
    International audienceLoop transformations such as tiling, parallelization or vector-ization are essential tools in the quest for high-performance program execution. But precise data dependence analysis is required to determine the validity of a loop transformation, and whether the compiler can apply it or not. In particular , current static analyses typically fail to provide precise enough dependence information when the code contains indirect memory accesses or even polynomial subscript functions to index arrays. This leads to considering superfluous may-dependences between instructions, in turn preventing many loop transformations to be applied. In this work we present a new framework to overcome several limitations of static dependence analyses, to enable aggressive loop transformations on programs with may-dependences. We statically generate a test to be evaluated at runtime which uses data dependence information to determine whether a program transformation is valid, and if so triggers the execution of the transformed code, falling back to the original code otherwise. These tests, originally constructed as a loop-based code with O(n 2d) iterations (d being the maximal loop depth of the program, n being the loop trip count), are reduced to a loop-free test of O(1) complexity thanks to a new quantifier elimination scheme that we introduce in this paper. The precision and low overhead of our method is demonstrated over 25 kernels containing may-alias memory pointers and polynomial memory access subscripts

    Service Abstractions for Scalable Deep Learning Inference at the Edge

    Get PDF
    Deep learning driven intelligent edge has already become a reality, where millions of mobile, wearable, and IoT devices analyze real-time data and transform those into actionable insights on-device. Typical approaches for optimizing deep learning inference mostly focus on accelerating the execution of individual inference tasks, without considering the contextual correlation unique to edge environments and the statistical nature of learning-based computation. Specifically, they treat inference workloads as individual black boxes and apply canonical system optimization techniques, developed over the last few decades, to handle them as yet another type of computation-intensive applications. As a result, deep learning inference on edge devices still face the ever increasing challenges of customization to edge device heterogeneity, fuzzy computation redundancy between inference tasks, and end-to-end deployment at scale. In this thesis, we propose the first framework that automates and scales the end-to-end process of deploying efficient deep learning inference from the cloud to heterogeneous edge devices. The framework consists of a series of service abstractions that handle DNN model tailoring, model indexing and query, and computation reuse for runtime inference respectively. Together, these services bridge the gap between deep learning training and inference, eliminate computation redundancy during inference execution, and further lower the barrier for deep learning algorithm and system co-optimization. To build efficient and scalable services, we take a unique algorithmic approach of harnessing the semantic correlation between the learning-based computation. Rather than viewing individual tasks as isolated black boxes, we optimize them collectively in a white box approach, proposing primitives to formulate the semantics of the deep learning workloads, algorithms to assess their hidden correlation (in terms of the input data, the neural network models, and the deployment trials) and merge common processing steps to minimize redundancy

    Lightweight Modular Staging and Embedded Compilers:Abstraction without Regret for High-Level High-Performance Programming

    Get PDF
    Programs expressed in a high-level programming language need to be translated to a low-level machine dialect for execution. This translation is usually accomplished by a compiler, which is able to translate any legal program to equivalent low-level code. But for individual source programs, automatic translation does not always deliver good results: Software engineering practice demands generalization and abstraction, whereas high performance demands specialization and concretization. These goals are at odds, and compilers can only rarely translate expressive high-level programs tomodern hardware platforms in a way that makes best use of the available resources. Explicit program generation is a promising alternative to fully automatic translation. Instead of writing down the program and relying on a compiler for translation, developers write a program generator, which produces a specialized, efficient, low-level program as its output. However, developing high-quality program generators requires a very large effort that is often hard to amortize. In this thesis, we propose a hybrid design: Integrate compilers into programs so that programs can take control of the translation process, but rely on libraries of common compiler functionality for help. We present Lightweight Modular Staging (LMS), a generative programming approach that lowers the development effort significantly. LMS combines program generator logic with the generated code in a single program, using only types to distinguish the two stages of execution. Through extensive use of component technology, LMS makes a reusable and extensible compiler framework available at the library level, allowing programmers to tightly integrate domain-specific abstractions and optimizations into the generation process, with common generic optimizations provided by the framework. Compared to previous work on programgeneration, a key aspect of our design is the use of staging not only as a front-end, but also as a way to implement internal compiler passes and optimizations, many of which can be combined into powerful joint simplification passes. LMS is well suited to develop embedded domain specific languages (DSLs) and has been used to develop powerful performance-oriented DSLs for demanding domains such as machine learning, with code generation for heterogeneous platforms including GPUs. LMS has also been used to generate SQL for embedded database queries and JavaScript for web applications

    Removing and restoring control flow with the Value State Dependence Graph

    Get PDF
    This thesis studies the practicality of compiling with only data flow information. Specifically, we focus on the challenges that arise when using the Value State Dependence Graph (VSDG) as an intermediate representation (IR). We perform a detailed survey of IRs in the literature in order to discover trends over time, and we classify them by their features in a taxonomy. We see how the VSDG fits into the IR landscape, and look at the divide between academia and the 'real world' in terms of compiler technology. Since most data flow IRs cannot be constructed for irreducible programs, we perform an empirical study of irreducibility in current versions of open source software, and then compare them with older versions of the same software. We also study machine-generated C code from a variety of different software tools. We show that irreducibility is no longer a problem, and is becoming less so with time. We then address the problem of constructing the VSDG. Since previous approaches in the literature have been poorly documented or ignored altogether, we give our approach to constructing the VSDG from a common IR: the Control Flow Graph. We show how our approach is independent of the source and target language, how it is able to handle unstructured control flow, and how it is able to transform irreducible programs on the fly. Once the VSDG is constructed, we implement Lawrence's proceduralisation algorithm in order to encode an evaluation strategy whilst translating the program into a parallel representation: the Program Dependence Graph. From here, we implement scheduling and then code generation using the LLVM compiler. We compare our compiler framework against several existing compilers, and show how removing control flow with the VSDG and then restoring it later can produce high quality code. We also examine specific situations where the VSDG can put pressure on existing code generators. Our results show that the VSDG represents a radically different, yet practical, approach to compilation
    corecore