    Future value based single assignment program representations and optimizations

    An optimizing compiler internal representation fundamentally affects the clarity, efficiency and feasibility of optimization algorithms employed by the compiler. Static Single Assignment (SSA) as a state-of-the-art program representation has great advantages though still can be improved. This dissertation explores the domain of single assignment beyond SSA, and presents two novel program representations: Future Gated Single Assignment (FGSA) and Recursive Future Predicated Form (RFPF). Both FGSA and RFPF embed control flow and data flow information, enabling efficient traversal program information and thus leading to better and simpler optimizations. We introduce future value concept, the designing base of both FGSA and RFPF, which permits a consumer instruction to be encountered before the producer of its source operand(s) in a control flow setting. We show that FGSA is efficiently computable by using a series T1/T2/TR transformation, yielding an expected linear time algorithm for combining together the construction of the pruned single assignment form and live analysis for both reducible and irreducible graphs. As a result, the approach results in an average reduction of 7.7%, with a maximum of 67% in the number of gating functions compared to the pruned SSA form on the SPEC2000 benchmark suite. We present a solid and near optimal framework to perform inverse transformation from single assignment programs. We demonstrate the importance of unrestricted code motion and present RFPF. We develop algorithms which enable instruction movement in acyclic, as well as cyclic regions, and show the ease to perform optimizations such as Partial Redundancy Elimination on RFPF

    Profile-guided redundancy elimination

    Program optimisations analyse and transform the programs such that better performance results can be achieved. Classical optimisations mainly use the static properties of the programs to analyse program code and make sure that the optimisations work for every possible combination of the program and the input data. This approach is conservative in those cases when the programs show the same runtime behaviours for most of their execution time. On the other hand, profile-guided optimisations use runtime profiling information to discover the aforementioned common behaviours of the programs and explore more optimisation opportunities, which are missed in the classical, non-profile-guided optimisations. Redundancy elimination is one of the most powerful optimisations in compilers. In this thesis, a new partial redundancy elimination (PRE) algorithm and a partial dead code elimination algorithm (PDE) are proposed for a profile-guided redundancy elimination framework. During the design and implementation of the algorithms, we address three critical issues: optimality, feasibility and profitability. First, we prove that both our speculative PRE algorithm and our region-based PDE algorithm are optimal for given edge profiling information. The total number of dynamic occurrences of redundant expressions or dead codes cannot be further eliminated by any other code motion. Moreover, our speculative PRE algorithm is lifetime optimal, which means that the lifetimes of new introduced temporary variables are minimised. Second, we show that both algorithms are practical and can be efficiently implemented in production compilers. For SPEC CPU2000 benchmarks, the average compilation overhead for our PRE algorithm is 3%, and the average overhead for our PDE algorithm is less than 2%. Moreover, edge profiling rather than expensive path profiling is sufficient to guarantee the optimality of the algorithms. Finally, we demonstrate that the proposed profile-guided redundancy elimination techniques can provide speedups on real machines by conducting a thorough performance evaluation. To the best of our knowledge, this is the first performance evaluation of the profile-guided redundancy elimination techniques on real machines

    IR-Level Versus Machine-Level If-Conversion for Predicated Architectures

    If-conversion is a simple yet powerful optimization that converts control dependences into data dependences. It allows elimination of branches and increases available instruction level parallelism and thus overall performance. If-conversion can either be applied alone or in combination with other techniques that increase the size of scheduling regions. The presence of hardware support for predicated execution allows if-conversion to be broadly applied in a given program. This makes it necessary to guide the optimization using heuristic estimates regarding its potential benefit. Similar to other transformations in an optimizing compiler, if-conversion inherently su↵ers from phase ordering issues. Driven by these facts, we developed two algorithms for if-conversion targeting the TI TMS320C64x+ architecture within the LLVM framework. Each implementation targets a di↵erent level of code abstraction. While one targets the intermediate representation, the other addresses machine-level code. Both make use of an adapted set of estimation heuristics and prove to be successful in general, but each one exhibits di↵erent strengths and weaknesses. High-level if-conversion, applied before other control flow transformations, has more freedom to operate. But in contrast to its machine-level counterpart, which is more restricted, its estimations of runtime are less accurate. Our results from experimental evaluation show a mean speedup close to 14 % for both algorithms on a set of programs from the MiBench and DSPstone benchmark suites. We give a comparison of the implemented optimizations and discuss gained insights on the topics of ifconversion, phase ordering issues and profitability analysis

    Algebraic aggregation of random forests

    Random Forests are one of the most popular classifiers in machine learning. The larger they are, the more precise the outcome of their predictions. However, this comes at a cost: it is increasingly difficult to understand why a Random Forest made a specific choice, and its running time for classification grows linearly with the size (number of trees). In this paper, we propose a method to aggregate large Random Forests into a single, semantically equivalent decision diagram which has the following two effects: (1) minimal, sufficient explanations for Random Forest-based classifications can be obtained by means of a simple three step reduction, and (2) the running time is radically improved. In fact, our experiments on various popular datasets show speed-ups of several orders of magnitude, while, at the same time, also significantly reducing the size of the required data structure

    FPGA acceleration of a quantized neural network for remote-sensed cloud detection

    The capture and transmission of remote-sensed imagery for Earth observation is both computationally and bandwidth expensive. In the analyses of remote-sensed imagery in the visual band, atmospheric cloud cover can obstruct up to two-thirds of observations, resulting in costly imagery being discarded. Mission objectives and satellite operational details vary; however, assuming a cloud-free observation requirement, a doubling of useful data downlinked with an associated halving of delivery cost is possible through effective cloud detection. A minimal-resource, real-time inference neural network is ideally suited to perform automatic cloud detection, both for pre-processing captured images prior to transmission and preventing unnecessary images being taken by larger payload sensors. Much of the hardware complexity of modern neural network implementations resides in high-precision floating-point calculation pipelines. In recent years, research has been conducted in identifying quantized, or low-integer precision equivalents to known deep learning models, which do not require the extensive resources of their floating-point, full-precision counterparts. Our work leverages existing research on binary and quantized neural networks to develop a real-time, remote-sensed cloud detection solution using a commodity field-programmable gate array. This follows on developments of the Forwards Looking Imager for predictive cloud detection developed by Craft Prospect, a space engineering practice based in Glasgow, UK. The synthesized cloud detection accelerator achieved an inference throughput of 358.1 images per second with a maximum power consumption of 2.4 W. This throughput is an order of magnitude faster than alternate algorithmic options for the Forwards Looking Imager at around one third reduction in classification accuracy, and approximately two orders of magnitude faster than the CloudScout deep neural network, deployed with HyperScout 2 on the European Space Agency PhiSat-1 mission. Strategies for incorporating fault tolerance mechanisms are expounded

    Mitigating the Effect of Misspeculations in Superscalar Processors

    Modern superscalar processors highly rely on the speculative execution which speculatively executes instructions and then verifies. If the prediction is different from the execution result, a misspeculation recovery is performed. Misspeculation recovery penalties still account for a substantial amount of performance reduction. This work focuses on the techniques to mitigate the effect of recovery penalties and proposes practical mechanisms which are thoroughly implemented and analyzed. In general, we can divide the misspeculation penalty into four parts: misspeculation detection delay; stale instruction elimination delay; state restoration delay and pipeline fill delay. This dissertation does not consider the detection delay, instead, we design four innovative mechanisms. Some of these mechanisms target a specific recovery delay whereas others target multiple types of delay in a unified algorithm. Mower was designed to address the stale instruction elimination delay and the state restoration delay by using a special walker. When a misprediction is detected, the walker will scan and repair the instructions which are younger than the mispredicted instruction. During the walking procedure, the correct state is restored and the stale instructions are eliminated. Based on Mower, we further simplify the design and develop a Two-Phase recovery mechanism. This mechanism uses only a basic recovery mechanism except for the case in which the retire stage was stalled by a long latency instruction. When the retire stage is stalled, the second phase is launched and the instructions in the pipeline are re-fetched. Two-Phase mechanism recovers from an earlier point in the program and overlaps the recovery penalty with the long latency penalty. In reality, some of the instructions on the wrong path can be reused during the recovery. However, such reuse of misprediction results is not easy and most of the time involves significant complexity. We design Passing Loop to reduce the pipeline fill delay. We applied our mechanism only for short forward branches which eliminates a substantial amount of complexity. In terms of memory dependence speculation and associated delays due to memory ordering violations, we develop a mechanism that optimizes store-queue-free architectures. A store-queue-free architecture experiences more memory dependence mispredictions due to its aggressive approach to speculations. A common solution is to delay the execution of an instruction which is more likely to be mispredicted. We propose a mechanism to dynamically insert predicates for comparing the address of memory instructions, which is called “Dynamic Memory Dependence Predication” (DMDP). This mechanism boosts the instruction execution to its earliest point and reduces the number of mispredictions

    Model-driven Code Optimization

    Although code optimizations have been applied by compilers for over 40 years, much of the research has been devoted to the development of particular optimizations. Certain problems with the application of optimizations have yet to be addressed, including when, where and in what order to apply optimizations to get the most benefit. A number of occurring events demand these problems to be considered. For example, cost-sensitive embedded systems are widely used, where any performance improvement from applying optimizations can help reduce cost. Although several approaches have been proposed for handling some of these issues, there is no systematic way to address the problems.This dissertation presents a novel model-based framework for effectively applying optimizations. The goal of the framework is to determine optimization properties and use these properties to drive the application of optimizations. This dissertation describes three framework instances: FPSO for predicting the profitability of scalar optimizations; FPLO for predicting the profitability of loop optimizations; and FIO for determining the interaction property. Based on profitability and the interaction properties, compilers will selectively apply only beneficial optimizations and determine code-specific optimization sequences to get the most benefit. We implemented the framework instances and performed the experiments to demonstrate their effectiveness and efficiency. On average, FPSO and FPLO can accurately predict profitability 90% of the time. Compared with a heuristic approach for selectively applying optimizations, our model-driven approach can achieve similar or better performance improvement without tuning the parameters necessary in the heuristic approach. Compared with an empirical approach that experimentally chooses a good order to apply optimizations, our model-driven approach can find similarly good sequences with up to 43 times compile-time savings.This dissertation demonstrates that analytic models can be used to address the effective application of optimizations. Our model-driven approach is practical and scalable. With model-driven optimizations, compilers can produce higher quality code in less time than what is possible with current approaches

    Efficient Precise Dynamic Data Race Detection For Cpu And Gpu

    Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs
