489 research outputs found
On-stack replacement, distilled
On-stack replacement (OSR) is essential technology for adaptive optimization, allowing changes to code actively executing in a managed runtime. The engineering aspects of OSR are well-known among VM architects, with several implementations available to date. However, OSR is yet to be explored as a general means to transfer execution between related program versions, which can pave the road to unprecedented applications that stretch beyond VMs. We aim at filling this gap with a constructive and provably correct OSR framework, allowing a class of general-purpose transformation functions to yield a special-purpose replacement. We describe and evaluate an implementation of our technique in LLVM. As a novel application of OSR, we present a feasibility study on debugging of optimized code, showing how our techniques can be used to fix variables holding incorrect values at breakpoints due to optimizations
Observable dynamic compilation
Managed language platforms such as the Java Virtual Machine rely on a dynamic compiler to achieve high performance. Despite the benefits that dynamic compilation provides, it also introduces some challenges to program profiling. Firstly, profilers based on bytecode instrumentation may yield wrong results in the presence of an optimizing dynamic compiler, either due to not being aware of optimizations, or because the inserted instrumentation code disrupts such optimizations. To avoid such perturbations, we present a technique to make profilers based on bytecode instrumentation aware of the optimizations performed by the dynamic compiler, and make the dynamic compiler aware of the inserted code. We implement our technique for separating inserted instrumentation code from base-program code in Oracle's Graal compiler, integrating our extension into the OpenJDK Graal project. We demonstrate its significance with concrete profilers. On the one hand, we improve accuracy of existing profiling techniques, for example, to quantify the impact of escape analysis on bytecode-level allocation profiling, to analyze object life-times, and to evaluate the impact of method inlining when profiling method invocations. On the other hand, we also illustrate how our technique enables new kinds of profilers, such as a profiler for non-inlined callsites, and a testing framework for locating performance bugs in dynamic compiler implementations. Secondly, the lack of profiling support at the intermediate representation (IR) level complicates the understanding of program behavior in the compiled code. This issue cannot be addressed by bytecode instrumentation because it cannot precisely capture the occurrence of IR-level operations. Binary instrumentation is not suited either, as it lacks a mapping from the collected low-level metrics to higher-level operations of the observed program. To fill this gap, we present an easy-to-use event-based framework for profiling operations at the IR level. We integrate the IR profiling framework in the Graal compiler, together with our instrumentation-separation technique. We illustrate our approach with a profiler that tracks the execution of memory barriers within compiled code. In addition, using a deoptimization profiler based on our IR profiling framework, we conduct an empirical study on deoptimization in the Graal compiler. We focus on situations which cause program execution to switch from machine code to the interpreter, and compare application performance using three different deoptimization strategies which influence the amount of extra compilation work done by Graal. Using an adaptive deoptimization strategy, we manage to improve the average start-up performance of benchmarks from the DaCapo, ScalaBench, and Octane suites by avoiding wasted compilation work. We also find that different deoptimization strategies have little impact on steady- state performance
Retrofitting Security in COTS Software with Binary Rewriting
We present a practical tool for inserting security features against low-level software attacks into third-party, proprietary or otherwise binary-only software. We are motivated by the inability of software users to select and use low-overhead protection schemes when source code is unavailable to them, by the lack of information as to what (if any) security mechanisms software producers have used in their toolchains, and the high overhead and inaccuracy of solutions that treat software as a black box. Our approach is based on SecondWrite, an advanced binary rewriter that operates without need for debugging information or other assist. Using SecondWrite, we insert a variety of defenses into program binaries. Although the defenses are generally well known, they have not generally been used together because they are implemented by different (non-integrated) tools. We are also the first to demonstrate the use of such mechanisms in the absence of source code availability. We experimentally evaluate the effectiveness and performance impact of our approach. We show that it stops all variants of low-level software attacks at a very low performance overhead, without impacting original program functionality
Optimizing SIMD execution in HW/SW co-designed processors
SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators.
This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization.
Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations.
Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version
Code optimizations for narrow bitwidth architectures
This thesis takes a HW/SW collaborative approach to tackle the problem of computational inefficiency in a holistic manner.
The hardware is redesigned by restraining the datapath to merely 16-bit datawidth (integer datapath only) to provide an
extremely simple, low-cost, low-complexity execution core which is best at executing the most common case efficiently. This
redesign, referred to as the Narrow Bitwidth Architecture, is unique in that although the datapath is squeezed to 16-bits, it
continues to offer the advantage of higher memory addressability like the contemporary wider datapath architectures. Its
interface to the outside (software) world is termed as the Narrow ISA. The software is responsible for efficiently mapping the
current stack of 64-bit applications onto the 16-bit hardware. However, this HW/SW approach introduces a non-negligible
penalty both in dynamic code-size and performance-impact even with a reasonably smart code-translator that maps the 64-
bit applications on to the 16-bit processor.
The goal of this thesis is to design a software layer that harnesses the power of compiler optimizations to assuage this
negative performance penalty of the Narrow ISA. More specifically, this thesis focuses on compiler optimizations targeting the
problem of how to compile a 64-bit program to a 16-bit datapath machine from the perspective of Minimum Required
Computations (MRC). Given a program, the notion of MRC aims to infer how much computation is really required to generate
the same (correct) output as the original program.
Approaching perfect MRC is an intrinsically ambitious goal and it requires oracle predictions of program behavior. Towards
this end, the thesis proposes three heuristic-based optimizations to closely infer the MRC. The perspective of MRC unfolds
into a definition of productiveness - if a computation does not alter the storage location, it is non-productive and hence, not
necessary to be performed. In this research, the definition of productiveness has been applied to different granularities of the
data-flow as well as control-flow of the programs.
Three profile-based, code optimization techniques have been proposed :
1. Global Productiveness Propagation (GPP) which applies the concept of productiveness at the granularity of a function.
2. Local Productiveness Pruning (LPP) applies the same concept but at a much finer granularity of a single instruction.
3. Minimal Branch Computation (MBC) is an profile-based, code-reordering optimization technique which applies the
principles of MRC for conditional branches.
The primary aim of all these techniques is to reduce the dynamic code footprint of the Narrow ISA. The first two optimizations
(GPP and LPP) perform the task of speculatively pruning the non-productive (useless) computations using profiles. Further,
these two optimization techniques perform backward traversal of the optimization regions to embed checks into the nonspeculative
slices, hence, making them self-sufficient to detect mis-speculation dynamically.
The MBC optimization is a use case of a broader concept of a lazy computation model. The idea behind MBC is to reorder the
backslices containing narrow computations such that the minimal necessary computations to generate the same (correct)
output are performed in the most-frequent case; the rest of the computations are performed only when necessary.
With the proposed optimizations, it can be concluded that there do exist ways to smartly compile a 64-bit application to a 16-
bit ISA such that the overheads are considerably reduced.Esta tesis deriva su motivación en la inherente ineficiencia computacional de los procesadores actuales: a pesar de que
muchas aplicaciones contemporáneas tienen unos requisitos de ancho de bits estrechos (aplicaciones de enteros, de red y
multimedia), el hardware acaba utilizando el camino de datos completo, utilizando más recursos de los necesarios y
consumiendo más energÃa.
Esta tesis utiliza una aproximación HW/SW para atacar, de forma Ãntegra, el problema de la ineficiencia computacional. El
hardware se ha rediseñado para restringir el ancho de bits del camino de datos a sólo 16 bits (únicamente el de enteros) y
ofrecer asà un núcleo de ejecución simple, de bajo consumo y baja complejidad, el cual está diseñado para ejecutar de
forma eficiente el caso común. El rediseño, llamado en esta tesis Arquitectura de Ancho de Bits Estrecho (narrow bitwidth
en inglés), es único en el sentido que aunque el camino de datos se ha estrechado a 16 bits, el sistema continúa
ofreciendo las ventajas de direccionar grandes cantidades de memoria tal como procesadores con caminos de datos más
anchos (64 bits actualmente). Su interface con el mundo exterior se denomina ISA estrecho. En nuestra propuesta el
software es responsable de mapear eficientemente la actual pila software de las aplicaciones de 64 bits en el hardware de
16 bits. Sin embargo, esta aproximación HW/SW introduce penalizaciones no despreciables tanto en el tamaño del código
dinámico como en el rendimiento, incluso con un traductor de código inteligente que mapea las aplicaciones de 64 bits en
el procesador de 16 bits.
El objetivo de esta tesis es el de diseñar una capa software que aproveche la capacidad de las optimizaciones para reducir
el efecto negativo en el rendimiento del ISA estrecho. Concretamente, esta tesis se centra en optimizaciones que tratan el
problema de como compilar programas de 64 bits para una máquina de 16 bits desde la perspectiva de las MÃnimas
Computaciones Requeridas (MRC en inglés). Dado un programa, la noción de MRC intenta deducir la cantidad de cómputo
que realmente se necesita para generar la misma (correcta) salida que el programa original.
Aproximarse al MRC perfecto es una meta intrÃnsecamente ambiciosa y que requiere predicciones perfectas de
comportamiento del programa. Con este fin, la tesis propone tres heurÃsticas basadas en optimizaciones que tratan de
inferir el MRC. La utilización de MRC se desarrolla en la definición de productividad: si un cálculo no altera el dato que ya
habÃa almacenado, entonces no es productivo y por lo tanto, no es necesario llevarlo a cabo.
Se han propuesto tres optimizaciones del código basadas en profile:
1. Propagación Global de la Productividad (GPP en inglés) aplica el concepto de productividad a la granularidad de función.
2. Poda Local de Productividad (LPP en inglés) aplica el mismo concepto pero a una granularidad mucho más fina, la de
una única instrucción.
3. Computación MÃnima del Salto (MBC en inglés) es una técnica de reordenación de código que aplica los principios de
MRC a los saltos condicionales.
El objetivo principal de todas esta técnicas es el de reducir el tamaño dinámico del código estrecho. Las primeras dos
optimizaciones (GPP y LPP) realizan la tarea de podar especulativamente las computaciones no productivas (innecesarias)
utilizando profiles. Además, estas dos optimizaciones realizan un recorrido hacia atrás de las regiones a optimizar para
añadir chequeos en el código no especulativo, haciendo de esta forma la técnica autosuficiente para detectar,
dinámicamente, los casos de fallo en la especulación.
La idea de la optimización MBC es reordenar las instrucciones que generan el salto condicional tal que las mÃnimas
computaciones que general la misma (correcta) salida se ejecuten en la mayorÃa de los casos; el resto de las
computaciones se ejecutarán sólo cuando sea necesario
Profile-guided redundancy elimination
Program optimisations analyse and transform the programs such
that better performance results can be achieved. Classical optimisations
mainly use the static properties of the programs to analyse program
code and make sure that the optimisations work for every possible
combination of the program and the input data. This approach
is conservative in those cases when the programs show the same runtime
behaviours for most of their execution time. On the other hand,
profile-guided optimisations use runtime profiling information to discover
the aforementioned common behaviours of the programs and explore
more optimisation opportunities, which are missed in the classical,
non-profile-guided optimisations. Redundancy elimination is one of the
most powerful optimisations in compilers. In this thesis, a new partial
redundancy elimination (PRE) algorithm and a partial dead code elimination
algorithm (PDE) are proposed for a profile-guided redundancy
elimination framework. During the design and implementation of the
algorithms, we address three critical issues: optimality, feasibility and
profitability.
First, we prove that both our speculative PRE algorithm and our
region-based PDE algorithm are optimal for given edge profiling information.
The total number of dynamic occurrences of redundant expressions
or dead codes cannot be further eliminated by any other code
motion. Moreover, our speculative PRE algorithm is lifetime optimal,
which means that the lifetimes of new introduced temporary variables
are minimised.
Second, we show that both algorithms are practical and can be efficiently
implemented in production compilers. For SPEC CPU2000
benchmarks, the average compilation overhead for our PRE algorithm
is 3%, and the average overhead for our PDE algorithm is less than 2%.
Moreover, edge profiling rather than expensive path profiling is sufficient
to guarantee the optimality of the algorithms.
Finally, we demonstrate that the proposed profile-guided redundancy
elimination techniques can provide speedups on real machines by conducting
a thorough performance evaluation. To the best of our knowledge,
this is the first performance evaluation of the profile-guided redundancy
elimination techniques on real machines
Partial Redundancy Elimination for Multi-threaded Programs
Multi-threaded programs have many applications which are widely used such as
operating systems. Analyzing multi-threaded programs differs from sequential
ones; the main feature is that many threads execute at the same time. The
effect of all other running threads must be taken in account. Partial
redundancy elimination is among the most powerful compiler optimizations: it
performs loop-invariant code motion and common subexpression elimination. We
present a type system with optimization component which performs partial
redundancy elimination for multi-threaded programs.Comment: 7 page
Unconventional Applications of Compiler Analysis
Previously, compiler transformations have primarily focused on
minimizing program execution time. This thesis explores some examples
of applying compiler technology outside of its original scope.
Specifically, we apply compiler analysis to the field of software
maintenance and evolution by examining the use of global data
throughout the lifetimes of many open source projects. Also, we
investigate the effects of compiler optimizations on the power
consumption of small battery powered devices. Finally, in an area
closer to traditional compiler research we examine automatic program
parallelization in the form of thread-level speculation
Run-time optimization of adaptive irregular applications
Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems
- …