    Array bounds check elimination in the context of deoptimization

    AbstractWhenever an array element is accessed, Java virtual machines execute a compare instruction to ensure that the index value is within the valid bounds. This reduces the execution speed of Java programs. Array bounds check elimination identifies situations in which such checks are redundant and can be removed. We present an array bounds check elimination algorithm for the Java HotSpotℱ VM based on static analysis in the just-in-time compiler.The algorithm works on an intermediate representation in static single assignment form and maintains conditions for index expressions. It fully removes bounds checks if it can be proven that they never fail. Whenever possible, it moves bounds checks out of loops. The static number of checks remains the same, but a check inside a loop is likely to be executed more often. If such a check fails, the executing program falls back to interpreted mode, avoiding the problem that an exception is thrown at the wrong place.The evaluation shows a speedup near to the theoretical maximum for the scientific SciMark benchmark suite and also significant improvements for some Java Grande benchmarks. The algorithm slightly increases the execution speed for the SPECjvm98 benchmark suite. The evaluation of the DaCapo benchmarks shows that array bounds checks do not have a significant impact on the performance of object-oriented applications

    Optimizing JavaScript Engines for Modern-day Workloads

    In modern times, we have seen tremendous increase in popularity and usage of web-based applications. Applications such as presentation softwareand word processors, which were traditionally considered desktop applications are being ported to the web by compiling them to JavaScript. Since JavaScript is the de facto language of the web, JavaScript engine performance significantly affects the overall web application experience. JavaScript, initially intended solely as a client-side scripting language for web browsers, is now being used to implement server-side web applications (node.js) that traditionally have been written in languages like Java. Web application developers expect "C"-like performance out of their applications. Thus, there is a need to reevaluate the optimization strategies implemented in the modern day engines.Thesis statement: I propose that by using run-time and ahead-of-time profiling and type specialization techniques it is possible to improve the performance of JavaScript engines to cater to the needs of modern-day workloads.In this dissertation, we present an improved synergistic type specialization strategy for optimized JavaScript code execution, implemented on top of a research JavaScript engine called MuscalietJS. Our technique combines type feedback and type inference to reinforce and augment each other in a unique way. We then present a novel deoptimization strategy that enables type specialized code generation on top of typed, stack-based virtual machines like CLR. We also describe a server-side offline profiling technique to collect profile information for web application which helps client JavaScript engines (running in the browser) avoid deoptimizations and improve performance of the applications. Finally, we describe a technique to improve the performance of server-side JavaScript code by making use of intelligent profile caching and two new type stability heuristics

    Verified lifting of stencil computations

    This paper demonstrates a novel combination of program synthesis and verification to lift stencil computations from low-level Fortran code to a high-level summary expressed using a predicate language. The technique is sound and mostly automated, and leverages counter-example guided inductive synthesis (CEGIS) to find provably correct translations. Lifting existing code to a high-performance description language has a number of benefits, including maintainability and performance portability. For example, our experiments show that the lifted summaries can enable domain specific compilers to do a better job of parallelization as compared to an off-the-shelf compiler working on the original code, and can even support fully automatic migration to hardware accelerators such as GPUs. We have implemented verified lifting in a system called STNG and have evaluated it using microbenchmarks, mini-apps, and real-world applications. We demonstrate the benefits of verified lifting by first automatically summarizing Fortran source code into a high-level predicate language, and subsequently translating the lifted summaries into Halide, with the translated code achieving median performance speedups of 4.1X and up to 24X for non-trivial stencils as compared to the original implementation.United States. Department of Energy. Office of Science (Award DE-SC0008923)United States. Department of Energy. Office of Science (Award DE-SC0005288

    Efficient Implementation of Parametric Polymorphism using Reified Types

    Parametric polymorphism is a language feature that lets programmers define code that behaves independently of the types of values it operates on. Using parametric polymorphism enables code reuse and improves the maintainability of software projects. The approach that a language implementation uses to support parametric polymorphism can have important performance implications. One such approach, erasure, converts generic code to non-generic code that uses a uniform representation for generic data. Erasure is notorious for introducing primitive boxing and other indirections that harm the performance of generic code. More generally, erasure destroys type information that could be used by the language implementation to optimize generic code. This thesis presents TASTyTruffle, a new interpreter for the Scala language. Whereas the standard Scala implementation executes erased Java Virtual Machine (JVM) bytecode, TASTyTruffle interprets TASTy, a different representation that has precise type information. This thesis explores how the type information present in TASTy empowers TASTyTruffle to implement generic code more effectively. In particular, TASTy's type information allows TASTyTruffle to reify types as objects that can be passed around the interpreter. These reified types are used to support heterogeneous box-free representations of generic values. Reified types also enable TASTyTruffle to create specialized, monomorphic copies of generic code that can be easily and reliably optimized by its just-in-time (JIT) compiler. Empirically, TASTyTruffle is competitive with the standard JVM implementation. Both implementations perform similarly on monomorphic workloads, but when generic code is used with multiple types, TASTyTruffle consistently outperforms the JVM. TASTy's type information enables TASTyTruffle to find additional optimization opportunities that could not be uncovered with erased JVM bytecode alone

    On the fly type specialization without type analysis

    Les langages de programmation typĂ©s dynamiquement tels que JavaScript et Python repoussent la vĂ©rification de typage jusqu’au moment de l’exĂ©cution. Afin d’optimiser la performance de ces langages, les implĂ©mentations de machines virtuelles pour langages dynamiques doivent tenter d’éliminer les tests de typage dynamiques redondants. Cela se fait habituellement en utilisant une analyse d’infĂ©rence de types. Cependant, les analyses de ce genre sont souvent coĂ»teuses et impliquent des compromis entre le temps de compilation et la prĂ©cision des rĂ©sultats obtenus. Ceci a conduit Ă  la conception d’architectures de VM de plus en plus complexes. Nous proposons le versionnement paresseux de blocs de base, une technique de compilation Ă  la volĂ©e simple qui Ă©limine efficacement les tests de typage dynamiques redondants sur les chemins d’exĂ©cution critiques. Cette nouvelle approche gĂ©nĂšre paresseusement des versions spĂ©cialisĂ©es des blocs de base tout en propageant de l’information de typage contextualisĂ©e. Notre technique ne nĂ©cessite pas l’utilisation d’analyses de programme coĂ»teuses, n’est pas contrainte par les limitations de prĂ©cision des analyses d’infĂ©rence de types traditionnelles et Ă©vite la complexitĂ© des techniques d’optimisation spĂ©culatives. Trois extensions sont apportĂ©es au versionnement de blocs de base afin de lui donner des capacitĂ©s d’optimisation interprocĂ©durale. Une premiĂšre extension lui donne la possibilitĂ© de joindre des informations de typage aux propriĂ©tĂ©s des objets et aux variables globales. Puis, la spĂ©cialisation de points d’entrĂ©e lui permet de passer de l’information de typage des fonctions appellantes aux fonctions appellĂ©es. Finalement, la spĂ©cialisation des continuations d’appels permet de transmettre le type des valeurs de retour des fonctions appellĂ©es aux appellants sans coĂ»t dynamique. Nous dĂ©montrons empiriquement que ces extensions permettent au versionnement de blocs de base d’éliminer plus de tests de typage dynamiques que toute analyse d’infĂ©rence de typage statique.Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language virtual machine implementations must attempt to eliminate redundant dynamic type checks. This is typically done using type inference analysis. However, type inference analyses are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures. We introduce lazy basic block versioning, a simple just-in-time compilation technique which effectively removes redundant type checks from critical code paths. This novel approach lazily generates type-specialized versions of basic blocks on the fly while propagating context-dependent type information. This does not require the use of costly program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques. Three extensions are made to the basic block versioning technique in order to give it interprocedural optimization capabilities. Typed object shapes give it the ability to attach type information to object properties and global variables. Entry point specialization allows it to pass type information from callers to callees, and call continuation specialization makes it possible to pass return value type information back to callers without dynamic overhead. We empirically demonstrate that these extensions enable basic block versioning to exceed the capabilities of static whole-program type analyses

    Specializing Scala with Truffle

    Scala is a generic object-oriented programming language with higher-order abstractions. Programming abstractions in Scala exemplify reusability and extensibility in the context of type safety. In particular, generic programming allows user-defined data structures to behave identically irrespective of the types of their values while remaining free of type errors. The implementation of reusability in Scala comes at a cost; the standard implementation of Scala compiles to Java bytecode, where type erasure significantly reduces Scala program type information to create compatible Java bytecode. Consequently, autoboxing, operations needed when using primitive values in a generic context, are introduced into the final program. The current state-of-the-art techniques for eliminating boxing and achieving optimal data representations at runtime, known as specialization, rely on static program analysis. Such techniques must mitigate the problem of code duplication; static optimizations cannot use runtime information to best select which data structures to specialize. This thesis proposes a new approach to the specialization of Scala programs. The approach integrates type information from a high-level source-like input language with the mechanisms of just-in-time compilation. We propose an ad-hoc specialization mechanism using a whole program approach; specializations of data structures are created based on concrete type arguments. In our approach, specialized objects are compatible with non-specialized code. The thesis uses Truffle, a framework that simplifies the implementation of just-in-time compilers, to implement an experimental research prototype. We demonstrate that our approach is viable and produces improvements in throughput for simplified implementations of real-world Scala programs. While these programs are simple, it is still challenging for state-of-the-art approaches to specialize optimally. We show that our approach can improve performance by an order of magnitude in the context of polymorphic data structures and methods that use bulk storage. We compare the results of our approach to our interpreter without specialization and compiled Scala on GraalVM, a state-of-art Java Virtual Machine

    Making non-volatile memory programmable

    Byte-addressable, non-volatile memory (NVM) is emerging as a revolutionary memory technology that provides persistence, near-DRAM performance, and scalable capacity. By using NVM, applications can directly create and manipulate durable data in place without the need for serialization out to SSDs. Ideally, through NVM, persistent applications will be able to maintain crash-consistency at a minimal cost. However, before this is possible, improvements must be made at both the hardware and software level to support persistent applications. Currently, software support for NVM places too high of a burden on the developer, introducing many opportunities for mistakes while also being too rigid for compiler optimizations. Likewise, at the hardware level, too little information is passed to the processor about the instruction-level ordering requirements of persistent applications; this forces the hardware to require the use of coarse fences, which significantly slow down execution. To help realize the promise of NVM, this thesis proposes both new software and hardware support that make NVM programmable. From the software side, this thesis proposes a new NVM programming model which relieves the programmer from performing much of the accounting work in persistent applications, instead relying on the runtime to perform error-prone tasks. Specifically, within the proposed model, the user only needs to provide minimal markings to identify the persistent data set and to ensure data is updated in a crash-consistent manner. Given this new NVM programming model, this thesis next presents an implementation of the model in Java. I call my implementation AutoPersist and build my support into the Maxine research Java Virtual Machine (JVM). In this thesis I describe how the JVM can be changed to support the proposed NVM programming model, including adding new Java libraries, adding new JVM runtime features, and augmenting the behavior of existing Java bytecodes. In addition to being easy-to-use, another advantage of the proposed model is that it is amenable to compiler optimizations. In this thesis I highlight two profile-guided optimizations: eagerly allocating objects directly into NVM and speculatively pruning control flow to only include expected-to-be taken paths. I also describe how to apply these optimizations to AutoPersist and show they have a substantial performance impact. While designing AutoPersist, I often observed that dependency information known by the compiler cannot be passed down to the underlying hardware; instead, the compiler must insert coarse-grain fences to enforce needed dependencies. This is because current instruction set architectures (ISA) cannot describe arbitrary instruction-level execution ordering constraints. To fix this limitation, I introduce the Execution Dependency Extension (EDE), and describe how EDE can be added to an existing ISA as well as be implemented in current processor pipelines. Overall, emerging NVM technologies can deliver programmer-friendly high performance. However, for this to happen, both software and hardware improvements are necessary. This thesis takes steps to address current the software and hardware gaps: I propose new software support to assist in the development of persistent applications and also introduce new instructions which allow for arbitrary instruction-level dependencies to be conveyed and enforced by the underlying hardware. With these improvements, hopefully the dream of programmable high-performance NVM is one step closer to being realized

    New techniques for adaptive program optimization

    Adaptive optimization technology is a key ingredient in modern runtime systems. This technology aims at improving performance by making optimization decisions on the basis of a program’s observed behavior. Application virtual machines indeed face different and perhaps more compelling issues compared to traditional static optimizers, as dynamic language features can force the deferral of most effective optimizations until run time. In this thesis, we present novel ideas to improve adaptive optimization, focusing on two main problems: collecting fine-grained program profiles with low overhead to guide feedback-directed optimization, and supporting continuous optimization and deoptimization by diverting execution across dynamically generated code versions. We present two profiling techniques: the first works at inter-procedural level to collect calling context information for hot code portions, while the second captures cyclic-path profiles within a function’s boundaries. Both techniques rely on efficient and elegant data structures, advancing the state of the art of the theory and practice of the performance profiling literature. We then focus our attention on supporting continuous optimization through on-stack replacement (OSR) mechanisms. We devise a new OSR framework encoded entirely at intermediate-representation level, which extends the best OSR practices with the ability to perform OSR at nearly any program location. Our techniques pave the road to aggressive optimizations and debugging techniques that were not supported by previous approaches. The main technical challenge is how to automatically generate compensation code to fix the program’s state across an OSR transition between different code versions. We present a conceptual framework for OSR, distilling its essence to a core calculus with an operational semantics. Using bisimulation techniques, we describe how OSR can be correctly supported in the presence of common compiler optimizations, providing the first soundness results in this context. We implement our ideas in production systems such as Jikes RVM and the LLVM compiler toolchain, and evaluate their performance against a variety of prominent benchmarks. We investigate the end-to-end utility of our techniques in a series of case studies: we illustrate two possible applications of multi-iteration path profiling, and show how our OSR techniques advance the state of the art for MATLAB code optimization and for source-level debugging of optimized code. Part of the results of this thesis have been published in PLDI, OOPSLA, CGO, and Software Practice and Experience

    Java Virtual Machine Optimizations for Java and Dynamic Languages

    í•™ìœ„ë…ŒëŹž (ë°•ì‚Ź)-- 서욞대학ꔐ 대학원 : ì „êž°Â·ì»Ží“ší„°êł”í•™ë¶€, 2017. 2. ëŹžìˆ˜ëŹ”.Java virtual machine (JVM) has been introduced as the machine-independent run- time environment to run a Java program. As a 32-bit stack machine, JVM can execute bytecode instructions generated through compilation of a Java program on any ma- chine if the JVM runtime was correctly ported on it. The machine-independence of JVM brought about the huge success of both the Java programming language and the Java virtual machine itself on various systems encompassing from cloud servers to embedded systems including handsets and smart cards. Since a bytecode instruction should be interpreted by the JVM runtime for execu- tion on top of a specific underlying system, a Java program runs innately slower due to the interpretation overhead than a C/C++ program that is compiled directly for the sys- tem. Java just-in-time (JIT) compilers, the de facto performance add-on modules, are employed to improve the performance of a Java virtual machine (JVM) by translating Java bytecode into native machine code on demand. One important problem in Java JIT compilation is how to map stack entries and local variables of the JVM runtime to physical registers efficiently and quickly, since register-based computations are much faster than memory-based ones, while JIT com- pilation overhead is part of the whole running time. This paper introduces LaTTe, an open-source Java JIT compiler that performs fast generation of efficiently register- mapped RISC code. LaTTe first maps all local variables and stack entries into pseudo registers, followed by real register allocation which also coalesces copies correspond- ing to pushes and pops between local variables and stack entries aggressively. In ad- dition to the efficient register allocation, LaTTe is equipped with various traditional and object-oriented optimizations such as CSE, dynamic method inlining, and special- ization. We also devised new mechanisms for Java exception handling and monitor handling in LaTTe, named on-demand exception handling and lightweight monitor, respectively, to boost up the JVM performance more. Our experimental results indicate that LaTTes sophisticated register mapping and allocation really pay off, achieving twice the performance of a naive JIT compiler that maps all local variables and stack entries to memory. It is also shown that LaTTe makes a reasonable trade-off between quality and speed of register mapping and allocation for the bytecode. We expect these results will also be beneficial to parallel and distributed Java computing 1) by enhancing single-thread Java performance and 2) by significantly reducing the number of memory accesses which the rest of the system must properly order to maintain coherence and keep threads synchronized. Furthermore, Java virtual machine (JVM) has recently evolved into a general- purpose language runtime environment to execute popular programming languages such as JavaScript, Ruby, Python, or Scala. These languages have complex non-Java features including dynamic typing and first-class function, so additional language run- times (engines) are provided on top of the JVM to support them with bytecode ex- tensions. Although there are high-performance JVMs with powerful just-in-time (JIT) compilers, running these languages efficiently on the JVM is still a challenge. This paper introduces a simple and novel technique for the JVM JIT compiler called exceptionization to improve the performance of JVM-based language runtimes. We observed that the JVM executing some non-Java languages encounters at least 2 times more branch bytecodes than Java, most of which are highly biased to take only one target. Exceptionization treats such a highly-biased branch as some implicit exception-throwing instruction. This allows the JVM JIT compiler to prune the infre- quent target of the branch from the frequent control flow, thus compiling the frequent control flow more aggressively with better optimization. If a pruned path was taken, it would run like a Java exception handler, i.e., a catch block. We also devised de- exceptionization, a mechanism to cope with the case when a pruned path is actually executed more often than expected. Since exceptionization is a generic JVM optimization, independent of any specific language runtime, it would be generally applicable to any language runtime on the JVM. Our experimental result shows that exceptionization accelerates the performance of several non-Java languages. The JavaScript-on-JVM runs faster by as much as 60%, and by 6% on average, when running the Octane benchmark suite on Oracles latest Nashorn JavaScript engine and HotSpot 1.9 JVM. Additionally, the Ruby-on-JVM experiences the performance improvement by as much as 60% and by 6% on average, while the Python-on-JVM by as much as 6%. We found that exceptionization is most effectively applicable to the branch bytecode of the language runtime itself, rather than the bytecode corresponding to the application code or the bytecode of the Java class libraries. This implies that the performance benefit of exceptionization comes from better JIT compilation of the non-Java language runtime.1. Introduction 1 2. Java Virtual Machine Optimization for Java 6 3. Java Virtual Machine Optimization for Dynamic Languages 39 4. Summary and Conclusion 76 Abstract (In Korean) 84Docto