76 research outputs found

    Workload characterization of JVM languages

    Get PDF
    Being developed with a single language in mind, namely Java, the Java Virtual Machine (JVM) nowadays is targeted by numerous programming languages. Automatic memory management, Just-In-Time (JIT) compilation, and adaptive optimizations provided by the JVM make it an attractive target for different language implementations. Even though being targeted by so many languages, the JVM has been tuned with respect to characteristics of Java programs only -- different heuristics for the garbage collector or compiler optimizations are focused more on Java programs. In this dissertation, we aim at contributing to the understanding of the workloads imposed on the JVM by both dynamically-typed and statically-typed JVM languages. We introduce a new set of dynamic metrics and an easy-to-use toolchain for collecting the latter. We apply our toolchain to applications written in six JVM languages -- Java, Scala, Clojure, Jython, JRuby, and JavaScript. We identify differences and commonalities between the examined languages and discuss their implications. Moreover, we have a close look at one of the most efficient compiler optimizations - method inlining. We present the decision tree of the HotSpot JVM's JIT compiler and analyze how well the JVM performs in inlining the workloads written in different JVM languages

    Observable dynamic compilation

    Get PDF
    Managed language platforms such as the Java Virtual Machine rely on a dynamic compiler to achieve high performance. Despite the benefits that dynamic compilation provides, it also introduces some challenges to program profiling. Firstly, profilers based on bytecode instrumentation may yield wrong results in the presence of an optimizing dynamic compiler, either due to not being aware of optimizations, or because the inserted instrumentation code disrupts such optimizations. To avoid such perturbations, we present a technique to make profilers based on bytecode instrumentation aware of the optimizations performed by the dynamic compiler, and make the dynamic compiler aware of the inserted code. We implement our technique for separating inserted instrumentation code from base-program code in Oracle's Graal compiler, integrating our extension into the OpenJDK Graal project. We demonstrate its significance with concrete profilers. On the one hand, we improve accuracy of existing profiling techniques, for example, to quantify the impact of escape analysis on bytecode-level allocation profiling, to analyze object life-times, and to evaluate the impact of method inlining when profiling method invocations. On the other hand, we also illustrate how our technique enables new kinds of profilers, such as a profiler for non-inlined callsites, and a testing framework for locating performance bugs in dynamic compiler implementations. Secondly, the lack of profiling support at the intermediate representation (IR) level complicates the understanding of program behavior in the compiled code. This issue cannot be addressed by bytecode instrumentation because it cannot precisely capture the occurrence of IR-level operations. Binary instrumentation is not suited either, as it lacks a mapping from the collected low-level metrics to higher-level operations of the observed program. To fill this gap, we present an easy-to-use event-based framework for profiling operations at the IR level. We integrate the IR profiling framework in the Graal compiler, together with our instrumentation-separation technique. We illustrate our approach with a profiler that tracks the execution of memory barriers within compiled code. In addition, using a deoptimization profiler based on our IR profiling framework, we conduct an empirical study on deoptimization in the Graal compiler. We focus on situations which cause program execution to switch from machine code to the interpreter, and compare application performance using three different deoptimization strategies which influence the amount of extra compilation work done by Graal. Using an adaptive deoptimization strategy, we manage to improve the average start-up performance of benchmarks from the DaCapo, ScalaBench, and Octane suites by avoiding wasted compilation work. We also find that different deoptimization strategies have little impact on steady- state performance

    Infrastructures and Compilation Strategies for the Performance of Computing Systems

    Get PDF
    This document presents our main contributions to the field of compilation, and more generally to the quest of performance ofcomputing systems.It is structured by type of execution environment, from static compilation (execution of native code), to JIT compilation, and purelydynamic optimization. We also consider interpreters. In each chapter, we give a focus on the most relevant contributions.Chapter 2 describes our work about static compilation. It covers a long time frame (from PhD work 1995--1998 to recent work on real-timesystems and worst-case execution times at Inria in 2015) and various positions, both in academia and in the industry.My research on JIT compilers started in the mid-2000s at STMicroelectronics, and is still ongoing. Chapter 3 covers the results we obtained on various aspects of JIT compilers: split-compilation, interaction with real-time systems, and obfuscation.Chapter 4 reports on dynamic binary optimization, a research effort started more recently, in 2012. This considers the optimization of a native binary (without source code), while it runs. It incurs significant challenges but also opportunities.Interpreters represent an alternative way to execute code. Instead of native code generation, an interpreter executes an infinite loop thatcontinuously reads a instruction, decodes it and executes its semantics. Interpreters are much easier to develop than compilers,they are also much more portable, often requiring a simple recompilation. The price to pay is the reduced performance. Chapter 5presents some of our work related to interpreters.All this research often required significant software infrastructures for validation, from early prototypes to robust quasi products, andfrom open-source to proprietary. We detail them in Chapter 6.The last chapter concludes and gives some perspectives

    nelli: a lightweight frontend for MLIR

    Full text link
    Multi-Level Intermediate Representation (MLIR) is a novel compiler infrastructure that aims to provide modular and extensible components to facilitate building domain specific compilers. However, since MLIR models programs at an intermediate level of abstraction, and most extant frontends are at a very high level of abstraction, the semantics and mechanics of the fundamental transformations available in MLIR are difficult to investigate and employ in and of themselves. To address these challenges, we have developed \texttt{nelli}, a lightweight, Python-embedded, domain-specific, language for generating MLIR code. \texttt{nelli} leverages existing MLIR infrastructure to develop Pythonic syntax and semantics for various MLIR features. We describe \texttt{nelli}'s design goals, discuss key details of our implementation, and demonstrate how \texttt{nelli} enables easily defining and lowering compute kernels to diverse hardware platforms

    Efficient Implementation of Parametric Polymorphism using Reified Types

    Get PDF
    Parametric polymorphism is a language feature that lets programmers define code that behaves independently of the types of values it operates on. Using parametric polymorphism enables code reuse and improves the maintainability of software projects. The approach that a language implementation uses to support parametric polymorphism can have important performance implications. One such approach, erasure, converts generic code to non-generic code that uses a uniform representation for generic data. Erasure is notorious for introducing primitive boxing and other indirections that harm the performance of generic code. More generally, erasure destroys type information that could be used by the language implementation to optimize generic code. This thesis presents TASTyTruffle, a new interpreter for the Scala language. Whereas the standard Scala implementation executes erased Java Virtual Machine (JVM) bytecode, TASTyTruffle interprets TASTy, a different representation that has precise type information. This thesis explores how the type information present in TASTy empowers TASTyTruffle to implement generic code more effectively. In particular, TASTy's type information allows TASTyTruffle to reify types as objects that can be passed around the interpreter. These reified types are used to support heterogeneous box-free representations of generic values. Reified types also enable TASTyTruffle to create specialized, monomorphic copies of generic code that can be easily and reliably optimized by its just-in-time (JIT) compiler. Empirically, TASTyTruffle is competitive with the standard JVM implementation. Both implementations perform similarly on monomorphic workloads, but when generic code is used with multiple types, TASTyTruffle consistently outperforms the JVM. TASTy's type information enables TASTyTruffle to find additional optimization opportunities that could not be uncovered with erased JVM bytecode alone

    Java Virtual Machine Optimizations for Java and Dynamic Languages

    Get PDF
    í•™ìœ„ë…ŒëŹž (ë°•ì‚Ź)-- 서욞대학ꔐ 대학원 : ì „êž°Â·ì»Ží“ší„°êł”í•™ë¶€, 2017. 2. ëŹžìˆ˜ëŹ”.Java virtual machine (JVM) has been introduced as the machine-independent run- time environment to run a Java program. As a 32-bit stack machine, JVM can execute bytecode instructions generated through compilation of a Java program on any ma- chine if the JVM runtime was correctly ported on it. The machine-independence of JVM brought about the huge success of both the Java programming language and the Java virtual machine itself on various systems encompassing from cloud servers to embedded systems including handsets and smart cards. Since a bytecode instruction should be interpreted by the JVM runtime for execu- tion on top of a specific underlying system, a Java program runs innately slower due to the interpretation overhead than a C/C++ program that is compiled directly for the sys- tem. Java just-in-time (JIT) compilers, the de facto performance add-on modules, are employed to improve the performance of a Java virtual machine (JVM) by translating Java bytecode into native machine code on demand. One important problem in Java JIT compilation is how to map stack entries and local variables of the JVM runtime to physical registers efficiently and quickly, since register-based computations are much faster than memory-based ones, while JIT com- pilation overhead is part of the whole running time. This paper introduces LaTTe, an open-source Java JIT compiler that performs fast generation of efficiently register- mapped RISC code. LaTTe first maps all local variables and stack entries into pseudo registers, followed by real register allocation which also coalesces copies correspond- ing to pushes and pops between local variables and stack entries aggressively. In ad- dition to the efficient register allocation, LaTTe is equipped with various traditional and object-oriented optimizations such as CSE, dynamic method inlining, and special- ization. We also devised new mechanisms for Java exception handling and monitor handling in LaTTe, named on-demand exception handling and lightweight monitor, respectively, to boost up the JVM performance more. Our experimental results indicate that LaTTes sophisticated register mapping and allocation really pay off, achieving twice the performance of a naive JIT compiler that maps all local variables and stack entries to memory. It is also shown that LaTTe makes a reasonable trade-off between quality and speed of register mapping and allocation for the bytecode. We expect these results will also be beneficial to parallel and distributed Java computing 1) by enhancing single-thread Java performance and 2) by significantly reducing the number of memory accesses which the rest of the system must properly order to maintain coherence and keep threads synchronized. Furthermore, Java virtual machine (JVM) has recently evolved into a general- purpose language runtime environment to execute popular programming languages such as JavaScript, Ruby, Python, or Scala. These languages have complex non-Java features including dynamic typing and first-class function, so additional language run- times (engines) are provided on top of the JVM to support them with bytecode ex- tensions. Although there are high-performance JVMs with powerful just-in-time (JIT) compilers, running these languages efficiently on the JVM is still a challenge. This paper introduces a simple and novel technique for the JVM JIT compiler called exceptionization to improve the performance of JVM-based language runtimes. We observed that the JVM executing some non-Java languages encounters at least 2 times more branch bytecodes than Java, most of which are highly biased to take only one target. Exceptionization treats such a highly-biased branch as some implicit exception-throwing instruction. This allows the JVM JIT compiler to prune the infre- quent target of the branch from the frequent control flow, thus compiling the frequent control flow more aggressively with better optimization. If a pruned path was taken, it would run like a Java exception handler, i.e., a catch block. We also devised de- exceptionization, a mechanism to cope with the case when a pruned path is actually executed more often than expected. Since exceptionization is a generic JVM optimization, independent of any specific language runtime, it would be generally applicable to any language runtime on the JVM. Our experimental result shows that exceptionization accelerates the performance of several non-Java languages. The JavaScript-on-JVM runs faster by as much as 60%, and by 6% on average, when running the Octane benchmark suite on Oracles latest Nashorn JavaScript engine and HotSpot 1.9 JVM. Additionally, the Ruby-on-JVM experiences the performance improvement by as much as 60% and by 6% on average, while the Python-on-JVM by as much as 6%. We found that exceptionization is most effectively applicable to the branch bytecode of the language runtime itself, rather than the bytecode corresponding to the application code or the bytecode of the Java class libraries. This implies that the performance benefit of exceptionization comes from better JIT compilation of the non-Java language runtime.1. Introduction 1 2. Java Virtual Machine Optimization for Java 6 3. Java Virtual Machine Optimization for Dynamic Languages 39 4. Summary and Conclusion 76 Abstract (In Korean) 84Docto

    Fast and Lean Immutable Multi-Maps on the JVM based on Heterogeneous Hash-Array Mapped Tries

    Get PDF
    An immutable multi-map is a many-to-many thread-friendly map data structure with expected fast insert and lookup operations. This data structure is used for applications processing graphs or many-to-many relations as applied in static analysis of object-oriented systems. When processing such big data sets the memory overhead of the data structure encoding itself is a memory usage bottleneck. Motivated by reuse and type-safety, libraries for Java, Scala and Clojure typically implement immutable multi-maps by nesting sets as the values with the keys of a trie map. Like this, based on our measurements the expected byte overhead for a sparse multi-map per stored entry adds up to around 65B, which renders it unfeasible to compute with effectively on the JVM. In this paper we propose a general framework for Hash-Array Mapped Tries on the JVM which can store type-heterogeneous keys and values: a Heterogeneous Hash-Array Mapped Trie (HHAMT). Among other applications, this allows for a highly efficient multi-map encoding by (a) not reserving space for empty value sets and (b) inlining the values of singleton sets while maintaining a (c) type-safe API. We detail the necessary encoding and optimizations to mitigate the overhead of storing and retrieving heterogeneous data in a hash-trie. Furthermore, we evaluate HHAMT specifically for the application to multi-maps, comparing them to state-of-the-art encodings of multi-maps in Java, Scala and Clojure. We isolate key differences using microbenchmarks and validate the resulting conclusions on a real world case in static analysis. The new encoding brings the per key-value storage overhead down to 30B: a 2x improvement. With additional inlining of primitive values it reaches a 4x improvement

    On the fly type specialization without type analysis

    Full text link
    Les langages de programmation typĂ©s dynamiquement tels que JavaScript et Python repoussent la vĂ©rification de typage jusqu’au moment de l’exĂ©cution. Afin d’optimiser la performance de ces langages, les implĂ©mentations de machines virtuelles pour langages dynamiques doivent tenter d’éliminer les tests de typage dynamiques redondants. Cela se fait habituellement en utilisant une analyse d’infĂ©rence de types. Cependant, les analyses de ce genre sont souvent coĂ»teuses et impliquent des compromis entre le temps de compilation et la prĂ©cision des rĂ©sultats obtenus. Ceci a conduit Ă  la conception d’architectures de VM de plus en plus complexes. Nous proposons le versionnement paresseux de blocs de base, une technique de compilation Ă  la volĂ©e simple qui Ă©limine efficacement les tests de typage dynamiques redondants sur les chemins d’exĂ©cution critiques. Cette nouvelle approche gĂ©nĂšre paresseusement des versions spĂ©cialisĂ©es des blocs de base tout en propageant de l’information de typage contextualisĂ©e. Notre technique ne nĂ©cessite pas l’utilisation d’analyses de programme coĂ»teuses, n’est pas contrainte par les limitations de prĂ©cision des analyses d’infĂ©rence de types traditionnelles et Ă©vite la complexitĂ© des techniques d’optimisation spĂ©culatives. Trois extensions sont apportĂ©es au versionnement de blocs de base afin de lui donner des capacitĂ©s d’optimisation interprocĂ©durale. Une premiĂšre extension lui donne la possibilitĂ© de joindre des informations de typage aux propriĂ©tĂ©s des objets et aux variables globales. Puis, la spĂ©cialisation de points d’entrĂ©e lui permet de passer de l’information de typage des fonctions appellantes aux fonctions appellĂ©es. Finalement, la spĂ©cialisation des continuations d’appels permet de transmettre le type des valeurs de retour des fonctions appellĂ©es aux appellants sans coĂ»t dynamique. Nous dĂ©montrons empiriquement que ces extensions permettent au versionnement de blocs de base d’éliminer plus de tests de typage dynamiques que toute analyse d’infĂ©rence de typage statique.Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language virtual machine implementations must attempt to eliminate redundant dynamic type checks. This is typically done using type inference analysis. However, type inference analyses are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures. We introduce lazy basic block versioning, a simple just-in-time compilation technique which effectively removes redundant type checks from critical code paths. This novel approach lazily generates type-specialized versions of basic blocks on the fly while propagating context-dependent type information. This does not require the use of costly program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques. Three extensions are made to the basic block versioning technique in order to give it interprocedural optimization capabilities. Typed object shapes give it the ability to attach type information to object properties and global variables. Entry point specialization allows it to pass type information from callers to callees, and call continuation specialization makes it possible to pass return value type information back to callers without dynamic overhead. We empirically demonstrate that these extensions enable basic block versioning to exceed the capabilities of static whole-program type analyses

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version
    • 

    corecore