35 research outputs found

    Java Virtual Machine Optimizations for Java and Dynamic Languages

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 문수묵.Java virtual machine (JVM) has been introduced as the machine-independent run- time environment to run a Java program. As a 32-bit stack machine, JVM can execute bytecode instructions generated through compilation of a Java program on any ma- chine if the JVM runtime was correctly ported on it. The machine-independence of JVM brought about the huge success of both the Java programming language and the Java virtual machine itself on various systems encompassing from cloud servers to embedded systems including handsets and smart cards. Since a bytecode instruction should be interpreted by the JVM runtime for execu- tion on top of a specific underlying system, a Java program runs innately slower due to the interpretation overhead than a C/C++ program that is compiled directly for the sys- tem. Java just-in-time (JIT) compilers, the de facto performance add-on modules, are employed to improve the performance of a Java virtual machine (JVM) by translating Java bytecode into native machine code on demand. One important problem in Java JIT compilation is how to map stack entries and local variables of the JVM runtime to physical registers efficiently and quickly, since register-based computations are much faster than memory-based ones, while JIT com- pilation overhead is part of the whole running time. This paper introduces LaTTe, an open-source Java JIT compiler that performs fast generation of efficiently register- mapped RISC code. LaTTe first maps all local variables and stack entries into pseudo registers, followed by real register allocation which also coalesces copies correspond- ing to pushes and pops between local variables and stack entries aggressively. In ad- dition to the efficient register allocation, LaTTe is equipped with various traditional and object-oriented optimizations such as CSE, dynamic method inlining, and special- ization. We also devised new mechanisms for Java exception handling and monitor handling in LaTTe, named on-demand exception handling and lightweight monitor, respectively, to boost up the JVM performance more. Our experimental results indicate that LaTTes sophisticated register mapping and allocation really pay off, achieving twice the performance of a naive JIT compiler that maps all local variables and stack entries to memory. It is also shown that LaTTe makes a reasonable trade-off between quality and speed of register mapping and allocation for the bytecode. We expect these results will also be beneficial to parallel and distributed Java computing 1) by enhancing single-thread Java performance and 2) by significantly reducing the number of memory accesses which the rest of the system must properly order to maintain coherence and keep threads synchronized. Furthermore, Java virtual machine (JVM) has recently evolved into a general- purpose language runtime environment to execute popular programming languages such as JavaScript, Ruby, Python, or Scala. These languages have complex non-Java features including dynamic typing and first-class function, so additional language run- times (engines) are provided on top of the JVM to support them with bytecode ex- tensions. Although there are high-performance JVMs with powerful just-in-time (JIT) compilers, running these languages efficiently on the JVM is still a challenge. This paper introduces a simple and novel technique for the JVM JIT compiler called exceptionization to improve the performance of JVM-based language runtimes. We observed that the JVM executing some non-Java languages encounters at least 2 times more branch bytecodes than Java, most of which are highly biased to take only one target. Exceptionization treats such a highly-biased branch as some implicit exception-throwing instruction. This allows the JVM JIT compiler to prune the infre- quent target of the branch from the frequent control flow, thus compiling the frequent control flow more aggressively with better optimization. If a pruned path was taken, it would run like a Java exception handler, i.e., a catch block. We also devised de- exceptionization, a mechanism to cope with the case when a pruned path is actually executed more often than expected. Since exceptionization is a generic JVM optimization, independent of any specific language runtime, it would be generally applicable to any language runtime on the JVM. Our experimental result shows that exceptionization accelerates the performance of several non-Java languages. The JavaScript-on-JVM runs faster by as much as 60%, and by 6% on average, when running the Octane benchmark suite on Oracles latest Nashorn JavaScript engine and HotSpot 1.9 JVM. Additionally, the Ruby-on-JVM experiences the performance improvement by as much as 60% and by 6% on average, while the Python-on-JVM by as much as 6%. We found that exceptionization is most effectively applicable to the branch bytecode of the language runtime itself, rather than the bytecode corresponding to the application code or the bytecode of the Java class libraries. This implies that the performance benefit of exceptionization comes from better JIT compilation of the non-Java language runtime.1. Introduction 1 2. Java Virtual Machine Optimization for Java 6 3. Java Virtual Machine Optimization for Dynamic Languages 39 4. Summary and Conclusion 76 Abstract (In Korean) 84Docto

    Proxy compilation for Java via a code migration technique

    Get PDF
    There is an increasing trend that intermediate representations (IRs) are used to deliver programs in more and more languages, such as Java. Although Java can provide many advantages, including a wider portability and better optimisation opportunities on execution, it introduces extra overhead by requiring an IR translation for the program execution. For maximum execution performance, an optimising compiler is placed in the runtime to selectively optimise code regions regarded as “hotspots”. This common approach has been effectively deployed in many implementation of programming languages. However, the computational resources demanded by this approach made it less efficient, or even difficult to deploy directly in a resourceconstrained environment. One implementation approach is to use a remote compilation technique to support compilation during the execution. The work presented in this dissertation supports the thesis that execution performance can be improved by the use of efficient optimising compilation by using a proxy dynamic optimising compiler. After surveying various approaches to the design and implementation of remote compilation, a proxy compilation system called Apus is defined. To demonstrate the effectiveness of using a dynamic optimising compiler as a proxy compiler, a complete proxy compilation system is written based on a research-oriented Java VirtualMachine (JVM). The proxy compilation system is discussed in detail, showing how to deliver remote binaries and manage a cache of binaries by using a code migration approach. The proxy compilation client shows how the proxy compilation service is integrated with the selective optimisation system to maximise execution performance. The results of empirical measurements of the system are given, showing the efficiency of code optimisation from either the proxy compilation service and a local binary cache. The conclusion of this work is that Java execution performance can be improved by efficient optimising compilation with a proxy compilation service by using a code migration technique

    A general framework to realize an abstract machine as an ILP processor with application to java

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A Co-Processor Approach for Efficient Java Execution in Embedded Systems

    Get PDF
    This thesis deals with a hardware accelerated Java virtual machine, named REALJava. The REALJava virtual machine is targeted for resource constrained embedded systems. The goal is to attain increased computational performance with reduced power consumption. While these objectives are often seen as trade-offs, in this context both of them can be attained simultaneously by using dedicated hardware. The target level of the computational performance of the REALJava virtual machine is initially set to be as fast as the currently available full custom ASIC Java processors. As a secondary goal all of the components of the virtual machine are designed so that the resulting system can be scaled to support multiple co-processor cores. The virtual machine is designed using the hardware/software co-design paradigm. The partitioning between the two domains is flexible, allowing customizations to the resulting system, for instance the floating point support can be omitted from the hardware in order to decrease the size of the co-processor core. The communication between the hardware and the software domains is encapsulated into modules. This allows the REALJava virtual machine to be easily integrated into any system, simply by redesigning the communication modules. Besides the virtual machine and the related co-processor architecture, several performance enhancing techniques are presented. These include techniques related to instruction folding, stack handling, method invocation, constant loading and control in time domain. The REALJava virtual machine is prototyped using three different FPGA platforms. The original pipeline structure is modified to suit the FPGA environment. The performance of the resulting Java virtual machine is evaluated against existing Java solutions in the embedded systems field. The results show that the goals are attained, both in terms of computational performance and power consumption. Especially the computational performance is evaluated thoroughly, and the results show that the REALJava is more than twice as fast as the fastest full custom ASIC Java processor. In addition to standard Java virtual machine benchmarks, several new Java applications are designed to both verify the results and broaden the spectrum of the tests.Siirretty Doriast

    Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

    Get PDF
    Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well. In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead. This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area. StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    ProcessJ: The JVMCSP Code Generator

    Full text link
    We as a society have achieved greatness because we work together. There is power in numbers. However, when it comes to programming we have not been able to achieve the same level of symbiosis. This is because concurrent programming has been stigmatized as an advance and ab- stract subject allegedly harder than sequential programming. Additionally, traditional approaches to solving concurrent problems using sequential programming become unnecessarily difficult be- cause most of what newcomers are taught when it comes to concurrent programming (e.g., message passing and threads), while being technically correct, is completely irrelevant to the problems at hand. Rather than examining the preconceived notions of the problem, we stubbornly try to fix it using thread-and-lock models or non-shared memory and message passing models, making rea- soning about the concurrent behavior of the problem extremely complicated if at all even possible. Exploiting threads effectively depends on the concurrency model supported by the program- ming language being used. What is also needed is fine-grained parallelism without the explicit use of locks and without asynchronicity so programs can be made easy to reason about. I believe that ProcessJ can be the programming language that provides a bridge from todays languages to tomorrows concurrent programs. This thesis introduces ProcessJ, a new process-oriented language with Java-like syntax and CSP-based communication that uses synchronous channels. ProcessJ is cooperatively scheduled, runs on the Java Virtual Machine (JVM), and allows hundreds of millions of concurrent processes on a single core. Next, I describe its implementation and features. Follow- ing this, I explain the translation scheme of ProcessJ source code to Java, and how the generated code is used to create processes that correctly cooperate in scheduling without using the Thread or Runnable Java classes

    Reconfiguration of legacy software artifacts in resource constraint embedded systems

    Get PDF
    Hochgradig ressourcenbeschränkte eingebettete Systeme befinden sich überall. Einige dieser Systeme befinden sich in Smart-Phones oder elektronischen Kontroll-Einheiten, andere in Sensor-Netzwerken oder auch Smart-Cards. Gerade die zuletzt genannten gehören zu den in Bezug auf Prozessorleistung und Speicherplatz am meist beschränkten Systemen. Um bei gleicher Ressourcenauslastung mehr Funktionalität bereitzustellen führt diese Arbeit ein Verfahren ein, welche es erlaubt durch Rekonfigurationstechniken dieses Problem zu lösen. Im Gegensatz zu traditionellen Verwendungszwecken von Rekonfigurationstechniken wird in dieser Arbeit Rekonfiguration zur Reduktion der Anwendungsgröße verwendet. Heutige Architekturen, welche Rekonfiguration ermöglichen, basieren auf der Unterstützung dieser Mechanismen auf Entwurfs- bzw. Source-Code Ebene. Software Lösungen basieren jedoch zum großen Teil auf wiederverwertbaren Bibliotheken oder Drittanbieter-Komponenten, welche keine Unterstützung von Rekonfiguration mit sich bringen und zumeist im Binärformat vorliegen. Diese Arbeit stellt eine Methode vor, um ein existierendes System unter Verwendung von Binärcode automatisch in ein rekonfigurierbares System umzuwandeln, mit dem Ziel die Anwendungsgröße zuverringern und dabei weiterhin harten Echtzeitbedingungen zu genügen. Das Verfahren basiert auf der Verwendung von Binärcode-Analyse Techniken zur Rekonstruktion der Anwendungssemantik, welche es erlauben dem Benutzer durch Bedingungen in einer Hochsprache Komponenten aus der Anwendungen zu extrahieren. Diese Komponenten werden anschließend optimiert. Mit dem Verfahren ist es möglich nicht rekonfigurierbare binäre Softwaresysteme in rekonfigurierbare Systeme umzuwandeln, welche die Anwendungsgröße reduzieren und dabei harte Echtzeit-Bedingungen erfüllen.Highly resource-constrained embedded systems are everywhere. Some of them can be found inside smartphones, electronic control units, others in wireless sensor networks or smart cards. The last two systems are among the most restrictive ones in the sense of processing power, energy consumption and memory availability. Pricing policies often lead to a reduction in software functionality as cheaper hardware with less resources is demanded for the final product. In order to allow more complex software to run on such constrained systems, this thesis proposes the use of software reconfiguration. In contrast to traditional uses of reconfiguration this thesis proposes the use of reconfiguration mechanisms in order to reduce the footprint of an deeply embedded application while maintaining real-time constraints. Todays adaptable architectures require the support of reconfigurability and adaptability at design level. However, modern software products are often constructed out of reusable but non-adaptable legacy software artifacts to meet early time-to-market requirements. This thesis proposes a methodology to semiautomatically use existing binaries in a reconfigurable manner. It is based on using binary analysis techniques to reconstruct the semantics of the binary application in order to allow the system developer to select meaningful code parts as components from the binary code. Using a set of high level constraints the user is able to extract components from the binary application. These components are then subject to a design space exploration step, which optimizes the resulting reconfigurable system regarding parameters as, e.g., worst case blocking time and flash lifetime. With this approach, reconfiguration can be added with a low effort to non-adaptive binary software in order to decrease the footprint of the application while maintaining real-time constraints.Tag der Verteidigung: 05.04.2013Paderborn, Univ., Diss., 201

    Scheme 2003: proceedings of the fourth workshop on scheme and functional programming

    Get PDF
    technical reportThis report contains the papers presented at the Fourth Workshop on Scheme and Functional Programming. The purpose of the Scheme Workshop is to discuss experience with and future developments of the Scheme programming language?including the future of Scheme standardization?as well as general aspects of computer science loosely centered on the general theme of Scheme
    corecore