169 research outputs found

    내장형 시스템에서 Just-in-Time 및 Ahead-of-Time 컴파일을 활용한 하이브리드 자바 컴파일

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 문수묵.Many embedded Java software platforms execute two types of Java classes: those installed statically on the client device and those downloaded dynamically from service providers at runtime. For higher performance, it would be desirable to compile static Java classes by ahead-of-time compiler (AOTC) and to handle dynamically downloaded classes by just-in-time compiler (JITC), providing a hybrid compilation environment. We propose a hybrid Java compilation approach and performs an initial case study with a hybrid environment, which is constructed simply by merging an existing AOTC and a JITC for the same JVM. Contrary to our expectations, the hybrid environment does not deliver a performance, in-between of full-JITCs and full-AOTCs. In fact, its performance is even lower than full-JITCs for many benchmarks. We analyzed the result and found that a naive merge of JITC and AOTC may result in inefficiencies, especially due to calls between JITC methods and AOTC methods. We also observed that the distribution of JITC methods and AOTC methods is also important, and experimented with various distributions to understand when a hybrid environment can deliver a desired performance. The Android Java is to be executed by the Dalvik virtual machine (VM), which is quite different from the traditional Java VM such as Oracles HotSpot VM. That is, Dalvik employs register-based bytecode while HotSpot employs stack-based bytecode, requiring a different way of interpretation. Also, Dalvik uses trace-based just-in-time compilation (JITC), while HotSpot uses method-based JITC. Therefore, it is questioned how the Dalvik VM performs compared the HotSpot VM. Unfortunately, there has been little comparative evaluation of both VMs, so the performance of the Dalvik VM is not well understood. More importantly, it is also not well understood how the performance of the Dalvik VM affects the overall performance of the Android applications (apps). We make an attempt to evaluate the Dalvik VM. We install both VMs on the same board and compare the performance using EEMBC benchmark. In the JITC mode, Dakvik is slower than HotSpot by more than 2.9 times and its generated code size is not smaller than HotSpots due to its worse code quality and trace-chaining code. We also investigated how real Android apps are different from Java benchmarks, to understand why the slow Dalvik VM does not affect the performance of the Android apps seriously. We proposes a bytecode-to-C ahead-of-time compilation (AOTC) for the DVM to accelerate pre-installed apps. We translated the bytecode of some of the hot methods used by these apps to C code, which is then compiled together with the DVM source code. AOTC-generated code works with the existing Android zygote mechanism, with correct garbage collection and exception handling. Due to off-line, method-based compilation using existing compiler with full optimizations and Java-specific optimizations, AOTC can generate quality code while obviating runtime compilation overhead. For benchmarks, AOTC can improve the performance by 65%. When we compare with the recently-introduced ART, which also performs ahead-of-time compilation, our AOTC performs better. We cannot AOTC all middleware and framework methods in DTV and android device for hybrid compilation. By case study on DTV, we found that we need to adopt AOTC enough methods and reduce method call overhead. We propose AOTC method selection heuristic using method call chain. We select hot methods and call chain methods using profile data. Our heuristic based on method call chain get better performance than other heuristics.Chapter 1 Introduction 1 1.1 The need of hybrid compilation 1 1.2 Outline of the Dissertation 2 Chapter 2 Hybrid Compilation for Java Virtual Machine 3 2.1 The Approach of Hybrid Compilation 3 2.2 The JITC and AOTC 6 2.2.1 JVM and the Interpreter 7 2.2.2 The JITC 8 2.2.3 The AOTC 9 2.3 Hybrid Compilation Environment 11 2.4 Analysis of the Hybrid Environment 14 2.4.1 Call Behavior of Benchmarks 14 2.4.2 Call Overhead 15 2.4.3 Application Methods and Library Methods 18 2.4.4 Improving hybrid performance 20 2.4.4.1 Reducing the JITC-to-AOTC call overhead 20 2.4.4.2 Performance impact of the distribution of JITC methods and AOTC methods 21 Chapter 3 Evaluation of Dalvik Virtual Machine 23 3.1 Android Platform 23 3.2 Java VM and Dalvik VM 24 3.2.1 Bytecode ISA 25 3.2.3 Just-in-Time Compilation (JITC) 27 3.3 Experimental Results 32 3.3.1 Experimental Environment 33 3.3.2 Interpreter Performance 34 3.3.3 JITC Performance 37 3.3.4 Trace Extension 43 3.4 Behavior of Real Android Apps 44 Chapter 4 Ahead-of-Time Compilation for Dalvik Virtual Machine 51 4.1 Android and Dalvik VM Execution 51 4.1.1 Android Execution Model 51 4.1.2 Dalvik VM 51 4.1.3 Dexopt and JITC in the Dalvik VM 53 4.2 AOTC Architecture 54 4.3 Design and Implementation of AOTC 56 4.3.1 Dexopt and Code Generation 56 4.3.2 C Code Generation 56 4.3.3 AOTC Method Call 58 4.3.4 Garbage Collection 61 4.3.5 Exception Handling 62 4.3.6 AOTC Method Linking 63 4.4 AOTC Code Optimization 64 4.4.1 Method Inlining 64 4.4.2 Spill Optimization 64 4.4.3 Elimination of Redundant Code 65 4.5 Experimental Result 65 4.5.1 Experimental Environment 66 4.5.2 AOTC Target Methods 66 4.5.3 Performance Impact of AOTC 67 4.5.4 DVM AOTC vs. ART 68 Chapter 5 Selecting Ahead-of-Time Compilation Target Methods for Hybrid Compilation 70 5.1 Hybrid Compilation on DTV 70 5.2 Hybrid Compilation on Android Device 72 5.3 AOTC for Hybrid Compilation 74 5.3.1 AOTC Target Methods 74 5.3.2 Case Study: Selecting on DTV 75 5.4 Method Selection Using Call Chain 77 5.5 Experimental Result 77 5.5.1 Experimental Environment 78 5.5.2 Performance Impact 79 Chapter 6 Related Works 81 Chapter 7 Conclusion 84 Bibliography 86Docto

    PolyBench/Python: Benchmarking Python Environments With Polyhedral Optimizations

    Get PDF
    [Abstract] Python has become one of the most used and taught languages nowadays. Its expressiveness, cross-compatibility and ease of use have made it popular in areas as diverse as finance, bioinformatics or machine learning. However, Python programs are often significantly slower to execute than an equivalent native C implementation, especially for computation-intensive numerical kernels. This work presents PolyBench/Python, implementing the 30 kernels in PolyBench/C, one of the standard benchmark suites for polyhedral optimization, in Python. In addition to the benchmark kernels, a functional wrapper including mechanisms for performance measurement, testing, and execution configuration has been developed. The framework includes support for different ways to translate C-array codes into Python, offering insight into the tradeoffs of Python lists and NumPy arrays. The benchmark performance is thoroughly evaluated on different Python interpreters, and compared against its PolyBench/C counterpart to highlight the profitability (or lack thereof) of using Python for regular numerical codes.Ministerio de Ciencia e innovación; PID2019-104184RB-I00Ministerio de Ciencia e innovación; AEI/10.13039/501100011033U.S. National Science Foundation; CCF-1750399Xunta de Galicia; ED431G 2019/0

    How to Do a Million Watchpoints: Efficient Debugging Using Dynamic Instrumentation

    Get PDF
    Application debugging is a tedious but inevitable chore in any software development project. An effective debugger can make programmers more productive by allowing them to pause execution and inspect the state of the process, or monitor writes to memory to detect data corruption. The latter is a notoriously difficult category of bugs to diagnose and repair especially in pointer-heavy applications. The debugging challenges will increase with the arrival of multicore processors which require explicit parallelization of the user code to get any performance gains. Parallelization in turn can lead to more data debugging issues such as the detection of data races between threads. This paper leverages the increasing efficiency of runtime binary interpreters to provide a new concept of Efficient Debugging using Dynamic Instrumentation, or EDDI. The paper demonstrates for the first time the feasibility of using dynamic instrumentation on demand to accelerate software debuggers, especially when the available hardware support is lacking or inadequate. As an example, EDDI can simultaneously monitor millions of memory locations, without crippling the host processing platform. It does this in software and hence provides a portable debugging environment. It is also well suited for interactive debugging because of the low associated overheads. EDDI provides a scalable and extensible debugging framework that can substantially increase the feature set of standard off the shelf debuggers.Singapore-MIT Alliance (SMA

    A virtualisation framework for embedded systems

    Get PDF

    Enabling high performance dynamic language programming for micro-core architectures

    Get PDF
    Micro-core architectures are intended to deliver high performance at a low overall power consumption by combining many simple central processing unit (CPU) cores, with an associated small amount of memory, onto a single chip. This technology is not only of great interest for embedded, Edge and IoT applications but also for High-Performance Computing (HPC) accelerators. However, micro-core architectures are difficult to program and exploit, not only because each technology is different, with its own idiosyncrasies, but also because they each present a different low-level interface to the programmer. Furthermore, micro-cores have very constrained amounts of on-chip, scratchpad memory (often around 32KB), further hampering programmer productivity by requiring the programmer to manually manage the regular loading and unloading of data from the host to the device during program execution. To help address these issues, dynamic languages such as Python have been ported to several micro-core architectures but these are often delivered as interpreters with the associated performance penalty over natively compiled languages, such as C. The research questions for this thesis target four areas of concern for dynamic programming languages on micro-core architectures: (RQ1) how to manage the limited on-chip memory for data, (RQ2) how to manage the limited on-chip memory for code, (RQ3) how to address the low runtime performance of virtual machines and (RQ4) how to manage the idiosyncratic architectures of micro-core architectures. The focus of this work is to address these concerns whilst maintaining the programmer productivity benefits of dynamic programming languages, using ePython as the research vehicle. Therefore, key areas of design (such as abstractions for offload) and implementation (novel compiler and runtime techniques for these technologies) are considered, resulting in a number of approaches that are not only applicable to the compilation of Python codes but also more generally to other dynamic languages on micro-cores architectures. RQ1 was addressed by providing support for kernels with arbitrary data size through high-level programming abstractions that enable access to the memory hierarchies of micro-core devices, allowing the deployment of real-world applications, such as a machine learning code to detect cancer cells in full-sized scan images. A new abstract machine, Olympus, addressed RQ2 by supporting the compilation of dynamic languages, such as Python, to micro-core native code. Olympus enables ePython to close the kernel runtime performance gap with native C, matching C for the LINPACK and an iterative Fibonacci benchmark, and to provide, on average, around 75\% of native C runtime performance across four benchmarks running on a set of eight CPU architectures. Olympus also addresses RQ3 by providing dynamic function loading, supporting kernel codes larger than the on-chip memory, whilst still retaining the runtime performance benefits of native code generation. Finally, RQ4 was addressed by the Eithne benchmarking framework which not only enabled a single benchmarking code to be deployed, unchanged, across different CPU architectures, but also provided the underlying communications framework for Olympus. The portability of end-user ePython codes and the underlying Olympus abstract machine were validated by running a set of four benchmarks on eight different CPU architectures, from a single codebase

    Energy analysis and optimisation techniques for automatically synthesised coprocessors

    Get PDF
    The primary outcome of this research project is the development of a methodology enabling fast automated early-stage power and energy analysis of configurable processors for system-on-chip platforms. Such capability is essential to the process of selecting energy efficient processors during design-space exploration, when potential savings are highest. This has been achieved by developing dynamic and static energy consumption models for the constituent blocks within the processors. Several optimisations have been identified, specifically targeting the most significant blocks in terms of energy consumption. Instruction encoding mechanism reduces both the energy and area requirements of the instruction cache; modifications to the multiplier unit reduce energy consumption during inactive cycles. Both techniques are demonstrated to offer substantial energy savings. The aforementioned techniques have undergone detailed evaluation and, based on the positive outcomes obtained, have been incorporated into Cascade, a system-on-chip coprocessor synthesis tool developed by Critical Blue, to provide automated analysis and optimisation of processor energy requirements. This thesis details the process of identifying and examining each method, along with the results obtained. Finally, a case study demonstrates the benefits of the developed functionality, from the perspective of someone using Cascade to automate the creation of an energy-efficient configurable processor for system-on-chip platforms

    New techniques for adaptive program optimization

    Get PDF
    Adaptive optimization technology is a key ingredient in modern runtime systems. This technology aims at improving performance by making optimization decisions on the basis of a program’s observed behavior. Application virtual machines indeed face different and perhaps more compelling issues compared to traditional static optimizers, as dynamic language features can force the deferral of most effective optimizations until run time. In this thesis, we present novel ideas to improve adaptive optimization, focusing on two main problems: collecting fine-grained program profiles with low overhead to guide feedback-directed optimization, and supporting continuous optimization and deoptimization by diverting execution across dynamically generated code versions. We present two profiling techniques: the first works at inter-procedural level to collect calling context information for hot code portions, while the second captures cyclic-path profiles within a function’s boundaries. Both techniques rely on efficient and elegant data structures, advancing the state of the art of the theory and practice of the performance profiling literature. We then focus our attention on supporting continuous optimization through on-stack replacement (OSR) mechanisms. We devise a new OSR framework encoded entirely at intermediate-representation level, which extends the best OSR practices with the ability to perform OSR at nearly any program location. Our techniques pave the road to aggressive optimizations and debugging techniques that were not supported by previous approaches. The main technical challenge is how to automatically generate compensation code to fix the program’s state across an OSR transition between different code versions. We present a conceptual framework for OSR, distilling its essence to a core calculus with an operational semantics. Using bisimulation techniques, we describe how OSR can be correctly supported in the presence of common compiler optimizations, providing the first soundness results in this context. We implement our ideas in production systems such as Jikes RVM and the LLVM compiler toolchain, and evaluate their performance against a variety of prominent benchmarks. We investigate the end-to-end utility of our techniques in a series of case studies: we illustrate two possible applications of multi-iteration path profiling, and show how our OSR techniques advance the state of the art for MATLAB code optimization and for source-level debugging of optimized code. Part of the results of this thesis have been published in PLDI, OOPSLA, CGO, and Software Practice and Experience

    Enabling software security mechanisms through architectural support

    Full text link
    Over the past decades, there has been a growing number of attacks compromising the security of computing systems. In the first half of 2020, data breaches caused by security attacks led to the exposure of 36 billion records containing private information, where the average cost of a data breach was $3.86 million. Over the years, researchers have developed a variety of software solutions that can actively protect computing systems against different classes of security attacks. However, such software solutions are rarely deployed in practice, largely due to their significant performance overhead, ranging from ~15% to multiple orders of magnitude. A hardware-assisted security extension can reduce the performance overhead of software-level implementations and provide a practical security solution. Hence, in recent years, there has been a growing trend in the industry to enforce security policies in hardware. Unfortunately, the current trend only implements dedicated hardware extensions for enforcing fixed security policies in hardware. As these policies are built in silicon, they cannot be updated at the pace at which security threats evolve. In this thesis, we propose a hybrid approach by developing and deploying both dedicated and flexible hardware-assisted security extensions. We incorporate an array of hardware engines as a security layer on top of an existing processor design. These engines are in the form of Programmable Engines (PEs) and Specialized Engines (SEs). A PE is a minimally invasive and flexible design, capable of enforcing a variety of security policies as security threats evolve. In contrast, an SE, which requires targeted modifications to an existing processor design, is a dedicated hardware security extension. An SE is less flexible than a PE, but has lower overheads. We first propose a PE called PHMon, which can enforce a variety of security policies. PHMon can also assist with detecting software bugs and security vulnerabilities. We demonstrate the versatility of PHMon through five representative use cases, (1) a shadow stack, (2) a hardware-accelerated fuzzing engine, (3) information leak prevention, (4) hardware accelerated debugging, and (5) a code coverage engine. We also propose two SEs as dedicated hardware extensions. Our first SE, called SealPK, provides an efficient and secure protection key-based intra-process memory isolation mechanism for the RISC-V ISA. SealPK provides higher security guarantees than the existing hardware extension in Intel processors, through three novel sealing features. These features prevent an attacker from modifying sealed domains, sealed pages, and sealed permissions. Our second SE, called FlexFilt, provides an efficient capability to guarantee the integrity of isolation-based mechanisms by preventing the execution of various instructions in untrusted parts of the code at runtime. We demonstrate the feasibility of our PE and SEs by providing a practical prototype of our hardware engines interfaced with a RISC-V processor on an FPGA and by providing the full Linux software stack for our design. Our FPGA-based evaluation demonstrates that PHMon improves the performance of fuzzing by 16X over the state-of-the-art software-based implementation while a PHMon-based shadow stack has less than 1% performance overhead. An isolated shadow stack implemented by leveraging SealPK is 80X faster than an isolated implementation using mprotect, and FlexFilt incurs negligible performance overhead for filtering instructions.2021-11-15T00:00:00
    corecore