16 research outputs found

    A Retargetable System-Level DBT Hypervisor

    Get PDF
    System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different to that of the host machine. Due to their performance critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new, or extending an existing architecture are high. In this paper we develop a novel, retargetable DBT hypervisor, which includes guest specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations, and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine (VM) hypervisor. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de-facto standard QEMU DBT system. Our system delivers an average speedup of 2.21× over QEMU across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49× on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA, while delivering outstanding performance.Publisher PD

    Efficient cross-architecture hardware virtualisation

    Get PDF
    Hardware virtualisation is the provision of an isolated virtual environment that represents real physical hardware. It enables operating systems, or other system-level software (the guest), to run unmodified in a “container” (the virtual machine) that is isolated from the real machine (the host). There are many use-cases for hardware virtualisation that span a wide-range of end-users. For example, home-users wanting to run multiple operating systems side-by-side (such as running a Windows¼ operating system inside an OS X environment) will use virtualisation to accomplish this. In research and development environments, developers building experimental software and hardware want to prototype their designs quickly, and so will virtualise the platform they are targeting to isolate it from their development workstation. Large-scale computing environments employ virtualisation to consolidate hardware, enforce application isolation, migrate existing servers or provision new servers. However, the majority of these use-cases call for same-architecture virtualisation, where the architecture of the guest and the host machines match—a situation that can be accelerated by the hardware-assisted virtualisation extensions present on modern processors. But, there is significant interest in virtualising the hardware of different architectures on a host machine, especially in the architectural research and development worlds. Typically, the instruction set architecture of a guest platform will be different to the host machine, e.g. an ARM guest on an x86 host will use an ARM instruction set, whereas the host will be using the x86 instruction set. Therefore, to enable this cross-architecture virtualisation, each guest instruction must be emulated by the host CPU—a potentially costly operation. This thesis presents a range of techniques for accelerating this instruction emulation, improving over a state-of-the art instruction set simulator by 2:64x. But, emulation of the guest platform’s instruction set is not enough for full hardware virtualisation. In fact, this is just one challenge in a range of issues that must be considered. Specifically, another challenge is efficiently handling the way external interrupts are managed by the virtualisation system. This thesis shows that when employing efficient instruction emulation techniques, it is not feasible to arbitrarily divert control-flow without consideration being given to the state of the emulated processor. Furthermore, it is shown that it is possible for the virtualisation environment to behave incorrectly if particular care is not given to the point at which control-flow is allowed to diverge. To solve this, a technique is developed that maintains efficient instruction emulation, and correctly handles external interrupt sources. Finally, modern processors have built-in support for hardware virtualisation in the form of instruction set extensions that enable the creation of an abstract computing environment, indistinguishable from real hardware. These extensions enable guest operating systems to run directly on the physical processor, with minimal supervision from a hypervisor. However, these extensions are geared towards same-architecture virtualisation, and as such are not immediately well-suited for cross-architecture virtualisation. This thesis presents a technique for exploiting these existing extensions, and using them in a cross-architecture virtualisation setting, improving the performance of a novel cross-architecture virtualisation hypervisor over state-of-the-art by 2:5x

    A virtualisation framework for embedded systems

    Get PDF

    Hardware Accelerated Cross-Architecture Full-System Virtualization

    Get PDF
    Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported by all major processor architectures, including Intel, ARM, PowerPC & MIPS, these extensions are targeted at virtualization of the same architecture, e.g. an x86 guest on an x86 host system. Existing techniques for cross-architecture virtualization, e.g. an ARM guest on an x86 host, still incur a substantial overhead for CPU, memory and I/O virtualization due to the necessity for software emulation of these mismatched system components. In this article we present a new hardware accelerated hypervisor called CAPTIVE, employing a range of novel techniques, which exploit existing hardware virtualization extensions for improving the performance of full-system cross-platform virtualization. We illustrate how (1) guest MMU events and operations can be mapped onto host memory virtualization extensions, eliminating the need for costly software MMU emulation, (2) a block-based DBT engine inside the virtual machine can improve CPU virtualization performance, (3) memory mapped guest I/O can be efficiently translated to fast I/O specific calls to emulated devices, and (4) the cost for asynchronous guest interrupts can be reduced. For an ARM-based Linux guest system running on an x86 host with Intel VT support we demonstrate application performance levels, based on SPEC CPU2006 benchmarks, of up to 5.88x over state-of-the-art QEMU and 2.5x on average, achieving a guest dynamic instruction throughput of up to 1280 MIPS and 915.52 MIPS, on average

    Lockdown: Dynamic Control-Flow Integrity

    Full text link
    Applications written in low-level languages without type or memory safety are especially prone to memory corruption. Attackers gain code execution capabilities through such applications despite all currently deployed defenses by exploiting memory corruption vulnerabilities. Control-Flow Integrity (CFI) is a promising defense mechanism that restricts open control-flow transfers to a static set of well-known locations. We present Lockdown, an approach to dynamic CFI that protects legacy, binary-only executables and libraries. Lockdown adaptively learns the control-flow graph of a running process using information from a trusted dynamic loader. The sandbox component of Lockdown restricts interactions between different shared objects to imported and exported functions by enforcing fine-grained CFI checks. Our prototype implementation shows that dynamic CFI results in low performance overhead.Comment: ETH Technical Repor

    SimBench: A Portable Benchmarking Methodology for Full-System Simulators

    Get PDF
    We acknowledge funding by the EPSRC grant PAMELA EP/K008730/1.Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmarksuites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on ‘real-world’ workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memoryaccess performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks.Postprin

    From High Level Architecture Descriptions to Fast Instruction Set Simulators

    Get PDF
    As computer systems become increasingly complex and diverse, so too do the architectures they implement. This leads to an increase in complexity in the tools used to design new hardware and software. One particularly important tool in hardware and software design is the Instruction Set Simulator, which is used to prototype new architectures and hardware features, verify hardware, and test and debug software. Many Architecture Description Languages exist which facilitate the description of new architectural or hardware features, and generate a tools such as simulators. However, these typically suffer from poor performance, are difficult to test effectively, and may be limited in functionality. This thesis considers three objectives when developing Instruction Set Simulators: performance, correctness, and completeness, and presents techniques which contribute to each of these. Performance is obtained by combining Dynamic Binary Translation techniques with a novel analysis of high level architecture descriptions. This makes use of partial evaluation techniques in order to both improve the translation system, and to improve the quality of the translated code, leading a performance improvement of over 2.5x compared to a naïve implementation. This thesis also presents techniques which contribute to the correctness objective. Each possible behaviour of each described instruction is used to guide the generation of a test case. Constraint satisfaction techniques are used to determine the necessary instruction encoding and context for each behaviour to be produced. It is shown that this is a significant improvement over benchmark-driven testing, and this technique has led to the discovery of several bugs and inconsistencies in multiple state of the art instruction set simulators. Finally, several challenges in ‘Full System’ simulation are addressed, contributing to both the performance and completeness objectives. Full System simulation generally carries significant performance costs compared with other simulation strategies. Crucially, instructions which access memory require virtual to physical address translation and can now cause exceptions. Both of these processes must be correctly and efficiently handled by the simulator. This thesis presents novel techniques to address this issue which provide up to a 1.65x speedup over a state of the art solution

    Simulation Native des SystÚmes Multiprocesseurs sur Puce à l'aide de la Virtualisation Assistée par le Matériel

    Get PDF
    L'intĂ©gration de plusieurs processeurs hĂ©tĂ©rogĂšnes en un seul systĂšme sur puce (SoC) est une tendance claire dans les systĂšmes embarquĂ©s. La conception et la vĂ©rification de ces systĂšmes nĂ©cessitent des plateformes rapides de simulation, et faciles Ă  construire. Parmi les approches de simulation de logiciels, la simulation native est un bon candidat grĂące Ă  l'exĂ©cution native de logiciel embarquĂ© sur la machine hĂŽte, ce qui permet des simulations Ă  haute vitesse, sans nĂ©cessiter le dĂ©veloppement de simulateurs d'instructions. Toutefois, les techniques de simulation natives existantes exĂ©cutent le logiciel de simulation dans l'espace de mĂ©moire partagĂ©e entre le matĂ©riel modĂ©lisĂ© et le systĂšme d'exploitation hĂŽte. Il en rĂ©sulte de nombreux problĂšmes, par exemple les conflits l'espace d'adressage et les chevauchements de mĂ©moire ainsi que l'utilisation des adresses de la machine hĂŽte plutĂŽt des celles des plates-formes matĂ©rielles cibles. Cela rend pratiquement impossible la simulation native du code existant fonctionnant sur la plate-forme cible. Pour surmonter ces problĂšmes, nous proposons l'ajout d'une couche transparente de traduction de l'espace adressage pour sĂ©parer l'espace d'adresse cible de celui du simulateur de hĂŽte. Nous exploitons la technologie de virtualisation assistĂ©e par matĂ©riel (HAV pour Hardware-Assisted Virtualization) Ă  cet effet. Cette technologie est maintenant disponibles sur plupart de processeurs grande public Ă  usage gĂ©nĂ©ral. Les expĂ©riences montrent que cette solution ne dĂ©grade pas la vitesse de simulation native, tout en gardant la possibilitĂ© de rĂ©aliser l'Ă©valuation des performances du logiciel simulĂ©. La solution proposĂ©e est Ă©volutive et flexible et nous fournit les preuves nĂ©cessaires pour appuyer nos revendications avec des solutions de simulation multiprocesseurs et hybrides. Nous abordons Ă©galement la simulation d'exĂ©cutables cross- compilĂ©s pour les processeurs VLIW (Very Long Instruction Word) en utilisant une technique de traduction binaire statique (SBT) pour gĂ©nĂ©rĂ© le code natif. Ainsi il n'est pas nĂ©cessaire de faire de traduction Ă  la volĂ©e ou d'interprĂ©tation des instructions. Cette approche est intĂ©ressante dans les situations oĂč le code source n'est pas disponible ou que la plate-forme cible n'est pas supportĂ© par les compilateurs reciblable, ce qui est gĂ©nĂ©ralement le cas pour les processeurs VLIW. Les simulateurs gĂ©nĂ©rĂ©s s'exĂ©cutent au-dessus de notre plate-forme basĂ©e sur le HAV et modĂ©lisent les processeurs de la sĂ©rie C6x de Texas Instruments (TI). Les rĂ©sultats de simulation des binaires pour VLIW montrent une accĂ©lĂ©ration de deux ordres de grandeur par rapport aux simulateurs prĂ©cis au cycle prĂšs.Integration of multiple heterogeneous processors into a single System-on-Chip (SoC) is a clear trend in embedded systems. Designing and verifying these systems require high-speed and easy-to-build simulation platforms. Among the software simulation approaches, native simulation is a good candidate since the embedded software is executed natively on the host machine, resulting in high speed simulations and without requiring instruction set simulator development effort. However, existing native simulation techniques execute the simulated software in memory space shared between the modeled hardware and the host operating system. This results in many problems, including address space conflicts and overlaps as well as the use of host machine addresses instead of the target hardware platform ones. This makes it practically impossible to natively simulate legacy code running on the target platform. To overcome these issues, we propose the addition of a transparent address space translation layer to separate the target address space from that of the host simulator. We exploit the Hardware-Assisted Virtualization (HAV) technology for this purpose, which is now readily available on almost all general purpose processors. Experiments show that this solution does not degrade the native simulation speed, while keeping the ability to accomplish software performance evaluation. The proposed solution is scalable as well as flexible and we provide necessary evidence to support our claims with multiprocessor and hybrid simulation solutions. We also address the simulation of cross-compiled Very Long Instruction Word (VLIW) executables, using a Static Binary Translation (SBT) technique to generated native code that does not require run-time translation or interpretation support. This approach is interesting in situations where either the source code is not available or the target platform is not supported by any retargetable compilation framework, which is usually the case for VLIW processors. The generated simulators execute on top of our HAV based platform and model the Texas Instruments (TI) C6x series processors. Simulation results for VLIW binaries show a speed-up of around two orders of magnitude compared to the cycle accurate simulators.SAVOIE-SCD - Bib.Ă©lectronique (730659901) / SudocGRENOBLE1/INP-Bib.Ă©lectronique (384210012) / SudocGRENOBLE2/3-Bib.Ă©lectronique (384219901) / SudocSudocFranceF

    Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

    Get PDF
    Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well. In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead. This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area. StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements
    corecore