16 research outputs found
A Retargetable System-Level DBT Hypervisor
System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different to that of the host machine. Due to their performance critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new, or extending an existing architecture are high. In this paper we develop a novel, retargetable DBT hypervisor, which includes guest specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations, and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine (VM) hypervisor. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de-facto standard QEMU DBT system. Our system delivers an average speedup of 2.21Ă over QEMU across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49Ă on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA, while delivering outstanding performance.Publisher PD
Efficient cross-architecture hardware virtualisation
Hardware virtualisation is the provision of an isolated virtual environment that
represents real physical hardware. It enables operating systems, or other system-level
software (the guest), to run unmodified in a âcontainerâ (the virtual machine)
that is isolated from the real machine (the host).
There are many use-cases for hardware virtualisation that span a wide-range
of end-users. For example, home-users wanting to run multiple operating systems
side-by-side (such as running a WindowsÂź operating system inside an OS
X environment) will use virtualisation to accomplish this. In research and development
environments, developers building experimental software and hardware
want to prototype their designs quickly, and so will virtualise the platform
they are targeting to isolate it from their development workstation. Large-scale
computing environments employ virtualisation to consolidate hardware, enforce
application isolation, migrate existing servers or provision new servers.
However, the majority of these use-cases call for same-architecture virtualisation,
where the architecture of the guest and the host machines matchâa situation
that can be accelerated by the hardware-assisted virtualisation extensions
present on modern processors. But, there is significant interest in virtualising
the hardware of different architectures on a host machine, especially in the
architectural research and development worlds.
Typically, the instruction set architecture of a guest platform will be different
to the host machine, e.g. an ARM guest on an x86 host will use an ARM instruction
set, whereas the host will be using the x86 instruction set. Therefore, to
enable this cross-architecture virtualisation, each guest instruction must be emulated
by the host CPUâa potentially costly operation. This thesis presents a
range of techniques for accelerating this instruction emulation, improving over
a state-of-the art instruction set simulator by 2:64x. But, emulation of the guest
platformâs instruction set is not enough for full hardware virtualisation. In fact,
this is just one challenge in a range of issues that must be considered. Specifically,
another challenge is efficiently handling the way external interrupts are
managed by the virtualisation system. This thesis shows that when employing
efficient instruction emulation techniques, it is not feasible to arbitrarily
divert control-flow without consideration being given to the state of the emulated
processor. Furthermore, it is shown that it is possible for the virtualisation
environment to behave incorrectly if particular care is not given to the point
at which control-flow is allowed to diverge. To solve this, a technique is developed
that maintains efficient instruction emulation, and correctly handles
external interrupt sources.
Finally, modern processors have built-in support for hardware virtualisation
in the form of instruction set extensions that enable the creation of an abstract
computing environment, indistinguishable from real hardware. These extensions
enable guest operating systems to run directly on the physical processor,
with minimal supervision from a hypervisor. However, these extensions are
geared towards same-architecture virtualisation, and as such are not immediately
well-suited for cross-architecture virtualisation. This thesis presents a
technique for exploiting these existing extensions, and using them in a cross-architecture
virtualisation setting, improving the performance of a novel cross-architecture
virtualisation hypervisor over state-of-the-art by 2:5x
Hardware Accelerated Cross-Architecture Full-System Virtualization
Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported by all major processor architectures, including Intel, ARM, PowerPC & MIPS, these extensions are targeted at virtualization of the same architecture, e.g. an x86 guest on an x86 host system. Existing techniques for cross-architecture virtualization, e.g. an ARM guest on an x86 host, still incur a substantial overhead for CPU, memory and I/O virtualization due to the necessity for software emulation of these mismatched system components. In this article we present a new hardware accelerated hypervisor called CAPTIVE, employing a range of novel techniques, which exploit existing hardware virtualization extensions for improving the performance of full-system cross-platform virtualization. We illustrate how (1) guest MMU events and operations can be mapped onto host memory virtualization extensions, eliminating the need for costly software MMU emulation, (2) a block-based DBT engine inside the virtual machine can improve CPU virtualization performance, (3) memory mapped guest I/O can be efficiently translated to fast I/O specific calls to emulated devices, and (4) the cost for asynchronous guest interrupts can be reduced. For an ARM-based Linux guest system running on an x86 host with Intel VT support we demonstrate application performance levels, based on SPEC CPU2006 benchmarks, of up to 5.88x over state-of-the-art QEMU and 2.5x on average, achieving a guest dynamic instruction throughput of up to 1280 MIPS and 915.52 MIPS, on average
Lockdown: Dynamic Control-Flow Integrity
Applications written in low-level languages without type or memory safety are
especially prone to memory corruption. Attackers gain code execution
capabilities through such applications despite all currently deployed defenses
by exploiting memory corruption vulnerabilities. Control-Flow Integrity (CFI)
is a promising defense mechanism that restricts open control-flow transfers to
a static set of well-known locations. We present Lockdown, an approach to
dynamic CFI that protects legacy, binary-only executables and libraries.
Lockdown adaptively learns the control-flow graph of a running process using
information from a trusted dynamic loader. The sandbox component of Lockdown
restricts interactions between different shared objects to imported and
exported functions by enforcing fine-grained CFI checks. Our prototype
implementation shows that dynamic CFI results in low performance overhead.Comment: ETH Technical Repor
SimBench: A Portable Benchmarking Methodology for Full-System Simulators
We acknowledge funding by the EPSRC grant PAMELA EP/K008730/1.Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmarksuites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on âreal-worldâ workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memoryaccess performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks.Postprin
Recommended from our members
Scalable Emulation of Heterogeneous Systems
The breakdown of Dennard's transistor scaling has driven computing systems toward application-specific accelerators, which can provide orders-of-magnitude improvements in performance and energy efficiency over general-purpose processors.
To enable the radical departures from conventional approaches that heterogeneous systems entail, research infrastructure must be able to model processors, memory and accelerators, as well as system-level changes---such as operating system or instruction set architecture (ISA) innovations---that might be needed to realize the accelerators' potential. Unfortunately, existing simulation tools that can support such system-level research are limited by the lack of fast, scalable machine emulators to drive execution.
To fill this need, in this dissertation we first present a novel machine emulator design based on dynamic binary translation that makes the following improvements over the state of the art: it scales on multicore hosts while remaining memory efficient, correctly handles cross-ISA differences in atomic instruction semantics, leverages the host floating point (FP) unit to speed up FP emulation without sacrificing correctness, and can be efficiently instrumented to---among other possible uses---drive the execution of a full-system, cross-ISA simulator with support for accelerators.
We then demonstrate the utility of machine emulation for studying heterogeneous systems by leveraging it to make two additional contributions. First, we quantify the trade-offs in different coupling models for on-chip accelerators. Second, we present a technique to reuse the private memories of on-chip accelerators when they are otherwise inactive to expand the system's last-level cache, thereby reducing the opportunity cost of the accelerators' integration
From High Level Architecture Descriptions to Fast Instruction Set Simulators
As computer systems become increasingly complex and diverse, so too do the architectures
they implement. This leads to an increase in complexity in the tools used to design
new hardware and software. One particularly important tool in hardware and software
design is the Instruction Set Simulator, which is used to prototype new architectures and
hardware features, verify hardware, and test and debug software. Many Architecture
Description Languages exist which facilitate the description of new architectural or
hardware features, and generate a tools such as simulators. However, these typically
suffer from poor performance, are difficult to test effectively, and may be limited in
functionality.
This thesis considers three objectives when developing Instruction Set Simulators:
performance, correctness, and completeness, and presents techniques which contribute
to each of these. Performance is obtained by combining Dynamic Binary Translation
techniques with a novel analysis of high level architecture descriptions. This makes use
of partial evaluation techniques in order to both improve the translation system, and to
improve the quality of the translated code, leading a performance improvement of over
2.5x compared to a naĂŻve implementation.
This thesis also presents techniques which contribute to the correctness objective.
Each possible behaviour of each described instruction is used to guide the generation
of a test case. Constraint satisfaction techniques are used to determine the necessary
instruction encoding and context for each behaviour to be produced. It is shown that
this is a significant improvement over benchmark-driven testing, and this technique
has led to the discovery of several bugs and inconsistencies in multiple state of the art
instruction set simulators.
Finally, several challenges in âFull Systemâ simulation are addressed, contributing
to both the performance and completeness objectives. Full System simulation generally
carries significant performance costs compared with other simulation strategies. Crucially,
instructions which access memory require virtual to physical address translation
and can now cause exceptions. Both of these processes must be correctly and efficiently
handled by the simulator. This thesis presents novel techniques to address this issue
which provide up to a 1.65x speedup over a state of the art solution
Simulation Native des SystÚmes Multiprocesseurs sur Puce à l'aide de la Virtualisation Assistée par le Matériel
L'intĂ©gration de plusieurs processeurs hĂ©tĂ©rogĂšnes en un seul systĂšme sur puce (SoC) est une tendance claire dans les systĂšmes embarquĂ©s. La conception et la vĂ©rification de ces systĂšmes nĂ©cessitent des plateformes rapides de simulation, et faciles Ă construire. Parmi les approches de simulation de logiciels, la simulation native est un bon candidat grĂące Ă l'exĂ©cution native de logiciel embarquĂ© sur la machine hĂŽte, ce qui permet des simulations Ă haute vitesse, sans nĂ©cessiter le dĂ©veloppement de simulateurs d'instructions. Toutefois, les techniques de simulation natives existantes exĂ©cutent le logiciel de simulation dans l'espace de mĂ©moire partagĂ©e entre le matĂ©riel modĂ©lisĂ© et le systĂšme d'exploitation hĂŽte. Il en rĂ©sulte de nombreux problĂšmes, par exemple les conflits l'espace d'adressage et les chevauchements de mĂ©moire ainsi que l'utilisation des adresses de la machine hĂŽte plutĂŽt des celles des plates-formes matĂ©rielles cibles. Cela rend pratiquement impossible la simulation native du code existant fonctionnant sur la plate-forme cible. Pour surmonter ces problĂšmes, nous proposons l'ajout d'une couche transparente de traduction de l'espace adressage pour sĂ©parer l'espace d'adresse cible de celui du simulateur de hĂŽte. Nous exploitons la technologie de virtualisation assistĂ©e par matĂ©riel (HAV pour Hardware-Assisted Virtualization) Ă cet effet. Cette technologie est maintenant disponibles sur plupart de processeurs grande public Ă usage gĂ©nĂ©ral. Les expĂ©riences montrent que cette solution ne dĂ©grade pas la vitesse de simulation native, tout en gardant la possibilitĂ© de rĂ©aliser l'Ă©valuation des performances du logiciel simulĂ©. La solution proposĂ©e est Ă©volutive et flexible et nous fournit les preuves nĂ©cessaires pour appuyer nos revendications avec des solutions de simulation multiprocesseurs et hybrides. Nous abordons Ă©galement la simulation d'exĂ©cutables cross- compilĂ©s pour les processeurs VLIW (Very Long Instruction Word) en utilisant une technique de traduction binaire statique (SBT) pour gĂ©nĂ©rĂ© le code natif. Ainsi il n'est pas nĂ©cessaire de faire de traduction Ă la volĂ©e ou d'interprĂ©tation des instructions. Cette approche est intĂ©ressante dans les situations oĂč le code source n'est pas disponible ou que la plate-forme cible n'est pas supportĂ© par les compilateurs reciblable, ce qui est gĂ©nĂ©ralement le cas pour les processeurs VLIW. Les simulateurs gĂ©nĂ©rĂ©s s'exĂ©cutent au-dessus de notre plate-forme basĂ©e sur le HAV et modĂ©lisent les processeurs de la sĂ©rie C6x de Texas Instruments (TI). Les rĂ©sultats de simulation des binaires pour VLIW montrent une accĂ©lĂ©ration de deux ordres de grandeur par rapport aux simulateurs prĂ©cis au cycle prĂšs.Integration of multiple heterogeneous processors into a single System-on-Chip (SoC) is a clear trend in embedded systems. Designing and verifying these systems require high-speed and easy-to-build simulation platforms. Among the software simulation approaches, native simulation is a good candidate since the embedded software is executed natively on the host machine, resulting in high speed simulations and without requiring instruction set simulator development effort. However, existing native simulation techniques execute the simulated software in memory space shared between the modeled hardware and the host operating system. This results in many problems, including address space conflicts and overlaps as well as the use of host machine addresses instead of the target hardware platform ones. This makes it practically impossible to natively simulate legacy code running on the target platform. To overcome these issues, we propose the addition of a transparent address space translation layer to separate the target address space from that of the host simulator. We exploit the Hardware-Assisted Virtualization (HAV) technology for this purpose, which is now readily available on almost all general purpose processors. Experiments show that this solution does not degrade the native simulation speed, while keeping the ability to accomplish software performance evaluation. The proposed solution is scalable as well as flexible and we provide necessary evidence to support our claims with multiprocessor and hybrid simulation solutions. We also address the simulation of cross-compiled Very Long Instruction Word (VLIW) executables, using a Static Binary Translation (SBT) technique to generated native code that does not require run-time translation or interpretation support. This approach is interesting in situations where either the source code is not available or the target platform is not supported by any retargetable compilation framework, which is usually the case for VLIW processors. The generated simulators execute on top of our HAV based platform and model the Texas Instruments (TI) C6x series processors. Simulation results for VLIW binaries show a speed-up of around two orders of magnitude compared to the cycle accurate simulators.SAVOIE-SCD - Bib.Ă©lectronique (730659901) / SudocGRENOBLE1/INP-Bib.Ă©lectronique (384210012) / SudocGRENOBLE2/3-Bib.Ă©lectronique (384219901) / SudocSudocFranceF
Dynamic Binary Translation for Embedded Systems with Scratchpad Memory
Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well.
In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead.
This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area.
StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements