80 research outputs found
Recommended from our members
Scalable Emulation of Heterogeneous Systems
The breakdown of Dennard's transistor scaling has driven computing systems toward application-specific accelerators, which can provide orders-of-magnitude improvements in performance and energy efficiency over general-purpose processors.
To enable the radical departures from conventional approaches that heterogeneous systems entail, research infrastructure must be able to model processors, memory and accelerators, as well as system-level changes---such as operating system or instruction set architecture (ISA) innovations---that might be needed to realize the accelerators' potential. Unfortunately, existing simulation tools that can support such system-level research are limited by the lack of fast, scalable machine emulators to drive execution.
To fill this need, in this dissertation we first present a novel machine emulator design based on dynamic binary translation that makes the following improvements over the state of the art: it scales on multicore hosts while remaining memory efficient, correctly handles cross-ISA differences in atomic instruction semantics, leverages the host floating point (FP) unit to speed up FP emulation without sacrificing correctness, and can be efficiently instrumented to---among other possible uses---drive the execution of a full-system, cross-ISA simulator with support for accelerators.
We then demonstrate the utility of machine emulation for studying heterogeneous systems by leveraging it to make two additional contributions. First, we quantify the trade-offs in different coupling models for on-chip accelerators. Second, we present a technique to reuse the private memories of on-chip accelerators when they are otherwise inactive to expand the system's last-level cache, thereby reducing the opportunity cost of the accelerators' integration
Fast and Correct Load-Link/Store-Conditional Instruction Handling in DBT Systems
Dynamic Binary Translation (DBT) requires the implementation of load-link/store-conditional (LL/SC) primitives for guest systems that rely on this form of synchronization. When targeting e.g. x86 host systems, LL/SC guest instructions are typically emulated using atomic Compare-and-Swap (CAS) instructions on the host. Whilst this direct mapping is efficient, this approach is problematic due to subtle differences between LL/SC and CAS semantics. In this paper, we demonstrate that this is a real problem, and we provide code examples that fail to execute correctly on QEMU and a commercial DBT system, which both use the CAS approach to LL/SC emulation. We then develop two novel and provably correct LL/SC emulation schemes: (1) A purely software based scheme, which uses the DBT system’s page translation cache for correctly selecting between fast, but unsynchronized, and slow, but fully synchronized memory accesses, and (2) a hardware accelerated scheme that leverages hardware transactional memory (HTM) provided by the host. We have implemented these two schemes in the Synopsys DesignWare® ARC® nSIM DBT system, and we evaluate our implementations against full applications, and targeted micro-benchmarks. We demonstrate that our novel schemes are not only correct, but also deliver competitive performance on-par or better than the widely used, but broken CAS scheme.Postprin
Emulação de RISC-V com Alto Desempenho
Orientador: Edson BorinDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: RISC-V é uma ISA aberta que tem chamado a atenção ao redor do mundo por seu rápido crescimento e adoção. Já é suportado pelo GCC, Clang e Kernel Linux. Além disso, vários emuladores e simuladores para RISC-V surgiram recentemente, mas nenhum deles com desempenho próximo ao nativo. Nesta dissertação, nós investigamos se emuladores mais rápidos para RISC-V podem ser criados. Como a técnica mais comum e também a mais rápida para implementar um emulador, Tradução Dinâmica de Binários (TDB), depende diretamente de boa qualidade de tradução para alcançar bom desempenho, nós investigamos se uma tradução de alta qualidade de binários RISC-V é plausível. Desta forma, neste trabalho nós implementamos e avaliamos um motor de Tradução Estática de Binários (TEB) baseado no LLVM, para investigar se é ou não possível produzir traduções de alta qualidade de RISC-V para x86 e ARM. Nossos resultados experimentais indicam que nosso motor de TEB consegue produzir código de alta qualidade quando traduz binários RISC-V para x86 e ARM, com sobrecargas médias em torno de 1.2x/1.3x quando comparado à código nativo x86/ARM, um resultado melhor do que motores de TDB de RISC-V bem conhecidos, como RV8 e QEMU. Além disso, como motores de TDB tem seu desempenho fortemente relacionado à qualidade de tradução, nosso motor de TEB evidencia a oportunidade na direção da criação de emuladores RISC-V de TDB com desempenho superior aos atuaisAbstract: RISC-V is an open ISA which has been calling the attention worldwide by its fast growth and adoption. It is already supported by GCC, Clang and the Linux Kernel. Moreover, several emulators and simulators for RISC-V have arisen recently, but none of them with near-native performance. In this work, we investigate if faster emulators for RISC-V could be created. As the most common and also the fastest technique to implement an emulator, Dynamic Binary Translation (DBT), depends directly on good translation quality to achieve good performance, we investigate if a high-quality translation of RISC- V binaries is feasible. Thus, in this work we implemented and evaluated a LLVM-based Static Binary Translation (SBT) engine to investigate whether or not it is possible to produce high-quality translations from RISC-V to x86 and ARM. Our experimental results indicate that our SBT engine is able to produce high-quality code when translating RISC- V binaries to x86 and ARM, with average overheads around 1.2x/1.3x when compared to native x86/ARM code, a better result than well-known RISC-V DBT engines such as RV8 and QEMU. Moreover, since DBT engines have its performance strongly related to translation quality, our SBT engine evidences the opportunity towards the creation of RISC-V DBT emulators with higher performance than the current onesMestradoCiência da ComputaçãoMestre em Ciência da Computaçã
Efficient cross-architecture hardware virtualisation
Hardware virtualisation is the provision of an isolated virtual environment that
represents real physical hardware. It enables operating systems, or other system-level
software (the guest), to run unmodified in a “container” (the virtual machine)
that is isolated from the real machine (the host).
There are many use-cases for hardware virtualisation that span a wide-range
of end-users. For example, home-users wanting to run multiple operating systems
side-by-side (such as running a Windows® operating system inside an OS
X environment) will use virtualisation to accomplish this. In research and development
environments, developers building experimental software and hardware
want to prototype their designs quickly, and so will virtualise the platform
they are targeting to isolate it from their development workstation. Large-scale
computing environments employ virtualisation to consolidate hardware, enforce
application isolation, migrate existing servers or provision new servers.
However, the majority of these use-cases call for same-architecture virtualisation,
where the architecture of the guest and the host machines match—a situation
that can be accelerated by the hardware-assisted virtualisation extensions
present on modern processors. But, there is significant interest in virtualising
the hardware of different architectures on a host machine, especially in the
architectural research and development worlds.
Typically, the instruction set architecture of a guest platform will be different
to the host machine, e.g. an ARM guest on an x86 host will use an ARM instruction
set, whereas the host will be using the x86 instruction set. Therefore, to
enable this cross-architecture virtualisation, each guest instruction must be emulated
by the host CPU—a potentially costly operation. This thesis presents a
range of techniques for accelerating this instruction emulation, improving over
a state-of-the art instruction set simulator by 2:64x. But, emulation of the guest
platform’s instruction set is not enough for full hardware virtualisation. In fact,
this is just one challenge in a range of issues that must be considered. Specifically,
another challenge is efficiently handling the way external interrupts are
managed by the virtualisation system. This thesis shows that when employing
efficient instruction emulation techniques, it is not feasible to arbitrarily
divert control-flow without consideration being given to the state of the emulated
processor. Furthermore, it is shown that it is possible for the virtualisation
environment to behave incorrectly if particular care is not given to the point
at which control-flow is allowed to diverge. To solve this, a technique is developed
that maintains efficient instruction emulation, and correctly handles
external interrupt sources.
Finally, modern processors have built-in support for hardware virtualisation
in the form of instruction set extensions that enable the creation of an abstract
computing environment, indistinguishable from real hardware. These extensions
enable guest operating systems to run directly on the physical processor,
with minimal supervision from a hypervisor. However, these extensions are
geared towards same-architecture virtualisation, and as such are not immediately
well-suited for cross-architecture virtualisation. This thesis presents a
technique for exploiting these existing extensions, and using them in a cross-architecture
virtualisation setting, improving the performance of a novel cross-architecture
virtualisation hypervisor over state-of-the-art by 2:5x
Reproducible Host Networking Evaluation with End-to-End Simulation
Networking researchers are facing growing challenges in evaluating and
reproducing results for modern network systems. As systems rely on closer
integration of system components and cross-layer optimizations in the pursuit
of performance and efficiency, they are also increasingly tied to specific
hardware and testbed properties. Combined with a trend towards heterogeneous
hardware, such as protocol offloads, SmartNICs, and in-network accelerators,
researchers face the choice of either investing more and more time and
resources into comparisons to prior work or, alternatively, lower the standards
for evaluation.
We aim to address this challenge by introducing SimBricks, a simulation
framework that decouples networked systems from the physical testbed and
enables reproducible end-to-end evaluation in simulation. Instead of
reinventing the wheel, SimBricks is a modular framework for combining existing
tried-and-true simulators for individual components, processor and memory, NIC,
and network, into complete testbeds capable of running unmodified systems. In
our evaluation, we reproduce key findings from prior work, including dctcp
congestion control, NOPaxos in-network consensus acceleration, and the Corundum
FPGA NIC.Comment: 15 pages, 10 figures, under submissio
Simulation Native des Systèmes Multiprocesseurs sur Puce à l'aide de la Virtualisation Assistée par le Matériel
L'intégration de plusieurs processeurs hétérogènes en un seul système sur puce (SoC) est une tendance claire dans les systèmes embarqués. La conception et la vérification de ces systèmes nécessitent des plateformes rapides de simulation, et faciles à construire. Parmi les approches de simulation de logiciels, la simulation native est un bon candidat grâce à l'exécution native de logiciel embarqué sur la machine hôte, ce qui permet des simulations à haute vitesse, sans nécessiter le développement de simulateurs d'instructions. Toutefois, les techniques de simulation natives existantes exécutent le logiciel de simulation dans l'espace de mémoire partagée entre le matériel modélisé et le système d'exploitation hôte. Il en résulte de nombreux problèmes, par exemple les conflits l'espace d'adressage et les chevauchements de mémoire ainsi que l'utilisation des adresses de la machine hôte plutôt des celles des plates-formes matérielles cibles. Cela rend pratiquement impossible la simulation native du code existant fonctionnant sur la plate-forme cible. Pour surmonter ces problèmes, nous proposons l'ajout d'une couche transparente de traduction de l'espace adressage pour séparer l'espace d'adresse cible de celui du simulateur de hôte. Nous exploitons la technologie de virtualisation assistée par matériel (HAV pour Hardware-Assisted Virtualization) à cet effet. Cette technologie est maintenant disponibles sur plupart de processeurs grande public à usage général. Les expériences montrent que cette solution ne dégrade pas la vitesse de simulation native, tout en gardant la possibilité de réaliser l'évaluation des performances du logiciel simulé. La solution proposée est évolutive et flexible et nous fournit les preuves nécessaires pour appuyer nos revendications avec des solutions de simulation multiprocesseurs et hybrides. Nous abordons également la simulation d'exécutables cross- compilés pour les processeurs VLIW (Very Long Instruction Word) en utilisant une technique de traduction binaire statique (SBT) pour généré le code natif. Ainsi il n'est pas nécessaire de faire de traduction à la volée ou d'interprétation des instructions. Cette approche est intéressante dans les situations où le code source n'est pas disponible ou que la plate-forme cible n'est pas supporté par les compilateurs reciblable, ce qui est généralement le cas pour les processeurs VLIW. Les simulateurs générés s'exécutent au-dessus de notre plate-forme basée sur le HAV et modélisent les processeurs de la série C6x de Texas Instruments (TI). Les résultats de simulation des binaires pour VLIW montrent une accélération de deux ordres de grandeur par rapport aux simulateurs précis au cycle près.Integration of multiple heterogeneous processors into a single System-on-Chip (SoC) is a clear trend in embedded systems. Designing and verifying these systems require high-speed and easy-to-build simulation platforms. Among the software simulation approaches, native simulation is a good candidate since the embedded software is executed natively on the host machine, resulting in high speed simulations and without requiring instruction set simulator development effort. However, existing native simulation techniques execute the simulated software in memory space shared between the modeled hardware and the host operating system. This results in many problems, including address space conflicts and overlaps as well as the use of host machine addresses instead of the target hardware platform ones. This makes it practically impossible to natively simulate legacy code running on the target platform. To overcome these issues, we propose the addition of a transparent address space translation layer to separate the target address space from that of the host simulator. We exploit the Hardware-Assisted Virtualization (HAV) technology for this purpose, which is now readily available on almost all general purpose processors. Experiments show that this solution does not degrade the native simulation speed, while keeping the ability to accomplish software performance evaluation. The proposed solution is scalable as well as flexible and we provide necessary evidence to support our claims with multiprocessor and hybrid simulation solutions. We also address the simulation of cross-compiled Very Long Instruction Word (VLIW) executables, using a Static Binary Translation (SBT) technique to generated native code that does not require run-time translation or interpretation support. This approach is interesting in situations where either the source code is not available or the target platform is not supported by any retargetable compilation framework, which is usually the case for VLIW processors. The generated simulators execute on top of our HAV based platform and model the Texas Instruments (TI) C6x series processors. Simulation results for VLIW binaries show a speed-up of around two orders of magnitude compared to the cycle accurate simulators.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF
Lasagne : a static binary translator for weak memory model architectures
Funding: This work was supported by a UK RISE Grant.The emergence of new architectures create a recurring challenge to ensure that existing programs still work on them. Manually porting legacy code is often impractical. Static binary translation (SBT) is a process where a program’s binary is automatically translated from one architecture to another, while preserving their original semantics. However, these SBT tools have limited support to various advanced architectural features. Importantly, they are currently unable to translate concurrent binaries. The main challenge arises from the mismatches of the memory consistency model specified by the different architectures, especially when porting existing binaries to a weak memory model architecture. In this paper, we propose Lasagne, an end-to-end static binary translator with precise translation rules between x86 and Arm concurrency semantics. First, we propose a concurrency model for Lasagne’s intermediate representation (IR) and formally proved mappings between the IR and the two architectures. The memory ordering is preserved by introducing fences in the translated code. Finally, we propose optimizations focused on raising the level of abstraction of memory address calculations and reducing the number offences. Our evaluation shows that Lasagne reduces the number of fences by up to about 65%, with an average reduction of 45.5%, significantly reducing their runtime overhead.Postprin
Kiihdytetyn laskennan ajoituksen ja energiankulutuksen simulointi
As the increase in the sequential processing performance of general-purpose central processing units has slowed down dramatically, computer systems have been moving towards increasingly parallel and heterogeneous architectures. Modern graphics processing units have emerged as one of the first affordable platforms for data-parallel processing. Due to their closed nature, it has been difficult for software developers to observe the performance and energy efficiency characteristics of the execution of applications of graphics processing units.
In this thesis, we have explored different tools and methods for observing the execution of accelerated processing on graphics processing units. We have found that hardware vendors provide interfaces for observing the timing of events that occur on the host platform and aggregated performance metrics of execution on the graphics processing units to some extent. However, more fine-grained details of execution are currently available only by using graphics processing unit simulators.
As a proof-of-concept, we have studied a functional graphics processing unit simulator as a tool for understanding the energy efficiency of accelerated processing. The presented energy estimation model and simulation method has been validated against a face detection application. The difference between the estimated and measured dynamic energy consumption in this case was found to be 5.4%. Functional simulators appear to be accurate enough to be used for observing the energy efficiency of graphics processing unit accelerated processing in certain use-cases.Suorittimien sarjallisen suorituskyvyn kasvun hidastuessa tietokonejärjestelmät ovat siirtymässä kohti rinnakkaislaskentaa ja heterogeenisia arkkitehtuureja. Modernit grafiikkasuorittimet ovat yleistyneet ensimmäisinä huokeina alustoina yleisluonteisen kiihdytetyn datarinnakkaisen laskennan suorittamiseen. Grafiikkasuorittimet ovat usein suljettuja alustoja, minkä takia ohjelmistokehittäjien on vaikea havainnoida tarkempia yksityiskohtia suorituksesta liittyen laskennan suorituskykyyn ja energian kulutukseen.
Tässä työssä on tutkittu erilaisia työkaluja ja tapoja tarkkailla ohjelmien kiihdytettyä suoritusta grafiikkasuorittimilla. Laitevalmistajat tarjoavat joitakin rajapintoja tapahtumien ajoituksen havainnointiin sekä isäntäalustalla että grafiikkasuorittimella. Laskennan tarkempaan havainnointiin on kuitenkin usein käytettävä grafiikkasuoritinsimulaattoreita.
Työn kokeellisessa osuudessa työssä on tutkittu funktionaalisten grafiikkasuoritinsimulaattoreiden käyttöä työkaluna grafiikkasuorittimella kiihdytetyn laskennan energiantehokkuuden arvioinnissa. Työssä on malli grafiikkasuorittimen energian kulutuksen arviontiin. Arvion validointiin on käytetty kasvontunnistussovellusta. Mittauksissa arvioidun ja mitatun energian kulutuksen eroksi mitattiin 5.4%. Funktionaaliset simulaattorit ovat mittaustemme perusteella tietyissä käyttötarkoituksissa tarpeeksi tarkkoja grafiikkasuorittimella kiihdytetyn laskennan energiatehokkuuden arviointiin
- …