25 research outputs found

    Full-System Simulation of Mobile CPU/GPU Platforms

    Get PDF
    Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and userspace drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU/GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. In this paper we develop a full-system system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali-G71 GPU powered device. We validate our simulator against a hardware implementation and Arm’s stand-alone GPU simulator, achieving 100% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework by optimizing an advanced Computer Vision application using simulated statistics unavailable with other simulation approaches or physical GPU implementations. We demonstrate that performance optimizations for desktop GPUs trigger bottlenecks on mobile GPUs, and show the importance of efficient memory use.Postprin

    Efficient Dual-ISA Support in a Retargetable, Asynchronous Dynamic Binary Translator

    Get PDF
    Dynamic Binary Translation (DBT) allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Some modern DBT systems decouple their main execution loop from the built-inJust-In-Time (JIT) compiler, i.e. the JIT compiler can operate asynchronously in a different thread without blocking program execution. However, this creates a problem for target architectures with dual-ISA support such as ARM/THUMB, where the ISA of the currently executed instruction stream may be different to the one processed by the JIT compiler due to their decoupled operation and dynamic mode changes. In this paper we present a new approach for dual-ISA support in such an asynchronous DBT system, which integrates ISA mode tracking and hot-swapping of software instruction decoders. We demonstrate how this can be achieved in a retargetable DBT system, where the target ISA is not hard-coded, but a processor-specific module is generated from a high-level architecture description. We have implemented ARM V5T support in our DBT and demonstrate execution rates of up to 1148 MIPS for the SPEC CPU 2006 benchmarks compiled for ARM/THUMB, achieving on average 192%, and up to 323%, of the speed of QEMU, which has been subject to intensive manual performance tuning and requires significant low-level effort for retargeting

    Harmless, a Hardware Architecture Description Language Dedicated to Real-Time Embedded System Simulation

    Get PDF
    International audienceValidation and Verification of embedded systems through simulation can be conducted at many levels, from the simulation of a high-level application model to the simulation of the actual binary code using an accurate model of the processor. However, for real-time applications, the simulated execution time must be as close as possible to the execution time on the actual platform and in this case the latter gives the closest results. The main drawback of the simulation of application's software using an accurate model of the processor resides in the development of a handwritten simulator which is a difficult and tedious task. This paper presents Harmless, a hardware Architecture Description Language (ADL) that mainly targets real-time embedded systems. Harmless is dedicated to the generation of simulator of the hardware platform to develop and test real-time embedded applications. Compared to existing ADLs, Harmless1) offers a more flexible description of the Instruction Set Architecture (ISA) 2) allows to describe the microarchitecture independently of the ISA to ease its reuse and 3) compares favorably to simulators generated by the existing ADLs toolsets

    Hardware Accelerated Cross-Architecture Full-System Virtualization

    Get PDF
    Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported by all major processor architectures, including Intel, ARM, PowerPC & MIPS, these extensions are targeted at virtualization of the same architecture, e.g. an x86 guest on an x86 host system. Existing techniques for cross-architecture virtualization, e.g. an ARM guest on an x86 host, still incur a substantial overhead for CPU, memory and I/O virtualization due to the necessity for software emulation of these mismatched system components. In this article we present a new hardware accelerated hypervisor called CAPTIVE, employing a range of novel techniques, which exploit existing hardware virtualization extensions for improving the performance of full-system cross-platform virtualization. We illustrate how (1) guest MMU events and operations can be mapped onto host memory virtualization extensions, eliminating the need for costly software MMU emulation, (2) a block-based DBT engine inside the virtual machine can improve CPU virtualization performance, (3) memory mapped guest I/O can be efficiently translated to fast I/O specific calls to emulated devices, and (4) the cost for asynchronous guest interrupts can be reduced. For an ARM-based Linux guest system running on an x86 host with Intel VT support we demonstrate application performance levels, based on SPEC CPU2006 benchmarks, of up to 5.88x over state-of-the-art QEMU and 2.5x on average, achieving a guest dynamic instruction throughput of up to 1280 MIPS and 915.52 MIPS, on average

    MPSoCBench : um framework para avaliação de ferramentas e metodologias para sistemas multiprocessados em chip

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Recentes metodologias e ferramentas de projetos de sistemas multiprocessados em chip (MPSoC) aumentam a produtividade por meio da utilização de plataformas baseadas em simuladores, antes de definir os últimos detalhes da arquitetura. No entanto, a simulação só é eficiente quando utiliza ferramentas de modelagem que suportem a descrição do comportamento do sistema em um elevado nível de abstração. A escassez de plataformas virtuais de MPSoCs que integrem hardware e software escaláveis nos motivou a desenvolver o MPSoCBench, que consiste de um conjunto escalável de MPSoCs incluindo quatro modelos de processadores (PowerPC, MIPS, SPARC e ARM), organizado em plataformas com 1, 2, 4, 8, 16, 32 e 64 núcleos, cross-compiladores, IPs, interconexões, 17 aplicações paralelas e estimativa de consumo de energia para os principais componentes (processadores, roteadores, memória principal e caches). Uma importante demanda em projetos MPSoC é atender às restrições de consumo de energia o mais cedo possível. Considerando que o desempenho do processador está diretamente relacionado ao consumo, há um crescente interesse em explorar o trade-off entre consumo de energia e desempenho, tendo em conta o domínio da aplicação alvo. Técnicas de escalabilidade dinâmica de freqüência e voltagem fundamentam-se em gerenciar o nível de tensão e frequência da CPU, permitindo que o sistema alcance apenas o desempenho suficiente para processar a carga de trabalho, reduzindo, consequentemente, o consumo de energia. Para explorar a eficiência energética e desempenho, foram adicionados recursos ao MPSoCBench, visando explorar escalabilidade dinâmica de voltaegem e frequência (DVFS) e foram validados três mecanismos com base na estimativa dinâmica de energia e taxa de uso de CPUAbstract: Recent design methodologies and tools aim at enhancing the design productivity by providing a software development platform before the definition of the final Multiprocessor System on Chip (MPSoC) architecture details. However, simulation can only be efficiently performed when using a modeling and simulation engine that supports system behavior description at a high abstraction level. The lack of MPSoC virtual platform prototyping integrating both scalable hardware and software in order to create and evaluate new methodologies and tools motivated us to develop the MPSoCBench, a scalable set of MPSoCs including four different ISAs (PowerPC, MIPS, SPARC, and ARM) organized in platforms with 1, 2, 4, 8, 16, 32, and 64 cores, cross-compilers, IPs, interconnections, 17 parallel version of software from well-known benchmarks, and power consumption estimation for main components (processors, routers, memory, and caches). An important demand in MPSoC designs is the addressing of energy consumption constraints as early as possible. Whereas processor performance comes with a high power cost, there is an increasing interest in exploring the trade-off between power and performance, taking into account the target application domain. Dynamic Voltage and Frequency Scaling techniques adaptively scale the voltage and frequency levels of the CPU allowing it to reach just enough performance to process the system workload while meeting throughput constraints, and thereby, reducing the energy consumption. To explore this wide design space for energy efficiency and performance, both for hardware and software components, we provided MPSoCBench features to explore dynamic voltage and frequency scalability (DVFS) and evaluated three mechanisms based on energy estimation and CPU usage rateDoutoradoCiência da ComputaçãoDoutora em Ciência da Computaçã

    Cache implementation in the ArchC project

    Get PDF
    Orientadores: Paulo Cesar Centoducatte, Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O projeto ArchC visa criar uma linguagem de descrição de arquiteturas, com o objetivo de se construir simuladores e toolchains de arquiteturas computacionais completas. O objetivo deste trabalho é dotar ArchC com capacidade para gerar simuladores de caches. Para tanto foi realizado um estudo detalhado das caches (tipos, organizações, configurações etc) e do funcionamento e do código do ArchC. O resultado foi a descrição de uma coleção de caches parametrizáveis que podem ser adicionadas 'as arquiteturas descritas em ArchC. A implementação das caches é modular, possuindo código isolado para a memória de armazenamento da cache e políticas de operação. A corretude da cache foi verificada utilizando uma sequ¿encia de simulações de diversas configurações de cache e com comparações com o simulador dinero. A cache resultante apresentou um overhead, no tempo de simulaçao, que varia entre 10% e 60%, quando comparada a um simulador sem cacheAbstract: The ArchC project aims to create an architecture description language, with the goal of building complete computer architecture simulators and toolchains. The goal of this project is to add support in ArchC for simulating caches. To achieve this, a detailed study about caches (types, organization, configuration etc) and about the ArchC code was done. The result was a collection of parameterized caches that may be included on the architectures described with ArchC. The cache implementation is modular, having isolated code for the storage and operation policies. Implementation correctness was verified using a set of many cache configurations and with comparisons with the results from dinero simulator. The resulting cache showed an overhead varying between 10% and 60%, when compared to a simulator without cachesMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

    Performance analysis and optimizations of the ArchC simulators

    Get PDF
    Orientadores: Edson Borin, Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Geração automática possui a grande vantagem de automatizar um processo, reduzir o tempo que seria gasto nesta etapa e evitar que erros comuns aconteçam. Porém, de que adianta reduzir o tempo de uma etapa se existe a possibilidade de aumentar o tempo das demais etapas. Em projetos de circuitos digitais, foram desenvolvidas as linguagens de descrição de arquitetura, que possibilitaram o surgimento de ferramentas capazes de gerar automaticamente simuladores, compiladores, etc., que são utilizados para avaliar uma arquitetura sem que esta tenha um hardware propriamente dito. Simuladores gerados automaticamente são utilizados para executar aplicações e averiguar o comportamento destas e da arquitetura sendo projetada. No entanto, caso o simulador gerado não seja eficiente, o tempo de simulação aumenta, podendo superar o ganho obtido pela geração automática, cancelando suas vantagens. Neste caso, como verificar a eficiência do simulador gerado? Uma forma bastante usada é comparar com outros simuladores existentes ou gerar o simulador manualmente para comparação. Comparar com simuladores existentes exigem que estes sejam similares, já gerar manualmente o simulador elimina o propósito da geração automática. Nesse contexto, desenvolvemos uma metodologia para se avaliar os simuladores gerados automaticamente através de perfilamento de código. Isto permitiu a identificação dos gargalos de desempenho e, consequentemente, o desenvolvimento de otimizações na geração de código. Com as otimizações, conseguimos gerar um simulador do modelo MIPS 1,48 vezes melhorAbstract: Automatic generation has a great advantage of automating a process. This reduces the time taken in this step and avoiding common mistakes. However, what is the advantage of reducing the time of a step if there is the possibility of increasing the time of the remaining steps? In digital circuit design, the architecture description languages emerged to make possible the development of tools that automatically generate simulators, compilers, and others tools, that we use to evaluate an architecture without it having a hardware itself. Automatically generated simulators run applications and verify their behavior and the architecture in design. But if the generated simulator is not efficient, the simulation time increases and can exceed the gain achieved by automatic generation, canceling its benefits. How to check the efficiency of the generated simulator in this case? A common option compares the generated simulator with other existing simulators. The other alternative is generating manually a simulator for comparison. The first choice requires that the simulators are similar and the second possibility eliminates the purpose of automatic generation. In this context, we have developed a methodology to evaluate the simulators automatically generated using code profiling. This allowed the identification of performance bottlenecks and, consequently, the development of optimizations on code generation. With the optimizations, we generated a MIPS simulator 1.48 times betterMestradoCiência da ComputaçãoMestre em Ciência da Computação01-P-3951/2011, 01-P-1965/2012CAPE

    OpenISA, um conjunto de instruções híbrido

    Get PDF
    Orientador: Edson BorinTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: OpenISA é concebido como a interface de processadores que pretendem ser altamente flexíveis. Isto é conseguido por meio de três estratégias: em primeiro lugar, o ISA é empiricamente escolhido para ser facilmente traduzido para outros, possibilitando flexibilidade do software no caso de um processador OpenISA físico não estar disponível. Neste caso, não há nenhuma necessidade de aplicar um processador virtual OpenISA em software. O ISA está preparado para ser estaticamente traduzido para outros ISAs. Segundo, o ISA não é um ISA concreto nem um ISA virtual, mas um híbrido com a capacidade de admitir modificações nos opcodes sem afetar a compatibilidade retroativa. Este mecanismo permite que as futuras versões do ISA possam sofrer modificações em vez de extensões simples das versões anteriores, um problema comum com ISA concretos, como o x86. Em terceiro lugar, a utilização de uma licença permissiva permite o ISA ser usado livremente por qualquer parte interessada no projeto. Nesta tese de doutorado, concentramo-nos nas instruções de nível de usuário do OpenISA. A tese discute (1) alternativas para ISAs, alternativas para distribuição de programas e o impacto de cada opção, (2) características importantes de OpenISA para atingir seus objetivos e (3) fornece uma completa avaliação do ISA escolhido com respeito a emulação de desempenho em duas CPUs populares, uma projetada pela Intel e outra pela ARM. Concluímos que a versão do OpenISA apresentada aqui pode preservar desempenho próximo do nativo quando traduzida para outros hospedeiros, funcionando como um modelo promissor para ISAs flexíveis da próxima geração que podem ser facilmente estendidos preservando a compatibilidade. Ainda, também mostramos como isso pode ser usado como um formato de distribuição de programas no nível de usuárioAbstract: OpenISA is designed as the interface of processors that aim to be highly flexible. This is achieved by means of three strategies: first, the ISA is empirically chosen to be easily translated to others, providing software flexibility in case a physical OpenISA processor is not available. Second, the ISA is not a concrete ISA nor a virtual ISA, but a hybrid one with the capability of admitting modifications to opcodes without impacting backwards compatibility. This mechanism allows future versions of the ISA to have real changes instead of simple extensions of previous versions, a common problem with concrete ISAs such as the x86. Third, the use of a permissive license allows the ISA to be freely used by any party interested in the project. In this PhD. thesis, we focus on the user-level instructions of OpenISA. The thesis discusses (1) ISA alternatives, program distribution alternatives and the impact of each choice, (2) important features of OpenISA to achieve its goals and (3) provides a thorough evaluation of the chosen ISA with respect to emulation performance on two popular host CPUs, one from Intel and another from ARM. We conclude that the version of OpenISA presented here can preserve close-to-native performance when translated to other hosts, working as a promising model for next-generation, flexible ISAs that can be easily extended while preserving backwards compatibility. Furthermore, we show how this can also be a program distribution format at user-levelDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação2011/09630-1FAPES

    A virtualisation framework for embedded systems

    Get PDF

    A Reconfigurable Processor for Heterogeneous Multi-Core Architectures

    Get PDF
    A reconfigurable processor is a general-purpose processor coupled with an FPGA-like reconfigurable fabric. By deploying application-specific accelerators, performance for a wide range of applications can be improved with such a system. In this work concepts are designed for the use of reconfigurable processors in multi-tasking scenarios and as part of multi-core systems
    corecore