65 research outputs found
Recommended from our members
Core level thermal estimation techniques for early design space exploration
textThe primary objective of this thesis is to develop a methodology for fast, yet accurate temperature estimation during design space exploration. Power and temperature of modern day systems have become important metrics in addition to performance. Static and dynamic power dissipation leads to an increase in temperature, which creates cooling and packaging issues. Furthermore, the transient thermal profile determines temperature gradients, hotspots and thermal cycles. Traditional solutions rely on cycle-accurate simulations of detailed micro-architectural structures and are slow. The thesis shows that the periodic power estimation is the key bottleneck in such approaches. It also demonstrates an approach (FastSpot) that integrates accurate thermal estimation into existing host-compiled simulations. The developed methodology can incorporate different sampling-based thermal models. It achieves a 32000x increase in simulation throughput for temperature trace generation, while incurring low measurement errors (0.06 K- transient,0.014 K- steady-state) compared to a cycle-accurate reference method.Electrical and Computer Engineerin
Accelerating host-compiled simulation by modifying IR code: industrial application in the spatial domain
Space applications rely on long and complex design processes, as they must deal with strict non-functional requirements such as criticality, timeliness, reliability and safety. The huge number of analysis and evaluations performed requires powerful simulations technologies combining high simulation speed and accuracy. Host-compiled simulation is a powerful approach to achieve fast, timed simulation of software running in complex embedded systems. However, in the general term, there is still the need of improving the speed and accuracy of these solutions, and there is a lack of host-compiled approaches oriented to space applications. To solve the first point, this paper presents an alternative that modifies the standard solution of adding the modeling of the cross-compiled control flow in the host computer by modifying the compiler's intermediate representation. That way, the host binary naturally follows the cross-compiled binary flow, avoiding a separate modeling, and improving simulation speed while maintaining accuracy. Additionally, the paper focuses on LEON processor, commonly used by the European Space Agency (ESA).This work has been funded by FEDER/Ministerio de Ciencia, Innovación y Universidades - Agencia Estatal de Investigación/ TEC2017-86722-C4-3-R and the EC through the FP7-JTI 621429 EMC2 project
Simulation Native des Systèmes Multiprocesseurs sur Puce à l'aide de la Virtualisation Assistée par le Matériel
L'intégration de plusieurs processeurs hétérogènes en un seul système sur puce (SoC) est une tendance claire dans les systèmes embarqués. La conception et la vérification de ces systèmes nécessitent des plateformes rapides de simulation, et faciles à construire. Parmi les approches de simulation de logiciels, la simulation native est un bon candidat grâce à l'exécution native de logiciel embarqué sur la machine hôte, ce qui permet des simulations à haute vitesse, sans nécessiter le développement de simulateurs d'instructions. Toutefois, les techniques de simulation natives existantes exécutent le logiciel de simulation dans l'espace de mémoire partagée entre le matériel modélisé et le système d'exploitation hôte. Il en résulte de nombreux problèmes, par exemple les conflits l'espace d'adressage et les chevauchements de mémoire ainsi que l'utilisation des adresses de la machine hôte plutôt des celles des plates-formes matérielles cibles. Cela rend pratiquement impossible la simulation native du code existant fonctionnant sur la plate-forme cible. Pour surmonter ces problèmes, nous proposons l'ajout d'une couche transparente de traduction de l'espace adressage pour séparer l'espace d'adresse cible de celui du simulateur de hôte. Nous exploitons la technologie de virtualisation assistée par matériel (HAV pour Hardware-Assisted Virtualization) à cet effet. Cette technologie est maintenant disponibles sur plupart de processeurs grande public à usage général. Les expériences montrent que cette solution ne dégrade pas la vitesse de simulation native, tout en gardant la possibilité de réaliser l'évaluation des performances du logiciel simulé. La solution proposée est évolutive et flexible et nous fournit les preuves nécessaires pour appuyer nos revendications avec des solutions de simulation multiprocesseurs et hybrides. Nous abordons également la simulation d'exécutables cross- compilés pour les processeurs VLIW (Very Long Instruction Word) en utilisant une technique de traduction binaire statique (SBT) pour généré le code natif. Ainsi il n'est pas nécessaire de faire de traduction à la volée ou d'interprétation des instructions. Cette approche est intéressante dans les situations où le code source n'est pas disponible ou que la plate-forme cible n'est pas supporté par les compilateurs reciblable, ce qui est généralement le cas pour les processeurs VLIW. Les simulateurs générés s'exécutent au-dessus de notre plate-forme basée sur le HAV et modélisent les processeurs de la série C6x de Texas Instruments (TI). Les résultats de simulation des binaires pour VLIW montrent une accélération de deux ordres de grandeur par rapport aux simulateurs précis au cycle près.Integration of multiple heterogeneous processors into a single System-on-Chip (SoC) is a clear trend in embedded systems. Designing and verifying these systems require high-speed and easy-to-build simulation platforms. Among the software simulation approaches, native simulation is a good candidate since the embedded software is executed natively on the host machine, resulting in high speed simulations and without requiring instruction set simulator development effort. However, existing native simulation techniques execute the simulated software in memory space shared between the modeled hardware and the host operating system. This results in many problems, including address space conflicts and overlaps as well as the use of host machine addresses instead of the target hardware platform ones. This makes it practically impossible to natively simulate legacy code running on the target platform. To overcome these issues, we propose the addition of a transparent address space translation layer to separate the target address space from that of the host simulator. We exploit the Hardware-Assisted Virtualization (HAV) technology for this purpose, which is now readily available on almost all general purpose processors. Experiments show that this solution does not degrade the native simulation speed, while keeping the ability to accomplish software performance evaluation. The proposed solution is scalable as well as flexible and we provide necessary evidence to support our claims with multiprocessor and hybrid simulation solutions. We also address the simulation of cross-compiled Very Long Instruction Word (VLIW) executables, using a Static Binary Translation (SBT) technique to generated native code that does not require run-time translation or interpretation support. This approach is interesting in situations where either the source code is not available or the target platform is not supported by any retargetable compilation framework, which is usually the case for VLIW processors. The generated simulators execute on top of our HAV based platform and model the Texas Instruments (TI) C6x series processors. Simulation results for VLIW binaries show a speed-up of around two orders of magnitude compared to the cycle accurate simulators.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF
Inferring Energy Bounds via Static Program Analysis and Evolutionary Modeling of Basic Blocks
The ever increasing number and complexity of energy-bound devices (such as
the ones used in Internet of Things applications, smart phones, and mission
critical systems) pose an important challenge on techniques to optimize their
energy consumption and to verify that they will perform their function within
the available energy budget. In this work we address this challenge from the
software point of view and propose a novel parametric approach to estimating
tight bounds on the energy consumed by program executions that are practical
for their application to energy verification and optimization. Our approach
divides a program into basic (branchless) blocks and estimates the maximal and
minimal energy consumption for each block using an evolutionary algorithm. Then
it combines the obtained values according to the program control flow, using
static analysis, to infer functions that give both upper and lower bounds on
the energy consumption of the whole program and its procedures as functions on
input data sizes. We have tested our approach on (C-like) embedded programs
running on the XMOS hardware platform. However, our method is general enough to
be applied to other microprocessor architectures and programming languages. The
bounds obtained by our prototype implementation can be tight while remaining on
the safe side of budgets in practice, as shown by our experimental evaluation.Comment: Pre-proceedings paper presented at the 27th International Symposium
on Logic-Based Program Synthesis and Transformation (LOPSTR 2017), Namur,
Belgium, 10-12 October 2017 (arXiv:1708.07854). Improved version of the one
presented at the HIP3ES 2016 workshop (v1): more experimental results (added
benchmark to Table 1, added figure for new benchmark, added Table 3),
improved Fig. 1, added Fig.
Simulation methodologies for mobile GPUs
GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU-GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. Making the situation even more dire, existing GPU simulation efforts are concentrated around desktop GPUs, making infrastructure for modelling mobile GPUs virtually non-existent, despite their surging importance in the GPU market. Still, mobile GPU designers are faced with the challenge of evaluating design alternatives involving hundreds of architectural configuration options and micro-architectural improvements under tight time-to-market constraints, to which currently employed design flows involving detailed, but slow simulations are not well suited. In this thesis we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali Bifrost GPU powered device, achieving 100\% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework through a number of case studies exploring modern, mobile GPU applications, and optimize them using functional simulation statistics, unavailable with other approaches or hardware. Furthermore, we develop a trace-based performance model, allowing architects to rapidly model GPU configurations in early design space exploration
FAT-DBT engine (framework for application-tailorcd, co-designcd dynamic binary translation enginc)
Tese de Doutoramento em Engenharia Eletrónica e de Computadores (PDEEC)Dynamic binary translation (DBT) has emerged as an execution engine that monitors,
modifies and possibly optimizes running applications for specific purposes.
DBT is deployed as an execution layer between the application binary and the operating
system or host-machine, which creates opportunities for collecting runtime
information. Initially, DBT supported binary-level compatibility, but based on the
collected runtime information, it also became popular for code instrumentation,
ISA-virtualization and dynamic-optimization purposes.
Building a DBT system brings many challenges, as it involves complex components
integration and requires deep architectural level knowledge. Moreover, DBT incurs
in significant overheads, mainly due to code decoding and translation, as well as
execution along with general functionalities emulation. While initially conceived
bearing in mind high-end architectures for performance demanding applications,
such challenges become even more evident when directing DBT to embedded systems.
The latter makes an effective deployment very challenging due to its complexity,
tight constraints on memory, and limited performance and power. Legacy
support and binary compatibility is a topic of relevant interest in such systems,
due to their broad dissemination among industrial environments and wide utilization
in sensing and monitoring processes, from yearly times, with considerable
maintenance and replacement costs.
To address such issues, this thesis intents to contribute with a solution that leverages
an optimized and accelerated dynamic binary translator targeting resourceconstrained
embedded systems while supporting legacy systems.
The developed work allows to: (1) evaluate the potential of DBT for legacy support
purposes on the resource-constrained embedded systems; (2) achieve a configurable
DBT architecture specialized for resource-constrained embedded systems;
(3) address DBT translation, execution and emulation overheads through the combination
of software and hardware; and (4) promote DBT utilization as a legacy
support tool for the industry as a end-product.A tradução binária dinâmica (TBD) emergiu como um motor de execução que
permite a modificação e possível optimização de código executável para um determinado
propósito. A TBD é integrada nos sistemas como uma camada de execução
entre o código binário executável e o sistema operativo ou a máquina hospedeira
alvo, o que origina oportunidades de recolha de informação de execução.
A criação de um sistema de TBD traz consigo diversos desafios, uma vez que envolve
a integração de componentes complexos e conhecimentos aprofundados das
arquitecturas de processadores envolvidas. Ademais, a utilização de TBD gera diversos
custos computacionais indirectos, maioritariamente devido à descodificação
e tradução de código, bem como emulação de funcionalidades em geral. Considerando
que a TBD foi inicialmente pensada para sistemas de gama alta, os
desafios mencionados tornam-se ainda mais evidentes quando a mesma é aplicada
em sistemas embebidos. Nesta área os limitados recursos de memória e os exigentes
requisitos de desempenho e consumo energético,tornam uma implementação eficiente
de TBD muito difícil de obter. Compatibilidade binária e suporte a código
de legado são tópicos de interesse em sistemas embebidos, justificado pela ampla
disseminação dos mesmos no meio industrial para tarefas de sensorização e monitorização
ao longo dos tempos, reforçado pelos custos de manutenção adjacentes
à sua utilização.
Para endereçar os desafios descritos, nesta tese propõe-se uma solução para potencializar
a tradução binária dinâmica, optimizada e com aceleração, para suporte a
código de legado em sistemas embebidos de baixa gama.
O trabalho permitiu (1) avaliar o potencial da TBD quando aplicada ao suporte
a código de legado em sistemas embebidos de baixa gama; (2) a obtenção de
uma arquitectura de TBD configurável e especializada para este tipo de sistemas;
(3) reduzir os custos computacionais associados à tradução, execução e emulação,
através do uso combinado de software e hardware; (4) e promover a utilização na
industria de TBD como uma ferramenta de suporte a código de legado.This thesis was supported by a PhD scholarship from Fundação para a Ciência e
Tecnologia, SFRH/BD/81681/201
From Parallel Programs to Customized Parallel Processors
The need for fast time to market of new embedded processor-based designs calls for a rapid design methodology of the included processors. The call for such a methodology is even more emphasized in the context of so called soft cores targeted to reconfigurable fabrics where per-design processor customization is commonplace.
The C language has been commonly used as an input to hardware/software co-design flows. However, as C is a sequential language, its potential to generate parallel operations to utilize naturally parallel hardware constructs is far from optimal, leading to a customized processor design space with limited parallel resource scalability. In contrast, when utilizing a parallel programming language as an input, a wider processor design space can be explored to produce customized processors with varying degrees of utilized parallelism.
This Thesis proposes a novel Multicore Application-Specific Instruction Set Processor (MCASIP) co-design methodology that exploits parallel programming languages as the application input format. In the methodology, the designer can explicitly capture the parallelism of the algorithm and exploit specialized instructions using a parallel programming language in contrast to being on the mercy of the compiler or the hardware to extract the parallelism from a sequential input. The Thesis proposes a multicore processor template based on the Transport Triggered Architecture, compiler techniques involved in static parallelization of computation kernels with barriers and a datapath integrated hardware accelerator for low overhead software synchronization implementation. These contributions enable scaling the customized processors both at the instruction and task levels to efficiently exploit the parallelism in the input program up to the implementation constraints such as the memory bandwidth or the chip area. The different contributions are validated with case studies, comparisons and design examples
Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs
Modern embedded systems are parallel, component-based, heterogeneous and finely tuned on the basis of the workload that must be executed on them. To improve design reuse, Application Specific Instruction-set Processors (ASIPs) are often employed as building blocks in such systems, as a solution capable of satisfying the required functional and physical constraints (e.g. throughput, latency, power or energy consumption etc.), while providing, at
the same time, high flexibility and adaptability. Composing a multi-processor architecture including ASIPs and mapping parallel applications onto it is a design activity that require an extensive Design Space Exploration process (DSE), to result in cost-effective systems. The
work described here aims at defining novel methodologies for the application-driven customizations of such highly heterogeneous embedded systems. The issue is tackled at different levels, integrating different tools.
High-level event-based simulation is a widely used technique that offers speed and flexibility as main points of strength, but needs, as a preliminary input and periodically during the iteration process, calibration data that must be acquired by means of more accurate
evaluation methods. Typically, this calibration is performed using instruction-level cycleaccurate
simulators that, however, turn out to be very slow, especially when complete multiprocessor
systems must be evaluated or when the grain of the calibration is too fine, while FPGA approaches have shown to performbetter for this particular applications.
FPGA-based emulation techniques have been proposed in the recent past as an alternative solution to the software-based simulation approach, but some further steps are needed
before they can be effectively exploitedwithin architectural design space exploration. Firstly,
some kind of technology-awareness must be introduced, to enable the translation of the emulation results into a pre-estimation of a prospective ASIC implementation of the design. Moreover, when performing architectural DSE, a significant number of different candidate design points has to be evaluated and compared. In this case, if no countermeasures are taken, the advantages achievable with FPGAs, in terms of emulation speed, are counterbalanced
by the overhead introduced by the time needed to go through the physical synthesis and implementation flow.
Developed FPGA-based prototyping platform overcomes such limitations, enabling the use of FPGA-based prototyping for micro-architectural design space exploration of ASIP processors. In this approach, to increase the emulation speed-up, two different methods are proposed: the first is based on automatic instantiation of additional hardware modules, able to reconfigure at runtime the prototype, while the second leverages manipulation of application
binary code, compiled for a custom VLIW ASIP architecture, that is transformed into code executable on a different configuration. This allows to prototype a whole set of ASIP
solutions after one single FPGA implementation flow, mitigating the afore-mentioned overhead.A short overview on the tools used throughout the work will also be offered, covering basic aspects of Intel-Silicon Hive ASIP development toolchain, SESAME framework general
description, along with a review of state-of-art simulation and prototyping techniques for complex multi-processor systems. Each proposed approach will be validated through a real-world use case, confirming the validity of this solution
Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs
Modern embedded systems are parallel, component-based, heterogeneous and finely tuned on the basis of the workload that must be executed on them. To improve design reuse, Application Specific Instruction-set Processors (ASIPs) are often employed as building blocks in such systems, as a solution capable of satisfying the required functional and physical constraints (e.g. throughput, latency, power or energy consumption etc.), while providing, at
the same time, high flexibility and adaptability. Composing a multi-processor architecture including ASIPs and mapping parallel applications onto it is a design activity that require an extensive Design Space Exploration process (DSE), to result in cost-effective systems. The
work described here aims at defining novel methodologies for the application-driven customizations of such highly heterogeneous embedded systems. The issue is tackled at different levels, integrating different tools.
High-level event-based simulation is a widely used technique that offers speed and flexibility as main points of strength, but needs, as a preliminary input and periodically during the iteration process, calibration data that must be acquired by means of more accurate
evaluation methods. Typically, this calibration is performed using instruction-level cycleaccurate
simulators that, however, turn out to be very slow, especially when complete multiprocessor
systems must be evaluated or when the grain of the calibration is too fine, while FPGA approaches have shown to performbetter for this particular applications.
FPGA-based emulation techniques have been proposed in the recent past as an alternative solution to the software-based simulation approach, but some further steps are needed
before they can be effectively exploitedwithin architectural design space exploration. Firstly,
some kind of technology-awareness must be introduced, to enable the translation of the emulation results into a pre-estimation of a prospective ASIC implementation of the design. Moreover, when performing architectural DSE, a significant number of different candidate design points has to be evaluated and compared. In this case, if no countermeasures are taken, the advantages achievable with FPGAs, in terms of emulation speed, are counterbalanced
by the overhead introduced by the time needed to go through the physical synthesis and implementation flow.
Developed FPGA-based prototyping platform overcomes such limitations, enabling the use of FPGA-based prototyping for micro-architectural design space exploration of ASIP processors. In this approach, to increase the emulation speed-up, two different methods are proposed: the first is based on automatic instantiation of additional hardware modules, able to reconfigure at runtime the prototype, while the second leverages manipulation of application
binary code, compiled for a custom VLIW ASIP architecture, that is transformed into code executable on a different configuration. This allows to prototype a whole set of ASIP
solutions after one single FPGA implementation flow, mitigating the afore-mentioned overhead.A short overview on the tools used throughout the work will also be offered, covering basic aspects of Intel-Silicon Hive ASIP development toolchain, SESAME framework general
description, along with a review of state-of-art simulation and prototyping techniques for complex multi-processor systems. Each proposed approach will be validated through a real-world use case, confirming the validity of this solution
OpenCL-based design methodology for application-specific processors
Abstract-OpenCL is a programming language standard which enables the programmer to express the application by structuring its computation as kernels. The OpenCL compiler is given the explicit freedom to parallelize the execution of kernel instances at all the levels of parallelism. In comparison to the traditional C programming language which is sequential in nature, OpenCL enables higher utilization of parallelism naturally available in hardware constructs while still having a feasible learning curve for engineers familiar with the C language. This paper describes methodology and compiler techniques involved in applying OpenCL as an input language for a design flow of application-specific processors. At the core of the methodology is a whole program optimizing compiler that links together the host and kernel codes of the input OpenCL program and parallelizes the result on a customized statically scheduled processor. The OpenCL vendor extension mechanism is used to provide clean access to custom operations. The methodology is studied with a design case to verify the scalability of the implementation at the instruction level and to exemplify the use of custom operations. The case shows that the use of OpenCL allows producing scalable application-specific processor designs and makes it possible to gradually reach the performance of hand-tailored RTL designs by exploiting the OpenCL extension mechanism to access custom hardware operations of varying complexity
- …