Search CORE

65 research outputs found

Recommended from our members

Core level thermal estimation techniques for early design space exploration

Author: Gandhi Darshan Dhimantkumar
Publication venue
Publication date: 18/09/2014
Field of study

textThe primary objective of this thesis is to develop a methodology for fast, yet accurate temperature estimation during design space exploration. Power and temperature of modern day systems have become important metrics in addition to performance. Static and dynamic power dissipation leads to an increase in temperature, which creates cooling and packaging issues. Furthermore, the transient thermal profile determines temperature gradients, hotspots and thermal cycles. Traditional solutions rely on cycle-accurate simulations of detailed micro-architectural structures and are slow. The thesis shows that the periodic power estimation is the key bottleneck in such approaches. It also demonstrates an approach (FastSpot) that integrates accurate thermal estimation into existing host-compiled simulations. The developed methodology can incorporate different sampling-based thermal models. It achieves a 32000x increase in simulation throughput for temperature trace generation, while incurring low measurement errors (0.06 K- transient,0.014 K- steady-state) compared to a cycle-accurate reference method.Electrical and Computer Engineerin

Texas ScholarWorks

Accelerating host-compiled simulation by modifying IR code: industrial application in the spatial domain

Author: Posadas Cobo Héctor
Sánchez Renedo Manuel
Villar Bonet Eugenio, 1957-
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Space applications rely on long and complex design processes, as they must deal with strict non-functional requirements such as criticality, timeliness, reliability and safety. The huge number of analysis and evaluations performed requires powerful simulations technologies combining high simulation speed and accuracy. Host-compiled simulation is a powerful approach to achieve fast, timed simulation of software running in complex embedded systems. However, in the general term, there is still the need of improving the speed and accuracy of these solutions, and there is a lack of host-compiled approaches oriented to space applications. To solve the first point, this paper presents an alternative that modifies the standard solution of adding the modeling of the cross-compiled control flow in the host computer by modifying the compiler's intermediate representation. That way, the host binary naturally follows the cross-compiled binary flow, avoiding a separate modeling, and improving simulation speed while maintaining accuracy. Additionally, the paper focuses on LEON processor, commonly used by the European Space Agency (ESA).This work has been funded by FEDER/Ministerio de Ciencia, Innovación y Universidades - Agencia Estatal de Investigación/ TEC2017-86722-C4-3-R and the EC through the FP7-JTI 621429 EMC2 project

Crossref

UCrea

Simulation Native des Systèmes Multiprocesseurs sur Puce à l'aide de la Virtualisation Assistée par le Matériel

Author: HAMAYUN Mian Muhammad
PETROT Frédéric
Publication venue
Publication date: 01/01/2013
Field of study

L'intégration de plusieurs processeurs hétérogènes en un seul système sur puce (SoC) est une tendance claire dans les systèmes embarqués. La conception et la vérification de ces systèmes nécessitent des plateformes rapides de simulation, et faciles à construire. Parmi les approches de simulation de logiciels, la simulation native est un bon candidat grâce à l'exécution native de logiciel embarqué sur la machine hôte, ce qui permet des simulations à haute vitesse, sans nécessiter le développement de simulateurs d'instructions. Toutefois, les techniques de simulation natives existantes exécutent le logiciel de simulation dans l'espace de mémoire partagée entre le matériel modélisé et le système d'exploitation hôte. Il en résulte de nombreux problèmes, par exemple les conflits l'espace d'adressage et les chevauchements de mémoire ainsi que l'utilisation des adresses de la machine hôte plutôt des celles des plates-formes matérielles cibles. Cela rend pratiquement impossible la simulation native du code existant fonctionnant sur la plate-forme cible. Pour surmonter ces problèmes, nous proposons l'ajout d'une couche transparente de traduction de l'espace adressage pour séparer l'espace d'adresse cible de celui du simulateur de hôte. Nous exploitons la technologie de virtualisation assistée par matériel (HAV pour Hardware-Assisted Virtualization) à cet effet. Cette technologie est maintenant disponibles sur plupart de processeurs grande public à usage général. Les expériences montrent que cette solution ne dégrade pas la vitesse de simulation native, tout en gardant la possibilité de réaliser l'évaluation des performances du logiciel simulé. La solution proposée est évolutive et flexible et nous fournit les preuves nécessaires pour appuyer nos revendications avec des solutions de simulation multiprocesseurs et hybrides. Nous abordons également la simulation d'exécutables cross- compilés pour les processeurs VLIW (Very Long Instruction Word) en utilisant une technique de traduction binaire statique (SBT) pour généré le code natif. Ainsi il n'est pas nécessaire de faire de traduction à la volée ou d'interprétation des instructions. Cette approche est intéressante dans les situations où le code source n'est pas disponible ou que la plate-forme cible n'est pas supporté par les compilateurs reciblable, ce qui est généralement le cas pour les processeurs VLIW. Les simulateurs générés s'exécutent au-dessus de notre plate-forme basée sur le HAV et modélisent les processeurs de la série C6x de Texas Instruments (TI). Les résultats de simulation des binaires pour VLIW montrent une accélération de deux ordres de grandeur par rapport aux simulateurs précis au cycle près.Integration of multiple heterogeneous processors into a single System-on-Chip (SoC) is a clear trend in embedded systems. Designing and verifying these systems require high-speed and easy-to-build simulation platforms. Among the software simulation approaches, native simulation is a good candidate since the embedded software is executed natively on the host machine, resulting in high speed simulations and without requiring instruction set simulator development effort. However, existing native simulation techniques execute the simulated software in memory space shared between the modeled hardware and the host operating system. This results in many problems, including address space conflicts and overlaps as well as the use of host machine addresses instead of the target hardware platform ones. This makes it practically impossible to natively simulate legacy code running on the target platform. To overcome these issues, we propose the addition of a transparent address space translation layer to separate the target address space from that of the host simulator. We exploit the Hardware-Assisted Virtualization (HAV) technology for this purpose, which is now readily available on almost all general purpose processors. Experiments show that this solution does not degrade the native simulation speed, while keeping the ability to accomplish software performance evaluation. The proposed solution is scalable as well as flexible and we provide necessary evidence to support our claims with multiprocessor and hybrid simulation solutions. We also address the simulation of cross-compiled Very Long Instruction Word (VLIW) executables, using a Static Binary Translation (SBT) technique to generated native code that does not require run-time translation or interpretation support. This approach is interesting in situations where either the source code is not available or the target platform is not supported by any retargetable compilation framework, which is usually the case for VLIW processors. The generated simulators execute on top of our HAV based platform and model the Texas Instruments (TI) C6x series processors. Simulation results for VLIW binaries show a speed-up of around two orders of magnitude compared to the cycle accurate simulators.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF

OpenGrey Repository

Inferring Energy Bounds via Static Program Analysis and Evolutionary Modeling of Basic Blocks

Author: A. SERRANO
E Albert
J Navas
M Hermenegildo
M Méndez-Lojo
M. V. HERMENEGILDO
P Lopez-Garcia
Reinhard Wilhelm
S Lafond
Saumya K. Debray
Steve Kerrison
U Liqat
U. Liqat
Publication venue
Publication date: 22/09/2017
Field of study

The ever increasing number and complexity of energy-bound devices (such as the ones used in Internet of Things applications, smart phones, and mission critical systems) pose an important challenge on techniques to optimize their energy consumption and to verify that they will perform their function within the available energy budget. In this work we address this challenge from the software point of view and propose a novel parametric approach to estimating tight bounds on the energy consumed by program executions that are practical for their application to energy verification and optimization. Our approach divides a program into basic (branchless) blocks and estimates the maximal and minimal energy consumption for each block using an evolutionary algorithm. Then it combines the obtained values according to the program control flow, using static analysis, to infer functions that give both upper and lower bounds on the energy consumption of the whole program and its procedures as functions on input data sizes. We have tested our approach on (C-like) embedded programs running on the XMOS hardware platform. However, our method is general enough to be applied to other microprocessor architectures and programming languages. The bounds obtained by our prototype implementation can be tight while remaining on the safe side of budgets in practice, as shown by our experimental evaluation.Comment: Pre-proceedings paper presented at the 27th International Symposium on Logic-Based Program Synthesis and Transformation (LOPSTR 2017), Namur, Belgium, 10-12 October 2017 (arXiv:1708.07854). Improved version of the one presented at the HIP3ES 2016 workshop (v1): more experimental results (added benchmark to Table 1, added figure for new benchmark, added Table 3), improved Fig. 1, added Fig.

arXiv.org e-Print Archive

Crossref

Archivo Digital UPM

Simulation methodologies for mobile GPUs

Author: Kaszyk Kuba
Publication venue: The University of Edinburgh
Publication date: 17/03/2022
Field of study

GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU-GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. Making the situation even more dire, existing GPU simulation efforts are concentrated around desktop GPUs, making infrastructure for modelling mobile GPUs virtually non-existent, despite their surging importance in the GPU market. Still, mobile GPU designers are faced with the challenge of evaluating design alternatives involving hundreds of architectural configuration options and micro-architectural improvements under tight time-to-market constraints, to which currently employed design flows involving detailed, but slow simulations are not well suited. In this thesis we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali Bifrost GPU powered device, achieving 100\% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework through a number of case studies exploring modern, mobile GPU applications, and optimize them using functional simulation statistics, unavailable with other approaches or hardware. Furthermore, we develop a trace-based performance model, allowing architects to rapidly model GPU configurations in early design space exploration

Edinburgh Research Archive

FAT-DBT engine (framework for application-tailorcd, co-designcd dynamic binary translation enginc)

Author: Salgado Filipe Alexandre Andrade
Publication venue
Publication date: 22/11/2017
Field of study

Tese de Doutoramento em Engenharia Eletrónica e de Computadores (PDEEC)Dynamic binary translation (DBT) has emerged as an execution engine that monitors, modifies and possibly optimizes running applications for specific purposes. DBT is deployed as an execution layer between the application binary and the operating system or host-machine, which creates opportunities for collecting runtime information. Initially, DBT supported binary-level compatibility, but based on the collected runtime information, it also became popular for code instrumentation, ISA-virtualization and dynamic-optimization purposes. Building a DBT system brings many challenges, as it involves complex components integration and requires deep architectural level knowledge. Moreover, DBT incurs in significant overheads, mainly due to code decoding and translation, as well as execution along with general functionalities emulation. While initially conceived bearing in mind high-end architectures for performance demanding applications, such challenges become even more evident when directing DBT to embedded systems. The latter makes an effective deployment very challenging due to its complexity, tight constraints on memory, and limited performance and power. Legacy support and binary compatibility is a topic of relevant interest in such systems, due to their broad dissemination among industrial environments and wide utilization in sensing and monitoring processes, from yearly times, with considerable maintenance and replacement costs. To address such issues, this thesis intents to contribute with a solution that leverages an optimized and accelerated dynamic binary translator targeting resourceconstrained embedded systems while supporting legacy systems. The developed work allows to: (1) evaluate the potential of DBT for legacy support purposes on the resource-constrained embedded systems; (2) achieve a configurable DBT architecture specialized for resource-constrained embedded systems; (3) address DBT translation, execution and emulation overheads through the combination of software and hardware; and (4) promote DBT utilization as a legacy support tool for the industry as a end-product.A tradução binária dinâmica (TBD) emergiu como um motor de execução que permite a modificação e possível optimização de código executável para um determinado propósito. A TBD é integrada nos sistemas como uma camada de execução entre o código binário executável e o sistema operativo ou a máquina hospedeira alvo, o que origina oportunidades de recolha de informação de execução. A criação de um sistema de TBD traz consigo diversos desafios, uma vez que envolve a integração de componentes complexos e conhecimentos aprofundados das arquitecturas de processadores envolvidas. Ademais, a utilização de TBD gera diversos custos computacionais indirectos, maioritariamente devido à descodificação e tradução de código, bem como emulação de funcionalidades em geral. Considerando que a TBD foi inicialmente pensada para sistemas de gama alta, os desafios mencionados tornam-se ainda mais evidentes quando a mesma é aplicada em sistemas embebidos. Nesta área os limitados recursos de memória e os exigentes requisitos de desempenho e consumo energético,tornam uma implementação eficiente de TBD muito difícil de obter. Compatibilidade binária e suporte a código de legado são tópicos de interesse em sistemas embebidos, justificado pela ampla disseminação dos mesmos no meio industrial para tarefas de sensorização e monitorização ao longo dos tempos, reforçado pelos custos de manutenção adjacentes à sua utilização. Para endereçar os desafios descritos, nesta tese propõe-se uma solução para potencializar a tradução binária dinâmica, optimizada e com aceleração, para suporte a código de legado em sistemas embebidos de baixa gama. O trabalho permitiu (1) avaliar o potencial da TBD quando aplicada ao suporte a código de legado em sistemas embebidos de baixa gama; (2) a obtenção de uma arquitectura de TBD configurável e especializada para este tipo de sistemas; (3) reduzir os custos computacionais associados à tradução, execução e emulação, através do uso combinado de software e hardware; (4) e promover a utilização na industria de TBD como uma ferramenta de suporte a código de legado.This thesis was supported by a PhD scholarship from Fundação para a Ciência e Tecnologia, SFRH/BD/81681/201

Universidade do Minho: RepositoriUM

From Parallel Programs to Customized Parallel Processors

Author: Jääskeläinen Pekka
Publication venue: Tampere University of Technology
Publication date: 01/01/2012
Field of study

The need for fast time to market of new embedded processor-based designs calls for a rapid design methodology of the included processors. The call for such a methodology is even more emphasized in the context of so called soft cores targeted to reconfigurable fabrics where per-design processor customization is commonplace. The C language has been commonly used as an input to hardware/software co-design flows. However, as C is a sequential language, its potential to generate parallel operations to utilize naturally parallel hardware constructs is far from optimal, leading to a customized processor design space with limited parallel resource scalability. In contrast, when utilizing a parallel programming language as an input, a wider processor design space can be explored to produce customized processors with varying degrees of utilized parallelism. This Thesis proposes a novel Multicore Application-Specific Instruction Set Processor (MCASIP) co-design methodology that exploits parallel programming languages as the application input format. In the methodology, the designer can explicitly capture the parallelism of the algorithm and exploit specialized instructions using a parallel programming language in contrast to being on the mercy of the compiler or the hardware to extract the parallelism from a sequential input. The Thesis proposes a multicore processor template based on the Transport Triggered Architecture, compiler techniques involved in static parallelization of computation kernels with barriers and a datapath integrated hardware accelerator for low overhead software synchronization implementation. These contributions enable scaling the customized processors both at the instruction and task levels to efficiently exploit the parallelism in the input program up to the implementation constraints such as the memory bandwidth or the chip area. The different contributions are validated with case studies, comparisons and design examples

Trepo - Institutional Repository of Tampere University

Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs

Author
Publication venue: Università degli Studi di Cagliari
Publication date: 03/05/2013
Field of study

Modern embedded systems are parallel, component-based, heterogeneous and finely tuned on the basis of the workload that must be executed on them. To improve design reuse, Application Specific Instruction-set Processors (ASIPs) are often employed as building blocks in such systems, as a solution capable of satisfying the required functional and physical constraints (e.g. throughput, latency, power or energy consumption etc.), while providing, at the same time, high flexibility and adaptability. Composing a multi-processor architecture including ASIPs and mapping parallel applications onto it is a design activity that require an extensive Design Space Exploration process (DSE), to result in cost-effective systems. The work described here aims at defining novel methodologies for the application-driven customizations of such highly heterogeneous embedded systems. The issue is tackled at different levels, integrating different tools. High-level event-based simulation is a widely used technique that offers speed and flexibility as main points of strength, but needs, as a preliminary input and periodically during the iteration process, calibration data that must be acquired by means of more accurate evaluation methods. Typically, this calibration is performed using instruction-level cycleaccurate simulators that, however, turn out to be very slow, especially when complete multiprocessor systems must be evaluated or when the grain of the calibration is too fine, while FPGA approaches have shown to performbetter for this particular applications. FPGA-based emulation techniques have been proposed in the recent past as an alternative solution to the software-based simulation approach, but some further steps are needed before they can be effectively exploitedwithin architectural design space exploration. Firstly, some kind of technology-awareness must be introduced, to enable the translation of the emulation results into a pre-estimation of a prospective ASIC implementation of the design. Moreover, when performing architectural DSE, a significant number of different candidate design points has to be evaluated and compared. In this case, if no countermeasures are taken, the advantages achievable with FPGAs, in terms of emulation speed, are counterbalanced by the overhead introduced by the time needed to go through the physical synthesis and implementation flow. Developed FPGA-based prototyping platform overcomes such limitations, enabling the use of FPGA-based prototyping for micro-architectural design space exploration of ASIP processors. In this approach, to increase the emulation speed-up, two different methods are proposed: the first is based on automatic instantiation of additional hardware modules, able to reconfigure at runtime the prototype, while the second leverages manipulation of application binary code, compiled for a custom VLIW ASIP architecture, that is transformed into code executable on a different configuration. This allows to prototype a whole set of ASIP solutions after one single FPGA implementation flow, mitigating the afore-mentioned overhead.A short overview on the tools used throughout the work will also be offered, covering basic aspects of Intel-Silicon Hive ASIP development toolchain, SESAME framework general description, along with a review of state-of-art simulation and prototyping techniques for complex multi-processor systems. Each proposed approach will be validated through a real-world use case, confirming the validity of this solution

Archivio istituzionale della ricerca - Università di Cagliari

Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs

Author: Pomata Sebastiano
Publication venue
Publication date: 03/05/2013
Field of study

Archivio istituzionale della ricerca - Università di Cagliari

UniCA Eprints

OpenCL-based design methodology for application-specific processors

Author: Carlos S De La Lama
Jarmo H Takala
Pablo Huerta
Pekka O Jääskeläinen
Publication venue
Publication date: 01/01/2010
Field of study

Abstract-OpenCL is a programming language standard which enables the programmer to express the application by structuring its computation as kernels. The OpenCL compiler is given the explicit freedom to parallelize the execution of kernel instances at all the levels of parallelism. In comparison to the traditional C programming language which is sequential in nature, OpenCL enables higher utilization of parallelism naturally available in hardware constructs while still having a feasible learning curve for engineers familiar with the C language. This paper describes methodology and compiler techniques involved in applying OpenCL as an input language for a design flow of application-specific processors. At the core of the methodology is a whole program optimizing compiler that links together the host and kernel codes of the input OpenCL program and parallelizes the result on a customized statically scheduled processor. The OpenCL vendor extension mechanism is used to provide clean access to custom operations. The methodology is studied with a design case to verify the scalability of the implementation at the instruction level and to exemplify the use of custom operations. The case shows that the use of OpenCL allows producing scalable application-specific processor designs and makes it possible to gradually reach the performance of hand-tailored RTL designs by exploiting the OpenCL extension mechanism to access custom hardware operations of varying complexity

CiteSeerX

Trepo - Institutional Repository of Tampere University