72 research outputs found

    Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

    Get PDF
    Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well. In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead. This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area. StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements

    DISK DESIGN-SPACE EXPLORATION IN TERMS OF SYSTEM-LEVEL PERFORMANCE, POWER, AND ENERGY CONSUMPTION

    Get PDF
    To make the common case fast, most studies focus on the computation phase of applications in which most instructions are executed. However, many programs spend significant time in the I/O intensive phase due to the I/O latency. To obtain a system with more balanced phases, we require greater insight into the effects of the I/O configurations to the entire system in both performance and power dissipation domains. Due to lack of public tools with the complete picture of the entire memory hierarchy, we developed SYSim. SYSim is a complete-system simulator aiming at complete memory hierarchy studies in both performance and power consumption domains. In this dissertation, we used SYSim to investigate the system-level impacts of several disk enhancements and technology improvements to the detailed interaction in memory hierarchy during the I/O-intensive phase. The experimental results are reported in terms of both total system performance and power/energy consumption. With SYSim, we conducted the complete-system experiments and revealed intriguing behaviors including, but not limited to, the following: During the I/O intensive phase which consists of both disk reads and writes, the average system CPI tracks only average disk read response time, and not overall average disk response time, which is the widely-accepted metric in disk drive research. In disk read-dominating applications, Disk Prefetching is more important than increasing the disk RPM. On the other hand, in applications with both disk reads and writes, the disk RPM matters. The execution time can be improved to an order of magnitude by applying some disk enhancements. Using disk caching and prefetching can improve the performance by the factor of 2, and write-buffering can improve the performance by the factor of 10. Moreover, using disk caching/prefetching and the write-buffering techniques in conjunction can improve the total system performance by at least an order of magnitude. Increasing the disk RPM and the number of disks in RAID disk system also have an impressive improvement over the total system performance. However, employing such techniques requires careful consideration for trade-offs in power/energy consumption

    Doctor of Philosophy

    Get PDF
    dissertationThe bandwidth requirement for each link on a network-on-chip (NoC) may differ based on topology and traffic properties of the IP cores. Available bandwidth on an asynchronous NoC link will also vary depending on the wire length between sender and receiver. This work explores the benefit to NoC performance, area, and energy when this property is used to optimize bandwidth on specific links based on its bandwidth required by a target SoC design. Three asynchronous routers were designed for implementing of asynchronous NoCs. Simple routing scheme and single-flit packet format lead to performance- and area-efficient router designs. Their performance was evaluated in consideration of link wire delay. Comprehensive analysis of pipeline latch insertion in asynchronous communication links is performed in regard to link bandwidth. Optimal placement of pipeline latch for maximizing benefit to increase of bandwidth is described. Specific methods are proposed for performance, area and energy optimization, respectively. Performance optimization is achieved by increasing bandwidth of high trafficked and high utilized links in an NoC, as inserting pipeline latches in those links. Through decrease of bandwidth of links with low traffic and low utilization by halving data-path width, reduction of wire area of an NoC is accomplished. Energy optimization is performed using wide spacing between wires in links with high energy consumption. An analytical model for asynchronous link bandwidth estimation is presented. It is utilized to deploy NoC optimization methods as identifying adequate links for each optimization method. Energy and latency characteristics of an asynchronous NoC are compared to a similarly-designed synchronous NoC. The results indicate that the asynchronous network has lower energy, and link-specific bandwidth optimization has improved NoC performance. Evaluation of proposed optimization methods by employing to an asynchronous NoC shows achievements of performance enhancement, wire area reduction and wire energy saving

    System-Level Power Estimation Methodology for MPSoC based Platforms

    Get PDF
    Avec l'essor des nouvelles technologies d'intégration sur silicium submicroniques, la consommation de puissance dans les systèmes sur puce multiprocesseur (MPSoC) est devenue un facteur primordial au niveau du flot de conception. La prise en considération de ce facteur clé dès les premières phases de conception, joue un rôle primordial puisqu'elle permet d'augmenter la fiabilité des composants et de réduire le temps d'arrivée sur le marché du produit final.Shifting the design entry point up to the system-level is the most important countermeasure adopted to manage the increasing complexity of Multiprocessor System on Chip (MPSoC). The reason is that decisions taken at this level, early in the design cycle, have the greatest impact on the final design in terms of power and energy efficiency. However, taking decisions at this level is very difficult, since the design space is extremely wide and it has so far been mostly a manual activity. Efficient system-level power estimation tools are therefore necessary to enable proper Design Space Exploration (DSE) based on power/energy and timing.VALENCIENNES-Bib. électronique (596069901) / SudocSudocFranceF

    Contribution au domaine de la conception des Systèmes Embarqués et Pervasifs Faible Consommation

    Get PDF
    La première partie est consacré à l’Estimation de la Consommation des Architectures Logicel. Ce travail est en continuité de mes travaux de thèse et ont démarré avec le projet SPICES piloté par le Dr Eric Senn. Ce projet avait pour but, pour notre partie, de modéliser et d’estimer la consommation des services d’un système d’exploitation à haut niveau. Ces travaux ont fait l’objet de la thèse de Saadia Dhouib (2006-2009) co dirigée par Eric Senn et Jean Philippe Diguet.La seconde aborde le problème du placement des données en mémoire pour les architectures logiciel. L’idée de ces travaux était de permettre un placement optimum des structures de données d’une application dans une hiérarchie mémoire fixée. Ce travail a été le début de la collaboration avec Marc Sevaux et André Rossi sur ces aspects et ont été poursuivis dans la thèse réalisée par Maria Soto (2008-2011).La troisième présente les travaux autour de l’estimation et l’optimisation de la consommation des interconnexions dans les systèmes sur puce (SoC). Dans un système sur puce la consommation d’énergie générée par les interconnexions peut devenir non négligeable ; il devient donc indispensable de pouvoir optimiser cette consommation. Afin de pouvoir juger des optimisations proposées, un modèle d’estimation est nécessaire car le temps de conception et de simulation (au niveau électrique) est prohibitif. Ces travaux ont fait l’objet de la thèse d’Antoine Courtay (2005-2008) co dirigée par Olivier Sentieys et Nathalie Julien.Enfin la dernière aborde mes derniers travaux de recherche sur la conception de systèmes pervasifs pour le domaine maritime. Ces travaux aborde plusieurs sous thèmes comme: -la mesure de la performance pour la course au large ; travaux de thèse de Ronan Douguet (2010-2014)-l’utilisation de la réalité augmentée pour l’aide à la navigation ; travaux de thèse de Jean Christophe Morgère (2011-2015)-l’optimisation temps réel d’énergies renouvelables pour voilier du futur ; travaux de thèse de Mathilde Tréhin (2013- ?)-les algorithmes et plateforme faible consommation pour la conception d’un pilote automatique haute performance pour le nautisme ; travaux de thèse d’Hugo Kerhascoet (2014-2017

    Doctor of Philosophy

    Get PDF
    dissertationPortable electronic devices will be limited to available energy of existing battery chemistries for the foreseeable future. However, system-on-chips (SoCs) used in these devices are under a demand to offer more functionality and increased battery life. A difficult problem in SoC design is providing energy-efficient communication between its components while maintaining the required performance. This dissertation introduces a novel energy-efficient network-on-chip (NoC) communication architecture. A NoC is used within complex SoCs due it its superior performance, energy usage, modularity, and scalability over traditional bus and point-to-point methods of connecting SoC components. This is the first academic research that combines asynchronous NoC circuits, a focus on energy-efficient design, and a software framework to customize a NoC for a particular SoC. Its key contribution is demonstrating that a simple, asynchronous NoC concept is a good match for low-power devices, and is a fruitful area for additional investigation. The proposed NoC is energy-efficient in several ways: simple switch and arbitration logic, low port radix, latch-based router buffering, a topology with the minimum number of 3-port routers, and the asynchronous advantages of zero dynamic power consumption while idle and the lack of a clock tree. The tool framework developed for this work uses novel methods to optimize the topology and router oorplan based on simulated annealing and force-directed movement. It studies link pipelining techniques that yield improved throughput in an energy-efficient manner. A simulator is automatically generated for each customized NoC, and its traffic generators use a self-similar message distribution, as opposed to Poisson, to better match application behavior. Compared to a conventional synchronous NoC, this design is superior by achieving comparable message latency with half the energy

    Systematic Design Methods for Efficient Off-Chip DRAM Access

    No full text
    Typical design flows for digital hardware take, as their input, an abstract description of computation and data transfer between logical memories. No existing commercial high-level synthesis tool demonstrates the ability to map logical memory inferred from a high level language to external memory resources. This thesis develops techniques for doing this, specifically targeting off-chip dynamic memory (DRAM) devices. These are a commodity technology in widespread use with standardised interfaces. In use, the bandwidth of an external memory interface and the latency of memory requests asserted on it may become the bottleneck limiting the performance of a hardware design. Careful consideration of this is especially important when designing with DRAMs, whose latency and bandwidth characteristics depend upon the sequence of memory requests issued by a controller. Throughout the work presented here, we pursue exact compile-time methods for designing application-specific memory systems with a focus on guaranteeing predictable performance through static analysis. This contrasts with much of the surveyed existing work, which considers general purpose memory controllers and optimized policies which improve performance in experiments run using simulation of suites of benchmark codes. The work targets loop-nests within imperative source code, extracting a mathematical representation of the loop-nest statements and their associated memory accesses, referred to as the ‘Polytope Model’. We extend this mathematical representation to represent the physical DRAM ‘row’ and ‘column’ structures accessed when performing memory transfers. From this augmented representation, we can automatically derive DRAM controllers which buffer data in on-chip memory and transfer data in an efficient order. Buffering data and exploiting ‘reuse’ of data is shown to enable up to 50× reduction in the quantity of data transferred to external memory. The reordering of memory transactions exploiting knowledge of the physical layout of the DRAM device allowing to 4× improvement in the efficiency of those data transfers
    • …
    corecore