6,816 research outputs found

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    Design of multimedia processor based on metric computation

    Get PDF
    Media-processing applications, such as signal processing, 2D and 3D graphics rendering, and image compression, are the dominant workloads in many embedded systems today. The real-time constraints of those media applications have taxing demands on today's processor performances with low cost, low power and reduced design delay. To satisfy those challenges, a fast and efficient strategy consists in upgrading a low cost general purpose processor core. This approach is based on the personalization of a general RISC processor core according the target multimedia application requirements. Thus, if the extra cost is justified, the general purpose processor GPP core can be enforced with instruction level coprocessors, coarse grain dedicated hardware, ad hoc memories or new GPP cores. In this way the final design solution is tailored to the application requirements. The proposed approach is based on three main steps: the first one is the analysis of the targeted application using efficient metrics. The second step is the selection of the appropriate architecture template according to the first step results and recommendations. The third step is the architecture generation. This approach is experimented using various image and video algorithms showing its feasibility

    Developing an energy efficient real-time system

    Get PDF
    Increasing number of battery operated devices creates a need for energy-efficient real-time operating system for such devices. Designing a truly energy-efficient system is a multi-staged effort; this thesis consists of three main tasks that address different aspects of energy efficiency of a real-time system (RTS). The first chapter introduces an energy-efficient algorithm that alternates processor frequency using DVFS to schedule tasks on cores. Speed profiles is calculated for every task that gives information about how long a task would run for and at what processor speed. We pair tasks with similar speed profiles to give us a resultant merged speed profile that can be efficient scheduled on a cluster. Experiments carried out on ODROID-XU3 are compared with a reference approach that provides energy saving of up to 20%. The second chapter proposes power-aware techniques to segregate a task set over a heterogeneous platform such that the overall energy consumption is minimized. With the help of calculated speed profiles, second contribution of this work feasibly partitions a given task set into individual sets for a cluster based homogeneous platform. Various heuristics are proposed that are compared against a baseline approach with simulation results. The final chapter of this thesis focuses on the importance of having an underlying energy-efficient operating system. We discuss an energy-efficient way of porting a real-time operating system (RTOS), QP, over TMS320F28377S along with modifications to make the Operating System (OS) consume minimal energy for its operation --Abstract, page iii

    Dynamic code mapping for limited local memory systems

    Full text link
    Abstract—This paper presents heuristics for dynamic man-agement of application code on limited local memories present in high-performance multi-core processors. Previous techniques formulate the problem using call graphs, which do not capture the temporal ordering of functions. In addition, they only use a conservative estimate of the interference cost between functions to obtain a mapping. As a result previous techniques are unable to achieve efficient code mapping. Techniques proposed in this paper overcome both these limitations and achieve superior code mapping. Experimental results from executing benchmarks from MiBench onto the Cell processor in the Sony Playstation 3 demonstrate upto 29 % and average 12 % performance improve-ment, at tolerable compile-time overhead. I

    Automated design of domain-specific custom instructions

    Get PDF

    Eliminating stack overflow by abstract interpretation

    Get PDF
    ManuscriptAn important correctness criterion for software running on embedded microcontrollers is stack safety: a guarantee that the call stack does not overflow. Our first contribution is a method for statically guaranteeing stack safety of interrupt-driven embedded software using an approach based on context-sensitive dataflow analysis of object code. We have implemented a prototype stack analysis tool that targets software for Atmel AVR microcontrollers and tested it on embedded applications compiled from up to 30,000 lines of C. We experimentally validate the accuracy of the tool, which runs in under 10 sec on the largest programs that we tested. The second contribution of this paper is the development of two novel ways to reduce stack memory requirements of embedded software

    Automatic design of domain-specific instructions for low-power processors

    Get PDF
    This paper explores hardware specialization of low­ power processors to improve performance and energy efficiency. Our main contribution is an automated framework that analyzes instruction sequences of applications within a domain at the loop body level and identifies exactly and partially-matching sequences across applications that can become custom instructions. Our framework transforms sequences to a new code abstraction, a Merging Diagram, that improves similarity identification, clusters alike groups of potential custom instructions to effectively reduce the search space, and selects merged custom instructions to efficiently exploit the available customizable area. For a set of 11 media applications, our fast framework generates instructions that significantly improve the energy-delay product and speed­ up, achieving more than double the savings as compared to a technique analyzing sequences within basic blocks. This paper shows that partially-matched custom instructions, which do not significantly increase design time, are crucial to achieving higher energy efficiency at limited hardware areas

    Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 165-170).This thesis explores a new approach to building data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the multiple-instruction multiple-data (MIMD) pattern, the vector single-instruction multiple-data (vector-SIMD) pattern, the single-instruction multiple-thread (SIMT) pattern, and the vector-thread (VT) pattern. Our recently proposed VT pattern includes many control threads that each manage their own array of microthreads. The control thread uses vector memory instructions to efficiently move data and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mechanisms are complemented by the ability for each microthread to direct its own control flow. In this thesis, I introduce various techniques for building simplified instances of the VT pattern. I propose unifying the VT control-thread and microthread scalar instruction sets to simplify the microarchitecture and programming methodology. I propose a new single-lane VT microarchitecture based on minimal changes to the vector-SIMD pattern.(cont.) Single-lane cores are simpler to implement than multi-lane cores and can achieve similar energy efficiency. This new microarchitecture uses control processor embedding to mitigate the area overhead of single-lane cores, and uses vector fragments to more efficiently handle both regular and irregular DLP as compared to previous VT architectures. I also propose an explicitly data-parallel VT programming methodology that is based on a slightly modified scalar compiler. This methodology is easier to use than assembly programming, yet simpler to implement than an automatically vectorizing compiler. To evaluate these ideas, we have begun implementing the Maven data-parallel accelerator. This thesis compares a simplified Maven VT core to MIMD, vector-SIMD, and SIMT cores. We have implemented these cores with an ASIC methodology, and I use the resulting gate-level models to evaluate the area, performance, and energy of several compiled microbenchmarks. This work is the first detailed quantitative comparison of the VT pattern to other patterns. My results suggest that future data-parallel accelerators based on simplified VT architectures should be able to combine the energy efficiency of vector-SIMD accelerators with the flexibility of MIMD accelerators.by Christopher Francis Batten.Ph.D

    Automated design of domain-specific custom instructions = Geautomatiseerd ontwerp van domeinspecifieke gespecialiseerde instructies

    Get PDF
    Cotutela Universitat Politècnica de Catalunya i Universiteit Gent, Faculteit Ingenieurswetenschappen en Architectuur Vakgroep Elektronica en InformatiesystemenIn the last years, hardware specialization has received renewed attention as chips approach a utilization wall. Specialized accelerators can take advantage of underutilized transistors implementing custom hardware that complements the main processor. However, specialization adds complexity to the design process and limits reutilization. Application-Specific Instruction Processors (ASIPs) balance performance and reusability, extending a general-purpose processor with custom instructions (CIs) specific for an application, implemented in Specialized Functional Units (SFUs). Still, time-to-market is a major issue with application-specific designs because, if CIs are not frequently executed, the acceleration benefits will not compensate for the overall design cost. Domain-specific acceleration increases the applicability of ASIPs, as it targets several applications that run on the same hardware. Also, reconfigurable SFUs and the automation of the CIs design can solve the aforementioned problems. In this dissertation, we explore different automated approaches to the design of CIs that extend a baseline processor for domain-specific acceleration to improve both performance and energy efficiency. First, we develop automated techniques to identify code sequences within a domain to create CI candidates. Due to the disparity among coding styles of different programs, it is difficult to find patterns that are represented by a unique CI across applications. Therefore, we propose an analysis at the basic block level that identifies equivalent CIs within and across different programs. We use the Taylor Expansion Diagram (TED) canonical representation to find not only structurally equivalent CIs, but also functionally similar ones, as opposed to the commonly applied directed acyclic graph (DAG) isomorphism detection. We combine both methods into a new Hybrid DAG/TED technique to identify more patterns across applications that map to the same CI. Then, we select a subset of the CI candidates that fits in the available SFU area. Because of the complexity of the problem, we propose four scoring heuristics to reduce the design space and smooth the potential performance speedup across applications. We include these methods in the FuSInG framework, and we estimate performance with hardware models on a set of media benchmarks. Results show that, when limiting core area devoted to specialization, the SFU customization with the largest speedups includes a mix of application and domain-specific custom instructions. If we target larger CIs to obtain higher speedups, reusability across applications becomes critical; without enough equivalences, CIs cannot be generalized for a domain. We aim to share partially common operations among CIs to accelerate more code, especially across basic blocks, and to reduce the hardware area needed for specialization. Hence, we create a new canonical representation across basic blocks, the Merging Diagram, to facilitate similarity detection and improve code coverage. We also introduce clustering-based partial matching to identify partially-similar domain-specific CIs, which generally leads to better performance than application-specific ones. Yet, at small areas, merging two CIs induces a high additional overhead that might penalize energy-efficiency. Thus, we also detect fragments of CIs and we join them with the existing merged clusters resulting in minimal extra overhead. Also, using speedup as the deciding factor for CI selection may not be optimal for devices with limited power budgets. For that reason, we propose a linear programming-based selection that balances performance and energy consumption. We implement these techniques in the MInGLE framework and evaluate them with media benchmarks. The selected CIs significantly improve the energy-delay product and performance, demonstrating that we can accelerate a domain covering more code while reducing the needed area for the CI implementation.La especialización de hardware ha recibido renovado interés debido al utilization wall, ya que transistores infrautilizados pueden implementar hardware a medida que complemente el procesador principal. Sin embargo, el proceso de diseño se complica y se limita la reutilización. Procesadores de instrucciones para aplicaciones específicas (ASIPs) equilibran rendimiento y reuso, extendiendo un procesador con instruciones especializadas (custom instructions ¿ CIs) para una aplicación, implementadas en unidades funcionales especializadas (SFUs). No obstante, los plazos de comercialización suponen un obstáculo en diseños específicos ya que, si las CIs no se ejecutan con frecuencia, los beneficios de la aceleración no compensan los costes de diseño. La aceleración de un dominio específico incrementa la aplicabilidad de los ASIPs, acelerando diferentes aplicaciones en el mismo hardware, mientras que una SFU reconfigurable y un diseño automatizado pueden resolver los problemas mencionados. En esta tesis, exploramos diferentes alternativas al diseño de CIs que extienden un procesador para acelerar un dominio, mejorando el rendimiento y la eficiencia energética. Proponemos primero técnicas automatizadas para identificar código acelerable en un dominio. Sin embargo, la identificación se ve dificultada por la diversidad de estilos entre diferentes programas. Por tanto, proponemos identificar en el bloque básico CIs equivalentes utilizando la representación canónica Taylor Expansion Diagram (TED). Con TEDs encontramos no sólo código estructuralmente equivalente, sino también con similitud funcional, en contraposición a la detección isomórfica de grafos acíclicos dirigidos (DAG). Combinamos ambos métodos en una nueva técnica híbrida DAG/TED, que identifica en varias aplicaciones más secuencias representadas por la misma CI. Tras esto, seleccionamos un subconjunto de CIs que puede ser contenido en el área de la SFU. Por la complejidad del problema, proponemos cuatro heurísticas de selección para reducir el espacio de búsqueda y homogeneizar el rendimiento de las aplicaciones. Incluimos estas técnicas en la infraestructura FuSInG y estimamos el rendimiento para un conjunto de benchmarks multimedia. Los resultados muestran que, al limitar el área de especialización, la configuración de la SFU con las mayores ganancias incluye una mezcla de CIs específicas tanto para una aplicación como para todo el dominio. Si nos centramos en CIs más grandes para obtener mayores ganancias, la reutilización se vuelve crítica; sin suficientes equivalencias las CIs no pueden ser generalizadas. Nuestro objetivo es que las CIs compartan parcialmente operaciones, especialmente a través de bloques básicos, y reducir el área de especialización. Por ello, creamos una representación canónica de CIs que cubre varios bloques básicos, Merging Diagram, para mejorar el alcance de la aceleración y facilitar la detección de similitud. Además, proponemos una búsqueda de coincidencias parciales basadas en clustering para identificar CIs de dominio específico parcialmente similares, las cuales derivan generalmente mejor rendimiento. Pero en áreas reducidas, la fusión de CIs induce un coste adicional que penalizaría la eficiencia energética. Así, detectamos fragmentos de CIs y los unimos con grupos de CIs previamente fusionadas con un coste extra mínimo. Usar el rendimiento como el factor decisivo en la selección puede no ser óptimo para disposivos con consumo de energía limitado. Por eso, proponemos un mecanismo de selección basado en programación lineal que equilibra rendimiento y consumo energético. Implementamos estas técnicas en la infraestructura MInGLE y las evaluamos con benchmarks multimedia. Las CIs seleccionadas mejoran notablemente la eficiencia energética y el rendimiento, demostrando que podemos acelerar un dominio cubriendo más código a la vez que reducimos el área de implementación.Postprint (published version
    corecore