39 research outputs found

    Automatic synthesis of reconfigurable instruction set accelerators

    Get PDF

    Energy Consumption Saving in Embedded Microprocessors Using Hardware Accelerators

    Get PDF
    This paper deals with the reduction of power consumption in embedded microprocessors. Computing power and energy efficiency are becoming the main challenges for embedded system applications. This is, in particular, the caseof wearable systems. When the power supply is provided by batteries, an important requirement for these systems is the long service life. This work investigates a method for the reduction of microprocessor energy consumption, based on the use of hardware accelerators. Their use allows to reduce the execution time and to decrease the clock frequency, so reducing the power consumption. In order to provide experimental results, authors analyze a case of study in the field of wearable devices for the processing of ECG signals. The experimental results show that the use of hardware accelerator significantly reduces the power consumption

    Preliminary multicore architecture for Introspective Computing

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 243-245).This thesis creates a framework for Introspective Computing. Introspective Computing is a computing paradigm characterized by self-aware software. Self-aware software systems use hardware mechanisms to observe an application's execution so that they may adapt execution to improve performance, reduce power consumption, or balance user-defined fitness criteria over time-varying conditions in a system environment. We dub our framework Partner Cores. The Partner Cores framework builds upon tiled multicore architectures [11, 10, 25, 9], closely coupling cores such that one may be used to observe and optimize execution in another. Partner cores incrementally collect and analyze execution traces from code cores then exploit knowledge of the hardware to optimize execution. This thesis develops a tiled architecture for the Partner Cores framework that we dub Evolve. Evolve provides a versatile substrate upon which software may coordinate core partnerships and various forms of parallelism. To do so, Evolve augments a basic tiled architecture with introspection hardware and programmable functional units. Partner Cores software systems on the Evolve hardware may follow the style of helper threading [13, 12, 6] or utilize the programmable functional units in each core to evolve application-specific coprocessor engines. This thesis work develops two Partner Cores software systems: the Dynamic Partner-Assisted Branch Predictor and the Introspective L2 Memory System (IL2). The branch predictor employs a partner core as a coprocessor engine for general dynamic branch prediction in a corresponding code core. The IL2 retasks the silicon resources of partner cores as banks of an on-chip, distributed, software L2 cache for code cores.(cont.) The IL2 employs aggressive, application-specific prefetchers for minimizing cache miss penalties and DRAM power consumption. Our results and future work show that the branch predictor is able to sustain prediction for code core branch frequencies as high as one every 7 instructions with no degradation in accuracy; updated prediction directions are available in a low minimum of 20-21 instructions. For the IL2, we develop a pixel block prefetcher for the image data structure used in a JPEG encoder benchmark and show that a 50% improvement in absolute performance is attainable.by Jonathan M. Eastep.S.M

    Microgrid - The microthreaded many-core architecture

    Full text link
    Traditional processors use the von Neumann execution model, some other processors in the past have used the dataflow execution model. A combination of von Neuman model and dataflow model is also tried in the past and the resultant model is referred as hybrid dataflow execution model. We describe a hybrid dataflow model known as the microthreading. It provides constructs for creation, synchronization and communication between threads in an intermediate language. The microthreading model is an abstract programming and machine model for many-core architecture. A particular instance of this model is named as the microthreaded architecture or the Microgrid. This architecture implements all the concurrency constructs of the microthreading model in the hardware with the management of these constructs in the hardware.Comment: 30 pages, 16 figure

    Static resource models for code generation of embedded processors

    Get PDF
    xii+129hlm.;24c

    Automated design of domain-specific custom instructions = Geautomatiseerd ontwerp van domeinspecifieke gespecialiseerde instructies

    Get PDF
    Cotutela Universitat Politècnica de Catalunya i Universiteit Gent, Faculteit Ingenieurswetenschappen en Architectuur Vakgroep Elektronica en InformatiesystemenIn the last years, hardware specialization has received renewed attention as chips approach a utilization wall. Specialized accelerators can take advantage of underutilized transistors implementing custom hardware that complements the main processor. However, specialization adds complexity to the design process and limits reutilization. Application-Specific Instruction Processors (ASIPs) balance performance and reusability, extending a general-purpose processor with custom instructions (CIs) specific for an application, implemented in Specialized Functional Units (SFUs). Still, time-to-market is a major issue with application-specific designs because, if CIs are not frequently executed, the acceleration benefits will not compensate for the overall design cost. Domain-specific acceleration increases the applicability of ASIPs, as it targets several applications that run on the same hardware. Also, reconfigurable SFUs and the automation of the CIs design can solve the aforementioned problems. In this dissertation, we explore different automated approaches to the design of CIs that extend a baseline processor for domain-specific acceleration to improve both performance and energy efficiency. First, we develop automated techniques to identify code sequences within a domain to create CI candidates. Due to the disparity among coding styles of different programs, it is difficult to find patterns that are represented by a unique CI across applications. Therefore, we propose an analysis at the basic block level that identifies equivalent CIs within and across different programs. We use the Taylor Expansion Diagram (TED) canonical representation to find not only structurally equivalent CIs, but also functionally similar ones, as opposed to the commonly applied directed acyclic graph (DAG) isomorphism detection. We combine both methods into a new Hybrid DAG/TED technique to identify more patterns across applications that map to the same CI. Then, we select a subset of the CI candidates that fits in the available SFU area. Because of the complexity of the problem, we propose four scoring heuristics to reduce the design space and smooth the potential performance speedup across applications. We include these methods in the FuSInG framework, and we estimate performance with hardware models on a set of media benchmarks. Results show that, when limiting core area devoted to specialization, the SFU customization with the largest speedups includes a mix of application and domain-specific custom instructions. If we target larger CIs to obtain higher speedups, reusability across applications becomes critical; without enough equivalences, CIs cannot be generalized for a domain. We aim to share partially common operations among CIs to accelerate more code, especially across basic blocks, and to reduce the hardware area needed for specialization. Hence, we create a new canonical representation across basic blocks, the Merging Diagram, to facilitate similarity detection and improve code coverage. We also introduce clustering-based partial matching to identify partially-similar domain-specific CIs, which generally leads to better performance than application-specific ones. Yet, at small areas, merging two CIs induces a high additional overhead that might penalize energy-efficiency. Thus, we also detect fragments of CIs and we join them with the existing merged clusters resulting in minimal extra overhead. Also, using speedup as the deciding factor for CI selection may not be optimal for devices with limited power budgets. For that reason, we propose a linear programming-based selection that balances performance and energy consumption. We implement these techniques in the MInGLE framework and evaluate them with media benchmarks. The selected CIs significantly improve the energy-delay product and performance, demonstrating that we can accelerate a domain covering more code while reducing the needed area for the CI implementation.La especialización de hardware ha recibido renovado interés debido al utilization wall, ya que transistores infrautilizados pueden implementar hardware a medida que complemente el procesador principal. Sin embargo, el proceso de diseño se complica y se limita la reutilización. Procesadores de instrucciones para aplicaciones específicas (ASIPs) equilibran rendimiento y reuso, extendiendo un procesador con instruciones especializadas (custom instructions ¿ CIs) para una aplicación, implementadas en unidades funcionales especializadas (SFUs). No obstante, los plazos de comercialización suponen un obstáculo en diseños específicos ya que, si las CIs no se ejecutan con frecuencia, los beneficios de la aceleración no compensan los costes de diseño. La aceleración de un dominio específico incrementa la aplicabilidad de los ASIPs, acelerando diferentes aplicaciones en el mismo hardware, mientras que una SFU reconfigurable y un diseño automatizado pueden resolver los problemas mencionados. En esta tesis, exploramos diferentes alternativas al diseño de CIs que extienden un procesador para acelerar un dominio, mejorando el rendimiento y la eficiencia energética. Proponemos primero técnicas automatizadas para identificar código acelerable en un dominio. Sin embargo, la identificación se ve dificultada por la diversidad de estilos entre diferentes programas. Por tanto, proponemos identificar en el bloque básico CIs equivalentes utilizando la representación canónica Taylor Expansion Diagram (TED). Con TEDs encontramos no sólo código estructuralmente equivalente, sino también con similitud funcional, en contraposición a la detección isomórfica de grafos acíclicos dirigidos (DAG). Combinamos ambos métodos en una nueva técnica híbrida DAG/TED, que identifica en varias aplicaciones más secuencias representadas por la misma CI. Tras esto, seleccionamos un subconjunto de CIs que puede ser contenido en el área de la SFU. Por la complejidad del problema, proponemos cuatro heurísticas de selección para reducir el espacio de búsqueda y homogeneizar el rendimiento de las aplicaciones. Incluimos estas técnicas en la infraestructura FuSInG y estimamos el rendimiento para un conjunto de benchmarks multimedia. Los resultados muestran que, al limitar el área de especialización, la configuración de la SFU con las mayores ganancias incluye una mezcla de CIs específicas tanto para una aplicación como para todo el dominio. Si nos centramos en CIs más grandes para obtener mayores ganancias, la reutilización se vuelve crítica; sin suficientes equivalencias las CIs no pueden ser generalizadas. Nuestro objetivo es que las CIs compartan parcialmente operaciones, especialmente a través de bloques básicos, y reducir el área de especialización. Por ello, creamos una representación canónica de CIs que cubre varios bloques básicos, Merging Diagram, para mejorar el alcance de la aceleración y facilitar la detección de similitud. Además, proponemos una búsqueda de coincidencias parciales basadas en clustering para identificar CIs de dominio específico parcialmente similares, las cuales derivan generalmente mejor rendimiento. Pero en áreas reducidas, la fusión de CIs induce un coste adicional que penalizaría la eficiencia energética. Así, detectamos fragmentos de CIs y los unimos con grupos de CIs previamente fusionadas con un coste extra mínimo. Usar el rendimiento como el factor decisivo en la selección puede no ser óptimo para disposivos con consumo de energía limitado. Por eso, proponemos un mecanismo de selección basado en programación lineal que equilibra rendimiento y consumo energético. Implementamos estas técnicas en la infraestructura MInGLE y las evaluamos con benchmarks multimedia. Las CIs seleccionadas mejoran notablemente la eficiencia energética y el rendimiento, demostrando que podemos acelerar un dominio cubriendo más código a la vez que reducimos el área de implementación.Postprint (published version

    Automated design of domain-specific custom instructions

    Get PDF

    Locality analysis and its hardware implications for graph pattern mining

    Get PDF
    En aquest treball hem abordat l'acceleració d'aplicacions GPM des de la perspectiva oferta per l'arquitectura NDP. Hem desenvolupat una nova eina de simulació, basada en la integració de dos coneguts simuladors: ZSim (per als cores i les caches) i Ramulator (per a la memòria). Hem hagut de dissenyar específicament aquesta integració perquè la implementació disponible per a la utilització conjunta de tots dos simuladors no aprofita les tècniques que fa servir ZSim per reduir la pèrdua de precisió. Després hem implementat al simulador un accelerador GPM que utilitza l'arquitectura NDP (NDMiner), que representa l'estat de l'art. L'eina de simulació permet realitzar un detallat ``profiling'' de NDMiner, molt útil per identificar els seus punts febles. D'aquesta manera, el simulador facilita el disseny d'estratègies per millorar el rendiment de l'accelerador. Mitjançant una sèrie dexperiments en simulació, hem elaborat una sèrie de propostes concretes per solucionar els problemes detectats i millorar NDMiner.En este trabajo, hemos abordado la aceleración de aplicaciones GPM desde la perspectiva ofrecida por la arquitectura NDP. Hemos desarrollado una nueva herramienta de simulación, basada en la integración de dos conocidos simuladores: ZSim (para los cores y las caches) y Ramulator (para la memoria). Hemos tenido que diseñar específicamente esta integración porque la implementación disponible para la utilización conjunta de ambos simuladores no aprovecha las técnicas que usa ZSim para reducir la pérdida de precisión. Luego hemos implementado en el simulador un acelerador GPM que utiliza la arquitectura NDP (NDMiner), entendemos que representa el estado-del-arte al respecto. La herramienta de simulación permite realizar un detallado ``profiling'' de NDMiner, muy útil para identificar sus puntos débiles. De esta forma, el simulador facilita el diseño de estrategias para mejorar el rendimiento del acelerador. Mediante una serie de experimentos en simulación, hemos elaborado una serie de propuestas concretas para solucionar los problemas detectados y mejorar NDMiner.In this work, we have addressed the acceleration of GPM applications from the perspective offered by the NDP architecture. We have developed a new simulation tool, based on the integration of two well-known simulators: ZSim (for the cores and the caches) and Ramulator (for the memory). The need to carry out this integration arises from the fact that the implementation available for the joint use of both simulators does not take advantage of the techniques that ZSim uses to reduce the loss of precision. We have implemented in simulation a state-of-the-art GPM accelerator based on the NDP architecture (NDMiner). The new simulation tool allows a detailed NDMiner profiling to identify its weak points. Therefore, it helps to design strategies that alleviate those bottlenecks and improve their performance. Consequently, after realizing experiments with the new simulator, we have elaborated a series of concrete proposals to solve some of the problems detected and to improve NDMiner.Outgoin

    ASC: A stream compiler for computing with FPGAs

    No full text
    Published versio
    corecore