11 research outputs found

    Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

    Full text link
    We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

    Improving Compute & Data Efficiency of Flexible Architectures

    Get PDF

    Architectural Exploration of Data Recomputation for Improving Energy Efficiency

    Get PDF
    University of Minnesota Ph.D. dissertation. July 2017. Major: Electrical/Computer Engineering. Advisor: Ulya Karpuzcu. 1 computer file (PDF); viii, 99 pages.There are two fundamental challenges for modern computer system design. The first one is accommodating the increasing demand for performance in a tight power budget. The second one is ensuring correct progress despite the increasing possibility of faults that may occur in the system. To address the first challenge, it is essential to track where the power goes. The energy consumption of data orchestration (i.e., storage, movement, communication) dominates the energy consumption of actual data production, i.e., computation. Oftentimes, recomputing data becomes more energy efficient than storing and retrieving pre-computed data by minimizing the prevalent power and performance overhead of data storage, retrieval, and communication. At the same time, recomputation can reduce the demand for communication bandwidth and shrink the memory footprint. In the first half of the dissertation, the potential of data recomputation in improving energy efficiency is quantified and a practical recomputation framework is introduced to trade computation for communication. To address the second challenge, it is needed to provide scalable checkpointing and recovery mechanisms. The traditional method to recover from a fault is to periodically checkpoint the state of the machine. Periodic checkpointing of the machine state makes rollback and restart of execution from a safe state possible upon detection of a fault. The energy overhead of checkpointing, however, as incurred by storage and communication of the machine state grows with the frequency of checkpointing. Amortizing this overhead becomes especially challenging, considering the growth of expected error rates as an artifact of contemporary technology scaling. Recomputation of data (which otherwise would be read from a checkpoint) can reduce both the frequency of checkpointing, the size of the checkpoints and thereby mitigate checkpointing overhead. In the second half, quantitative characterization of recomputation-enabled checkpointing (based on recomputation framework) is provided

    Model-Based Design for Wireless Body Sensor Network Nodes

    Get PDF
    Wireless body sensor networks (WBSNs) are a rising technology that allows constant and unobtrusive monitoring of the vital signals of a patient. The configuration of a WBSN node proves to be critical in order to maximize its lifetime, while meeting the predefined performance during signal sensing, preprocessing, and wireless transmission to the base station. In this work, we propose a model-based optimization framework for WBSN nodes, which is centered on a detailed analytical characterization of the most energy-demanding components of this application domain. We also propose a multi-objective exploration algorithm to evaluate the node configurations and the corresponding performance tradeoffs. A case study is discussed to validate the proposed framework, proving that our model captures the behavior of real WBSNs and efficiently leads to the determination of the Pareto-optimal configurations

    ENERGY-AWARE OPTIMIZATION FOR EMBEDDED SYSTEMS WITH CHIP MULTIPROCESSOR AND PHASE-CHANGE MEMORY

    Get PDF
    Over the last two decades, functions of the embedded systems have evolved from simple real-time control and monitoring to more complicated services. Embedded systems equipped with powerful chips can provide the performance that computationally demanding information processing applications need. However, due to the power issue, the easy way to gain increasing performance by scaling up chip frequencies is no longer feasible. Recently, low-power architecture designs have been the main trend in embedded system designs. In this dissertation, we present our approaches to attack the energy-related issues in embedded system designs, such as thermal issues in the 3D chip multiprocessor (CMP), the endurance issue in the phase-change memory(PCM), the battery issue in the embedded system designs, the impact of inaccurate information in embedded system, and the cloud computing to move the workload to remote cloud computing facilities. We propose a real-time constrained task scheduling method to reduce peak temperature on a 3D CMP, including an online 3D CMP temperature prediction model and a set of algorithm for scheduling tasks to different cores in order to minimize the peak temperature on chip. To address the challenging issues in applying PCM in embedded systems, we propose a PCM main memory optimization mechanism through the utilization of the scratch pad memory (SPM). Furthermore, we propose an MLC/SLC configuration optimization algorithm to enhance the efficiency of the hybrid DRAM + PCM memory. We also propose an energy-aware task scheduling algorithm for parallel computing in mobile systems powered by batteries. When scheduling tasks in embedded systems, we make the scheduling decisions based on information, such as estimated execution time of tasks. Therefore, we design an evaluation method for impacts of inaccurate information on the resource allocation in embedded systems. Finally, in order to move workload from embedded systems to remote cloud computing facility, we present a resource optimization mechanism in heterogeneous federated multi-cloud systems. And we also propose two online dynamic algorithms for resource allocation and task scheduling. We consider the resource contention in the task scheduling

    Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms

    Get PDF
    Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level

    Design techniques for smart and energy-efficient wireless body sensor networks

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 26/10/2012Las redes inalámbricas de sensores corporales (en inglés: "wireless body sensor networks" o WBSNs) para monitorización, diagnóstico y detección de emergencias, están ganando popularidad y están llamadas a cambiar profundamente la asistencia sanitaria en los próximos años. El uso de estas redes permite una supervisión continua, contribuyendo a la prevención y el diagnóstico precoz de enfermedades, al tiempo que mejora la autonomía del paciente con respecto a otros sistemas de monitorización actuales. Valiéndose de esta tecnología, esta tesis propone el desarrollo de un sistema de monitorización de electrocardiograma (ECG), que no sólo muestre continuamente el ECG del paciente, sino que además lo analice en tiempo real y sea capaz de dar información sobre el estado del corazón a través de un dispositivo móvil. Esta información también puede ser enviada al personal médico en tiempo real. Si ocurre un evento peligroso, el sistema lo detectará automáticamente e informará de inmediato al paciente y al personal médico, posibilitando una rápida reacción en caso de emergencia. Para conseguir la implementación de dicho sistema, se desarrollan y optimizan distintos algoritmos de procesamiento de ECG en tiempo real, que incluyen filtrado, detección de puntos característicos y clasificación de arritmias. Esta tesis también aborda la mejora de la eficiencia energética de la red de sensores, cumpliendo con los requisitos de fidelidad y rendimiento de la aplicación. Para ello se proponen técnicas de diseño para reducir el consumo de energía, que permitan buscar un compromiso óptimo entre el tamaño de la batería y su tiempo de vida. Si el consumo de energía puede reducirse lo suficiente, sería posible desarrollar una red que funcione permanentemente. Por lo tanto, el muestreo, procesamiento, almacenamiento y transmisión inalámbrica tienen que hacerse de manera que se suministren todos los datos relevantes, pero con el menor consumo posible de energía, minimizando así el tamaño de la batería (que condiciona el tamaño total del nodo) y la frecuencia de recarga de la batería (otro factor clave para su usabilidad). Por lo tanto, para lograr una mejora en la eficiencia energética del sistema de monitorización y análisis de ECG propuesto en esta tesis, se estudian varias soluciones a nivel de control de acceso al medio y sistema operativo.Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    Minimizing energy consumption of banked memories using data recomputation

    No full text
    corecore