140 research outputs found

    Domain-Specific Computing Architectures and Paradigms

    Full text link
    We live in an exciting era where artificial intelligence (AI) is fundamentally shifting the dynamics of industries and businesses around the world. AI algorithms such as deep learning (DL) have drastically advanced the state-of-the-art cognition and learning capabilities. However, the power of modern AI algorithms can only be enabled if the underlying domain-specific computing hardware can deliver orders of magnitude more performance and energy efficiency. This work focuses on this goal and explores three parts of the domain-specific computing acceleration problem; encapsulating specialized hardware and software architectures and paradigms that support the ever-growing processing demand of modern AI applications from the edge to the cloud. This first part of this work investigates the optimizations of a sparse spatio-temporal (ST) cognitive system-on-a-chip (SoC). This design extracts ST features from videos and leverages sparse inference and kernel compression to efficiently perform action classification and motion tracking. The second part of this work explores the significance of dataflows and reduction mechanisms for sparse deep neural network (DNN) acceleration. This design features a dynamic, look-ahead index matching unit in hardware to efficiently discover fine-grained parallelism, achieving high energy efficiency and low control complexity for a wide variety of DNN layers. Lastly, this work expands the scope to real-time machine learning (RTML) acceleration. A new high-level architecture modeling framework is proposed. Specifically, this framework consists of a set of high-performance RTML-specific architecture design templates, and a Python-based high-level modeling and compiler tool chain for efficient cross-stack architecture design and exploration.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162870/1/lchingen_1.pd

    Doctor of Philosophy

    Get PDF
    dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications

    A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior

    Full text link
    Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.Comment: To appear in IEEE Transactions on Parallel and Distributed Systems (TPDS

    Optimized Biosignals Processing Algorithms for New Designs of Human Machine Interfaces on Parallel Ultra-Low Power Architectures

    Get PDF
    The aim of this dissertation is to explore Human Machine Interfaces (HMIs) in a variety of biomedical scenarios. The research addresses typical challenges in wearable and implantable devices for diagnostic, monitoring, and prosthetic purposes, suggesting a methodology for tailoring such applications to cutting edge embedded architectures. The main challenge is the enhancement of high-level applications, also introducing Machine Learning (ML) algorithms, using parallel programming and specialized hardware to improve the performance. The majority of these algorithms are computationally intensive, posing significant challenges for the deployment on embedded devices, which have several limitations in term of memory size, maximum operative frequency, and battery duration. The proposed solutions take advantage of a Parallel Ultra-Low Power (PULP) architecture, enhancing the elaboration on specific target architectures, heavily optimizing the execution, exploiting software and hardware resources. The thesis starts by describing a methodology that can be considered a guideline to efficiently implement algorithms on embedded architectures. This is followed by several case studies in the biomedical field, starting with the analysis of a Hand Gesture Recognition, based on the Hyperdimensional Computing algorithm, which allows performing a fast on-chip re-training, and a comparison with the state-of-the-art Support Vector Machine (SVM); then a Brain Machine Interface (BCI) to detect the respond of the brain to a visual stimulus follows in the manuscript. Furthermore, a seizure detection application is also presented, exploring different solutions for the dimensionality reduction of the input signals. The last part is dedicated to an exploration of typical modules for the development of optimized ECG-based applications

    A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones

    Full text link
    Fully-autonomous miniaturized robots (e.g., drones), with artificial intelligence (AI) based visual navigation capabilities are extremely challenging drivers of Internet-of-Things edge intelligence capabilities. Visual navigation based on AI approaches, such as deep neural networks (DNNs) are becoming pervasive for standard-size drones, but are considered out of reach for nanodrones with size of a few cm2{}^\mathrm{2}. In this work, we present the first (to the best of our knowledge) demonstration of a navigation engine for autonomous nano-drones capable of closed-loop end-to-end DNN-based visual navigation. To achieve this goal we developed a complete methodology for parallel execution of complex DNNs directly on-bard of resource-constrained milliwatt-scale nodes. Our system is based on GAP8, a novel parallel ultra-low-power computing platform, and a 27 g commercial, open-source CrazyFlie 2.0 nano-quadrotor. As part of our general methodology we discuss the software mapping techniques that enable the state-of-the-art deep convolutional neural network presented in [1] to be fully executed on-board within a strict 6 fps real-time constraint with no compromise in terms of flight results, while all processing is done with only 64 mW on average. Our navigation engine is flexible and can be used to span a wide performance range: at its peak performance corner it achieves 18 fps while still consuming on average just 3.5% of the power envelope of the deployed nano-aircraft.Comment: 15 pages, 13 figures, 5 tables, 2 listings, accepted for publication in the IEEE Internet of Things Journal (IEEE IOTJ

    Energy-efficient architectures for chip-scale networks and memory systems using silicon-photonics technology

    Full text link
    Today's supercomputers and cloud systems run many data-centric applications such as machine learning, graph algorithms, and cognitive processing, which have large data footprints and complex data access patterns. With computational capacity of large-scale systems projected to rise up to 50GFLOPS/W, the target energy-per-bit budget for data movement is expected to reach as low as 0.1pJ/bit, assuming 200bits/FLOP for data transfers. This tight energy budget impacts the design of both chip-scale networks and main memory systems. Conventional electrical links used in chip-scale networks (0.5-3pJ/bit) and DRAM systems used in main memory (>30pJ/bit) fail to provide sustained performance at low energy budgets. This thesis builds on the promising research on silicon-photonic technology to design system architectures and system management policies for chip-scale networks and main memory systems. The adoption of silicon-photonic links as chip-scale networks, however, is hampered by the high sensitivity of optical devices towards thermal and process variations. These device sensitivities result in high power overheads at high-speed communications. Moreover, applications differ in their resource utilization, resulting in application-specific thermal profiles and bandwidth needs. Similarly, optically-controlled memory systems designed using conventional electrical-based architectures require additional circuitry for electrical-to-optical and optical-to-electrical conversions within memory. These conversions increase the energy and latency per memory access. Due to these issues, chip-scale networks and memory systems designed using silicon-photonics technology leave much of their benefits underutilized. This thesis argues for the need to rearchitect memory systems and redesign network management policies such that they are aware of the application variability and the underlying device characteristics of silicon-photonic technology. We claim that such a cross-layer design enables a high-throughput and energy-efficient unified silicon-photonic link and main memory system. This thesis undertakes the cross-layer design with silicon-photonic technology in two fronts. First, we study the varying network bandwidth requirements across different applications and also within a given application. To address this variability, we develop bandwidth allocation policies that account for application needs and device sensitivities to ensure power-efficient operation of silicon-photonic links. Second, we design a novel architecture of an optically-controlled main memory system that is directly interfaced with silicon-photonic links using a novel read and write access protocol. Such a system ensures low-energy and high-throughput access from the processor to a high-density memory. To further address the diversity in application memory characteristics, we explore heterogeneous memory systems with multiple memory modules that provide varied power-performance benefits. We design a memory management policy for such systems that allocates pages at the granularity of memory objects within an application

    Gestión de jerarquías de memoria híbridas a nivel de sistema

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadoras y Automática y de Ku Leuven, Arenberg Doctoral School, Faculty of Engineering Science, leída el 11/05/2017.In electronics and computer science, the term ‘memory’ generally refers to devices that are used to store information that we use in various appliances ranging from our PCs to all hand-held devices, smart appliances etc. Primary/main memory is used for storage systems that function at a high speed (i.e. RAM). The primary memory is often associated with addressable semiconductor memory, i.e. integrated circuits consisting of silicon-based transistors, used for example as primary memory but also other purposes in computers and other digital electronic devices. The secondary/auxiliary memory, in comparison provides program and data storage that is slower to access but offers larger capacity. Examples include external hard drives, portable flash drives, CDs, and DVDs. These devices and media must be either plugged in or inserted into a computer in order to be accessed by the system. Since secondary storage technology is not always connected to the computer, it is commonly used for backing up data. The term storage is often used to describe secondary memory. Secondary memory stores a large amount of data at lesser cost per byte than primary memory; this makes secondary storage about two orders of magnitude less expensive than primary storage. There are two main types of semiconductor memory: volatile and nonvolatile. Examples of non-volatile memory are ‘Flash’ memory (sometimes used as secondary, sometimes primary computer memory) and ROM/PROM/EPROM/EEPROM memory (used for firmware such as boot programs). Examples of volatile memory are primary memory (typically dynamic RAM, DRAM), and fast CPU cache memory (typically static RAM, SRAM, which is fast but energy-consuming and offer lower memory capacity per are a unit than DRAM). Non-volatile memory technologies in Si-based electronics date back to the 1990s. Flash memory is widely used in consumer electronic products such as cellphones and music players and NAND Flash-based solid-state disks (SSDs) are increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and even data centers. The integration limit of Flash memories is approaching, and many new types of memory to replace conventional Flash memories have been proposed. The rapid increase of leakage currents in Silicon CMOS transistors with scaling poses a big challenge for the integration of SRAM memories. There is also the case of susceptibility to read/write failure with low power schemes. As a result of this, over the past decade, there has been an extensive pooling of time, resources and effort towards developing emerging memory technologies like Resistive RAM (ReRAM/RRAM), STT-MRAM, Domain Wall Memory and Phase Change Memory(PRAM). Emerging non-volatile memory technologies promise new memories to store more data at less cost than the expensive-to build silicon chips used by popular consumer gadgets including digital cameras, cell phones and portable music players. These new memory technologies combine the speed of static random-access memory (SRAM), the density of dynamic random-access memory (DRAM), and the non-volatility of Flash memory and so become very attractive as another possibility for future memory hierarchies. The research and information on these Non-Volatile Memory (NVM) technologies has matured over the last decade. These NVMs are now being explored thoroughly nowadays as viable replacements for conventional SRAM based memories even for the higher levels of the memory hierarchy. Many other new classes of emerging memory technologies such as transparent and plastic, three-dimensional(3-D), and quantum dot memory technologies have also gained tremendous popularity in recent years...En el campo de la informática, el término ‘memoria’ se refiere generalmente a dispositivos que son usados para almacenar información que posteriormente será usada en diversos dispositivos, desde computadoras personales (PC), móviles, dispositivos inteligentes, etc. La memoria principal del sistema se utiliza para almacenar los datos e instrucciones de los procesos que se encuentre en ejecución, por lo que se requiere que funcionen a alta velocidad (por ejemplo, DRAM). La memoria principal está implementada habitualmente mediante memorias semiconductoras direccionables, siendo DRAM y SRAM los principales exponentes. Por otro lado, la memoria auxiliar o secundaria proporciona almacenaje(para ficheros, por ejemplo); es más lenta pero ofrece una mayor capacidad. Ejemplos típicos de memoria secundaria son discos duros, memorias flash portables, CDs y DVDs. Debido a que estos dispositivos no necesitan estar conectados a la computadora de forma permanente, son muy utilizados para almacenar copias de seguridad. La memoria secundaria almacena una gran cantidad de datos aun coste menor por bit que la memoria principal, siendo habitualmente dos órdenes de magnitud más barata que la memoria primaria. Existen dos tipos de memorias de tipo semiconductor: volátiles y no volátiles. Ejemplos de memorias no volátiles son las memorias Flash (algunas veces usadas como memoria secundaria y otras veces como memoria principal) y memorias ROM/PROM/EPROM/EEPROM (usadas para firmware como programas de arranque). Ejemplos de memoria volátil son las memorias DRAM (RAM dinámica), actualmente la opción predominante a la hora de implementar la memoria principal, y las memorias SRAM (RAM estática) más rápida y costosa, utilizada para los diferentes niveles de cache. Las tecnologías de memorias no volátiles basadas en electrónica de silicio se remontan a la década de1990. Una variante de memoria de almacenaje por carga denominada como memoria Flash es mundialmente usada en productos electrónicos de consumo como telefonía móvil y reproductores de música mientras NAND Flash solid state disks(SSDs) están progresivamente desplazando a los dispositivos de disco duro como principal unidad de almacenamiento en computadoras portátiles, de escritorio e incluso en centros de datos. En la actualidad, hay varios factores que amenazan la actual predominancia de memorias semiconductoras basadas en cargas (capacitivas). Por un lado, se está alcanzando el límite de integración de las memorias Flash, lo que compromete su escalado en el medio plazo. Por otra parte, el fuerte incremento de las corrientes de fuga de los transistores de silicio CMOS actuales, supone un enorme desafío para la integración de memorias SRAM. Asimismo, estas memorias son cada vez más susceptibles a fallos de lectura/escritura en diseños de bajo consumo. Como resultado de estos problemas, que se agravan con cada nueva generación tecnológica, en los últimos años se han intensificado los esfuerzos para desarrollar nuevas tecnologías que reemplacen o al menos complementen a las actuales. Los transistores de efecto campo eléctrico ferroso (FeFET en sus siglas en inglés) se consideran una de las alternativas más prometedores para sustituir tanto a Flash (por su mayor densidad) como a DRAM (por su mayor velocidad), pero aún está en una fase muy inicial de su desarrollo. Hay otras tecnologías algo más maduras, en el ámbito de las memorias RAM resistivas, entre las que cabe destacar ReRAM (o RRAM), STT-RAM, Domain Wall Memory y Phase Change Memory (PRAM)...Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    A virtualisation framework for embedded systems

    Get PDF
    • …
    corecore