4,575 research outputs found

    A low-power, high-performance speech recognition accelerator

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft

    Content Delivery Latency of Caching Strategies for Information-Centric IoT

    Full text link
    In-network caching is a central aspect of Information-Centric Networking (ICN). It enables the rapid distribution of content across the network, alleviating strain on content producers and reducing content delivery latencies. ICN has emerged as a promising candidate for use in the Internet of Things (IoT). However, IoT devices operate under severe constraints, most notably limited memory. This means that nodes cannot indiscriminately cache all content; instead, there is a need for a caching strategy that decides what content to cache. Furthermore, many applications in the IoT space are timesensitive; therefore, finding a caching strategy that minimises the latency between content request and delivery is desirable. In this paper, we evaluate a number of ICN caching strategies in regards to latency and hop count reduction using IoT devices in a physical testbed. We find that the topology of the network, and thus the routing algorithm used to generate forwarding information, has a significant impact on the performance of a given caching strategy. To the best of our knowledge, this is the first study that focuses on latency effects in ICN-IoT caching while using real IoT hardware, and the first to explicitly discuss the link between routing algorithm, network topology, and caching effects.Comment: 10 pages, 9 figures, journal pape

    Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies

    Get PDF
    The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8PF (peak) and 5.06PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.Comment: 11 pages, 7 figures, accepted to SuperComputing 201

    YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration

    Get PDF
    Convolutional neural networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy. The computational effort of today's CNNs requires power-hungry parallel processors or GP-GPUs. Recent developments in CNN accelerators for system-on-chip integration have reduced energy consumption significantly. Unfortunately, even these highly optimized devices are above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. This prevents the adoption of CNNs in future ultra-low power Internet of Things end-nodes for near-sensor analytics. Recent algorithmic and theoretical advancements enable competitive classification accuracy even when limiting CNNs to binary (+1/-1) weights during training. These new findings bring major optimization opportunities in the arithmetic core by removing the need for expensive multiplications, as well as reducing I/O bandwidth and storage. In this work, we present an accelerator optimized for binary-weight CNNs that achieves 1510 GOp/s at 1.2 V on a core area of only 1.33 MGE (Million Gate Equivalent) or 0.19 mm2^2 and with a power dissipation of 895 {\mu}W in UMC 65 nm technology at 0.6 V. Our accelerator significantly outperforms the state-of-the-art in terms of energy and area efficiency achieving 61.2 TOp/s/[email protected] V and 1135 GOp/s/[email protected] V, respectively

    Unravelling the Impact of Temporal and Geographical Locality in Content Caching Systems

    Get PDF
    To assess the performance of caching systems, the definition of a proper process describing the content requests generated by users is required. Starting from the analysis of traces of YouTube video requests collected inside operational networks, we identify the characteristics of real traffic that need to be represented and those that instead can be safely neglected. Based on our observations, we introduce a simple, parsimonious traffic model, named Shot Noise Model (SNM), that allows us to capture temporal and geographical locality of content popularity. The SNM is sufficiently simple to be effectively employed in both analytical and scalable simulative studies of caching systems. We demonstrate this by analytically characterizing the performance of the LRU caching policy under the SNM, for both a single cache and a network of caches. With respect to the standard Independent Reference Model (IRM), some paradigmatic shifts, concerning the impact of various traffic characteristics on cache performance, clearly emerge from our results.Comment: 14 pages, 11 Figures, 2 Appendice

    Exploring the design space for 3D clustered architectures

    Get PDF
    Journal Article3D die-stacked chips are emerging as intriguing prospects for the future because of their ability to reduce on-chip wire delays and power consumption. However, they will likely cause an increase in chip operating temperature, which is already a major bottleneck in modern microprocessor design. We believe that 3D will provide the highest performance benefit for high-ILP cores, where wire delays for 2D designs can be substantial. A clustered microarchitecture is an example of a complexity-effective implementation of a high-ILP core. In this paper, we consider 3D organizations of a single-threaded clustered microarchitecture to understand how floorplanning impacts performance and temperature. We first show that delays between the data cache and ALUs are most critical to performance. We then present a novel 3D layout that provides the best balance between temperature and performance. The best-performing 3D layout has 12% higher performance than the best-performing 2D layout

    EbbRT: a framework for building per-application library operating systems

    Full text link
    Efficient use of high speed hardware requires operating system components be customized to the application work- load. Our general purpose operating systems are ill-suited for this task. We present EbbRT, a framework for constructing per-application library operating systems for cloud applications. The primary objective of EbbRT is to enable high-performance in a tractable and maintainable fashion. This paper describes the design and implementation of EbbRT, and evaluates its ability to improve the performance of common cloud applications. The evaluation of the EbbRT prototype demonstrates memcached, run within a VM, can outperform memcached run on an unvirtualized Linux. The prototype evaluation also demonstrates an 14% performance improvement of a V8 JavaScript engine benchmark, and a node.js webserver that achieves a 50% reduction in 99th percentile latency compared to it run on Linux
    • …
    corecore