8,926 research outputs found

    Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing

    Full text link
    Hierarchical ring networks, which hierarchically connect multiple levels of rings, have been proposed in the past to improve the scalability of ring interconnects, but past hierarchical ring designs sacrifice some of the key benefits of rings by introducing more complex in-ring buffering and buffered flow control. Our goal in this paper is to design a new hierarchical ring interconnect that can maintain most of the simplicity of traditional ring designs (no in-ring buffering or buffered flow control) while achieving high scalability as more complex buffered hierarchical ring designs. Our design, called HiRD (Hierarchical Rings with Deflection), includes features that allow us to mostly maintain the simplicity of traditional simple ring topologies while providing higher energy efficiency and scalability. First, HiRD does not have any buffering or buffered flow control within individual rings, and requires only a small amount of buffering between the ring hierarchy levels. When inter-ring buffers are full, our design simply deflects flits so that they circle the ring and try again, which eliminates the need for in-ring buffering. Second, we introduce two simple mechanisms that provides an end-to-end delivery guarantee within the entire network without impacting the critical path or latency of the vast majority of network traffic. HiRD attains equal or better performance at better energy efficiency than multiple versions of both a previous hierarchical ring design and a traditional single ring design. We also analyze our design's characteristics and injection and delivery guarantees. We conclude that HiRD can be a compelling design point that allows higher energy efficiency and scalability while retaining the simplicity and appeal of conventional ring-based designs

    Efficient Processing of Deep Neural Networks: A Tutorial and Survey

    Full text link
    Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.Comment: Based on tutorial on DNN Hardware at eyeriss.mit.edu/tutorial.htm

    FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

    Full text link
    Convolutional Neural Networks have rapidly become the most successful machine learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing-systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations and model parameters. The resulting scalability in performance, power efficiency and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool which enables design space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets and a specific precision. We introduce formalizations of resource cost functions and performance predictions, and elaborate on the optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS\,F1, demonstrating new unprecedented measured throughput at 50TOp/s on AWS-F1 and 5TOp/s on embedded devices.Comment: to be published in ACM TRETS Special Edition on Deep Learnin

    In-DRAM Bulk Bitwise Execution Engine

    Full text link
    Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming high latency, memory bandwidth, and energy. In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory. Ambit exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy. Ambit exposes a new bulk bitwise execution model to the host processor. Evaluations show that Ambit significantly improves the performance of several applications that use bulk bitwise operations, including databases.Comment: arXiv admin note: substantial text overlap with arXiv:1605.06483, arXiv:1610.09603, arXiv:1611.0998

    CODA: Enabling Co-location of Computation and Data for Near-Data Processing

    Full text link
    Recent studies have demonstrated that near-data processing (NDP) is an effective technique for improving performance and energy efficiency of data-intensive workloads. However, leveraging NDP in realistic systems with multiple memory modules introduces a new challenge. In today's systems, where no computation occurs in memory modules, the physical address space is interleaved at a fine granularity among all memory modules to help improve the utilization of processor-memory interfaces by distributing the memory traffic. However, this is at odds with efficient use of NDP, which requires careful placement of data in memory modules such that near-data computations and their exclusively used data can be localized in individual memory modules, while distributing shared data among memory modules to reduce hotspots. In order to address this new challenge, we propose a set of techniques that (1) enable collections of OS pages to either be fine-grain interleaved among memory modules (as is done today) or to be placed contiguously on individual memory modules (as is desirable for NDP private data), and (2) decide whether to localize or distribute each memory object based on its anticipated access pattern and steer computations to the memory where the data they access is located. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote data accesses over a baseline system that cannot exploit computate-data affinity characteristics.Comment: 14 pages, 16 figure

    Ring-Mesh: A Scalable and High-Performance Approach for Manycore Accelerators

    Full text link
    There are increasing number of works addressing the design challenges of fast, scalable solutions for the growing number of new type of applications. Recently, many of the solutions aimed at improving processing element capabilities to speed up the execution of machine learning application domain. However, only a few works focused on the interconnection subsystem as a potential source of performance improvement. Wrapping many cores together offer excellent parallelism, but it brings other challenges (e.g., adequate interconnections). Scalable, power-aware interconnects are required to support such a growing number of processing elements, as well as modern applications. In this paper, we propose a scalable and energy efficient Network-on-Chip architecture fusing the advantages of rings as well as the 2D-mesh without using any bridge router to provide high-performance. A dynamic adaptation mechanism allows to better adapt to the application requirements. Simulation results show efficient power consumption (up to 141.3% saving for connecting 1024 cores), 2x (on average) throughput growth with better scalability (up to 1024 processing elements) compared to popular 2D-mesh while tested in multiple statistical traffic pattern scenarios.Comment: 35 pages, Accepted to Journal of Supercomputin

    Evolutionary Optimisation of Real-Time Systems and Networks

    Full text link
    The design space of networked embedded systems is very large, posing challenges to the optimisation of such platforms when it comes to support applications with real-time guarantees. Recent research has shown that a number of inter-related optimisation problems have a critical influence over the schedulability of a system, i.e. whether all its application components can execute and communicate by their respective deadlines. Examples of such optimization problems include task allocation and scheduling, communication routing and arbitration, memory allocation, and voltage and frequency scaling. In this paper, we advocate the use of evolutionary approaches to address such optimization problems, aiming to evolve individuals of increased fitness over multiple generations of potential solutions. We refer to plentiful evidence that existing real-time schedulability tests can be used effectively to guide evolutionary optimisation, either by themselves or in combination with other metrics such as energy dissipation or hardware overheads. We then push that concept one step further and consider the possibility of using evolutionary techniques to evolve the schedulability tests themselves, aiming to support the verification and optimisation of systems which are too complex for state-of-the-art (manual) derivation of schedulability tests

    Simulation Environment for Link Energy Estimation in Networks-on-Chip with Virtual Channels

    Full text link
    Network-on-chip (NoC) is the most promising design paradigm for the interconnect architecture of a multiprocessor system-on-chip (MPSoC). On the downside, a NoC has a significant impact on the overall energy consumption of the system. NoC simulators are highly relevant for design space exploration even at an early stage. Since links in NoC consume up to 50% of the energy, a realistic energy consumption of links in NoC simulators is important. This work presents a simulation environment which implements a technique to precisely estimate the data dependent link energy consumption in NoCs with virtual channels for the first time. Our model works at a high level of abstraction, making it feasible to estimate the energy requirements at an early design stage. Additionally, it enables the fast evaluation and early exploration of low-power coding techniques. The presented model is applicable for 2D and 3D NoCs. A case study for an image processing application shows that the current link model leads to an underestimate of the link energy consumption by up to a factor of four. In contrast, the technique presented in this paper estimates the energy quantities precisely with an error below 1% compared to results obtained by precise, but computational extensive, bit-level simulation

    Estimation of Optimized Energy and Latency Constraints for Task Allocation in 3d Network on Chip

    Full text link
    In Network on Chip (NoC) rooted system, energy consumption is affected by task scheduling and allocation schemes which affect the performance of the system. In this paper we test the pre-existing proposed algorithms and introduced a new energy skilled algorithm for 3D NoC architecture. An efficient dynamic and cluster approaches are proposed along with the optimization using bio-inspired algorithm. The proposed algorithm has been implemented and evaluated on randomly generated benchmark and real life application such as MMS, Telecom and VOPD. The algorithm has also been tested with the E3S benchmark and has been compared with the existing mapping algorithm spiral and crinkle and has shown better reduction in the communication energy consumption and shows improvement in the performance of the system. On performing experimental analysis of proposed algorithm results shows that average reduction in energy consumption is 49%, reduction in communication cost is 48% and average latency is 34%. Cluster based approach is mapped onto NoC using Dynamic Diagonal Mapping (DDMap), Crinkle and Spiral algorithms and found DDmap provides improved result. On analysis and comparison of mapping of cluster using DDmap approach the average energy reduction is 14% and 9% with crinkle and spiral.Comment: 20 Pages,17 Figure, International Journal of Computer Science & Information Technology. arXiv admin note: substantial text overlap with arXiv:1404.251

    Energy and Latency Aware Application Mapping Algorithm & Optimization for Homogeneous 3D Network on Chip

    Full text link
    Energy efficiency is one of the most critical issue in design of System on Chip. In Network On Chip (NoC) based system, energy consumption is influenced dramatically by mapping of Intellectual Property (IP) which affect the performance of the system. In this paper we test the antecedently extant proposed algorithms and introduced a new energy proficient algorithm stand for 3D NoC architecture. In addition a hybrid method has also been implemented using bioinspired optimization (particle swarm optimization) technique. The proposed algorithm has been implemented and evaluated on randomly generated benchmark and real life application such as MMS, Telecom and VOPD. The algorithm has also been tested with the E3S benchmark and has been compared with the existing algorithm (spiral and crinkle) and has shown better reduction in the communication energy consumption and shows improvement in the performance of the system. Comparing our work with spiral and crinkle, experimental result shows that the average reduction in communication energy consumption is 19% with spiral and 17% with crinkle mapping algorithms, while reduction in communication cost is 24% and 21% whereas reduction in latency is of 24% and 22% with spiral and crinkle. Optimizing our work and the existing methods using bio-inspired technique and having the comparison among them an average energy reduction is found to be of 18% and 24%.Comment: 15 pages, 11 figure, CCSEA 201
    • …
    corecore