4,981 research outputs found

    A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)

    Full text link
    Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multi-core neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.Comment: 17 pages, 14 figure

    CLAASIC: a Cortex-Inspired Hardware Accelerator

    Full text link
    This work explores the feasibility of specialized hardware implementing the Cortical Learning Algorithm (CLA) in order to fully exploit its inherent advantages. This algorithm, which is inspired in the current understanding of the mammalian neo-cortex, is the basis of the Hierarchical Temporal Memory (HTM). In contrast to other machine learning (ML) approaches, the structure is not application dependent and relies on fully unsupervised continuous learning. We hypothesize that a hardware implementation will be able not only to extend the already practical uses of these ideas to broader scenarios but also to exploit the hardware-friendly CLA characteristics. The architecture proposed will enable an unfeasible scalability for software solutions and will fully capitalize on one of the many CLA advantages: low computational requirements and reduced storage utilization. Compared to a state-of-the-art CLA software implementation it could be possible to improve by 4 orders of magnitude in performance and up to 8 orders of magnitude in energy efficiency. We propose to use a packet-switched network to tackle this. The paper addresses the fundamental issues of such an approach, proposing solutions to achieve scalable solutions. We will analyze cost and performance when using well-known architecture techniques and tools. The results obtained suggest that even with CMOS technology, under constrained cost, it might be possible to implement a large-scale system. We found that the proposed solutions enable a saving of 90% of the original communication costs running either synthetic or realistic workloads.Comment: Submitted for publicatio

    Optimizing Routerless Network-on-Chip Designs: An Innovative Learning-Based Framework

    Full text link
    Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and 1.14x reduction in average hop count albeit with slightly more power overhead.Comment: 13 pages, 15 figure

    Learning to update Auto-associative Memory in Recurrent Neural Networks for Improving Sequence Memorization

    Full text link
    Learning to remember long sequences remains a challenging task for recurrent neural networks. Register memory and attention mechanisms were both proposed to resolve the issue with either high computational cost to retain memory differentiability, or by discounting the RNN representation learning towards encoding shorter local contexts than encouraging long sequence encoding. Associative memory, which studies the compression of multiple patterns in a fixed size memory, were rarely considered in recent years. Although some recent work tries to introduce associative memory in RNN and mimic the energy decay process in Hopfield nets, it inherits the shortcoming of rule-based memory updates, and the memory capacity is limited. This paper proposes a method to learn the memory update rule jointly with task objective to improve memory capacity for remembering long sequences. Also, we propose an architecture that uses multiple such associative memory for more complex input encoding. We observed some interesting facts when compared to other RNN architectures on some well-studied sequence learning tasks

    A Survey and Evaluation of Data Center Network Topologies

    Full text link
    Data centers are becoming increasingly popular for their flexibility and processing capabilities in the modern computing environment. They are managed by a single entity (administrator) and allow dynamic resource provisioning, performance optimization as well as efficient utilization of available resources. Each data center consists of massive compute, network and storage resources connected with physical wires. The large scale nature of data centers requires careful planning of compute, storage, network nodes, interconnection as well as inter-communication for their effective and efficient operations. In this paper, we present a comprehensive survey and taxonomy of network topologies either used in commercial data centers, or proposed by researchers working in this space. We also compare and evaluate some of those topologies using mininet as well as gem5 simulator for different traffic patterns, based on various metrics including throughput, latency and bisection bandwidth

    Data Path Processing in Fast Programmable Routers

    Full text link
    Internet is growing at a fast pace. The link speeds are surging toward 40 Gbps with the emergence of faster link technologies. New applications are coming up which require intelligent processing at the intermediate routers. Switches and routers are becoming the bottlenecks in fast communication. On one hand faster links deliver more packets every second and on the other hand intelligent processing consumes more CPU cycles at the router. The conflicting goals of providing faster but computationally expensive processing call for new approaches in designing routers. This survey takes a look at the core functionalities, like packet classification, buffer memory management, switch scheduling and output link scheduling performed by a router in its data path processing and discusses the algorithms that aim to reduce the performance bound for these operations. An important requirement for the routers is to provide Quality of Service guarantees. We propose an algorithm to guarantee QoS in Input Queued Routers. The hardware solution to speed up router operation was Application Specific Integrated Circuits (ASICs). But the inherent inflexibility of the method is a demerit as network standards and application requirements are constantly evolving, which seek a faster turnaround time to keep up with the changes. The promise of Network Processors (NP) is the flexibility of general-purpose processors together with the speed of ASICs. We will study the architectural choices for the design of Network Processors and focus on some of the commercially available NPs. There is a plethora of NP vendors in the market. The discussion on the NP benchmarks sets the normalizing platform to evaluate these NPs.Comment: ECSL Technical Report #127, Stony Brook Universit

    Early Routability Assessment in VLSI Floorplans: A Generalized Routing Model

    Full text link
    Multiple design iterations are inevitable in nanometer Integrated Circuit (IC) design flow until desired printability and performance metrics are achieved. This starts with placement optimization aimed at improving routability, wirelength, congestion and timing in the design. Contrarily, no such practice exists on a floorplanned layout, during the early stage of the design flow. Recently, STAIRoute \cite{karb2} aimed to address that by identifying the shortest routing path of a net through a set of routing regions in the floorplan in multiple metal layers. Since the blocks in hierarchical ASIC/SoC designs do not use all the permissible routing layers for the internal routing corresponding to standard cell connectivity, the proposed STAIRoute framework is not an effective for early global routability assessment. This leads to improper utilization of routing area, specifically in higher routing layers with fewer routing blockages, as the lack of placement of standard cells does not facilitates any routing of their interconnections. This paper presents a generalized model for early global routability assessment, HGR, by utilizing the free regions over the blocks beyond certain metal layers. The proposed (hybrid) routing model comprises of (a) the junction graph model in STAIRoute routing through the block boundary regions in lower routing layers, and (ii) the grid graph model for routing in higher layers over the free regions of the blocks. Experiment with the latest floorplanning benchmarks exhibit an average reduction of 4%4\%, 54%54\% and 70%70\% in netlength, via count, and congestion respectively when HGR is used over STAIRoute. Further, we conducted another experiment on an industrial design flow targeted for 45nm45nm process, and the results are encouraging with  3~3X runtime boost when early global routing is used in conjunction with the existing physical design flow.Comment: A draft of 24 pages aimed at ACM-TODAES Journal, with 10 figures and 5 table

    The Road Ahead for Networking: A Survey on ICN-IP Coexistence Solutions

    Full text link
    In recent years, the current Internet has experienced an unexpected paradigm shift in the usage model, which has pushed researchers towards the design of the Information-Centric Networking (ICN) paradigm as a possible replacement of the existing architecture. Even though both Academia and Industry have investigated the feasibility and effectiveness of ICN, achieving the complete replacement of the Internet Protocol (IP) is a challenging task. Some research groups have already addressed the coexistence by designing their own architectures, but none of those is the final solution to move towards the future Internet considering the unaltered state of the networking. To design such architecture, the research community needs now a comprehensive overview of the existing solutions that have so far addressed the coexistence. The purpose of this paper is to reach this goal by providing the first comprehensive survey and classification of the coexistence architectures according to their features (i.e., deployment approach, deployment scenarios, addressed coexistence requirements and architecture or technology used) and evaluation parameters (i.e., challenges emerging during the deployment and the runtime behaviour of an architecture). We believe that this paper will finally fill the gap required for moving towards the design of the final coexistence architecture.Comment: 23 pages, 16 figures, 3 table

    Optimal Placement of Cores, Caches and Memory Controllers in Network On-Chip

    Full text link
    Parallel programming is emerging fast and intensive applications need more resources, so there is a huge demand for on-chip multiprocessors. Accessing L1 caches beside the cores are the fastest after registers but the size of private caches cannot increase because of design, cost and technology limits. Then split I-cache and D-cache are used with shared LLC (last level cache). For a unified shared LLC, bus interface is not scalable, and it seems that distributed shared LLC (DSLLC) is a better choice. Most of papers assume a distributed shared LLC beside each core in on-chip network. Many works assume that DSLLCs are placed in all cores; however, we will show that this design ignores the effect of traffic congestion in on-chip network. In fact, our work focuses on optimal placement of cores, DSLLCs and even memory controllers to minimize the expected latency based on traffic load in a mesh on-chip network with fixed number of cores and total cache capacity. We try to do some analytical modeling deriving intended cost function and then optimize the mean delay of the on-chip network communication. This work is supposed to be verified using some traffic patterns that are run on CSIM simulator

    Scalable NoC-based Neuromorphic Hardware Learning and Inference

    Full text link
    Bio-inspired neuromorphic hardware is a research direction to approach brain's computational power and energy efficiency. Spiking neural networks (SNN) encode information as sparsely distributed spike trains and employ spike-timing-dependent plasticity (STDP) mechanism for learning. Existing hardware implementations of SNN are limited in scale or do not have in-hardware learning capability. In this work, we propose a low-cost scalable Network-on-Chip (NoC) based SNN hardware architecture with fully distributed in-hardware STDP learning capability. All hardware neurons work in parallel and communicate through the NoC. This enables chip-level interconnection, scalability and reconfigurability necessary for deploying different applications. The hardware is applied to learn MNIST digits as an evaluation of its learning capability. We explore the design space to study the trade-offs between speed, area and energy. How to use this procedure to find optimal architecture configuration is also discussed
    corecore