4,981 research outputs found
A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)
Neuromorphic computing systems comprise networks of neurons that use
asynchronous events for both computation and communication. This type of
representation offers several advantages in terms of bandwidth and power
consumption in neuromorphic electronic systems. However, managing the traffic
of asynchronous events in large scale systems is a daunting task, both in terms
of circuit complexity and memory requirements. Here we present a novel routing
methodology that employs both hierarchical and mesh routing strategies and
combines heterogeneous memory structures for minimizing both memory
requirements and latency, while maximizing programming flexibility to support a
wide range of event-based neural network architectures, through parameter
configuration. We validated the proposed scheme in a prototype multi-core
neuromorphic processor chip that employs hybrid analog/digital circuits for
emulating synapse and neuron dynamics together with asynchronous digital
circuits for managing the address-event traffic. We present a theoretical
analysis of the proposed connectivity scheme, describe the methods and circuits
used to implement such scheme, and characterize the prototype chip. Finally, we
demonstrate the use of the neuromorphic processor with a convolutional neural
network for the real-time classification of visual symbols being flashed to a
dynamic vision sensor (DVS) at high speed.Comment: 17 pages, 14 figure
CLAASIC: a Cortex-Inspired Hardware Accelerator
This work explores the feasibility of specialized hardware implementing the
Cortical Learning Algorithm (CLA) in order to fully exploit its inherent
advantages. This algorithm, which is inspired in the current understanding of
the mammalian neo-cortex, is the basis of the Hierarchical Temporal Memory
(HTM). In contrast to other machine learning (ML) approaches, the structure is
not application dependent and relies on fully unsupervised continuous learning.
We hypothesize that a hardware implementation will be able not only to extend
the already practical uses of these ideas to broader scenarios but also to
exploit the hardware-friendly CLA characteristics. The architecture proposed
will enable an unfeasible scalability for software solutions and will fully
capitalize on one of the many CLA advantages: low computational requirements
and reduced storage utilization. Compared to a state-of-the-art CLA software
implementation it could be possible to improve by 4 orders of magnitude in
performance and up to 8 orders of magnitude in energy efficiency. We propose to
use a packet-switched network to tackle this. The paper addresses the
fundamental issues of such an approach, proposing solutions to achieve scalable
solutions. We will analyze cost and performance when using well-known
architecture techniques and tools. The results obtained suggest that even with
CMOS technology, under constrained cost, it might be possible to implement a
large-scale system. We found that the proposed solutions enable a saving of 90%
of the original communication costs running either synthetic or realistic
workloads.Comment: Submitted for publicatio
Optimizing Routerless Network-on-Chip Designs: An Innovative Learning-Based Framework
Machine learning applied to architecture design presents a promising
opportunity with broad applications. Recent deep reinforcement learning (DRL)
techniques, in particular, enable efficient exploration in vast design spaces
where conventional design strategies may be inadequate. This paper proposes a
novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as
an evaluation case study. The new framework successfully resolves problems with
prior design approaches being either unreliable due to random searches or
inflexible due to severe design space restrictions. The framework learns
(near-)optimal loop placement for routerless NoCs with various design
constraints. A deep neural network is developed using parallel threads that
efficiently explore the immense routerless NoC design space with a Monte Carlo
search tree. Experimental results show that, compared with conventional mesh,
the proposed deep reinforcement learning (DRL) routerless design achieves a
3.25x increase in throughput, 1.6x reduction in packet latency, and 5x
reduction in power. Compared with the state-of-the-art routerless NoC, DRL
achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and
1.14x reduction in average hop count albeit with slightly more power overhead.Comment: 13 pages, 15 figure
Learning to update Auto-associative Memory in Recurrent Neural Networks for Improving Sequence Memorization
Learning to remember long sequences remains a challenging task for recurrent
neural networks. Register memory and attention mechanisms were both proposed to
resolve the issue with either high computational cost to retain memory
differentiability, or by discounting the RNN representation learning towards
encoding shorter local contexts than encouraging long sequence encoding.
Associative memory, which studies the compression of multiple patterns in a
fixed size memory, were rarely considered in recent years. Although some recent
work tries to introduce associative memory in RNN and mimic the energy decay
process in Hopfield nets, it inherits the shortcoming of rule-based memory
updates, and the memory capacity is limited. This paper proposes a method to
learn the memory update rule jointly with task objective to improve memory
capacity for remembering long sequences. Also, we propose an architecture that
uses multiple such associative memory for more complex input encoding. We
observed some interesting facts when compared to other RNN architectures on
some well-studied sequence learning tasks
A Survey and Evaluation of Data Center Network Topologies
Data centers are becoming increasingly popular for their flexibility and
processing capabilities in the modern computing environment. They are managed
by a single entity (administrator) and allow dynamic resource provisioning,
performance optimization as well as efficient utilization of available
resources. Each data center consists of massive compute, network and storage
resources connected with physical wires. The large scale nature of data centers
requires careful planning of compute, storage, network nodes, interconnection
as well as inter-communication for their effective and efficient operations. In
this paper, we present a comprehensive survey and taxonomy of network
topologies either used in commercial data centers, or proposed by researchers
working in this space. We also compare and evaluate some of those topologies
using mininet as well as gem5 simulator for different traffic patterns, based
on various metrics including throughput, latency and bisection bandwidth
Data Path Processing in Fast Programmable Routers
Internet is growing at a fast pace. The link speeds are surging toward 40
Gbps with the emergence of faster link technologies. New applications are
coming up which require intelligent processing at the intermediate routers.
Switches and routers are becoming the bottlenecks in fast communication. On one
hand faster links deliver more packets every second and on the other hand
intelligent processing consumes more CPU cycles at the router. The conflicting
goals of providing faster but computationally expensive processing call for new
approaches in designing routers.
This survey takes a look at the core functionalities, like packet
classification, buffer memory management, switch scheduling and output link
scheduling performed by a router in its data path processing and discusses the
algorithms that aim to reduce the performance bound for these operations. An
important requirement for the routers is to provide Quality of Service
guarantees. We propose an algorithm to guarantee QoS in Input Queued Routers.
The hardware solution to speed up router operation was Application Specific
Integrated Circuits (ASICs). But the inherent inflexibility of the method is a
demerit as network standards and application requirements are constantly
evolving, which seek a faster turnaround time to keep up with the changes. The
promise of Network Processors (NP) is the flexibility of general-purpose
processors together with the speed of ASICs. We will study the architectural
choices for the design of Network Processors and focus on some of the
commercially available NPs. There is a plethora of NP vendors in the market.
The discussion on the NP benchmarks sets the normalizing platform to evaluate
these NPs.Comment: ECSL Technical Report #127, Stony Brook Universit
Early Routability Assessment in VLSI Floorplans: A Generalized Routing Model
Multiple design iterations are inevitable in nanometer Integrated Circuit
(IC) design flow until desired printability and performance metrics are
achieved. This starts with placement optimization aimed at improving
routability, wirelength, congestion and timing in the design. Contrarily, no
such practice exists on a floorplanned layout, during the early stage of the
design flow. Recently, STAIRoute \cite{karb2} aimed to address that by
identifying the shortest routing path of a net through a set of routing regions
in the floorplan in multiple metal layers. Since the blocks in hierarchical
ASIC/SoC designs do not use all the permissible routing layers for the internal
routing corresponding to standard cell connectivity, the proposed STAIRoute
framework is not an effective for early global routability assessment. This
leads to improper utilization of routing area, specifically in higher routing
layers with fewer routing blockages, as the lack of placement of standard cells
does not facilitates any routing of their interconnections.
This paper presents a generalized model for early global routability
assessment, HGR, by utilizing the free regions over the blocks beyond certain
metal layers. The proposed (hybrid) routing model comprises of (a) the junction
graph model in STAIRoute routing through the block boundary regions in lower
routing layers, and (ii) the grid graph model for routing in higher layers over
the free regions of the blocks.
Experiment with the latest floorplanning benchmarks exhibit an average
reduction of , and in netlength, via count, and congestion
respectively when HGR is used over STAIRoute. Further, we conducted another
experiment on an industrial design flow targeted for process, and the
results are encouraging with X runtime boost when early global routing is
used in conjunction with the existing physical design flow.Comment: A draft of 24 pages aimed at ACM-TODAES Journal, with 10 figures and
5 table
The Road Ahead for Networking: A Survey on ICN-IP Coexistence Solutions
In recent years, the current Internet has experienced an unexpected paradigm
shift in the usage model, which has pushed researchers towards the design of
the Information-Centric Networking (ICN) paradigm as a possible replacement of
the existing architecture. Even though both Academia and Industry have
investigated the feasibility and effectiveness of ICN, achieving the complete
replacement of the Internet Protocol (IP) is a challenging task.
Some research groups have already addressed the coexistence by designing
their own architectures, but none of those is the final solution to move
towards the future Internet considering the unaltered state of the networking.
To design such architecture, the research community needs now a comprehensive
overview of the existing solutions that have so far addressed the coexistence.
The purpose of this paper is to reach this goal by providing the first
comprehensive survey and classification of the coexistence architectures
according to their features (i.e., deployment approach, deployment scenarios,
addressed coexistence requirements and architecture or technology used) and
evaluation parameters (i.e., challenges emerging during the deployment and the
runtime behaviour of an architecture). We believe that this paper will finally
fill the gap required for moving towards the design of the final coexistence
architecture.Comment: 23 pages, 16 figures, 3 table
Optimal Placement of Cores, Caches and Memory Controllers in Network On-Chip
Parallel programming is emerging fast and intensive applications need more
resources, so there is a huge demand for on-chip multiprocessors. Accessing L1
caches beside the cores are the fastest after registers but the size of private
caches cannot increase because of design, cost and technology limits. Then
split I-cache and D-cache are used with shared LLC (last level cache). For a
unified shared LLC, bus interface is not scalable, and it seems that
distributed shared LLC (DSLLC) is a better choice. Most of papers assume a
distributed shared LLC beside each core in on-chip network. Many works assume
that DSLLCs are placed in all cores; however, we will show that this design
ignores the effect of traffic congestion in on-chip network. In fact, our work
focuses on optimal placement of cores, DSLLCs and even memory controllers to
minimize the expected latency based on traffic load in a mesh on-chip network
with fixed number of cores and total cache capacity. We try to do some
analytical modeling deriving intended cost function and then optimize the mean
delay of the on-chip network communication. This work is supposed to be
verified using some traffic patterns that are run on CSIM simulator
Scalable NoC-based Neuromorphic Hardware Learning and Inference
Bio-inspired neuromorphic hardware is a research direction to approach
brain's computational power and energy efficiency. Spiking neural networks
(SNN) encode information as sparsely distributed spike trains and employ
spike-timing-dependent plasticity (STDP) mechanism for learning. Existing
hardware implementations of SNN are limited in scale or do not have in-hardware
learning capability. In this work, we propose a low-cost scalable
Network-on-Chip (NoC) based SNN hardware architecture with fully distributed
in-hardware STDP learning capability. All hardware neurons work in parallel and
communicate through the NoC. This enables chip-level interconnection,
scalability and reconfigurability necessary for deploying different
applications. The hardware is applied to learn MNIST digits as an evaluation of
its learning capability. We explore the design space to study the trade-offs
between speed, area and energy. How to use this procedure to find optimal
architecture configuration is also discussed
- …