697 research outputs found
Recommended from our members
Design and performance optimization of asynchronous networks-on-chip
As digital systems continue to grow in complexity, the design of conventional synchronous systems is facing unprecedented challenges. The number of transistors on individual chips is already in the multi-billion range, and a greatly increasing number of components are being integrated onto a single chip. As a consequence, modern digital designs are under strong time-to-market pressure, and there is a critical need for composable design approaches for large complex systems.
In the past two decades, networks-on-chip (NoC’s) have been a highly active research area. In a NoC-based system, functional blocks are first designed individually and may run at different clock rates. These modules are then connected through a structured network for on-chip global communication. However, due to the rigidity of centrally-clocked NoC’s, there have been bottlenecks of system scalability, energy and performance, which cannot be easily solved with synchronous approaches. As a result, there has been significant recent interest in combing the notion of asynchrony with NoC designs. Since the NoC approach inherently separates the communication infrastructure, and its timing, from computational elements, it is a natural match for an asynchronous paradigm. Asynchronous NoC’s, therefore, enable a modular and extensible system composition for an ‘object-orient’ design style.
The thesis aims to significantly advance the state-of-art and viability of asynchronous and globally-asynchronous locally-synchronous (GALS) networks-on-chip, to enable high-performance and low-energy systems. The proposed asynchronous NoC’s are nearly entirely based on standard cells, which eases their integration into industrial design flows. The contributions are instantiated in three different directions.
First, practical acceleration techniques are proposed for optimizing the system latency, in order to break through the latency bottleneck in the memory interfaces of many on-chip parallel processors. Novel asynchronous network protocols are proposed, along with concrete NoC designs. A new concept, called ‘monitoring network’, is introduced. Monitoring networks are lightweight shadow networks used for fast-forwarding anticipated traffic information, ahead of the actual packet traffic. The routers are therefore allowed to initiate and perform arbitration and channel allocation in advance. The technique is successfully applied to two topologies which belong to two different categories – a variant mesh-of-trees (MoT) structure and a 2D-mesh topology. Considerable and stable latency improvements are observed across a wide range of traffic patterns, along with moderate throughput gains.
Second, for the first time, a high-performance and low-power asynchronous NoC router is compared directly to a leading commercial synchronous counterpart in an advanced industrial technology. The asynchronous router design shows significant performance improvements, as well as area and power savings. The proposed asynchronous router integrates several advanced techniques, including a low-latency circular FIFO for buffer design, and a novel end-to-end credit-based virtual channel (VC) flow control. In addition, a semi-automated design flow is created, which uses portions of a standard synchronous tool flow.
Finally, a high-performance multi-resource asynchronous arbiter design is developed. This small but important component can be directly used in existing asynchronous NoC’s for performance optimization. In addition, this standalone design promises use in opening up new NoC directions, as well as for general use in parallel systems. In the proposed arbiter design, the allocation of a resource to a client is divided into several steps. Multiple successive client-resource pairs can be selected rapidly in pipelined sequence, and the completion of the assignments can overlap in parallel.
In sum, the thesis provides a set of advanced design solutions for performance optimization of asynchronous and GALS networks-on-chip. These solutions are at different levels, from network protocols, down to router- and component-level optimizations, which can be directly applied to existing basic asynchronous NoC designs to provide a leap in performance improvement
Mesh-of-Trees Interconnection Network for an Explicitly Multi-Threaded Parallel Computer Architecture
As the multiple-decade long increase in clock rates starts to
slow down, main-stream general-purpose processors evolve towards
single-chip parallel processing.
On-chip interconnection networks are essential components of such
machines, supporting the communication between processors and
the memory system.
This task is especially challenging for some easy-to-program
parallel computers, which are designed with performance-demanding
memory systems.
This study proposes an interconnection network, with
a novel implementation of the Mesh-of-Trees (MoT) topology.
The MoT network is evaluated relative to metrics such as wire area
complexity, total register
count, bandwidth, network diameter, single switch delay, maximum
throughput per area, trade-offs between
throughput and latency, and post-layout performance.
It is also compared with some other traditional
network topologies, such as mesh, ring, hypercube, butterfly, fat
trees, butterfly fat trees, and replicated butterfly
networks.
Concrete results show that MoT provides
higher throughput and lower latency especially when the input
traffic (or the on-chip parallelism) is high, at comparable
area cost.
The layout of MoT network is evaluated using standard cell design
methodology. A prototype chip with 8-terminal MoT network
was taped out at technology and tested.
In the context of an easy-to-program single-chip parallel processor,
MoT network is
embedded in the eXplicit Multi-Threading (XMT) architecture, and
evaluated by running parallel applications.
In addition to the basic MoT architecture,
a novel hybrid extension of MoT is proposed, which allows
significant area savings with a small reduction in throughput
08371 Abstracts Collection -- Fault-Tolerant Distributed Algorithms on VLSI Chips
From September the , 2008 to September the
, 2008 the Dagstuhl Seminar 08371 ``Fault-Tolerant
Distributed Algorithms on VLSI Chips \u27\u27 was held in Schloss
Dagstuhl~--~Leibniz Center for Informatics. The seminar was devoted to
exploring whether the wealth of existing fault-tolerant distributed
algorithms research can be utilized for meeting the challenges of
future-generation VLSI chips. During the seminar, several participants
from both the VLSI and distributed algorithms\u27 discipline, presented
their current research, and ongoing work and possibilities for
collaboration were discussed. Abstracts of the presentations given
during the seminar as well as abstracts of seminar results and ideas
are put together in this paper. The first section describes the
seminar topics and goals in general. Links to extended abstracts or
full papers are provided, if available
ARBITRATE-AND-MOVE PRIMITIVES FOR HIGH THROUGHPUT ON-CHIP INTERCONNECTION NETWORKS
An n-leaf pipelined balanced binary tree is used for
arbitration of order and movement of data from n input
ports to one output port. A novel arbitrate-and-move
primitive circuit for every node of the tree, which is based on
a concept of reduced synchrony that benefits from attractive
features of both asynchronous and synchronous designs, is
presented. The design objective of the pipelined binary tree
is to provide a key building block in a high-throughput
mesh-of-trees interconnection network for Explicit Multi
Threading (XMT) architecture, a recently introduced
parallel computation framework. The proposed reduced
synchrony circuit was compared with asynchronous and
synchronous designs of arbitrate-and-move primitives.
Simulations with 0.18m technology show that compared to
an asynchronous design, the proposed reduced synchrony
implementation achieves a higher throughput, up to 2 Giga-
Requests per second on an 8-leaf binary tree. Our circuit
also consumes less power than the synchronous design, and
requires less silicon area than both the synchronous and
asynchronous designs
Speck: A Smart event-based Vision Sensor with a low latency 327K Neuron Convolutional Neuronal Network Processing Pipeline
Edge computing solutions that enable the extraction of high level information
from a variety of sensors is in increasingly high demand. This is due to the
increasing number of smart devices that require sensory processing for their
application on the edge. To tackle this problem, we present a smart vision
sensor System on Chip (Soc), featuring an event-based camera and a low power
asynchronous spiking Convolutional Neuronal Network (sCNN) computing
architecture embedded on a single chip. By combining both sensor and processing
on a single die, we can lower unit production costs significantly. Moreover,
the simple end-to-end nature of the SoC facilitates small stand-alone
applications as well as functioning as an edge node in a larger systems. The
event-driven nature of the vision sensor delivers high-speed signals in a
sparse data stream. This is reflected in the processing pipeline, focuses on
optimising highly sparse computation and minimising latency for 9 sCNN layers
to . Overall, this results in an extremely low-latency visual
processing pipeline deployed on a small form factor with a low energy budget
and sensor cost. We present the asynchronous architecture, the individual
blocks, the sCNN processing principle and benchmark against other sCNN capable
processors
A High-Throughput, Low-Power Asynchronous Mesh-of-Trees Interconnection Network for the Explicit Multi-Threading (XMT) Parallel Architecture
This thesis presents an asynchronous (clockless) Mesh-of-Trees network that
consumes less power and area than the synchronous Mesh-of-Trees network,
while maintaining high throughput and low latency.
Two new asynchronous designs are proposed for the fundamental
pipelined components of the network (routing and arbitration),
which are optimized for power, area, latency and throughput.
Mixed-timing interfaces are added to create a mixed-timing network
which provides communication between synchronous and asynchronous
domains.
Two issues top the agenda of CPU design in the emerging many-core era:
programmers' productivity and power consumption.
Through its reliance on the richest available theory of parallel algorithms,
the eXplicit Multi-Threading (XMT) parallel architecture addresses
programmers' productivity.
The motivation for this work is to provide an effective
interconnection network for the XMT architecture
in terms of both performance and power consumption.
Performance of the XMT processor with the mixed-timing network is
measured for several applications
- …