393 research outputs found
Recommended from our members
Architectural Exploration and Design Methodologies of Photonic Interconnection Networks
Photonic technology is becoming an increasingly attractive solution to the problems facing today's electronic chip-scale interconnection networks. Recent progress in silicon photonics research has enabled the demonstration of all the necessary optical building blocks for creating extremely high-bandwidth density and energy-efficient links for on- and off-chip communications. From the feasibility and architecture perspective however, photonics represents a dramatic paradigm shift from traditional electronic network designs due to fundamental differences in how electronics and photonics function and behave. As a result of these differences, new modeling and analysis methods must be employed in order to properly realize a functional photonic chip-scale interconnect design. In this work, we present a methodology for characterizing and modeling fundamental photonic building blocks which can subsequently be combined to form full photonic network architectures. We also describe a set of tools which can be utilized to assess the physical-layer and system-level performance properties of a photonic network. The models and tools are integrated in a novel open-source design and simulation environment called PhoenixSim. Next, we leverage PhoenixSim for the study of chip-scale photonic networks. We examine several photonic networks through the synergistic study of both physical-layer metrics and system-level metrics. This holistic analysis method enables us to provide deeper insight into architecture scalability since it considers insertion loss, crosstalk, and power dissipation. In addition to these novel physical-layer metrics, traditional system-level metrics of bandwidth and latency are also obtained. Lastly, we propose a novel routing architecture known as wavelength-selective spatial routing. This routing architecture is analogous to electronic virtual channels since it enables the transmission of multiple logical optical channels through a single physical plane (i.e. the waveguides). The available wavelength channels are partitioned into separate groups, and each group is routed independently in the network. Each partition is spectrally multiplexed, as opposed to temporally multiplexed in the electronic case. The wavelength-selective spatial routing technique benefits network designers by provider lower contention and increased path diversity
On-board B-ISDN fast packet switching architectures. Phase 2: Development. Proof-of-concept architecture definition report
For the next-generation packet switched communications satellite system with onboard processing and spot-beam operation, a reliable onboard fast packet switch is essential to route packets from different uplink beams to different downlink beams. The rapid emergence of point-to-point services such as video distribution, and the large demand for video conference, distributed data processing, and network management makes the multicast function essential to a fast packet switch (FPS). The satellite's inherent broadcast features gives the satellite network an advantage over the terrestrial network in providing multicast services. This report evaluates alternate multicast FPS architectures for onboard baseband switching applications and selects a candidate for subsequent breadboard development. Architecture evaluation and selection will be based on the study performed in phase 1, 'Onboard B-ISDN Fast Packet Switching Architectures', and other switch architectures which have become commercially available as large scale integration (LSI) devices
Mesh-of-Trees Interconnection Network for an Explicitly Multi-Threaded Parallel Computer Architecture
As the multiple-decade long increase in clock rates starts to
slow down, main-stream general-purpose processors evolve towards
single-chip parallel processing.
On-chip interconnection networks are essential components of such
machines, supporting the communication between processors and
the memory system.
This task is especially challenging for some easy-to-program
parallel computers, which are designed with performance-demanding
memory systems.
This study proposes an interconnection network, with
a novel implementation of the Mesh-of-Trees (MoT) topology.
The MoT network is evaluated relative to metrics such as wire area
complexity, total register
count, bandwidth, network diameter, single switch delay, maximum
throughput per area, trade-offs between
throughput and latency, and post-layout performance.
It is also compared with some other traditional
network topologies, such as mesh, ring, hypercube, butterfly, fat
trees, butterfly fat trees, and replicated butterfly
networks.
Concrete results show that MoT provides
higher throughput and lower latency especially when the input
traffic (or the on-chip parallelism) is high, at comparable
area cost.
The layout of MoT network is evaluated using standard cell design
methodology. A prototype chip with 8-terminal MoT network
was taped out at technology and tested.
In the context of an easy-to-program single-chip parallel processor,
MoT network is
embedded in the eXplicit Multi-Threading (XMT) architecture, and
evaluated by running parallel applications.
In addition to the basic MoT architecture,
a novel hybrid extension of MoT is proposed, which allows
significant area savings with a small reduction in throughput
Scheduling algorithms for high-speed switches
The virtual output queued (VOQ) switching architecture was adopted for high speed switch implementation owing to its scalability and high throughput. An ideal VOQ algorithm should provide Quality of Service (QoS) with low complexity. However, none of the existing algorithms can meet these requirements.
Several algorithms for VOQ switches are introduced in this dissertation in order to improve upon existing algorithms in terms of implementation or QoS features. Initially, the earliest due date first matching (EDDFM) algorithm, which is stable for both uniform and non-uniform traffic patterns, is proposed. EDDFM has lower probability of cell overdue than other existing maximum weight matching algorithms. Then, the shadow departure time algorithm (SDTA) and iterative SDTA (ISDTA) are introduced. The QoS features of SDTA and ISDTA are better than other existing algorithms with the same computational complexity. Simulations show that the performance of a VOQ switch using ISDTA with a speedup of 1.5 is similar to that of an output queued (OQ) switch in terms of cell delay and throughput. Later, the enhanced Birkhoff-von Neumann decomposition (EBVND) algorithm based on the Birkhoff-von Neumann decomposition (BVND) algorithm, which can provide rate and cell delay guarantees, is introduced. Theoretical analysis shows that the performance of EBVND is better than BVND in terms of throughput and cell delay. Finally, the maximum credit first (MCF), the Enhanced MCF (EMCF), and the iterative MCF (IMCF) algorithms are presented. These new algorithms have the similar performance as BNVD, yet are easier to implement in practice
PaST-NoC: A Packet-Switched Superconducting Temporal NoC
Temporal computing promises to mitigate the stringent area constraints and
clock distribution overheads of traditional superconducting digital computing.
To design a scalable, area- and power-efficient superconducting network on chip
(NoC), we propose packet-switched superconducting temporal NoC (PaST-NoC).
PaST-NoC operates its control path in the temporal domain using race logic
(RL), combined with bufferless deflection flow control to minimize area.
Packets encode their destination using RL and carry a collection of data pulses
that the receiver can interpret as pulse trains, RL, serialized binary, or
other formats. We demonstrate how to scale up PaST-NoC to arbitrary topologies
based on 2x2 routers and 4x4 butterflies as building blocks. As we show, if
data pulses are interpreted using RL, PaST-NoC outperforms state-of-the-art
superconducting binary NoCs in throughput per area by as much as 5x for long
packets.Comment: 14 pages, 18 figures, 2 tables. In press in IEEE Transactions on
Applied Superconductivit
Social Insect-Inspired Adaptive Hardware
Modern VLSI transistor densities allow large systems to be implemented within a single chip. As technologies get smaller, fundamental limits of silicon devices are reached resulting in lower design yields and post-deployment failures. Many-core systems provide a platform for leveraging the computing resource on offer by deep sub-micron technologies and also offer high-level capabilities for mitigating the issues with small feature sizes. However, designing for many-core systems that can adapt to in-field failures and operation variability requires an extremely large multi-objective optimisation space. When a many-core reaches the size supported by the densities of modern technologies (thousands of processing cores), finding design solutions in this problem space becomes extremely difficult.
Many biological systems show properties that are adaptive and scalable. This thesis proposes a self-optimising and adaptive, yet scalable, design approach for many-core based on the emergent behaviours of social-insect colonies. In these colonies there are many thousands of individuals with low intelligence who contribute, without any centralised control, to complete a wide range of tasks to build and maintain the colony. The experiments presented translate biological models of social-insect intelligence into simple embedded intelligence circuits. These circuits sense low-level system events and use this manage the parameters of the many-core's Network-on-Chip (NoC) during runtime.
Centurion, a 128-node many-core, was created to investigate these models at large scale in hardware. The results show that, by monitoring a small number of signals within each NoC router, task allocation emerges from the social-insect intelligence models that can self-configure to support representative applications. It is demonstrated that emergent task allocation supports fault tolerance with no extra hardware overhead. The response-threshold decision making circuitry uses a negligible amount of hardware resources relative to the size of the many-core and is an ideal technology for implementing embedded intelligence for system runtime management of large-complexity single-chip systems
Comparing timed-division multiplexing and best-effort networks-on-chip
Best-effort (BE) networks-on-chips (NOCs) are usually preferred over time-division multiplexed (TDM) NOCs in multi-core platforms because they are work-conserving and have lower (zero-load) latency. On the other hand, BE NOCs are significantly more expensive to implement than TDM NOCs because of their virtual channel buffers, allocators/arbiters, and (credit-based) flow control; functionality that a TDM NOC avoids altogether. The objective of this paper is to compare the performance of BE and TDM NOCs, taking hardware cost into consideration. The networks are compared using graphs showing average latency as a function of offered load. For the BE NOCs, we use the BookSim simulator, and for the TDM NOCs, we derive a queuing theory model and an associated TDM NOC simulator. Through experiments with both router architectures, packet length, link width, and different traffic patterns, we show that for the same hardware cost, a TDM NOC can provide higher bandwidth and comparable latency. We also show that the packet length is the most important factor affecting the TDM period, which again is the primary factor affecting latency. The best TDM NOC design for BE traffic uses single flit packets, wide links/flits, and a router with two pipeline stages: link and router traversal.publishedVersionPeer reviewe
Design and Implementation of a Multi-Class Network Architecture for Hardware Neural Networks
Die vorliegende Arbeit beschreibt den Entwurf und die Implementierung einer Netzwerkarchitektur, welche Techniken von leitungsvermittelnden und paketvermittelnden Netzwerken verbindet, um zwei verschiedene Dienstgüten anzubieten: isochrone Verbindungen und paketbasierte Verbindungen mit bestmöglicher Zustellung. Isochrone Verbindungen verwenden reservierte Netzwerkresourcen, um eine verlustfreie Übertragung sowie eine niedrige Ende-zu-Ende Verzögerung mit begrenzter Varianz zu garantieren. Die Synchronisierung aller Netzwerkknoten sowie die Berechnung einer kompakten Reservierungsbelegung werden durch effiziente Algorithmen gelöst. Paketbasierte Übertragungen verwenden die verbleibende Bandbreite. Das Multiplexen beider Verkehrsklassen wird von einem neuartigen Bypass-Switch geleistet, der skalierbar ist in der Anzahl der Schnittstellen sowie in der externen Bandbreite und ohne eine interne Beschleunigung auskommt. Die Netzwerkarchitektur kommt in der Forschung innerhalb des FACETS Projektes mit großskaligen künstlichen neuronalen Netzen in Hardware zum Einsatz, für die Vernetzung eines verteilten Systems aus VLSI neuronalen Netzen. Axonale Verbindungen zwischen Neuronen werden mit Hilfe von isochronen Verbindungen modelliert, wohingegen paketbasierte Übertragung die Grundlage für eine systemweite gemeinsame Speicherarchitektur bildet. Der zur Laufzeit ausgeführte Teil des Netzwerkes ist in programmierbarer Logik implementiert und arbeitet mit einer externen Übertragungsrate von 3.125 Gbit/s. Die Arbeit diskutiert die anwendungsbezogenen Anforderungen an das Netzwerk, sowie dessen Entwurf und Referenzimplementierung in programmierbarer Logik und Software. Theoretische Überlegungen über die Leistungsfähigkeit werden durch Messungen und Simulationen verifiziert. Obwohl die Netzwerkarchitektur für die spezielle Anwendung mit neuronalen Netzen entworfen wurde, stellt sie eine generelle Lösung für alle Netzwerkumgebungen dar, welche isochrone Verbindungen und Paketvermittlung mit niedriger Komplexität benötigen. Die Architektur ist insbesondere für den Einsatz in der nächsten Stufe der Hardwareentwicklung des FACETS Projektes zur Vernetzung künstlicher neuronaler Netze auf Wafer-Ebene geeignet
- …