393 research outputs found

    On-board B-ISDN fast packet switching architectures. Phase 2: Development. Proof-of-concept architecture definition report

    Get PDF
    For the next-generation packet switched communications satellite system with onboard processing and spot-beam operation, a reliable onboard fast packet switch is essential to route packets from different uplink beams to different downlink beams. The rapid emergence of point-to-point services such as video distribution, and the large demand for video conference, distributed data processing, and network management makes the multicast function essential to a fast packet switch (FPS). The satellite's inherent broadcast features gives the satellite network an advantage over the terrestrial network in providing multicast services. This report evaluates alternate multicast FPS architectures for onboard baseband switching applications and selects a candidate for subsequent breadboard development. Architecture evaluation and selection will be based on the study performed in phase 1, 'Onboard B-ISDN Fast Packet Switching Architectures', and other switch architectures which have become commercially available as large scale integration (LSI) devices

    Mesh-of-Trees Interconnection Network for an Explicitly Multi-Threaded Parallel Computer Architecture

    Get PDF
    As the multiple-decade long increase in clock rates starts to slow down, main-stream general-purpose processors evolve towards single-chip parallel processing. On-chip interconnection networks are essential components of such machines, supporting the communication between processors and the memory system. This task is especially challenging for some easy-to-program parallel computers, which are designed with performance-demanding memory systems. This study proposes an interconnection network, with a novel implementation of the Mesh-of-Trees (MoT) topology. The MoT network is evaluated relative to metrics such as wire area complexity, total register count, bandwidth, network diameter, single switch delay, maximum throughput per area, trade-offs between throughput and latency, and post-layout performance. It is also compared with some other traditional network topologies, such as mesh, ring, hypercube, butterfly, fat trees, butterfly fat trees, and replicated butterfly networks. Concrete results show that MoT provides higher throughput and lower latency especially when the input traffic (or the on-chip parallelism) is high, at comparable area cost. The layout of MoT network is evaluated using standard cell design methodology. A prototype chip with 8-terminal MoT network was taped out at 90nm90nm technology and tested. In the context of an easy-to-program single-chip parallel processor, MoT network is embedded in the eXplicit Multi-Threading (XMT) architecture, and evaluated by running parallel applications. In addition to the basic MoT architecture, a novel hybrid extension of MoT is proposed, which allows significant area savings with a small reduction in throughput

    Scheduling algorithms for high-speed switches

    Get PDF
    The virtual output queued (VOQ) switching architecture was adopted for high speed switch implementation owing to its scalability and high throughput. An ideal VOQ algorithm should provide Quality of Service (QoS) with low complexity. However, none of the existing algorithms can meet these requirements. Several algorithms for VOQ switches are introduced in this dissertation in order to improve upon existing algorithms in terms of implementation or QoS features. Initially, the earliest due date first matching (EDDFM) algorithm, which is stable for both uniform and non-uniform traffic patterns, is proposed. EDDFM has lower probability of cell overdue than other existing maximum weight matching algorithms. Then, the shadow departure time algorithm (SDTA) and iterative SDTA (ISDTA) are introduced. The QoS features of SDTA and ISDTA are better than other existing algorithms with the same computational complexity. Simulations show that the performance of a VOQ switch using ISDTA with a speedup of 1.5 is similar to that of an output queued (OQ) switch in terms of cell delay and throughput. Later, the enhanced Birkhoff-von Neumann decomposition (EBVND) algorithm based on the Birkhoff-von Neumann decomposition (BVND) algorithm, which can provide rate and cell delay guarantees, is introduced. Theoretical analysis shows that the performance of EBVND is better than BVND in terms of throughput and cell delay. Finally, the maximum credit first (MCF), the Enhanced MCF (EMCF), and the iterative MCF (IMCF) algorithms are presented. These new algorithms have the similar performance as BNVD, yet are easier to implement in practice

    PaST-NoC: A Packet-Switched Superconducting Temporal NoC

    Full text link
    Temporal computing promises to mitigate the stringent area constraints and clock distribution overheads of traditional superconducting digital computing. To design a scalable, area- and power-efficient superconducting network on chip (NoC), we propose packet-switched superconducting temporal NoC (PaST-NoC). PaST-NoC operates its control path in the temporal domain using race logic (RL), combined with bufferless deflection flow control to minimize area. Packets encode their destination using RL and carry a collection of data pulses that the receiver can interpret as pulse trains, RL, serialized binary, or other formats. We demonstrate how to scale up PaST-NoC to arbitrary topologies based on 2x2 routers and 4x4 butterflies as building blocks. As we show, if data pulses are interpreted using RL, PaST-NoC outperforms state-of-the-art superconducting binary NoCs in throughput per area by as much as 5x for long packets.Comment: 14 pages, 18 figures, 2 tables. In press in IEEE Transactions on Applied Superconductivit

    Social Insect-Inspired Adaptive Hardware

    Get PDF
    Modern VLSI transistor densities allow large systems to be implemented within a single chip. As technologies get smaller, fundamental limits of silicon devices are reached resulting in lower design yields and post-deployment failures. Many-core systems provide a platform for leveraging the computing resource on offer by deep sub-micron technologies and also offer high-level capabilities for mitigating the issues with small feature sizes. However, designing for many-core systems that can adapt to in-field failures and operation variability requires an extremely large multi-objective optimisation space. When a many-core reaches the size supported by the densities of modern technologies (thousands of processing cores), finding design solutions in this problem space becomes extremely difficult. Many biological systems show properties that are adaptive and scalable. This thesis proposes a self-optimising and adaptive, yet scalable, design approach for many-core based on the emergent behaviours of social-insect colonies. In these colonies there are many thousands of individuals with low intelligence who contribute, without any centralised control, to complete a wide range of tasks to build and maintain the colony. The experiments presented translate biological models of social-insect intelligence into simple embedded intelligence circuits. These circuits sense low-level system events and use this manage the parameters of the many-core's Network-on-Chip (NoC) during runtime. Centurion, a 128-node many-core, was created to investigate these models at large scale in hardware. The results show that, by monitoring a small number of signals within each NoC router, task allocation emerges from the social-insect intelligence models that can self-configure to support representative applications. It is demonstrated that emergent task allocation supports fault tolerance with no extra hardware overhead. The response-threshold decision making circuitry uses a negligible amount of hardware resources relative to the size of the many-core and is an ideal technology for implementing embedded intelligence for system runtime management of large-complexity single-chip systems

    Comparing timed-division multiplexing and best-effort networks-on-chip

    Get PDF
    Best-effort (BE) networks-on-chips (NOCs) are usually preferred over time-division multiplexed (TDM) NOCs in multi-core platforms because they are work-conserving and have lower (zero-load) latency. On the other hand, BE NOCs are significantly more expensive to implement than TDM NOCs because of their virtual channel buffers, allocators/arbiters, and (credit-based) flow control; functionality that a TDM NOC avoids altogether. The objective of this paper is to compare the performance of BE and TDM NOCs, taking hardware cost into consideration. The networks are compared using graphs showing average latency as a function of offered load. For the BE NOCs, we use the BookSim simulator, and for the TDM NOCs, we derive a queuing theory model and an associated TDM NOC simulator. Through experiments with both router architectures, packet length, link width, and different traffic patterns, we show that for the same hardware cost, a TDM NOC can provide higher bandwidth and comparable latency. We also show that the packet length is the most important factor affecting the TDM period, which again is the primary factor affecting latency. The best TDM NOC design for BE traffic uses single flit packets, wide links/flits, and a router with two pipeline stages: link and router traversal.publishedVersionPeer reviewe

    Design and Implementation of a Multi-Class Network Architecture for Hardware Neural Networks

    Get PDF
    Die vorliegende Arbeit beschreibt den Entwurf und die Implementierung einer Netzwerkarchitektur, welche Techniken von leitungsvermittelnden und paketvermittelnden Netzwerken verbindet, um zwei verschiedene Dienstgüten anzubieten: isochrone Verbindungen und paketbasierte Verbindungen mit bestmöglicher Zustellung. Isochrone Verbindungen verwenden reservierte Netzwerkresourcen, um eine verlustfreie Übertragung sowie eine niedrige Ende-zu-Ende Verzögerung mit begrenzter Varianz zu garantieren. Die Synchronisierung aller Netzwerkknoten sowie die Berechnung einer kompakten Reservierungsbelegung werden durch effiziente Algorithmen gelöst. Paketbasierte Übertragungen verwenden die verbleibende Bandbreite. Das Multiplexen beider Verkehrsklassen wird von einem neuartigen Bypass-Switch geleistet, der skalierbar ist in der Anzahl der Schnittstellen sowie in der externen Bandbreite und ohne eine interne Beschleunigung auskommt. Die Netzwerkarchitektur kommt in der Forschung innerhalb des FACETS Projektes mit großskaligen künstlichen neuronalen Netzen in Hardware zum Einsatz, für die Vernetzung eines verteilten Systems aus VLSI neuronalen Netzen. Axonale Verbindungen zwischen Neuronen werden mit Hilfe von isochronen Verbindungen modelliert, wohingegen paketbasierte Übertragung die Grundlage für eine systemweite gemeinsame Speicherarchitektur bildet. Der zur Laufzeit ausgeführte Teil des Netzwerkes ist in programmierbarer Logik implementiert und arbeitet mit einer externen Übertragungsrate von 3.125 Gbit/s. Die Arbeit diskutiert die anwendungsbezogenen Anforderungen an das Netzwerk, sowie dessen Entwurf und Referenzimplementierung in programmierbarer Logik und Software. Theoretische Überlegungen über die Leistungsfähigkeit werden durch Messungen und Simulationen verifiziert. Obwohl die Netzwerkarchitektur für die spezielle Anwendung mit neuronalen Netzen entworfen wurde, stellt sie eine generelle Lösung für alle Netzwerkumgebungen dar, welche isochrone Verbindungen und Paketvermittlung mit niedriger Komplexität benötigen. Die Architektur ist insbesondere für den Einsatz in der nächsten Stufe der Hardwareentwicklung des FACETS Projektes zur Vernetzung künstlicher neuronaler Netze auf Wafer-Ebene geeignet
    corecore