697 research outputs found

    Mesh-of-Trees Interconnection Network for an Explicitly Multi-Threaded Parallel Computer Architecture

    Get PDF
    As the multiple-decade long increase in clock rates starts to slow down, main-stream general-purpose processors evolve towards single-chip parallel processing. On-chip interconnection networks are essential components of such machines, supporting the communication between processors and the memory system. This task is especially challenging for some easy-to-program parallel computers, which are designed with performance-demanding memory systems. This study proposes an interconnection network, with a novel implementation of the Mesh-of-Trees (MoT) topology. The MoT network is evaluated relative to metrics such as wire area complexity, total register count, bandwidth, network diameter, single switch delay, maximum throughput per area, trade-offs between throughput and latency, and post-layout performance. It is also compared with some other traditional network topologies, such as mesh, ring, hypercube, butterfly, fat trees, butterfly fat trees, and replicated butterfly networks. Concrete results show that MoT provides higher throughput and lower latency especially when the input traffic (or the on-chip parallelism) is high, at comparable area cost. The layout of MoT network is evaluated using standard cell design methodology. A prototype chip with 8-terminal MoT network was taped out at 90nm90nm technology and tested. In the context of an easy-to-program single-chip parallel processor, MoT network is embedded in the eXplicit Multi-Threading (XMT) architecture, and evaluated by running parallel applications. In addition to the basic MoT architecture, a novel hybrid extension of MoT is proposed, which allows significant area savings with a small reduction in throughput

    08371 Abstracts Collection -- Fault-Tolerant Distributed Algorithms on VLSI Chips

    Get PDF
    From September the 7textth7^{text{th}}, 2008 to September the 10textth10^{text{th}}, 2008 the Dagstuhl Seminar 08371 ``Fault-Tolerant Distributed Algorithms on VLSI Chips \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. The seminar was devoted to exploring whether the wealth of existing fault-tolerant distributed algorithms research can be utilized for meeting the challenges of future-generation VLSI chips. During the seminar, several participants from both the VLSI and distributed algorithms\u27 discipline, presented their current research, and ongoing work and possibilities for collaboration were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    ARBITRATE-AND-MOVE PRIMITIVES FOR HIGH THROUGHPUT ON-CHIP INTERCONNECTION NETWORKS

    Get PDF
    An n-leaf pipelined balanced binary tree is used for arbitration of order and movement of data from n input ports to one output port. A novel arbitrate-and-move primitive circuit for every node of the tree, which is based on a concept of reduced synchrony that benefits from attractive features of both asynchronous and synchronous designs, is presented. The design objective of the pipelined binary tree is to provide a key building block in a high-throughput mesh-of-trees interconnection network for Explicit Multi Threading (XMT) architecture, a recently introduced parallel computation framework. The proposed reduced synchrony circuit was compared with asynchronous and synchronous designs of arbitrate-and-move primitives. Simulations with 0.18m technology show that compared to an asynchronous design, the proposed reduced synchrony implementation achieves a higher throughput, up to 2 Giga- Requests per second on an 8-leaf binary tree. Our circuit also consumes less power than the synchronous design, and requires less silicon area than both the synchronous and asynchronous designs

    The MANGO clockless network-on-chip: Concepts and implementation

    Get PDF

    Speck: A Smart event-based Vision Sensor with a low latency 327K Neuron Convolutional Neuronal Network Processing Pipeline

    Full text link
    Edge computing solutions that enable the extraction of high level information from a variety of sensors is in increasingly high demand. This is due to the increasing number of smart devices that require sensory processing for their application on the edge. To tackle this problem, we present a smart vision sensor System on Chip (Soc), featuring an event-based camera and a low power asynchronous spiking Convolutional Neuronal Network (sCNN) computing architecture embedded on a single chip. By combining both sensor and processing on a single die, we can lower unit production costs significantly. Moreover, the simple end-to-end nature of the SoC facilitates small stand-alone applications as well as functioning as an edge node in a larger systems. The event-driven nature of the vision sensor delivers high-speed signals in a sparse data stream. This is reflected in the processing pipeline, focuses on optimising highly sparse computation and minimising latency for 9 sCNN layers to 3.36μs3.36\mu s. Overall, this results in an extremely low-latency visual processing pipeline deployed on a small form factor with a low energy budget and sensor cost. We present the asynchronous architecture, the individual blocks, the sCNN processing principle and benchmark against other sCNN capable processors

    A High-Throughput, Low-Power Asynchronous Mesh-of-Trees Interconnection Network for the Explicit Multi-Threading (XMT) Parallel Architecture

    Get PDF
    This thesis presents an asynchronous (clockless) Mesh-of-Trees network that consumes less power and area than the synchronous Mesh-of-Trees network, while maintaining high throughput and low latency. Two new asynchronous designs are proposed for the fundamental pipelined components of the network (routing and arbitration), which are optimized for power, area, latency and throughput. Mixed-timing interfaces are added to create a mixed-timing network which provides communication between synchronous and asynchronous domains. Two issues top the agenda of CPU design in the emerging many-core era: programmers' productivity and power consumption. Through its reliance on the richest available theory of parallel algorithms, the eXplicit Multi-Threading (XMT) parallel architecture addresses programmers' productivity. The motivation for this work is to provide an effective interconnection network for the XMT architecture in terms of both performance and power consumption. Performance of the XMT processor with the mixed-timing network is measured for several applications
    • …
    corecore