11 research outputs found

    Network-on-Chip

    Get PDF
    Limitations of bus-based interconnections related to scalability, latency, bandwidth, and power consumption for supporting the related huge number of on-chip resources result in a communication bottleneck. These challenges can be efficiently addressed with the implementation of a network-on-chip (NoC) system. This book gives a detailed analysis of various on-chip communication architectures and covers different areas of NoCs such as potentials, architecture, technical challenges, optimization, design explorations, and research directions. In addition, it discusses current and future trends that could make an impactful and meaningful contribution to the research and design of on-chip communications and NoC systems

    On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation

    Get PDF
    In this era of exascale computing, conventional synchronous design techniques are facing unprecedented challenges. The consumer electronics market is replete with many-core systems in the range of 16 cores to thousands of cores on chip, integrating multi-billion transistors. However, with this ever increasing complexity, the traditional design approaches are facing key issues such as increasing chip power, process variability, aging, thermal problems, and scalability. An alternative paradigm that has gained significant interest in the last decade is asynchronous design. Asynchronous designs have several potential advantages: they are naturally energy proportional, burning power only when active, do not require complex clock distribution, are robust to different forms of variability, and provide ease of composability for heterogeneous platforms. Networks-on-chip (NoCs) is an interconnect paradigm that has been introduced to deal with the ever-increasing system complexity. NoCs provide a distributed, scalable, and efficient interconnect solution for today’s many-core systems. Moreover, NoCs are a natural match with asynchronous design techniques, as they separate communication infrastructure and timing from the computational elements. To this end, globally-asynchronous locally-synchronous (GALS) systems that interconnect multiple processing cores, operating at different clock speeds, using an asynchronous NoC, have gained significant interest. While asynchronous NoCs have several advantages, they also face a key challenge of supporting new types of traffic patterns. Once such pattern is multicast communication, where a source sends packets to arbitrary number of destinations. Multicast is not only common in parallel computing, such as for cache coherency, but also for emerging areas such as neuromorphic computing. This important capability has been largely missing from asynchronous NoCs. This thesis introduces several efficient multicast solutions for these interconnects. In particular, techniques, and network architectures are introduced to support high-performance and low-power multicast. Two leading network topologies are the focus: a variant mesh-of-trees (MoT) and a 2D mesh. In addition, for a more realistic implementation and analysis, as well as significantly advancing the field of asynchronous NoCs, this thesis also targets synthesis of these NoCs on commercial FPGAs. While there has been significant advances in FPGA technologies, there has been only limited research on implementing asynchronous NoCs on FPGAs. To this end, a systematic computeraided design (CAD) methodology has been introduced to efficiently and safely map asynchronous NoCs on FPGAs. Overall, this thesis makes the following three contributions. The first contribution is a multicast solution for a variant MoT network topology. This topology consists of simple low-radix switches, and has been used in high-performance computing platforms. A novel local speculation technique is introduced, where a subset of the network’s switches are speculative that always broadcast every packet. These switches are very simple and have high performance. Speculative switches are surrounded by non-speculative ones that route packets based on their destinations and also throttle any redundant copies created by the former. This hybrid network architecture achieved significant performance and power benefits over other multicast approaches. The second contribution is a multicast solution for a 2D-mesh topology, which is more complex with higher-radix switches and also is more commonly used. A novel continuous-time replication strategy is introduced to optimize the critical multi-way forking operation of a multicast transmission. In this technique, a multicast packet is first stored in an input port of a switch, from where it is sent through distinct output ports towards different destinations concurrently, at each output’s own rate and in continuous time. This strategy is shown to have significant latency and energy benefits over an approach that performs multicast using multiple distinct serial unicasts to each destination. Finally, a systematic CAD methodology is introduced to synthesize asynchronous NoCs on commercial FPGAs. A two-fold goal is targeted: correctness and high performance. For ease of implementation, only existing FPGA synthesis tools are used. Moreover, since asynchronous NoCs involve special asynchronous components, a comprehensive guide is introduced to map these elements correctly and efficiently. Two asynchronous NoC switches are synthesized using the proposed approach on a leading Xilinx FPGA in 28 nm: one that only handles unicast, and the other that also supports multicast. Both showed significant energy benefits with some performance gains over a state-of-the-art synchronous switch

    Security of Electrical, Optical and Wireless On-Chip Interconnects: A Survey

    Full text link
    The advancement of manufacturing technologies has enabled the integration of more intellectual property (IP) cores on the same system-on-chip (SoC). Scalable and high throughput on-chip communication architecture has become a vital component in today's SoCs. Diverse technologies such as electrical, wireless, optical, and hybrid are available for on-chip communication with different architectures supporting them. Security of the on-chip communication is crucial because exploiting any vulnerability would be a goldmine for an attacker. In this survey, we provide a comprehensive review of threat models, attacks, and countermeasures over diverse on-chip communication technologies as well as sophisticated architectures.Comment: 41 pages, 24 figures, 4 table

    Carbon nanotubes as interconnect for next generation network on chip

    Get PDF
    Multi-core processors provide better performance when compared with their single-core equivalent. Recently, Networks-on-Chip (NoC) have emerged as a communication methodology for multi core chips. Network-on-Chip uses packet based communication for establishing a communication path between multiple cores connected via interconnects. Clock frequency, energy consumption and chip size are largely determined by these interconnects. According to the International Technology Roadmap for Semiconductors (ITRS), in the next five years up to 80% of microprocessor power will be consumed by interconnects. In the sub 100nm scaling range, interconnect behavior limits the performance and correctness of VLSI systems. The performance of copper interconnects tend to get reduced in the sub 100nm range and hence we need to examine other interconnect options. Single Wall Carbon Nanotubes exhibit better performance in sub 100nm processing technology due to their very large current carrying capacity and large electron mean free paths. This work suggests using Single Wall Carbon Nanotubes (SWCNT) as interconnects for Networks-on-Chip as they consume less energy and gives more throughput and bandwidth when compared with traditional Copper wires

    Improving the Scalability of High Performance Computer Systems

    Full text link
    Improving the performance of future computing systems will be based upon the ability of increasing the scalability of current technology. New paths need to be explored, as operating principles that were applied up to now are becoming irrelevant for upcoming computer architectures. It appears that scaling the number of cores, processors and nodes within an system represents the only feasible alternative to achieve Exascale performance. To accomplish this goal, we propose three novel techniques addressing different layers of computer systems. The Tightly Coupled Cluster technique significantly improves the communication for inter node communication within compute clusters. By improving the latency by an order of magnitude over existing solutions the cost of communication is considerably reduced. This enables to exploit fine grain parallelism within applications, thereby, extending the scalability considerably. The mechanism virtually moves the network interconnect into the processor, bypassing the latency of the I/O interface and rendering protocol conversions unnecessary. The technique is implemented entirely through firmware and kernel layer software utilizing off-the-shelf AMD processors. We present a proof-of-concept implementation and real world benchmarks to demonstrate the superior performance of our technique. In particular, our approach achieves a software-to-software communication latency of 240 ns between two remote compute nodes. The second part of the dissertation introduces a new framework for scalable Networks-on-Chip. A novel rapid prototyping methodology is proposed, that accelerates the design and implementation substantially. Due to its flexibility and modularity a large application space is covered ranging from Systems-on-chip, to high performance many-core processors. The Network-on-Chip compiler enables to generate complex networks in the form of synthesizable register transfer level code from an abstract design description. Our engine supports different target technologies including Field Programmable Gate Arrays and Application Specific Integrated Circuits. The framework enables to build large designs while minimizing development and verification efforts. Many topologies and routing algorithms are supported by partitioning the tasks into several layers and by the introduction of a protocol agnostic architecture. We provide a thorough evaluation of the design that shows excellent results regarding performance and scalability. The third part of the dissertation addresses the Processor-Memory Interface within computer architectures. The increasing compute power of many-core processors, leads to an equally growing demand for more memory bandwidth and capacity. Current processor designs exhibit physical limitations that restrict the scalability of main memory. To address this issue we propose a memory extension technique that attaches large amounts of DRAM memory to the processor via a low pin count interface using high speed serial transceivers. Our technique transparently integrates the extension memory into the system architecture by providing full cache coherency. Therefore, applications can utilize the memory extension by applying regular shared memory programming techniques. By supporting daisy chained memory extension devices and by introducing the asymmetric probing approach, the proposed mechanism ensures high scalability. We furthermore propose a DMA offloading technique to improve the performance of the processor memory interface. The design has been implemented in a Field Programmable Gate Array based prototype. Driver software and firmware modifications have been developed to bring up the prototype in a Linux based system. We show microbenchmarks that prove the feasibility of our design

    Energy Efficient Neocortex-Inspired Systems with On-Device Learning

    Get PDF
    Shifting the compute workloads from cloud toward edge devices can significantly improve the overall latency for inference and learning. On the contrary this paradigm shift exacerbates the resource constraints on the edge devices. Neuromorphic computing architectures, inspired by the neural processes, are natural substrates for edge devices. They offer co-located memory, in-situ training, energy efficiency, high memory density, and compute capacity in a small form factor. Owing to these features, in the recent past, there has been a rapid proliferation of hybrid CMOS/Memristor neuromorphic computing systems. However, most of these systems offer limited plasticity, target either spatial or temporal input streams, and are not demonstrated on large scale heterogeneous tasks. There is a critical knowledge gap in designing scalable neuromorphic systems that can support hybrid plasticity for spatio-temporal input streams on edge devices. This research proposes Pyragrid, a low latency and energy efficient neuromorphic computing system for processing spatio-temporal information natively on the edge. Pyragrid is a full-scale custom hybrid CMOS/Memristor architecture with analog computational modules and an underlying digital communication scheme. Pyragrid is designed for hierarchical temporal memory, a biomimetic sequence memory algorithm inspired by the neocortex. It features a novel synthetic synapses representation that enables dynamic synaptic pathways with reduced memory usage and interconnects. The dynamic growth in the synaptic pathways is emulated in the memristor device physical behavior, while the synaptic modulation is enabled through a custom training scheme optimized for area and power. Pyragrid features data reuse, in-memory computing, and event-driven sparse local computing to reduce data movement by ~44x and maximize system throughput and power efficiency by ~3x and ~161x over custom CMOS digital design. The innate sparsity in Pyragrid results in overall robustness to noise and device failure, particularly when processing visual input and predicting time series sequences. Porting the proposed system on edge devices can enhance their computational capability, response time, and battery life

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Fault-Tolerant Arbitration in Multichip Crossbar Switches

    No full text

    Fault-Tolerant Arbitration in Multichip Crossbar Switches

    No full text
    . A major problem in the design of large multichip crossbar networks is the latency associated with the centralized controller used to arbitrate requests. We present a solution to reduce this latency and also provide fault-tolerance by the use of multiple controllers that can operate concurrently. This paper concentrates on crossbars that are constructed using onesided crosspoint switches, as such designs can alleviate simultaneous switching noise. Two arbitration mechanisms are proposed that provide a range of tradeoffs in controller complexity and hardware overheads. We derive a lower bound on the number of busses required for any distributed control scheme and compare it with simulation results for a 64 port network. The results show that besides reducing arbitration latencies, distributed control can provide almost non-blocking operation with little extra hardware, and with improved properties of fault-tolerance and graceful degradation. 1 Introduction Crossbar switches are used..
    corecore