645 research outputs found

    Scalability of broadcast performance in wireless network-on-chip

    Get PDF
    Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version

    Energy Scalability of On-Chip Interconnection Networks in Multicore Architectures

    Get PDF
    On-chip interconnection networks (OCNs) such as point-to-point networks and buses form the communication backbone in systems-on-a-chip, multicore processors, and tiled processors. OCNs can consume significant portions of a chip's energy budget, so analyzing their energy consumption early in the design cycle becomes important for architectural design decisions. Although numerous studies have examined OCN implementation and performance, few have examined energy. This paper develops an analytical framework for energy estimation in OCNs and presents results based on both analytical models of communication patterns and real network traces from applications running on a tiled multicore processor. Our analytical framework supports arbitrary OCN topologies under arbitrary communication patterns while accounting for wire length, switch energy, and network contention. It is the first to incorporate the effects of communication locality and network contention, and use real traces extensively. This paper compares the energy of point-to-point networks against buses under varying degrees of communication locality. The results indicate that, for 16 or more processors, a one-dimensional and a two-dimensional point-to-point network provide 66% and 82% energy savings, respectively, over a bus assuming that processors communicate with equal likelihood. The energy savings increase for patterns which exhibit locality. For the two-dimensional point-to-point OCN of the Raw tiled microprocessor, contention contributes a maximum of just 23% of the OCN energy, using estimated values for channel, switch control logic, and switch queue buffer energy of 34.5pJ, 17pJ, and 12pJ, respectively. Our results show that the energy-delay product per message decreases with increasing processor message injection rate

    Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

    Get PDF
    Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive which makes them unsuitable for mW-devices such as IoT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth and system-level efficiency that are crucial for deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W system-level efficiency (i.e., including I/Os)---3.1x higher than state-of-the-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness

    Efficient Cache Coherence on Manycore Optical Networks

    Get PDF
    Ever since industry has turned to parallelism instead of frequency scaling to improve processor performance, multicore processors have continued to scale to larger and larger numbers of cores. Some believe that multicores will have 1000 cores or more by the middle of the next decade. However, their promise of increased performance will only be reached if their inherent scaling challenges are overcome. One such major scaling challenge is the viability of efficient cache coherence with large numbers of cores. Meanwhile, recent advances in nanophotonic device manufacturing are making CMOS-integrated optics a realityâ interconnect technology which can provide significantly more bandwidth at lower power than conventional electrical analogs. The contributions of this paper are two-fold. (1) It presents ATAC, a new manycore architecture that augments an electrical mesh network with an optical network that performs highly efficient broadcasts. (2) It introduces ACKwise, a novel directory-based cache coherence protocol that provides high performance and scalability on any large-scale manycore interconnection net- work with broadcast capability. Performance evaluation studies using analytical models show that (i) a 1024-core ATAC chip using ACKwise achieves a speedup of 3.9Ã compared to a similarly-sized pure electrical mesh manycore with a conventional limited directory protocol; (ii) the ATAC chip with ACKwise achieves a speedup of 1.35Ã compared to the electrical mesh chip with ACKwise; and (iii) a pure electrical mesh chip with ACKwise achieves a speedup of 2.9Ã over the same chip using a conventional limited directory protocol
    corecore