309 research outputs found

    DDRNoC: Dual Data-Rate Network-on-Chip

    Get PDF
    This article introduces DDRNoC, an on-chip interconnection network capable of routing packets at Dual Data Rate. The cycle time of current 2D-mesh Network-on-Chip routers is limited by their control as opposed to the datapath (switch and link traversal), which exhibits significant slack. DDRNoC capitalizes on this observation, allowing two flits per cycle to share the same datapath. Thereby, DDRNoC achieves higher throughput than a Single Data Rate (SDR) network. Alternatively, using lower voltage circuits, the above slack can be exploited to reduce power consumption while matching the SDR network throughput. In addition, DDRNoC exhibits reduced clock distribution power, improving energy efficiency, as it needs a slower clock than a SDR network that routes packets at the same rate. Post place and route results in 28nm technology show that, compared to an iso-voltage (1.1V) SDR network, DDRNoC improves throughput proportionally to the SDR datapath slack. Moreover, a low-voltage (0.95V) DDRNoC implementation converts that slack to power reduction offering the 1.1V SDR throughput at a substantially lower energy cost

    FastTrackNoC: A DDR NoC with FastTrack Router Datapaths

    Get PDF
    This paper introduces FastTrackNoC, a Network-on-Chip router architecture that reduces latency by bypassing its switch traversal (ST) stage. FastTrackNoC adds a fast-track path between the head of a particular virtual channel (VC) buffer at each input port and the link of the opposite output. This allows non-turning flits to bypass ST when the required router resources are available. FastTrackNoC combines ST bypassing with existing techniques for reducing latency, namely, pipeline bypassing of control stages, precomputed routing and lookahead control signaling, to allow at best a flit to proceed directly to link traversal (LT). FastTrackNoC is applied to a Dual Data Rate (DDR) router in order to maximize throughput. Post place and route results in 28nm technology show that (i) compared to the current state of the art DDR NoCs, FastTrackNoC offers the same throughput and reduces average packet latency by 11-32% requiring up to 5% more power and (ii) compared to current state of the art Single Data Rate (SDR) NoCs, FastTrackNoC reduces packet latency by 9-40% and achieves 16-19% higher throughput with 5% higher power at the SDR NoC saturation point

    Dual Data Rate Network-on-Chip Architectures

    Get PDF
    Networks-on-Chip (NoCs) are becoming increasing important for the performance of modern multi-core systems-on-chip. The performance of current NoCs is limited, among others, by two factors: their limited clock frequency and long router pipeline. The clock frequency of a network defines the limits of its saturation throughput. However, for high throughput routers, clock is constrained by the control logic (for virtual channel and switch allocation) whereas the datapath (crossbar switch and links) possesses significant slack. This slack in the datapath wastes network throughput potential. Secondly, routers require flits to go through a large number of pipeline stages increasing packet latency at low traffic loads. These stages include router resource allocation, switch traversal (ST) and link traversal (LT). The allocation stages are used to manage contention among flits attempting to simultaneously access switch and links, and the ST stage is needed to change the routing dimension. In some cases, these stages are not needed and then requiring flits to go through them increases packet latency. The aim of this thesis is to improve NoC performance, in terms of network throughput, by removing the slack in the router datapath, and in terms of average packet latency, by enabling incoming flits to bypass, when possible, allocation and ST stages. More precisely, this thesis introduces Dual Data-Rate (DDR) NoC architectures which exploit the slack present in the NoC datapath to operate it at DDR. This requires a clock with period twice the datapath delay and removes the control logic from the critical path. DDR datapaths enable throughput higher than existing single data-rate (SDR) networks where the clock period is defined by the control logic. Additionally, this thesis supplements DDR NoC architectures with varying levels of pipeline stage bypassing capabilities to reduce low-load packet latency. In order to avoid complex logic required for bypassing from all inputs to all outputs, this thesis implements and evaluates a simplified bypassing approach. DDR NoC routers support bypassing of the allocation stage for flits propagating an in-network straight hop (i.e. East to West, North to South and vice versa) and when entering or exiting the network. Disabling bypassing during XY-turns limits its benefits, but, for most routing algorithms under low traffic loads, flits encounter at most one XY-turn on their way to the destination. Bypassing allocation stage enables incoming flits to directly initiate ST, when required conditions are met, and propagate at one cycle per hop. Furthermore, DDR NoC routers allow flits to bypass the ST stage when propagating a straight hop from the head of a specific input VC. Restricting ST bypassing from a specific VC further simplifies check logic to have clock period defined by datapath delays. Bypassing ST requires dedicated bypass paths from non-local input ports to opposite output ports. It enables flits to propagate half a cycle per hop. This thesis shows that compared to current state-of-the-art SDR NoCs, operating router’s datapath at DDR improves throughput by up to 20%. Adding to a DDR NoC an allocation bypassing mechanism for straight hops reduces its packet latency by up to 45%, while maintaining the DDR throughput advantage. Enhancing allocation bypassing to include flits entering and exiting the network further reduces latency by another 24%. Finally, adding ST bypassing further reduces latency by another 32%. Overall, DDR NoCs offer up to 40% lower latency and about 20% higher throughput compared to the SDR networks

    Extending the performance of hybrid NoCs beyond the limitations of network heterogeneity

    Get PDF
    To meet the performance and scalability demands of the fast-paced technological growth towards exascale and Big-Data processing with the performance bottleneck of conventional metal based interconnects (wireline), alternative interconnect fabrics such as inhomogeneous three-dimensional integrated Network-on-Chip (3D NoC) and hybrid wired-wireless Network-on-Chip (WiNoC) have emanated as a cost-effective solution for emerging System-on-Chip (SoC) design. However, these interconnects trade-off optimized performance for cost by restricting the number of area and power hungry 3D routers and wireless nodes. Moreover, the non-uniform distributed traffic in chip multiprocessor (CMP) demands an on-chip communication infrastructure which can avoid congestion under high traffic conditions while possessing minimal pipeline delay at low-load conditions. To this end, in this paper, we propose a low-latency adaptive router with a low-complexity single-cycle bypassing mechanism to alleviate the performance degradation due to the slow 2D routers in such emerging hybrid NoCs. The proposed router transmits a flit using dimension-ordered routing (DoR) in the bypass datapath at low-loads. When the output port required for intra-dimension bypassing is not available, the packet is routed adaptively to avoid congestion. The router also has a simplified virtual channel allocation (VA) scheme that yields a non-speculative low-latency pipeline. By combining the low-complexity bypassing technique with adaptive routing, the proposed router is able balance the traffic in hybrid NoCs to achieve low-latency communication under various traffic loads. Simulation shows that, the proposed router can reduce applications’ execution time by an average of 16.9% compared to low-latency routers such as SWIFT. By reducing the latency between 2D routers (or wired nodes) and 3D routers (or wireless nodes) the proposed router can improve performance efficiency in terms of average packet delay by an average of 45% (or 50%) in 3D NoCs (or WiNoCs)

    SWIFT: A Low-Power Network-On-Chip Implementing the Token Flow Control Router Architecture With Swing-Reduced Interconnects

    Get PDF
    A 64-bit, 8 Ă— 8 mesh network-on-chip (NoC) is presented that uses both new architectural and circuit design techniques to improve on-chip network energy-efficiency, latency, and throughput. First, we propose token flow control, which enables bypassing of flit buffering in routers, thereby reducing buffer size and their power consumption. We also incorporate reduced-swing signaling in on-chip links and crossbars to minimize datapath interconnect energy. The 64-node NoC is experimentally validated with a 2 Ă— 2 test chip in 90 nm, 1.2 V CMOS that incorporates traffic generators to emulate the traffic of the full network. Compared with a fully synthesized baseline 8 Ă— 8 NoC architecture designed to meet the same peak throughput, the fabricated prototype reduces network latency by 20% under uniform random traffic, when both networks are run at their maximum operating frequencies. When operated at the same frequencies, the SWIFT NoC reduces network power by 38% and 25% at saturation and low loads, respectively

    A comparative study of arbitration algorithms for the Alpha 21364 pipelined router

    Full text link
    Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed due to conflicting demand for resources, such as output ports or buffer space. Hence, routers typically employ arbiters that resolve conflicting resource demands to maximize the number of matches between packets waiting at input ports and free output ports. Efficient design and implementation of the algorithm running on these arbiters is critical to maximize network performance.This paper proposes a new arbitration algorithm called SPAA (Simple Pipelined Arbitration Algorithm), which is implemented in the Alpha 21364 processor's on-chip router pipeline. Simulation results show that SPAA significantly outperforms two earlier well-known arbitration algorithms: PIM (Parallel Iterative Matching) and WFA (Wave-Front Arbiter) implemented in the SGI Spider switch. SPAA outperforms PIM and WFA because SPAA exhibits matching capabilities similar to PIM and WFA under realistic conditions when many output ports are busy, incurs fewer clock cycles to perform the arbitration, and can be pipelined effectively. Additionally, we propose a new prioritization policy called the Rotary Rule, which prevents the network's adverse performance degradation from saturation at high network loads by prioritizing packets already in the network over new packets generated by caches or memory.Mukherjee, S.; Silla Jiménez, F.; Bannon, P.; Emer, J.; Lang, S.; Webb, D. (2002). A comparative study of arbitration algorithms for the Alpha 21364 pipelined router. ACM SIGPLAN Notices. 37(10):223-234. doi:10.1145/605432.605421S223234371

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    DDRNoC: Dual Data-Rate Network-on-Chip

    Get PDF
    This paper introduces DDRNoC, an on-chip interconnection network able to route packets at Dual Data Rate. The cycle time of current 2D-mesh Network-on-Chip routers is limited by their control as opposed to the datapath (switch and link traversal) which exhibits significant slack. DDRNoC capitalizes on this observation allowing two flits per cycle to share the same datapath. Thereby, DDRNoC achieves higher throughput than a Single Data Rate (SDR) network. Alternatively, using lower voltage circuits, the above slack can be exploited to reduce power consumption while matching the SDR network throughput. In addition, DDRNoC exhibits reduced clock distribution power, improving energy efficiency, as it needs a slower clock than a SDR network that routes packets at the same rate. Post place and route results in 28 nm technology show that, compared to an iso-voltage (1.1V) SDR network, DDRNoC improves throughput proportionally to the SDR datapath slack. Moreover, a low-voltage (0.95V) DDRNoC implementation converts that slack to power reduction offering the 1.1V SDR throughput at a substantially lower energy cost

    BOOM: Broadcast Optimizations for On-chip Meshes

    Get PDF
    Future many-core chips will require an on-chip network that can support broadcasts and multicasts at good power-performance. A vanilla on-chip network would send multiple unicast packets for each broadcast packet, resulting in latency, throughput and power overheads. Recent research in on-chip multicast support has proposed forking of broadcast/multicast packets within the network at the router buffers, but these techniques are far from ideal, since they increase buffer occupancy which lowers throughput, and packets incur delay and power penalties at each router. In this work, we analyze an ideal broadcast mesh; show the substantial gaps between state-of-the-art multicast NoCs and the ideal; then propose BOOM, which comprises a WHIRL routing protocol that ideally load balances broadcast traffic, a mXbar multicast crossbar circuit that enables multicast traversal at similar energy-delay as unicasts, and speculative bypassing of buffering for multicast flits. Together, they enable broadcast packets to approach the delay, energy, and throughput of the ideal fabric. Our simulations show BOOM realizing an average network latency that is 5% off ideal, attaining 96% of ideal throughput, with energy consumption that is 9% above ideal. Evaluations using synthetic traffic show BOOM achieving a latency reduction of 61%, throughput improvement of 63%, and buffer power reduction of 80% as compared to a baseline broadcast. Simulations with PARSEC benchmarks show BOOM reducing average request and network latency by 40% and 15% respectively
    • …
    corecore