251 research outputs found
Extending the performance of hybrid NoCs beyond the limitations of network heterogeneity
To meet the performance and scalability demands of the fast-paced technological growth towards exascale and Big-Data processing with the performance bottleneck of conventional metal based interconnects (wireline), alternative interconnect fabrics such as inhomogeneous three-dimensional integrated Network-on-Chip (3D NoC) and hybrid wired-wireless Network-on-Chip (WiNoC) have emanated as a cost-effective solution for emerging System-on-Chip (SoC) design. However, these interconnects trade-off optimized performance for cost by restricting the number of area and power hungry 3D routers and wireless nodes. Moreover, the non-uniform distributed traffic in chip multiprocessor (CMP) demands an on-chip communication infrastructure which can avoid congestion under high traffic conditions while possessing minimal pipeline delay at low-load conditions. To this end, in this paper, we propose a low-latency adaptive router with a low-complexity single-cycle bypassing mechanism to alleviate the performance degradation due to the slow 2D routers in such emerging hybrid NoCs. The proposed router transmits a flit using dimension-ordered routing (DoR) in the bypass datapath at low-loads. When the output port required for intra-dimension bypassing is not available, the packet is routed adaptively to avoid congestion. The router also has a simplified virtual channel allocation (VA) scheme that yields a non-speculative low-latency pipeline. By combining the low-complexity bypassing technique with adaptive routing, the proposed router is able balance the traffic in hybrid NoCs to achieve low-latency communication under various traffic loads. Simulation shows that, the proposed router can reduce applicationsâ execution time by an average of 16.9% compared to low-latency routers such as SWIFT. By reducing the latency between 2D routers (or wired nodes) and 3D routers (or wireless nodes) the proposed router can improve performance efficiency in terms of average packet delay by an average of 45% (or 50%) in 3D NoCs (or WiNoCs)
Recommended from our members
Indirect interconnection networks for high performance routers/switches
Routers form the backbone of the Internet; their kernel, structure, andconfiguration (scheduler) of the backplane (or switching fabrics) dominate the routersâperformance, scalability, reliability and cost. As higher performance is required with therapid development of the network applications, routerâs architecture has also evolvedfrom the shared backplane to switched backplane, which mainly uses the indirectinterconnection networks.The indirect interconnection networks include crossbar, MIN (multistageinterconnection networks) and some other irregular topologies. At present, most oftodayâs routers and switches are implemented on single crossbar with symmetric bufferarchitecture. In the first part of this dissertation, we introduce novel asymmetric bufferarchitecture for the crossbar in which a new port and a local shared bus are added. Wethen evaluate its performance and simulate under different bus arbitration and buffermanagement algorithms. Our studies indicate that we can get great improvement for thethroughput and low drop rate. Thus we could save a lot of expensive link bandwidth anddecrease the probability of congestion for the network.Single crossbar complexity increases at O(N2) in terms of crosspoint number,which become unacceptable for scalability as the port number (N) increases. A delta classself-routing MIN with complexity of O(NĂlog2N) has been widely used in the ATMswitches. But the reduction of crosspoint number results in considerable internal blocking.A number of scalable methods have been proposed to solve this problem. One of themuses more stages with recirculation architecture to reroute the deflected packets, whichgreatly increase the latency. In the second part of this dissertation, we propose aninterleaved multistage switching fabrics architecture and assess its throughput with ananalytical model and simulations. We compare this novel scheme with some previousparallel architectures and show its benefits. From extensive simulations under differenttraffic patterns and fault models, our interleaved architecture achieves better performancethan its counterpart of single panel fabric. Our interleaved scheme achieves speedups(over the single panel fabric) of 3.4 and 2.25 under uniform and hot-spot traffic patterns,respectively at maximum load (p=1). Moreover, the interleaved fabrics show greattolerance against internal hardware failures
An Energy and Performance Exploration of Network-on-Chip Architectures
In this paper, we explore the designs of a circuit-switched router, a wormhole router, a quality-of-service (QoS) supporting virtual channel router and a speculative virtual channel router and accurately evaluate the energy-performance tradeoffs they offer. Power results from the designs placed and routed in a 90-nm CMOS process show that all the architectures dissipate significant idle state power. The additional energy required to route a packet through the router is then shown to be dominated by the data path. This leads to the key result that, if this trend continues, the use of more elaborate control can be justified and will not be immediately limited by the energy budget. A performance analysis also shows that dynamic resource allocation leads to the lowest network latencies, while static allocation may be used to meet QoS goals. Combining the power and performance figures then allows an energy-latency product to be calculated to judge the efficiency of each of the networks. The speculative virtual channel router was shown to have a very similar efficiency to the wormhole router, while providing a better performance, supporting its use for general purpose designs. Finally, area metrics are also presented to allow a comparison of implementation costs
Heterogeneous Photonic Network-on-Chip with Dynamic Bandwidth Allocation
Advancements in the field of chip fabrication has facilitated in integrating more number of transistors in a given area which has lead to an era of multi-core processors. Future multi-core chips or chip multiprocessors (CMPs) will have hundreds of heterogeneous components including processing engines, custom logic, GPU units, programmable fabrics and distributed memory. Such multi-core chips are expected to run varied multiple parallel workloads simultaneously. Hence, different communicating cores will require different bandwidths leading to the necessity of a heterogeneous Network-on-Chip (NoC) architecture. Simply over-provisioning for performance will invariably result in loss of power efficiency. On the other hand, recent research has shown that photonic interconnects are capable of achieving high-bandwidth and energy-efficient on-chip data transfer. In this paper we propose a dynamic heterogeneous photonic NoC (d-HetPNOC) architecture with dynamic bandwidth allocation to achieve better performance and energy-efficiency compared to a homogeneous photonic NoC architecture with the same aggregate data bandwidth
Towards the practical design of performance-aware resilient wireless NoC architectures
Recently, an improved surface wave-enabled communication fabric has been proposed to solve the reliability issues of emerging hybrid wired-wireless Network-on-Chip (WiNoC) architectures. Thus, providing a promising solution to the performance and scalability demands of the fast-paced technological growth towards exascale and Big-Data processing on future System-on-Chip (SoC) design. However, WiNoCs trade-off optimized performance for cost by restricting the number of area and power hungry wireless nodes. Consequently, in this paper, we propose a low-latency adaptive router with a low-complexity single-cycle bypassing mechanism to alleviate the performance degradation due to the slow wired routers in such emerging hyhbrid NoCs. The proposed router is able to redistribute traffic in the network to alleviate average packet latency at both low and high traffic conditions. As a second contribution the paper presents an experimental evaluation of a practically implemented surface wave communication fabric. By reducing the latency between the wired nodes and wireless nodes the proposed router can improve performance efficiency in terms of average packet delay by an average of 50% in WiNoCs
High Peformance and Low Power On-Die Interconnect Fabrics.
Increasing power density with technology scaling has caused stagnation in operating frequency of modern day microprocessors. This has led designers to prefer multicore architectures over complex monolithic processors to keep up with the demand for rising computing throughput. Although processing units are getting smaller and simpler, the dramatic rise of their count on a single die has made the fabric that connects these processing units increasingly complex. These interconnect fabrics have become a bottleneck in improving overall system effciency. As a result, the design paradigm for multi-core chips is gradually shifting from a core-centric architecture towards an interconnect-centric architecture, where system efficiency is limited by the fabric rather than the processing ability of any individual core. This dissertation introduces three novel and synergistic circuit techniques to improve scalability of switch fabrics to make on-die integration of hundreds to thousands of cores feasible. 1) A matrix topology is proposed for designing a fully connected switch fabric that re-uses output buses for programming, and stores shue congurations at cross points. This significantly reduces routing congestion, lowers area/power, and improves per-
formance. Silicon measurements demonstrate 47% energy savings in a 64-lane SIMD processor fabricated in 65nm CMOS over a conventional implementation. 2) A novel approach to handle high radix arbitration along with data routing is proposed. It optimally uses existing cross-bar interconnect resources without requiring any additional overhead. Bandwidth exceeding 2Tb/s is recorded in a test prototype fabricated in 65nm. 3) Building on the later, a new circuit topology to manage and update priority adaptively within the switch fabric without incurring additional delay or area is then proposed. Several assist circuit techniques, such as a thyristor based sense amplifier and self regenerating bi-directional repeaters are proposed for high speed energy efficient signaling to and from the switch fabric to improve overall routing efficiency.
Using these techniques a 64 x 64 switch fabric with 128b data bus fabricated in 45nm achieves a throughput of 4.5Tb/s at single cycle latency while operating at 559MHz.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91506/1/sudhirks_1.pd
An efficient 2D router architecture for extending the performance of inhomogeneous 3D NoC-based multi-core architectures
To meet the performance and scalability demands of the fast-paced technological growth towards exascale and Big-Data processing with the performance bottleneck of conventional metal based interconnects, alternative interconnect fabrics such as inhomogeneous three dimensional integrated Network-on-Chip (3D NoC) has emanated as a cost-effective solution for emerging multi-core design. However, these interconnects trade-off optimized performance for cost by restricting the number of area and power hungry 3D routers. Consequently, in this paper, we propose a low-latency adaptive router with a low-complexity single-cycle bypassing mechanism to alleviate the performance degradation due to the slow 2D routers in inhomogeneous 3D NoCs. By combining the low-complexity bypassing technique with adaptive routing, the proposed router is able to balance the traffic in the network to reduce the average packet latency under various traffic loads. Simulation shows that, the proposed router can reduce the average packet delay by an average of 45% in 3D NoCs
Fabric-on-a-Chip: Toward Consolidating Packet Switching Functions on Silicon
The switching capacity of an Internet router is often dictated by the memory bandwidth required to bu€er arriving packets. With the demand for greater capacity and improved service provisioning, inherent memory bandwidth limitations are encountered rendering input queued (IQ) switches and combined input and output queued (CIOQ) architectures more practical. Output-queued (OQ) switches, on the other hand, offer several highly desirable performance characteristics, including minimal average packet delay, controllable Quality of Service (QoS) provisioning and work-conservation under any admissible traffic conditions. However, the memory bandwidth requirements of such systems is O(NR), where N denotes the number of ports and R the data rate of each port. Clearly, for high port densities and data rates, this constraint dramatically limits the scalability of the switch.
In an effort to retain the desirable attributes of output-queued switches, while significantly reducing the memory bandwidth requirements, distributed shared memory architectures, such as the parallel shared memory (PSM) switch/router, have recently received much attention. The principle advantage of the PSM architecture is derived from the use of slow-running memory units operating in parallel to distribute the memory bandwidth requirement. At the core of the PSM architecture is a memory management algorithm that determines, for each arriving packet, the memory unit in which it will be placed. However, to date, the computational complexity of this algorithm is O(N), thereby limiting the scalability of PSM switches.
In an effort to overcome the scalability limitations, it is the goal of this dissertation to extend existing shared-memory architecture results while introducing the notion of Fabric on a Chip (FoC). In taking advantage of recent advancements in integrated circuit technologies, FoC aims to facilitate the consolidation of as many packet switching functions as possible on a single chip. Accordingly, this dissertation introduces a novel pipelined memory management algorithm, which plays a key role in the context of on-chip output- queued switch emulation. We discuss in detail the fundamental properties of the proposed scheme, along with hardware-based implementation results that illustrate its scalability and performance attributes.
To complement the main effort and further support the notion of FoC, we provide performance analysis of output queued cell switches with heterogeneous traffic. The result is a flexible tool for obtaining bounds on the memory requirements in output queued switches under a wide range of traÂą c scenarios. Additionally, we present a reconfigurable high-speed hardware architecture for real-time generation of packets for the various traffic scenarios. The work presented in this thesis aims at providing pragmatic foundations for designing next-generation, high-performance Internet switches and routers
- âŠ