18 research outputs found

    A High Speed Hardware Scheduler for 1000-port Optical Packet Switches to Enable Scalable Data Centers

    Get PDF
    Meeting the exponential increase in the global demand for bandwidth has become a major concern for today's data centers. The scalability of any data center is defined by the maximum capacity and port count of the switching devices it employs, limited by total pin bandwidth on current electronic switch ASICs. Optical switches can provide higher capacity and port counts, and hence, can be used to transform data center scalability. We have recently demonstrated a 1000-port star-coupler based wavelength division multiplexed (WDM) and time division multiplexed (TDM) optical switch architecture offering a bandwidth of 32 Tbit/s with the use of fast wavelength-tunable transmitters and high-sensitivity coherent receivers. However, the major challenge in deploying such an optical switch to replace current electronic switches lies in designing and implementing a scalable scheduler capable of operating on packet timescales. In this paper, we present a pipelined and highly parallel electronic scheduler that configures the high-radix (1000-port) optical packet switch. The scheduler can process requests from 1000 nodes and allocate timeslots across 320 wavelength channels and 4000 wavelength-tunable transceivers within a time constraint of 1μs. Using the Opencell NanGate 45nm standard cell library, we show that the complete 1000-port parallel scheduler algorithm occupies a circuit area of 52.7mm2, 4-8x smaller than that of a high-performance switch ASIC, with a clock period of less than 8ns, enabling 138 scheduling iterations to be performed in 1μs. The performance of the scheduling algorithm is evaluated in comparison to maximal matching from graph theory and conventional software-based wavelength allocation heuristics. The parallel hardware scheduler is shown to achieve similar matching performance and network throughput while being orders of magnitude faster

    Design and Analysis of a Novel Low Complexity and Low Power Ping Lock Arbiter by using EGDI based CMOS Technique

    Get PDF
    Network-on-chip (NoC) provides solution to overcome the complications of the on-chip interconnect architecture in multi-core systems. It mainly consists of router, links and network interface. An essential component of on-chip router is an arbiter that significantly impacts the performance of the router. The arbiter should provide fast and fair arbitration when it is placed in Critical Path Delay (CPD) systems. The main aim of this research work is to design a novel arbiter for an effective network scheduler in complex real time applications. At the same time resource allocation and power consumption should be very low. Previously, a novel gate level Ping Lock Arbiter (PLA) is designed to overcome the limited fair arbitration in Improved Ping Pong Arbiter (IPPA) with less delay. But the chip size and power consumption are very high. To overcome this problem, an Effective Gate Diffusion Input (EGDI) logic based CMOS scheme is used to design a novel Compact Ping Lock Arbiter (CPLA).  The proposed CPLA is compared with the existing PLA based on static CMOS scheme. The comparison between the conventional and proposed arbiter is carried out to analyze the area, delay and power by using Tanner Tool 14.1 with 250nm and 45nm. The results show that the proposed NPLA achieves low power and consumes less than the existing ping lock arbiter

    Energy Implications of Photonic Networks With Speculative Transmission

    Get PDF
    Speculative transmission has been proposed to overcome the high latency of setting up end-to-end paths through photonic networks for computer systems. However, speculative transmission has implications for the energy efficiency of the network, in particular, control circuits are more complex and power hungry and failed speculative transmissions must be repeated. Moreover, in future chip multiprocessors (CMPs) with integrated photonic network end points, a large proportion of the additional energy will be dissipated on the CMP. This paper compares the energy characteristics of scheduled and speculative chip-to-chip networks for shared memory computer systems on the scale of a rack. For this comparison, we use a novel speculative control plane which reduces energy consumption by eliminating duplicate packets from the allocation process. In addition, we consider photonic power gating to reduce processor chip energy dissipation and the energy impact of the choice between semiconductor optical amplifier and ring resonator switching technologies. We model photonic network elements using values from the published literature as well as determine the power consumption of the allocator and network adapter circuits, implemented in a commercial low leakage 45 nm CMOS process. The power dissipated on the CMP using speculative networks is shown to be roughly double that of scheduled networks at saturation load and an order of magnitude higher at low loads

    An in-depth look at prior art in fast round-robin arbiter circuits

    Get PDF
    Arbiters are found where shared resources exist such as busses, switching fabrics, processing elements. Round-robin is a fair arbitration method, where requestors get near-equal shares of a common resource or service. Round-robin arbitration (RRA) finds use in network switches/routers and processor boards/systems as well as many other applications that have concurrency. Today's electronic systems require arbiters with hundreds of ports (e.g., switching fabrics with virtual I/O queues) and clock speeds near the limits of even the latest microelectronics fabrication processes/libraries. Achieving high clock speeds in the presence of large number of ports is only possible with highly parallel arbiter architectures. This paper presents an in-depth literature survey of previous work on this problem. It looks at RRA work in the literature in a bigger context, then defines the typical RRA problem (RRA_typical), and specifically investigates work on fast architectures that solve the RRA_typical problem. There are five such works that are really competitive. This report takes a very in-depth look at these works. It explains each architecture and how/why it works from a unique perspective that cannot be found in the original publication of that architecture. It also proposes improvements to these architectures. We wrote generators for the improved versions of these architectures. We will share a summary of synthesis results in this report – although a detailed account of how these results were obtained and their analysis is the subject of another (upcoming) publicatio

    Efficient Scheduling for SDMG CIOQ Switches

    Full text link
    Combined input and output queuing (CIOQ) switches are being considered as high-performance switch architectures due to their ability to achieve 100% throughput and perfectly emulate output queuing (OQ) switch performance with a small speedup factor S. To realize a speedup factor S, a conventional CIOQ switch requires the switching fabric and memories to operate S times faster than the line rate. In this paper, we propose to use a CIOQ switch with space-division multiplexing expansion and grouped input/output ports (SDMG CIOQ switch for short) to realize speedup while only requiring the switching fabric and memories to operate at the line rate. The cell scheduling problem for the SDMG CIOQ switch is abstracted as a bipartite k-matching problem. Using fluid model techniques, we prove that any maximal size k-matching algorithm on an SDMG CIOQ switch with an expansion factor 2 can achieve 100% throughput assuming input line arrivals satisfy the strong law of large numbers (SLLN) and no input/output line is oversubscribed. We further propose an efficient and starvation-free maximal size k-matching scheduling algorithm, kFRR, for the SDMG CIOQ switch. Simulation results show that kFRR achieves 100% throughput for SDMG CIOQ switches with an expansion factor 2 under two SLLN traffic models, uniform traffic and polarized traffic, confirming our analysis

    An in-depth look at prior art in fast round-robin arbiter circuits

    Get PDF
    Arbiters are found where shared resources exist such as busses, switching fabrics, processing elements. Round-robin is a fair arbitration method, where requestors get near-equal shares of a common resource or service. Round-robin arbitration (RRA) finds use in network switches/routers and processor boards/systems as well as many other applications that have concurrency. Today's electronic systems require arbiters with hundreds of ports (e.g., switching fabrics with virtual I/O queues) and clock speeds near the limits of even the latest microelectronics fabrication processes/libraries. Achieving high clock speeds in the presence of large number of ports is only possible with highly parallel arbiter architectures. This paper presents an in-depth literature survey of previous work on this problem. It looks at RRA work in the literature in a bigger context, then defines the typical RRA problem (RRA_typical), and specifically investigates work on fast architectures that solve the RRA_typical problem. There are five such works that are really competitive. This report takes a very in-depth look at these works. It explains each architecture and how/why it works from a unique perspective that cannot be found in the original publication of that architecture. It also proposes improvements to these architectures. We wrote generators for the improved versions of these architectures. We will share a summary of synthesis results in this report – although a detailed account of how these results were obtained and their analysis is the subject of another (upcoming) publicatio

    A Bandwidth Control Arbitration for SoC Interconnections Performing Applications With Task Dependencies

    Get PDF
    Current System-on-Chips (SoCs) execute applications with task dependency that compete for shared resources such as buses, memories, and accelerators. In such a structure, the arbitration policy becomes a critical part of the system to guarantee access and bandwidth suitable for the competing applications. Some strategies proposed in the literature to cope with these issues are Round-Robin, Weighted Round-Robin, Lottery, Time Division Access Multiplexing (TDMA), and combinations. However, a fine-grained bandwidth control arbitration policy is missing from the literature. We propose an innovative arbitration policy based on opportunistic access and a supervised utilization of the bus in terms of transmitted flits (transmission units) that settle the access and fine-grained control. In our proposal, every competing element has a budget. Opportunistic access grants the bus to request even if the component has spent all its flits. Supervised debt accounts a record for every transmitted flit when it has no flits to spend. Our proposal applies to interconnection systems such as buses, switches, and routers. The presented approach achieves deadlock-free behavior even with task dependency applications in the scenarios analyzed through cycle-accurate simulation models. The synergy between opportunistic and supervised debt techniques outperforms Lottery, TDMA, and Weighted Round-Robin in terms of bandwidth control in the experimental studies performed

    On-the-Fly Adaptive Routing for dragonfly interconnection networks

    Get PDF
    Adaptive deadlock-free routing mechanisms are required to handle variable traffic patterns in dragonfly networks. However, distance-based deadlock avoidance mechanisms typically employed in Dragonflies increase the router cost and complexity as a function of the maximum allowed path length. This paper presents on-the-fly adaptive routing (OFAR), a routing/flow-control scheme that decouples the routing and the deadlock avoidance mechanisms. OFAR allows for in-transit adaptive routing with local and global misrouting, without imposing dependencies between virtual channels, and relying on a deadlock-free escape subnetwork to avoid deadlock. This model lowers latency, increases throughput, and adapts faster to transient traffic than previously proposed mechanisms. The low capacity of the escape subnetwork makes it prone to congestion. A simple congestion management mechanism based on injection restriction is considered to avoid such issues. Finally, reliability is considered by introducing mechanisms to find multiple edge-disjoint Hamiltonian rings embedded on the dragonfly, allowing to use multiple escape subnetworks
    corecore