123 research outputs found

    Power Control for Crossbar-based Input-Queued Switches

    Get PDF
    Abstract—We consider an N ×N input-queued switch with a crossbarbased switching fabric implemented on a single chip. The power consumption produced by the crossbar chip and due to the data transfer grows as NR 3, where R is the maximum bit rate. Thus, at increasing bit rate, power dissipation is becoming more and more challenging, limiting the crossbar scalability for high performance switches. We propose to exploit Dynamic Voltage and Frequency Scaling (DVFS) techniques to control packet transmissions through each crosspoint of the switching fabric. Our power control operates independently of the packet scheduler and exploits the knowledge of a traffic matrix obtained by on-line measurements. We propose a family of control algorithms to reduce the power consumption. The algorithms are particularly efficient in non-overloaded conditions. The actual potential of the proposed approach is also evaluated on a real design case synthesized on a 90 nm CMOS technology. Index Terms—Input queued switch, power control, dynamic voltage frequency scaling.

    Dynamic Voltage and Frequency Scaling Control for Crossbars in Input-Queued Switches

    Get PDF
    The power consumption in chips, in general, and in crossbars switching fabrics, in particular, grows with the maximum sustainable throughput. Due to the fast increasing traffic demands, the performance scalability of crossbars is severely limited by the capability of cooling the hardware devices. Hence, reducing the power consumption is an important design question to improve the crossbar switching performance. We propose to leverage Dynamic Voltage and Frequency Scaling (DVFS) hardware technique for the switching fabric. The main idea is to exploit temporary underloaded conditions to decrease the crossbar transmission rate while preserving maximum throughput. Differently from previous works, we consider a scenario in which the arrival rates are unknown in advance. Our proposed architecture is based on a power controller which runs periodically and independently of the packet scheduler, and whose decisions are based on the real time estimation of the arrival rates. We discuss the performance tradeoff in terms of throughput, delays and power, and show the relevant performance gain due to the use of DVFS in controlling the crossba

    Low latency optical switch for high performance computing with minimized processor energy load [Invited]

    Get PDF
    Power density and cooling issues are limiting the performance of high performance chip multiprocessors (CMPs), and off-chip communications currently consume more than 20% of power for memory, coherence, PCI, and Ethernet links. Photonic transceivers integrated with CMPs are being developed to overcome these issues, potentially allowing low hop count switched connections between chips or data center servers. However, latency in setting up optical connections is critically important in all computing applications, and having transceivers integrated on the processor chip also pushes other network functions and their associated power consumption onto the chip. In this paper, we propose a low latency optical switch architecture that minimizes the power consumed on the processor chip for two scenarios: multiple-socket shared memory coherence networks and optical top-of-rack switches for data centers. The switch architecture reduces power consumed on the CMP using a control plane with a simplified send and forget server interface and the use of a hybrid Mach–Zehnder interferometer and semiconductor optical amplifier integrated optical switch with electronic buffering. Results show that the proposed architecture offers a 42% reduction in head latency at low loads compared with a conventional scheduled optical switch as well as offering increased performance for streaming and incast traffic patterns. Power dissipated on the server chip is shown to be reduced by more than 60% compared with a scheduled optical switch architecture with ring resonator switching.This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) INTERNET program grant and an EPSRC Fellowship grant to Philip Watts. Both University College London and the University of Cambridge are members of GreenTouch.This paper was published in the Journal of Optical Communications and Networking and is made available as an electronic reprint with the permission of OSA. The paper can be found at the following URL on the OSA website: http://www.opticsinfobase.org/jocn/abstract.cfm?uri=jocn-7-3-A498. Systematic or multiple reproduction or distribution to multiple locations via electronic or other means is prohibited and is subject to penalties under law. This is the accepted manuscript of a paper published in the Journal of Optical Communications and Networking, Vol. 7, Issue 3, pp. A498-A510 (2015) http://dx.doi.org/10.1364/JOCN.7.00A49

    Scalable, accurate multicore simulation in the 1000-core era

    Get PDF
    We present HORNET, a parallel, highly configurable, cycle-level multicore simulator based on an ingress-queued worm-hole router NoC architecture. The parallel simulation engine offers cycle-accurate as well as periodic synchronization; while preserving functional accuracy, this permits tradeoffs between perfect timing accuracy and high speed with very good accuracy. When run on 6 separate physical cores on a single die, speedups can exceed a factor of over 5, and when run on a two-die 12-core system with 2-way hyperthreading, speedups exceed 11 ×. Most hardware parameters are configurable, including memory hierarchy, interconnect geometry, bandwidth, crossbar dimensions, and parameters driving power and thermal effects. A highly parametrized table-based NoC design allows a variety of routing and virtual channel allocation algorithms out of the box, ranging from simple DOR routing to complex Valiant, ROMM, or PROM schemes, BSOR, and adaptive routing. HORNET can run in network-only mode using synthetic traffic or traces, directly emulate a MIPS-based multicore, or function as the memory subsystem for native applications executed under the Pin instrumentation tool. HORNET is freely available under the open-source MIT license at http://csg.csail.mit.edu/hornet/

    Energy Implications of Photonic Networks With Speculative Transmission

    Get PDF
    Speculative transmission has been proposed to overcome the high latency of setting up end-to-end paths through photonic networks for computer systems. However, speculative transmission has implications for the energy efficiency of the network, in particular, control circuits are more complex and power hungry and failed speculative transmissions must be repeated. Moreover, in future chip multiprocessors (CMPs) with integrated photonic network end points, a large proportion of the additional energy will be dissipated on the CMP. This paper compares the energy characteristics of scheduled and speculative chip-to-chip networks for shared memory computer systems on the scale of a rack. For this comparison, we use a novel speculative control plane which reduces energy consumption by eliminating duplicate packets from the allocation process. In addition, we consider photonic power gating to reduce processor chip energy dissipation and the energy impact of the choice between semiconductor optical amplifier and ring resonator switching technologies. We model photonic network elements using values from the published literature as well as determine the power consumption of the allocator and network adapter circuits, implemented in a commercial low leakage 45 nm CMOS process. The power dissipated on the CMP using speculative networks is shown to be roughly double that of scheduled networks at saturation load and an order of magnitude higher at low loads

    Host and Network Optimizations for Performance Enhancement and Energy Efficiency in Data Center Networks

    Get PDF
    Modern data centers host hundreds of thousands of servers to achieve economies of scale. Such a huge number of servers create challenges for the data center network (DCN) to provide proportionally large bandwidth. In addition, the deployment of virtual machines (VMs) in data centers raises the requirements for efficient resource allocation and find-grained resource sharing. Further, the large number of servers and switches in the data center consume significant amounts of energy. Even though servers become more energy efficient with various energy saving techniques, DCN still accounts for 20% to 50% of the energy consumed by the entire data center. The objective of this dissertation is to enhance DCN performance as well as its energy efficiency by conducting optimizations on both host and network sides. First, as the DCN demands huge bisection bandwidth to interconnect all the servers, we propose a parallel packet switch (PPS) architecture that directly processes variable length packets without segmentation-and-reassembly (SAR). The proposed PPS achieves large bandwidth by combining switching capacities of multiple fabrics, and it further improves the switch throughput by avoiding padding bits in SAR. Second, since certain resource demands of the VM are bursty and demonstrate stochastic nature, to satisfy both deterministic and stochastic demands in VM placement, we propose the Max-Min Multidimensional Stochastic Bin Packing (M3SBP) algorithm. M3SBP calculates an equivalent deterministic value for the stochastic demands, and maximizes the minimum resource utilization ratio of each server. Third, to provide necessary traffic isolation for VMs that share the same physical network adapter, we propose the Flow-level Bandwidth Provisioning (FBP) algorithm. By reducing the flow scheduling problem to multiple stages of packet queuing problems, FBP guarantees the provisioned bandwidth and delay performance for each flow. Finally, while DCNs are typically provisioned with full bisection bandwidth, DCN traffic demonstrates fluctuating patterns, we propose a joint host-network optimization scheme to enhance the energy efficiency of DCNs during off-peak traffic hours. The proposed scheme utilizes a unified representation method that converts the VM placement problem to a routing problem and employs depth-first and best-fit search to find efficient paths for flows
    • 

    corecore