102 research outputs found

    NaNet:a low-latency NIC enabling GPU-based, real-time low level trigger systems

    Full text link
    We implemented the NaNet FPGA-based PCI2 Gen2 GbE/APElink NIC, featuring GPUDirect RDMA capabilities and UDP protocol management offloading. NaNet is able to receive a UDP input data stream from its GbE interface and redirect it, without any intermediate buffering or CPU intervention, to the memory of a Fermi/Kepler GPU hosted on the same PCIe bus, provided that the two devices share the same upstream root complex. Synthetic benchmarks for latency and bandwidth are presented. We describe how NaNet can be employed in the prototype of the GPU-based RICH low-level trigger processor of the NA62 CERN experiment, to implement the data link between the TEL62 readout boards and the low level trigger processor. Results for the throughput and latency of the integrated system are presented and discussed.Comment: Proceedings for the 20th International Conference on Computing in High Energy and Nuclear Physics (CHEP

    Physically Dense Server Architectures.

    Full text link
    Distributed, in-memory key-value stores have emerged as one of today's most important data center workloads. Being critical for the scalability of modern web services, vast resources are dedicated to key-value stores in order to ensure that quality of service guarantees are met. These resources include: many server racks to store terabytes of key-value data, the power necessary to run all of the machines, networking equipment and bandwidth, and the data center warehouses used to house the racks. There is, however, a mismatch between the key-value store software and the commodity servers on which it is run, leading to inefficient use of resources. The primary cause of inefficiency is the overhead incurred from processing individual network packets, which typically carry small payloads, and require minimal compute resources. Thus, one of the key challenges as we enter the exascale era is how to best adjust to the paradigm shift from compute-centric to storage-centric data centers. This dissertation presents a hardware/software solution that addresses the inefficiency issues present in the modern data centers on which key-value stores are currently deployed. First, it proposes two physical server designs, both of which use 3D-stacking technology and low-power CPUs to improve density and efficiency. The first 3D architecture---Mercury---consists of stacks of low-power CPUs with 3D-stacked DRAM. The second architecture---Iridium---replaces DRAM with 3D NAND Flash to improve density. The second portion of this dissertation proposes and enhanced version of the Mercury server design---called KeyVault---that incorporates integrated, zero-copy network interfaces along with an integrated switching fabric. In order to utilize the integrated networking hardware, as well as reduce the response time of requests, a custom networking protocol is proposed. Unlike prior works on accelerating key-value stores---e.g., by completely bypassing the CPU and OS when processing requests---this work only bypasses the CPU and OS when placing network payloads into a process' memory. The insight behind this is that because most of the overhead comes from processing packets in the OS kernel---and not the request processing itself---direct placement of packet's payload is sufficient to provide higher throughput and lower latency than prior approaches.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111414/1/atgutier_1.pd

    Network stack specialization for performance

    Full text link

    Architecting Efficient Data Centers.

    Full text link
    Data center power consumption has become a key constraint in continuing to scale Internet services. As our society’s reliance on “the Cloud” continues to grow, companies require an ever-increasing amount of computational capacity to support their customers. Massive warehouse-scale data centers have emerged, requiring 30MW or more of total power capacity. Over the lifetime of a typical high-scale data center, power-related costs make up 50% of the total cost of ownership (TCO). Furthermore, the aggregate effect of data center power consumption across the country cannot be ignored. In total, data center energy usage has reached approximately 2% of aggregate consumption in the United States and continues to grow. This thesis addresses the need to increase computational efficiency to address this grow- ing problem. It proposes a new classes of power management techniques: coordinated full-system idle low-power modes to increase the energy proportionality of modern servers. First, we introduce the PowerNap server architecture, a coordinated full-system idle low- power mode which transitions in and out of an ultra-low power nap state to save power during brief idle periods. While effective for uniprocessor systems, PowerNap relies on full-system idleness and we show that such idleness disappears as the number of cores per processor continues to increase. We expose this problem in a case study of Google Web search in which we demonstrate that coordinated full-system active power modes are necessary to reach energy proportionality and that PowerNap is ineffective because of a lack of idleness. To recover full-system idleness, we introduce DreamWeaver, architectural support for deep sleep. DreamWeaver allows a server to exchange latency for full-system idleness, allowing PowerNap-enabled servers to be effective and provides a better latency- power savings tradeoff than existing approaches. Finally, this thesis investigates workloads which achieve efficiency through methodical cluster provisioning techniques. Using the popular memcached workload, this thesis provides examples of provisioning clusters for cost-efficiency given latency, throughput, and data set size targets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91499/1/meisner_1.pd

    Management, Optimization and Evolution of the LHCb Online Network

    Get PDF
    The LHCb experiment is one of the four large particle detectors running at the Large Hadron Collider (LHC) at CERN. It is a forward single-arm spectrometer dedicated to test the Standard Model through precision measurements of Charge-Parity (CP) violation and rare decays in the b quark sector. The LHCb experiment will operate at a luminosity of 2x10^32cm-2s-1, the proton-proton bunch crossings rate will be approximately 10 MHz. To select the interesting events, a two-level trigger scheme is applied: the rst level trigger (L0) and the high level trigger (HLT). The L0 trigger is implemented in custom hardware, while HLT is implemented in software runs on the CPUs of the Event Filter Farm (EFF). The L0 trigger rate is dened at about 1 MHz, and the event size for each event is about 35 kByte. It is a serious challenge to handle the resulting data rate (35 GByte/s). The Online system is a key part of the LHCb experiment, providing all the IT services. It consists of three major components: the Data Acquisition (DAQ) system, the Timing and Fast Control (TFC) system and the Experiment Control System (ECS). To provide the services, two large dedicated networks based on Gigabit Ethernet are deployed: one for DAQ and another one for ECS, which are referred to Online network in general. A large network needs sophisticated monitoring for its successful operation. Commercial network management systems are quite expensive and dicult to integrate into the LHCb ECS. A custom network monitoring system has been implemented based on a Supervisory Control And Data Acquisition (SCADA) system called PVSS which is used by LHCb ECS. It is a homogeneous part of the LHCb ECS. In this thesis, it is demonstrated how a large scale network can be monitored and managed using tools originally made for industrial supervisory control. The thesis is organized as the follows: Chapter 1 gives a brief introduction to LHC and the B physics on LHC, then describes all sub-detectors and the trigger and DAQ system of LHCb from structure to performance. Chapter 2 first introduces the LHCb Online system and the dataflow, then focuses on the Online network design and its optimization. In Chapter 3, the SCADA system PVSS is introduced briefly, then the architecture and implementation of the network monitoring system are described in detail, including the front-end processes, the data communication and the supervisory layer. Chapter 4 first discusses the packet sampling theory and one of the packet sampling mechanisms: sFlow, then demonstrates the applications of sFlow for the network trouble-shooting, the traffic monitoring and the anomaly detection. In Chapter 5, the upgrade of LHC and LHCb is introduced, the possible architecture of DAQ is discussed, and two candidate internetworking technologies (high speed Ethernet and InfniBand) are compared in different aspects for DAQ. Three schemes based on 10 Gigabit Ethernet are presented and studied. Chapter 6 is a general summary of the thesis

    A cross-stack, network-centric architectural design for next-generation datacenters

    Get PDF
    This thesis proposes a full-stack, cross-layer datacenter architecture based on in-network computing and near-memory processing paradigms. The proposed datacenter architecture is built atop two principles: (1) utilizing commodity, off-the-shelf hardware (i.e., processor, DRAM, and network devices) with minimal changes to their architecture, and (2) providing a standard interface to the programmers for using the novel hardware. More specifically, the proposed datacenter architecture enables a smart network adapter to collectively compress/decompress data exchange between distributed DNN training nodes and assist the operating system in performing aggressive processor power management. It also deploys specialized memory modules in the servers, capable of performing general-purpose computation and network connectivity. This thesis unlocks the potentials of hardware and operating system co-design in architecting application-transparent, near-data processing hardware for improving datacenter's performance, energy efficiency, and scalability. We evaluate the proposed datacenter architecture using a combination of full-system simulation, FPGA prototyping, and real-system experiments

    Novel Operation Modes of Accelerated Neuromorphic Hardware

    Get PDF
    The hybrid operation mode relies on a combination of conventional computing resources and a neuromorphic, beyond von Neumann system to perform a joint real-time experiment. The interactive operation mode provides prompt feedback to the user and benefits from high experiment throughput. The performance of a custom transport-layer protocol is evaluated connecting the accelerated neuromorphic system and the computer cluster. Wire-speed performance is achieved between host and eight FPGAs ((846.7 ± 1.2) MiB/s, 94% wire speed), and between two hosts using 10-Gigabit Ethernet (> 99%) as well as 40GbE (> 99%) to explore scaling behavior. The software architecture to process neuronal network experiments at high rates is presented including measurements which address the key performance indicators. During hybrid operation, the tight coupling between both resources requires low-latency communication. Using a custom-developed software framework, an average one-way latency between two host computers connected via 10GbE is found to be (2.4 ± 0.2) μs and (8.5 ± 0.4) μs to the neuromorphic system. A hybrid experiment is designed to demonstrate the hardware infrastructure and software framework. Starting from a conventional neuronal network simulation, the experiment is gradually migrated into a time-continuous experiment which interacts between a host computer and the neuromorphic system in real time. Results of the intermediate steps and the final, hybrid operation are evaluated
    • …
    corecore