102 research outputs found
NaNet:a low-latency NIC enabling GPU-based, real-time low level trigger systems
We implemented the NaNet FPGA-based PCI2 Gen2 GbE/APElink NIC, featuring
GPUDirect RDMA capabilities and UDP protocol management offloading. NaNet is
able to receive a UDP input data stream from its GbE interface and redirect it,
without any intermediate buffering or CPU intervention, to the memory of a
Fermi/Kepler GPU hosted on the same PCIe bus, provided that the two devices
share the same upstream root complex. Synthetic benchmarks for latency and
bandwidth are presented. We describe how NaNet can be employed in the prototype
of the GPU-based RICH low-level trigger processor of the NA62 CERN experiment,
to implement the data link between the TEL62 readout boards and the low level
trigger processor. Results for the throughput and latency of the integrated
system are presented and discussed.Comment: Proceedings for the 20th International Conference on Computing in
High Energy and Nuclear Physics (CHEP
Physically Dense Server Architectures.
Distributed, in-memory key-value stores have emerged as one of today's most
important data center workloads. Being critical for the scalability of modern
web services, vast resources are dedicated to key-value stores in order
to ensure that quality of service guarantees are met. These resources include:
many server racks to store terabytes of key-value data, the power necessary to
run all of the machines, networking equipment and bandwidth, and the data center
warehouses used to house the racks.
There is, however, a mismatch between the key-value store software and the
commodity servers on which it is run, leading to inefficient use of resources.
The primary cause of inefficiency is the overhead incurred from processing
individual network packets, which typically carry small payloads, and require
minimal compute resources. Thus, one of the key challenges as we enter the
exascale era is how to best adjust to the paradigm shift from compute-centric
to storage-centric data centers.
This dissertation presents a hardware/software solution that addresses the
inefficiency issues present in the modern data centers on which key-value
stores are currently deployed. First, it proposes two physical server
designs, both of which use 3D-stacking technology and low-power CPUs to improve
density and efficiency. The first 3D architecture---Mercury---consists of stacks
of low-power CPUs with 3D-stacked DRAM. The second
architecture---Iridium---replaces DRAM with 3D NAND Flash to improve density.
The second portion of this dissertation proposes and enhanced version of the
Mercury server design---called KeyVault---that incorporates integrated,
zero-copy network interfaces along with an integrated switching fabric. In order
to utilize the integrated networking hardware, as well as reduce the
response time of requests, a custom networking protocol is proposed. Unlike
prior works on accelerating key-value stores---e.g., by completely bypassing the
CPU and OS when processing requests---this work only bypasses the CPU and OS
when placing network payloads into a process' memory. The insight behind this is
that because most of the overhead comes from processing packets in the OS
kernel---and not the request processing itself---direct placement of packet's
payload is sufficient to provide higher throughput and lower latency than prior
approaches.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111414/1/atgutier_1.pd
Architecting Efficient Data Centers.
Data center power consumption has become a key constraint in continuing to scale Internet services. As our society’s reliance on “the Cloud” continues to grow, companies require an ever-increasing amount of computational capacity to support their customers. Massive warehouse-scale data centers have emerged, requiring 30MW or more of total power capacity. Over the lifetime of a typical high-scale data center, power-related costs make up 50% of the total cost of ownership (TCO). Furthermore, the aggregate effect of data center power consumption across the country cannot be ignored. In total, data center energy usage has reached approximately 2% of aggregate consumption in the United States and continues to grow.
This thesis addresses the need to increase computational efficiency to address this grow- ing problem. It proposes a new classes of power management techniques: coordinated full-system idle low-power modes to increase the energy proportionality of modern servers. First, we introduce the PowerNap server architecture, a coordinated full-system idle low- power mode which transitions in and out of an ultra-low power nap state to save power during brief idle periods. While effective for uniprocessor systems, PowerNap relies on full-system idleness and we show that such idleness disappears as the number of cores per processor continues to increase. We expose this problem in a case study of Google Web search in which we demonstrate that coordinated full-system active power modes are necessary to reach energy proportionality and that PowerNap is ineffective because of a lack of idleness. To recover full-system idleness, we introduce DreamWeaver, architectural support for deep sleep. DreamWeaver allows a server to exchange latency for full-system idleness, allowing PowerNap-enabled servers to be effective and provides a better latency- power savings tradeoff than existing approaches. Finally, this thesis investigates workloads which achieve efficiency through methodical cluster provisioning techniques. Using the popular memcached workload, this thesis provides examples of provisioning clusters for cost-efficiency given latency, throughput, and data set size targets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91499/1/meisner_1.pd
Management, Optimization and Evolution of the LHCb Online Network
The LHCb experiment is one of the four large particle detectors running at the
Large Hadron Collider (LHC) at CERN. It is a forward single-arm spectrometer dedicated to test the Standard Model through precision measurements of
Charge-Parity (CP) violation and rare decays in the b quark sector. The LHCb
experiment will operate at a luminosity of 2x10^32cm-2s-1, the proton-proton
bunch crossings rate will be approximately 10 MHz. To select the interesting
events, a two-level trigger scheme is applied: the rst level trigger (L0) and the
high level trigger (HLT). The L0 trigger is implemented in custom hardware,
while HLT is implemented in software runs on the CPUs of the Event Filter
Farm (EFF). The L0 trigger rate is dened at about 1 MHz, and the event size
for each event is about 35 kByte. It is a serious challenge to handle the resulting
data rate (35 GByte/s).
The Online system is a key part of the LHCb experiment, providing all the
IT services. It consists of three major components: the Data Acquisition (DAQ)
system, the Timing and Fast Control (TFC) system and the Experiment Control
System (ECS). To provide the services, two large dedicated networks based on
Gigabit Ethernet are deployed: one for DAQ and another one for ECS, which are
referred to Online network in general. A large network needs sophisticated monitoring for its successful operation. Commercial network management systems are
quite expensive and dicult to integrate into the LHCb ECS. A custom network
monitoring system has been implemented based on a Supervisory Control And
Data Acquisition (SCADA) system called PVSS which is used by LHCb ECS. It
is a homogeneous part of the LHCb ECS. In this thesis, it is demonstrated how
a large scale network can be monitored and managed using tools originally made
for industrial supervisory control.
The thesis is organized as the follows:
Chapter 1 gives a brief introduction to LHC and the B physics on LHC,
then describes all sub-detectors and the trigger and DAQ system of LHCb from
structure to performance.
Chapter 2 first introduces the LHCb Online system and the dataflow, then
focuses on the Online network design and its optimization.
In Chapter 3, the SCADA system PVSS is introduced briefly,
then the
architecture and implementation of the network monitoring system are described
in detail, including the front-end processes, the data communication and the
supervisory layer.
Chapter 4 first discusses the packet sampling theory and one of the packet
sampling mechanisms: sFlow, then demonstrates the applications of sFlow for
the network trouble-shooting, the traffic monitoring and the anomaly detection.
In Chapter 5, the upgrade of LHC and LHCb is introduced, the possible
architecture of DAQ is discussed, and two candidate internetworking technologies (high speed Ethernet and InfniBand) are compared in different aspects for
DAQ. Three schemes based on 10 Gigabit Ethernet are presented and studied.
Chapter 6 is a general summary of the thesis
A cross-stack, network-centric architectural design for next-generation datacenters
This thesis proposes a full-stack, cross-layer datacenter architecture based on in-network computing and near-memory processing paradigms. The proposed datacenter architecture is built atop two principles: (1) utilizing commodity, off-the-shelf hardware (i.e., processor, DRAM, and network devices) with minimal changes to their architecture, and (2) providing a standard interface to the programmers for using the novel hardware. More specifically, the proposed datacenter architecture enables a smart network adapter to collectively compress/decompress data exchange between distributed DNN training nodes and assist the operating system in performing aggressive processor power management. It also deploys specialized memory modules in the servers, capable of performing general-purpose computation and network connectivity.
This thesis unlocks the potentials of hardware and operating system co-design in architecting application-transparent, near-data processing hardware for improving datacenter's performance, energy efficiency, and scalability. We evaluate the proposed datacenter architecture using a combination of full-system simulation, FPGA prototyping, and real-system experiments
Novel Operation Modes of Accelerated Neuromorphic Hardware
The hybrid operation mode relies on a combination of conventional computing resources and a neuromorphic, beyond von Neumann system to perform a joint real-time experiment. The interactive operation mode provides prompt feedback to the user and benefits from high experiment throughput. The performance of a custom transport-layer protocol is evaluated connecting the accelerated neuromorphic system and the computer cluster. Wire-speed performance is achieved between host and eight FPGAs ((846.7 ± 1.2) MiB/s, 94% wire speed), and between two hosts using 10-Gigabit Ethernet (> 99%) as well as 40GbE (> 99%) to explore scaling behavior. The software architecture to process neuronal network experiments at high rates is presented including measurements which address the key performance indicators. During hybrid operation, the tight coupling between both resources requires low-latency communication. Using a custom-developed software framework, an average one-way latency between two host computers connected via 10GbE is found to be (2.4 ± 0.2) μs and (8.5 ± 0.4) μs to the neuromorphic system. A hybrid experiment is designed to demonstrate the hardware infrastructure and software framework. Starting from a conventional neuronal network simulation, the experiment is gradually migrated into a time-continuous experiment which interacts between a host computer and the neuromorphic system in real time. Results of the intermediate steps and the final, hybrid operation are evaluated
- …