406 research outputs found
Management, Optimization and Evolution of the LHCb Online Network
The LHCb experiment is one of the four large particle detectors running at the
Large Hadron Collider (LHC) at CERN. It is a forward single-arm spectrometer dedicated to test the Standard Model through precision measurements of
Charge-Parity (CP) violation and rare decays in the b quark sector. The LHCb
experiment will operate at a luminosity of 2x10^32cm-2s-1, the proton-proton
bunch crossings rate will be approximately 10 MHz. To select the interesting
events, a two-level trigger scheme is applied: the rst level trigger (L0) and the
high level trigger (HLT). The L0 trigger is implemented in custom hardware,
while HLT is implemented in software runs on the CPUs of the Event Filter
Farm (EFF). The L0 trigger rate is dened at about 1 MHz, and the event size
for each event is about 35 kByte. It is a serious challenge to handle the resulting
data rate (35 GByte/s).
The Online system is a key part of the LHCb experiment, providing all the
IT services. It consists of three major components: the Data Acquisition (DAQ)
system, the Timing and Fast Control (TFC) system and the Experiment Control
System (ECS). To provide the services, two large dedicated networks based on
Gigabit Ethernet are deployed: one for DAQ and another one for ECS, which are
referred to Online network in general. A large network needs sophisticated monitoring for its successful operation. Commercial network management systems are
quite expensive and dicult to integrate into the LHCb ECS. A custom network
monitoring system has been implemented based on a Supervisory Control And
Data Acquisition (SCADA) system called PVSS which is used by LHCb ECS. It
is a homogeneous part of the LHCb ECS. In this thesis, it is demonstrated how
a large scale network can be monitored and managed using tools originally made
for industrial supervisory control.
The thesis is organized as the follows:
Chapter 1 gives a brief introduction to LHC and the B physics on LHC,
then describes all sub-detectors and the trigger and DAQ system of LHCb from
structure to performance.
Chapter 2 first introduces the LHCb Online system and the dataflow, then
focuses on the Online network design and its optimization.
In Chapter 3, the SCADA system PVSS is introduced briefly,
then the
architecture and implementation of the network monitoring system are described
in detail, including the front-end processes, the data communication and the
supervisory layer.
Chapter 4 first discusses the packet sampling theory and one of the packet
sampling mechanisms: sFlow, then demonstrates the applications of sFlow for
the network trouble-shooting, the traffic monitoring and the anomaly detection.
In Chapter 5, the upgrade of LHC and LHCb is introduced, the possible
architecture of DAQ is discussed, and two candidate internetworking technologies (high speed Ethernet and InfniBand) are compared in different aspects for
DAQ. Three schemes based on 10 Gigabit Ethernet are presented and studied.
Chapter 6 is a general summary of the thesis
Recommended from our members
Performance analysis and improvement of InfiniBand networks. Modelling and effective Quality-of-Service mechanisms for interconnection networks in cluster computing systems.
The InfiniBand Architecture (IBA) network has been proposed as a new
industrial standard with high-bandwidth and low-latency suitable for constructing
high-performance interconnected cluster computing systems. This architecture
replaces the traditional bus-based interconnection with a switch-based network for
the server Input-Output (I/O) and inter-processor communications. The efficient
Quality-of-Service (QoS) mechanism is fundamental to ensure the import at QoS
metrics, such as maximum throughput and minimum latency, leaving aside other
aspects like guarantee to reduce the delay, blocking probability, and mean queue
length, etc.
Performance modelling and analysis has been and continues to be of great
theoretical and practical importance in the design and development of
communication networks. This thesis aims to investigate efficient and cost-effective
QoS mechanisms for performance analysis and improvement of InfiniBand
networks in cluster-based computing systems.
Firstly, a rate-based source-response link-by-link admission and congestion
control function with improved Explicit Congestion Notification (ECN) packet
marking scheme is developed. This function adopts the rate control to reduce
congestion of multiple-class traffic. Secondly, a credit-based flow control scheme is
presented to reduce the mean queue length, throughput and response time of the system. In order to evaluate the performance of this scheme, a new queueing
network model is developed. Theoretical analysis and simulation experiments show
that these two schemes are quite effective and suitable for InfiniBand networks.
Finally, to obtain a thorough and deep understanding of the performance attributes
of InfiniBand Architecture network, two efficient threshold function flow control
mechanisms are proposed to enhance the QoS of InfiniBand networks; one is Entry
Threshold that sets the threshold for each entry in the arbitration table, and other is
Arrival Job Threshold that sets the threshold based on the number of jobs in each
Virtual Lane. Furthermore, the principle of Maximum Entropy is adopted to analyse
these two new mechanisms with the Generalized Exponential (GE)-Type
distribution for modelling the inter-arrival times and service times of the input traffic.
Extensive simulation experiments are conducted to validate the accuracy of the
analytical models
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
Enhancing HPC on Virtual Systems in Clouds through Optimizing Virtual Overlay Networks
Virtual Ethernet overlay provides a powerful model for realizing virtual distributed and parallel computing systems with strong isolation, portability, and recoverability properties. However, in extremely high throughput and low latency networks, such overlays can suffer from bandwidth and latency limitations, which is of particular concern in HPC environments. Through a careful and quantitative analysis, I iden- tify three core issues limiting performance: delayed and excessive virtual interrupt delivery into guests, copies between host and guest data buffers during encapsulation, and the semantic gap between virtual Ethernet features and underlying physical network features. I propose three novel optimizations in response: optimistic timer- free virtual interrupt injection, zero-copy cut-through data forwarding, and virtual TCP offload. These optimizations improve the latency and bandwidth of the overlay network on 10 Gbps Ethernet and InfiniBand interconnects, resulting in near-native performance for a wide range of microbenchmarks and MPI application benchmarks
Recommended from our members
RAS enhancements for RDMA communications
textEthernet as the communication medium in the enterprise data center has outlived all competing mediums and resisted the test of time with regards to speed and costs. The future is also poised for growth with 40 and 100Gps speeds just over horizon. The current state of the technology is being enhanced and extended with lossless features to allow for fabric convergence of Storage and Inter Process Communication (IPC) Networks. It is under this medium that an increase in the adoption of Remote Direct Memory Access (RDMA) over Ethernet using offloaded TCP/IP (iWARP) and Infiniband over Ethernet (RoCE) communication stacks to RDMA capable NIC adapter s (RNIC) is observed.
RDMA enables direct application to application communication over the network resulting in numerous and significant benefits such as reduced CPU utilization, lower latency communications, increased energy efficiency, and reduced overall system requirements. However, with said benefits also comes increased software complexity in how RDMA interface users communicate. The RDMA communication semantics, which originate from the HPC domain, are heavily biased towards Low-Latency and High-Bandwidth communications rather than Reliability, Availability, and Serviceability (RAS). As adoption increases, and enterprise data centers begins to leverage RDMA over Ethernet, enhancements to the OS stack software architecture and design of the components involved is required to address these deficiencies. Operating system interfaces, device drivers, adapter hardware design, and embedded firmware features must be viewed from a high-availability and maintainability point of view.
RAS enhancements for RDMA communications proposes the software architectural tradeoffs for enhancing the iWARP and RoCE RDMA implementations for communications in the enterprise data center, with new and traditional RAS features for existing communications stacks and devices. The architecture leverages software enhancements in traceability, availability, maintainability, serviceability, fault-isolation and resource management; such that in the advent of errors, the probability that the forensics data points to identify root cause are immediately and automatically available is increased.Electrical and Computer Engineerin
- …